Skip to content

Conversation

@hero78119
Copy link
Collaborator

@hero78119 hero78119 commented Dec 4, 2025

Related to #1145
Build on top of #1202

design rationale

Make init chip across shard, and assure consecutive address across shards without overlap. The mechanism to read from previous shard works as usual.

remove pitfall of Platform clone

wrap Platform -> prog_data into Arc otherwise each platform clone with a massive and hidden prover cost which we do not aware.

benchmark

Run on 23817600 with K=6, cache trace = none, 1x4090

Component base Time Init-Across-Shard Time Custom Serde Time Init-Across-Shard Improvement Custom Serde Improvement
app.prove 203 s 175 s 167 s ↓ 13.8% ↓ 17.7%
emulator.preflight-execute 43.9 s 35.8 s 36.7 s ↓ 18.5% ↓ 16.4%
app_prove.inner 158 s 139 s 130 s ↓ 12.0% ↓ 17.7%
create_proof_of_shard (shard_id = 0) 14.9 s 4.91 s 8.02 s ↓ 67.0% ↓ 46.2%
create_proof_of_shard (shard_id = 1) 16.4 s 12.7 s 3.56 s ↓ 22.6% ↓ 78.3%

verifier sound TODOs in next PR

This PR got some verifier change, but not fully sound yet. A refactor is need to constrain specific chip number of instance is on the way, in short to support this pi[<chip_id>].num_instance == chip_proof.num_instance. This is needed and assure init heap/hint in each shard are constrain within [platform.heap.start, platform.heap.end) or [platform.hint.start, platform.hint.end)

  • constraint pi[heap_length], pi[hint_length] equal to number of instance of respective init chip.
  • constraint pi[xxx_start_addr] + pi[xxx_length] within range
  • both rust/recursion verifier
  • constrain init chip across shard only allow 1 chip proof

Another TODO for Hint read and rkyz

rkyz serialized data from hint_end -> hint_start, so right now hint still mostly happened in shard one because in first read max_hint_addr goes from end address. This make first shard still need to write bunch of records to shard ram circuit. To further improve this, we need to find way, e.g. make rykv go from lower address to high. So the first shard issue can be solved entirely

@hero78119 hero78119 marked this pull request as draft December 4, 2025 02:02
@hero78119 hero78119 changed the title track heap watermark OMC: init heap table across different shard Dec 5, 2025
@hero78119
Copy link
Collaborator Author

need to figure out soundness #1178

@hero78119 hero78119 added the speed label Dec 5, 2025
@hero78119
Copy link
Collaborator Author

TODO

  • check is there any debug revert, e.g. debug log, or par -> seq change

@hero78119 hero78119 marked this pull request as ready for review December 22, 2025 13:28
@hero78119 hero78119 changed the title OMC: init heap table across different shard OMC: init heap/hint table across different shard Dec 22, 2025
Copy link
Collaborator

@kunxian-xia kunxian-xia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1st round of review

} else {
system_config
.mmu_config
.assign_init_table_circuit(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact we can just avoid this step as we only assign static init tables in the 0th shard.

Copy link
Collaborator Author

@hero78119 hero78119 Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although literally we can omit this in shard id > 0 but in some place we need all chip presented with empty RMM so we still need it, e.g. remove this cause mock_prover failed

// │ │ └─ later rw? YES (rw in >0 exists) -> ShardRAM
// │ │
// │ └─ rw occurs in current shard (current shard may be >0)
// │ ├─ later rw? NO (no rw in later) -> ShardRAM + LocalFinalize
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? It should be "LocalFinalize".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ShardRAM + LocalFinalize" correct, for example 0th shard is previous shard, then need to global read from ShardRAM

// │ └─ later rw? YES -> ShardRAM
// │
// └─ NO: init in a previous shard
// ├─ later rw? NO -> LocalFinalize
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO will update this to later rw? NO -> ShardRAM + LocalFinalize

Copy link
Collaborator

@kunxian-xia kunxian-xia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@kunxian-xia kunxian-xia added this pull request to the merge queue Dec 30, 2025
Merged via the queue into master with commit fe4b65d Dec 30, 2025
4 checks passed
@kunxian-xia kunxian-xia deleted the feat/shard_mem_init branch December 30, 2025 08:20
github-merge-queue bot pushed a commit that referenced this pull request Dec 31, 2025
Follow up on #1171 
replace `rkyv`  with custom serde for 2 reasons
- serialization to bytes need to implement `rkyv` friendly struct, or
leverage external library, e.g. "rkyv + bincode". `rkyv` friendly struct
is not friendly for existing application in particular struct are
defined in 3rd-party library. Thus previously we integrate bincode for
the serialiation to bytes. This bring extra effort for guest program as
we need bincode to deserialize back to owned struct.
- in deserialize `rkyv` will access high addr to retrieve some meta
information before sequential read. Below are example read pattern
 ```
 hint address 28000004
 hint address 28000008
 hint address 280c8770 <- high addr accessed to fetch meta data
 hint address 280c876c
 hint address 280c876c
 hint address 280c8770
 hint address 28000010
 hint address 28000010
 hint address 28000010
 hint address 28000010
 hint address 28000014
 hint address 28000014
 hint address 28000014
 ...
 ```
The high order access not friendly if we want to record max accessed
address in each shard for memory region initialized across shard.


New `ceno_serde` crates credits to
https://github.com/openvm-org/openvm/tree/main/crates/toolchain/openvm/src/serde

## benchmark

Run on 23817600 with K=6, cache trace = none, 1x4090

| Component | base Time | Init-Across-Shard Time | Custom Serde Time |
Init-Across-Shard Improvement | Custom Serde Improvement |

|--------------------------------------|-----------|-------------------------|-------------------|-------------------------------|--------------------------|
| app.prove | 203 s | 175 s | 160 s | ↓ 13.8% | ↓ 21.2% |
| emulator.preflight-execute | 43.9 s | 35.8 s | 36.6 s | ↓ 18.5% | ↓
16.6% |
| app_prove.inner | 158 s | 139 s | 123 s | ↓ 12.0% | ↓ 22.2% |
| create_proof_of_shard (shard_id = 0) | 14.9 s | 4.91 s | 8.46 s | ↓
67.0% | ↓ 43.2% |
| create_proof_of_shard (shard_id = 1) | 16.4 s | 12.7 s | 3.74 s | ↓
22.6% | ↓ 77.2% |

Overall cycle `322620252` -> `315955342` (2.06%)

> Custom Serde updated setting: max cell: (1 << 30) * 8 / 4 / 2 ------>
(1 << 30) * 10 / 4 / 2
keccak blowup K=6 `unchange`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants