Skip to content

Partial zarr download/upload (stage: Design + Implementation)#1816

Draft
yarikoptic wants to merge 6 commits intomasterfrom
partial-zarr
Draft

Partial zarr download/upload (stage: Design + Implementation)#1816
yarikoptic wants to merge 6 commits intomasterfrom
partial-zarr

Conversation

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Mar 2, 2026

Summary

Design and implementation for partial zarr download and upload support, addressing #1462, #1474, and related archive issues.

The PR covers five areas:

  1. --zarr TYPE:PATTERN filtering for dandi download — glob, path, and regex filters for selecting entries within zarr assets, with a metadata alias for common zarr metadata files
  2. URL parsing with zarr boundary detectionAssetZarrEntryURL to handle URLs like dandi://dandi/000108/.../file.ome.zarr/0/0/0
  3. --zarr-mode {full, patch} for dandi upload — patch mode uploads changed files without deleting remote files absent locally
  4. Checksums and manifests — documents that per-directory checksums are computed hierarchically by the zarr_checksum library but are NOT persisted (only the root digest is stored in the DB); legacy .checksum files exist on S3 at zarr-checksums/ for ~72% of older zarrs but are orphaned since Dec 2022
  5. dandi ls for zarr contents — listing files within a zarr asset when URL points into a zarr

Key findings from investigation

  • The zarr_checksum algorithm IS hierarchical (Merkle tree, bottom-up via ZarrChecksumTree)
  • The archive's ingest_zarr_archive task computes checksums entirely in memory and stores only the root digest
  • Per-directory .checksum files on S3 (zarr-checksums/ prefix) were written by ZarrChecksumFileUpdater, removed in dandi-archive PRs Testing started to fail due to an error in parsing (?) + new deprecationwarning #1390/[FEAT] dandi fsspec filesystem #1395 (Dec 2022). Legacy files remain for older zarrs but no API exposes them
  • Subtree checksum verification is not possible today without recomputation from file ETags

Implementation

New files

  • dandi/zarr_filter.pyZarrFilter dataclass with glob/path/regex matching, parse_zarr_filter(), make_zarr_entry_filter(), ZARR_FILTER_ALIASES (includes metadata alias)
  • dandi/tests/test_zarr_filter.py — 52 unit tests covering all filter types, parsing, aliases, edge cases, invalid regex validation

Modified files

  • dandi/dandiarchive.pysplit_zarr_location(), AssetZarrEntryURL class, updated parse_dandi_url() to detect zarr boundaries
  • dandi/download.pyzarr_filters parameter on download(), filter threading through Downloader and _download_zarr(), skip deletion/checksum when filter active
  • dandi/files/zarr.pyzarr_mode: Literal["full", "patch"] on iter_upload(), patch mode skips remote file deletion and client-side checksum verification
  • dandi/upload.pyzarr_mode parameter on upload(), conditional pass-through for ZarrAsset
  • dandi/cli/cmd_download.py--zarr click option (multiple, OR logic)
  • dandi/cli/cmd_upload.py--zarr-mode click option
  • dandi/cli/cmd_ls.py — list zarr entries when URL is AssetZarrEntryURL
  • dandi/cli/tests/test_download.py — updated mock expectations for new zarr_filters parameter

Integration tests

  • dandi/tests/test_download.py — 6 tests (glob filter, metadata alias, no-delete, path filter, nonexistent filter, sync conflict)
  • dandi/tests/test_upload.py — 3 tests (patch no-delete, full delete, patch updates)
  • dandi/tests/test_dandiarchive.py — 3 URL parsing cases + 8 split_zarr_location cases

Review checklist

Please review the design at doc/design/partial-zarr.md and comment on:

  • --zarr TYPE:PATTERN syntax — is the filter approach right? Are glob/path/regex the right types?
  • metadata alias expansion — does glob:**/.z* + glob:**/zarr.json + glob:**/.zmetadata cover all cases?
  • --zarr-mode patch semantics — is "upload without deleting" the right default for patch? Should subtree cleanup happen?
  • URL parsing — is AssetZarrEntryURL with zarr boundary detection the right approach?
  • Checksum strategy — relying on per-file ETags for partial ops, deferring subtree checksums to future manifests
  • Open questions in the doc (OR composition chosen, --sync conflict raises error, server-side glob deferred)
  • Should legacy zarr-checksums/ files on S3 be cleaned up as part of this or separately?

TODO

  • Implement dandi/zarr_filter.py — filter parsing and matching
  • Implement AssetZarrEntryURL and split_zarr_location() in dandi/dandiarchive.py
  • Add --zarr option to dandi download CLI
  • Modify _download_zarr() for partial download support
  • Add --zarr-mode option to dandi upload CLI
  • Implement patch mode in iter_upload() (dandi/files/zarr.py)
  • Thread zarr_mode through dandi/upload.py
  • dandi ls zarr contents support
  • Unit tests for zarr_filter.py (52 tests)
  • Unit tests for URL parsing and split_zarr_location (11 test cases)
  • Integration tests for download filtering (6 tests)
  • Integration tests for upload patch mode (3 tests)
  • Fix existing CLI mock test expectations
  • Code review fixes (regex pre-compilation, Literal typing, ** collapse, redundant override removal)
  • Run integration tests against docker-compose archive instance
  • Coordinate with dandi-archive on manifest design (#2702) for future subtree checksum support

🤖 Generated with Claude Code

@yarikoptic yarikoptic marked this pull request as draft March 2, 2026 21:37
@yarikoptic yarikoptic requested review from kabilar and satra March 2, 2026 21:38
@yarikoptic yarikoptic changed the title Design: partial zarr download/upload Partial zarr download/upload (stage: Design) Mar 2, 2026
@codecov
Copy link

codecov bot commented Mar 2, 2026

Codecov Report

❌ Patch coverage is 91.64491% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.57%. Comparing base (5f03d9b) to head (e32fb86).

Files with missing lines Patch % Lines
dandi/dandiarchive.py 64.28% 10 Missing ⚠️
dandi/download.py 86.44% 8 Missing ⚠️
dandi/upload.py 37.50% 5 Missing ⚠️
dandi/zarr_filter.py 95.31% 3 Missing ⚠️
dandi/cli/cmd_ls.py 60.00% 2 Missing ⚠️
dandi/files/zarr.py 90.00% 2 Missing ⚠️
dandi/tests/test_download.py 96.15% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1816      +/-   ##
==========================================
+ Coverage   75.12%   75.57%   +0.44%     
==========================================
  Files          84       86       +2     
  Lines       11930    12230     +300     
==========================================
+ Hits         8963     9243     +280     
- Misses       2967     2987      +20     
Flag Coverage Δ
unittests 75.57% <91.64%> (+0.44%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yarikoptic yarikoptic added enhancement New feature or request minor Increment the minor version when merged labels Mar 3, 2026
yarikoptic and others added 3 commits March 6, 2026 08:10
Covers five areas:
- --zarr TYPE:PATTERN filtering for download (glob, path, regex)
- URL parsing with zarr boundary detection (AssetZarrEntryURL)
- --zarr-mode {full, patch} for upload
- Checksums and manifests (per-directory checksums are NOT
  persisted on the archive; legacy .checksum files exist on S3
  under zarr-checksums/ for ~72% of older zarrs but are orphaned)
- dandi ls for zarr contents

Related: #1462, #1474

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add --zarr CLI option for download to filter entries within zarr assets
(glob/path/regex patterns with predefined 'metadata' alias), and
--zarr-mode option for upload to support 'patch' mode (upload/update
without deleting remote-only files).

Key changes:
- New dandi/zarr_filter.py: filter parsing, matching, and aliases
- URL parsing: AssetZarrEntryURL for URLs pointing into zarr assets
- Download pipeline: thread zarr_entry_filter through Downloader and
  _download_zarr, skip deletion and checksum when filter active
- Upload pipeline: zarr_mode='patch' skips remote file deletion and
  client-side checksum verification
- dandi ls: list zarr entries when URL points into a zarr

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pre-compile regex patterns at ZarrFilter construction time, catching
  invalid patterns early instead of on every matches() call (B1)
- Remove redundant get_asset_download_path override in AssetZarrEntryURL
  that was identical to the inherited SingleAssetURL method (H1)
- Use Literal["full", "patch"] for zarr_mode parameter instead of bare
  str to prevent silent misbehavior on invalid values (H2)
- Collapse consecutive ** glob segments to avoid exponential
  backtracking in _glob_match_parts (H3)
- Simplify split_zarr_location to use str.split instead of
  PurePosixPath (M1)
- Add explanatory comment for type: ignore in parse_zarr_filter (M2)
- Yield {"status": "done"} when zarr filter matches zero entries (M4)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yarikoptic yarikoptic changed the title Partial zarr download/upload (stage: Design) Partial zarr download/upload (stage: Design + Implementation) Mar 9, 2026
yarikoptic and others added 3 commits March 11, 2026 11:34
click.Choice returns str at runtime, but upload() expects
Literal["full", "patch"]. Add typing cast at the call site.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Follow existing codebase pattern (UploadExisting, DownloadExisting, etc.)
to define ZarrMode as a str Enum in upload.py.  Eliminates the duplicated
Literal["full", "patch"] across three files and the cast() workaround in
cmd_upload.py.

Uses TYPE_CHECKING guard in files/zarr.py to avoid circular import
(files/zarr.py -> upload.py -> .files).

Co-Authored-By: Claude Code 2.1.63 / Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Code 2.1.63 / Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request minor Increment the minor version when merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant