Skip to content

Conversation

@ranger-ross
Copy link
Member

@ranger-ross ranger-ross commented Oct 26, 2025

This PR adds fine grain locking for the build cache using build unit level locking.
I'd recommend reading the design details in this description and then reviewing commit by commit.
Part of #4282

Previous attempt: #16089

Design decisions / rational

  • Still hold build-dir/<profile>/.cargo-lock
    • to protect against cargo clean (exclusive)
    • changed from exclusive to shared for builds
  • Using build unit level locking with a single lock per build unit.
    • Before checking fingerprint freshness we take a shared lock. This prevents reading a fingerprint while another build is active.
    • For units that are dirty, when the job server queues the job we take an exclusive lock to prevent others from reading while we compile.
    • After the unit's compilation is complete, we downgrade back to a shared lock allowing other readers.
    • All locks are released at the end of the entire build process
  • artifact-dir was handled in Do not lock the artifact-dir for check builds + fix uplifting #16307.

For the rational for this design see the discussion #t-cargo > Build cache and locking design @ 💬

Open Questions

  • Do we need rlimit checks and dynamic rlimits? Implement fine grain locking for build-dir #16155 (comment)
  • Proper handling of blocking message (Implement fine grain locking for build-dir #16155 (comment))
    • Update Dec 18 2025: With updated impl, we now get the blocking message when taking the initial shared lock, but we get no message when taking the exclusive lock right before compiling.
  • Reduce parallelism when blocking
  • How do we want to handle locking on the artifact directory?
    • We could simply continue using coarse grain locking, locking and unlocking when files are uplifted.
    • One downside of locking/unlocking multiple times per invocation is that artifact-dir is touch many times across the compilation process (for example, there is a pre-rustc clean up step Also we need to take into account other commands like cargo doc
    • Another option would to only take a lock on the artifact-dir for commands that we know will uplift files. (e.g. cargo check would not take a lock artifact-dir but cargo build would). This would mean that 2 cargo build invocations would not run in parallel because one of them would hold the lock artifact-dir (blocking the other). This might actually be ideal to avoid 2 instances fighting over the CPU while recompiling the same crates.
    • Solved by Do not lock the artifact-dir for check builds + fix uplifting #16307
  • What should our testing strategy for locking be?
    • My testing strategy thus far has been to run cargo on dummy projects to verify the locking.
    • For the max file descriptor testing, I have been using the Zed codebase as a testbed as it has over 1,500 build units which is more than the default ulimit on my linux system. (I am happy to test this on other large codebase that we think would be good to verify against)
    • It’s not immediately obvious to me as to how to create repeatable unit tests for this or what those tests should be testing for.
    • For performance testing, I have been using hyperfine to benchmark builds with and without -Zbuild-dir-new-layout. With the current implementation I am not seeing any perf regression on linux but I have yet to test on windows/macos.

Original Design
  • Using build unit level locking instead of a temporary working directory.
    • After experimenting with multiple approaches, I am currently leaning to towards build unit level locking.
    • The working directory approach introduces a fair bit of uplifting complexity and I further along I pushed my prototype the more I ran into unexpected issues.
      • mtime changes in fingerprints due to uplifting/downlifting order
      • tests/benches need to be ran before being uplifted OR uplifted and locked during execution which leads to more locking design needed. (also running pre-uplift introduces other potential side effects like the path displayed to the user being deleted as its temporary)
    • The trade off here is that with build unit level locks, we need a more advanced locking mechanism and we will have more open locks at once.
    • The reason I think this is a worth while trade of is that the locking complexity can largely be contained to single module where the uplifting complexity would be spread through out the cargo codebase anywhere we do uplifting. The increased locks count while unavoidable can be mitigated (see below for more details)
  • Risk of too many locks (file descriptors)
    • On Linux 1024 is a fairly common default soft limit. Windows is even lower at 256.
    • Having 2 locks per build unit makes is possible to hit with a moderate amount of dependencies
    • There are a few mitigations I could think of for this problem (that are included in this PR)
      • Increasing the file descriptor limits of based on the number of build units (if hard limit is high enough)
      • Share file descriptors for shared locks across jobs (within a single process) using a virtual lock
        • This could be implemented using reference counting.
      • Falling back to coarse grain locking if some heuristic is not met

Implementation details

  • We have a stateful lock per build unit made up of multiple file locks primary.lock and secondary.lock (see locking.rs module docs for more details on the states)
    • This is needed to enable pipelined builds
  • We fall back to coarse grain locking if fine grain locking is determined to be unsafe (see determine_locking_mode())
  • Fine grain locking continues to take the existing .cargo-lock lock as RO shared to continue working with older cargo versions while allowing multiple newer cargo instances to run in parallel.
  • Locking is disabled on network filesystems. (keeping existing behavior from Don't use flock on NFS mounts #2623)
  • cargo clean continues to use coarse grain locking for simplicity.
  • File descriptors
    • I added functionality to increase the file descriptors if cargo detects that there will not be enough based on the number of build units in the UnitGraph.
    • If we aren’t able to increase a threshold (currently number of build units * 10) we automatically fallback to coarse grain locking and display a warning to the user.
      • I picked 10 times the number of build units a conservative estimate for now. I think lowering this number may be reasonable.
    • While testing, I was seeing a peak of ~3,200 open file descriptors while compiling Zed. This is approximately x2 the number of build units.
      • Without the RcFileLock I was seeing peaks of ~12,000 open fds which I felt was quiet high even for a large project like Zed.
  • We use a global FileLockInterner that holds on to the file descriptors (RcFileLock) until the end of the process. (We could potentially add it to JobState if preferred, it would just be a bit more plumbing)

See #16155 (comment) for proposal to transition away from this to the current scheme

@rustbot
Copy link
Collaborator

rustbot commented Oct 26, 2025

r? @ehuss

rustbot has assigned @ehuss.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added A-build-execution Area: anything dealing with executing the compiler A-filesystem Area: issues with filesystems A-layout Area: target output directory layout, naming, and organization Command-clean S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 26, 2025
Comment on lines 133 to 143
let primary_lock = open_file(&self.primary)?;
primary_lock.lock()?;

let secondary_lock = open_file(&self.secondary)?;
secondary_lock.lock()?;

self.guard = Some(UnitLockGuard {
primary: primary_lock,
_secondary: Some(secondary_lock),
});
Ok(())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we double checked if we run into problems like #15698?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a closer look and it appears that there is a possibility that a similar issue could happen.
I think in practice it would not dead lock since failing to take a lock would result in the build failing, so the lock would be released when the process exits.

But regardless, I went ahead and added logic to unlock the partial lock if we fail to take the full lock just incase.

@ranger-ross ranger-ross force-pushed the multi-locking-attempt branch from b234db5 to 1deca86 Compare October 28, 2025 07:17
@ranger-ross ranger-ross force-pushed the multi-locking-attempt branch from 4ace282 to 1a81371 Compare December 19, 2025 11:13
@ranger-ross ranger-ross force-pushed the multi-locking-attempt branch 2 times, most recently from 39d1cee to c75a617 Compare December 22, 2025 13:58
Ok(key)
}

pub fn lock(&self, key: &LockKey) -> CargoResult<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work to also have a Blocking message for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this and I think it should be possible. We don't have direct access to GlobalContext inside of JobState.
However, we do have indirect access through DiagDedupe.

I think we can modify the Filesystem functions to accept a trait like ReportBlocking and implement that trait for Shell and DiagDedupe.

Though perhaps we defer that to a follow up PR?

@ranger-ross ranger-ross force-pushed the multi-locking-attempt branch from c75a617 to eee8ef7 Compare December 27, 2025 14:49
@ranger-ross ranger-ross force-pushed the multi-locking-attempt branch from eee8ef7 to 4235aa2 Compare December 30, 2025 09:08
@rustbot
Copy link
Collaborator

rustbot commented Dec 30, 2025

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

Copy link
Contributor

@epage epage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epage epage added this pull request to the merge queue Dec 30, 2025
Merged via the queue into rust-lang:master with commit abaa830 Dec 30, 2025
29 checks passed
@rustbot rustbot removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 30, 2025
@weihanglo
Copy link
Member

The PR description is up-to-date, right?

@epage
Copy link
Contributor

epage commented Dec 30, 2025

As much as I'm aware, yes. I did a pass yesterday and moved the historical section down further to not draw as much attention to it.

bors added a commit to rust-lang/rust that referenced this pull request Dec 31, 2025
Update cargo submodule

9 commits in 94c368ad2b9db0f0da5bdd8421cea13786ce4412..b54051b1505281ec7a45a250140a0ff25d33f319
2025-12-26 19:39:15 +0000 to 2025-12-30 20:35:52 +0000
- fix(log): add `dependencies` field to `UnitRegistered` (rust-lang/cargo#16448)
- Implement fine grain locking for `build-dir` (rust-lang/cargo#16155)
- feat(resolver): List features when no close match (rust-lang/cargo#16445)
- feat(report): new command `cargo report sessions` (rust-lang/cargo#16428)
- feat (patch): Display where the patch was defined in patch-related error messages (rust-lang/cargo#16407)
- test(build-rs): Reduce from 'build' to 'check' where possible (rust-lang/cargo#16444)
- feat(toml): TOML 1.1 parse support (rust-lang/cargo#16415)
- feat(report): support --manifest-path in `cargo report timings` (rust-lang/cargo#16441)
- fix(vendor): recursively filter git files in subdirectories (rust-lang/cargo#16439)

r? ghost
@rustbot rustbot added this to the 1.94.0 milestone Dec 31, 2025
@ranger-ross ranger-ross deleted the multi-locking-attempt branch January 5, 2026 12:03
rust-bors bot added a commit to rust-lang/rust that referenced this pull request Jan 7, 2026
Update cargo submodule

19 commits in 94c368ad2b9db0f0da5bdd8421cea13786ce4412..623444427f04292f3a3c7eac560cfa65ba0bb88e
2025-12-26 19:39:15 +0000 to 2026-01-06 21:05:05 +0000
- docs(unstable): expand docs for `-Zbuild-analysis` (rust-lang/cargo#16476)
- test: add `-Zunstable-options` with custom targets (rust-lang/cargo#16467)
- feat(report): add cargo report rebuilds  (rust-lang/cargo#16456)
- feat(test-support): Use test name for dir when running tests (rust-lang/cargo#16121)
- refactor: Migrate some cases to expect/reason (rust-lang/cargo#16461)
- docs(build-script): clarify OUT_DIR is not cleaned between builds (rust-lang/cargo#16437)
- chore: Update dependencies (rust-lang/cargo#16460)
- Update handlebars to 6.4.0 (rust-lang/cargo#16457)
- chore(deps): update alpine docker tag to v3.23 (rust-lang/cargo#16454)
- Any build scripts can now use cargo::metadata=KEY=VALUE (rust-lang/cargo#16436)
- fix(log): add `dependencies` field to `UnitRegistered` (rust-lang/cargo#16448)
- Implement fine grain locking for `build-dir` (rust-lang/cargo#16155)
- feat(resolver): List features when no close match (rust-lang/cargo#16445)
- feat(report): new command `cargo report sessions` (rust-lang/cargo#16428)
- feat (patch): Display where the patch was defined in patch-related error messages (rust-lang/cargo#16407)
- test(build-rs): Reduce from 'build' to 'check' where possible (rust-lang/cargo#16444)
- feat(toml): TOML 1.1 parse support (rust-lang/cargo#16415)
- feat(report): support --manifest-path in `cargo report timings` (rust-lang/cargo#16441)
- fix(vendor): recursively filter git files in subdirectories (rust-lang/cargo#16439)

r? ghost
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-build-execution Area: anything dealing with executing the compiler A-documenting-cargo-itself Area: Cargo's documentation A-filesystem Area: issues with filesystems A-layout Area: target output directory layout, naming, and organization A-unstable Area: nightly unstable support Command-clean

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants