(practical runbook + repeatable workflow)
- Profile on native Windows (real NTFS stack, real caching behavior, real syscalls).
- Move artifacts by USB (GBs OK).
- Analyze on macOS (M4) with a good UI (flame graphs, timelines), ideally without being “stuck” using Windows-only viewers.
- Optimize ruthlessly: you need repeatability, symbols, and the ability to compare runs.
This plan gives you two complementary pipelines:
- Fastest “works everywhere” sharing format (recommended):
PerfView → export SpeedScope JSON → analyze on macOS in speedscope.app - Best interactive profiler UI + off-CPU insight (great during iteration):
samply on Windows (ETW) → moveprofile.json(+ symbols) →samply loadon macOS
(samply uses the Firefox Profiler UI and works on Windows/macOS/Linux.)
- PerfView is Windows-native and ETW-backed; it can capture CPU stacks and more. citeturn2search5
- PerfView can export to SpeedScope JSON, which is self-contained and cross-platform (open it on macOS in a browser). citeturn1view2turn2search9turn2search6
- Great for moving profiles to your Mac without worrying about symbol servers or platform-specific analysis tools.
- samply is a cross-platform CLI CPU profiler and uses the Firefox Profiler as its UI. citeturn1view0turn0search8
- On Windows it uses ETW and can record both on-CPU and off-CPU samples (locks / waits show up). citeturn1view0
- You can save the profile to disk (
profile.json) and later open it viasamply load. citeturn6search14turn9search8 - samply’s symbol stack (wholesym / samply-symbols) supports Windows formats (PDB / PE) and symbol servers, across platforms. citeturn4search3turn11view0turn4search20
Reality check: for “ship a profile to another machine and get perfect symbols”, PerfView→SpeedScope is the least finicky. samply is awesome, but you must treat symbols as a first-class artifact (see below).
samply explicitly recommends compiling release mode with debug info for good stacks & source view. citeturn1view0
Add a dedicated Cargo profile (project-local is easiest):
# Cargo.toml
[profile.profiling]
inherits = "release"
debug = true(Alternative per samply docs: put this in ~/.cargo/config.toml.) citeturn1view0
- Make stacks more reliable (especially in hot loops / LTO-heavy builds):
- Consider disabling full LTO for profiling builds (LTO can smear stacks and change inlining).
- Consider
codegen-units = 1to make profiles less “noisy” across builds.
- If you use MSVC target, ensure
.pdbis produced and copied with the.exe.
You want: (exe + pdb) always travel together.
On your Mac, create a deterministic staging folder per build:
bundles/
<date>-<gitsha>-<scenario>/
dist/
uffs.exe
uffs.pdb (or relevant debug info)
*.dll (if you ship any)
inputs/ (the exact dataset used)
run_args.txt
build_meta.json (git sha, rustc version, flags, target)
profiles/
perfview.speedscope.json
samply.profile.json (optional)
notes.md (what you changed, what you’re testing)
Why it matters:
- When you find a win, you can reproduce it.
- When you regress, you can bisect with evidence.
- PerfView is a Windows performance analysis tool (CPU & memory focused) and produces ETL traces. citeturn2search5turn2search2
- Install samply (either via
cargo installor prebuilt scripts). citeturn1view0
Good stacks on Windows require symbols.
- PDBs are the symbol files for Windows builds. citeturn4search4
- Windows debuggers and profiling tools can use symbol servers and symbol stores; Microsoft documents the concept + SymStore. citeturn6search2
- Microsoft also documents how to use the public symbol server and configure symbol paths. citeturn6search0turn6search1
Create:
C:\symcache(download/cache)C:\mysymbols(your own symbol store, optional but recommended)
Then use a symbol path like:
srv*C:\symcache*https://msdl.microsoft.com/download/symbols
If you maintain your own symbol store too:
srv*C:\symcache*C:\mysymbols*https://msdl.microsoft.com/download/symbols
This massively improves call stacks inside Windows libraries and lets you correlate time spent in kernel / filesystem / memory manager.
- Launch PerfView.
- Use its collection workflow to record your run (CPU sampling / thread time).
- Stop collection after the operation of interest (e.g., “scan disk + parse MFT + build results”).
(PerfView is a well-known ETW-based collector, including in Microsoft’s own guidance.) citeturn2search2
PerfView has a SpeedScope export feature; it generates a JSON file you can load in speedscope.app. citeturn1view2turn2search6
In PerfView:
- Open the trace
- Go to CPU stack view / stack viewer
- Export using SpeedScope export (often via “Save View As” → SpeedScope)
Microsoft also calls out SpeedScope as a cross-platform analysis target. citeturn2search9
- Copy
*.speedscope.jsonto your Mac (USB is perfect). - Open it in a browser using SpeedScope.
- Use:
- flame graph
- time order view
- left/right compare (great for before/after changes)
- Single file artifact that’s easy to attach to PRs or drop into a “perf-results” folder.
- No dependency on Windows-only viewers for day-to-day iteration.
Baseline command:
# From the folder that contains uffs.exe (+ uffs.pdb)
samply record --save-only -o profiles\samply.profile.json -- .\uffs.exe <args>- samply records and uses the Firefox Profiler UI. citeturn1view0turn0search8
--save-onlyis used in real samply workflows to write the JSON profile to disk. citeturn3search2turn9search5
samply explicitly supports using the Microsoft Symbol Server on Windows. citeturn1view0
Example:
samply record --save-only -o profiles\samply.profile.json `
--windows-symbol-server https://msdl.microsoft.com/download/symbols `
-- .\uffs.exe <args>(If you’re profiling the whole system / many processes, samply also supports -a.) citeturn1view0
Copy both:
profiles/samply.profile.jsondist/containing your exe + pdb (or equivalent debug files)
On macOS:
# install once, if needed
cargo install --locked samply
# in the bundle directory
samply load profiles/samply.profile.jsonUsing samply load is the documented way to open a saved profile with working symbolication. citeturn6search14turn9search8
samply’s symbol stack supports:
- Windows symbols (PDB/PE)
- symbol servers
- local symbol directories …across platforms. citeturn11view0turn4search3turn4search20
This is usually because the symbol resolver can’t find your PDB/binary by its identifiers.
Most robust fix:
- Create a symbol store on Windows (using SymStore). citeturn6search2
- Put it on the USB drive (e.g.,
USB:\symbols\...). - Point your tooling at that symbol store + Microsoft symbol server.
If you don’t want to fight this today:
- fall back to Pipeline A (PerfView → SpeedScope JSON), which is specifically designed for portable viewing. citeturn1view2turn2search9
For low-level NTFS scanning, a pure CPU profile can be misleading:
- page cache effects
- synchronous reads
- readahead behavior
- file metadata calls
- kernel time
Windows Performance Analyzer can open ETL traces produced by WPR / Xperf. citeturn0search2
Workflow:
- Record an ETL trace that includes CPU sampling + disk/file I/O providers (WPR scenario / custom profile).
- Analyze disk I/O, file I/O, CPU usage, context switches, etc.
- Export tables/charts from WPA as CSV for archiving and later analysis on macOS.
You won’t get WPA itself on macOS, but you can export data products (CSV) and correlate them with your CPU profile findings.
Define 3–5 fixed scenarios (and keep them forever), e.g.
mft_small(tiny image)mft_medium(realistic)mft_large(worst-case)cold_cachevswarm_cache(explicitly note which)
Store the exact dataset hash in build_meta.json.
- Pick one:
- Warm cache (run once, discard, then profile)
- Cold cache (reboot or flush file cache; harder)
- Keep sampling rate stable (don’t change it between A/B runs).
- Use the same binary flags (profiling build).
- Use stable CPU power settings (“High Performance” plan) if possible.
Example:
2026-01-20_a1b2c3d_mft_large_readonly.perfview.speedscope.json
2026-01-20_a1b2c3d_mft_large_readonly.samply.profile.json
If you want the highest ROI with minimal tool pain:
-
Make PerfView→SpeedScope your canonical, shareable artifact.
That becomes your “performance PR evidence” on macOS. citeturn1view2turn2search9turn2search6 -
Use samply when you’re actively iterating, especially if you care about:
- off-CPU waiting / lock contention
- fast visual scanning (Firefox Profiler UI) citeturn1view0turn0search8
-
Treat symbols as build artifacts, not optional files.
Always ship exe + pdb together; optionally maintain a symbol store. citeturn6search2turn4search4 -
Add a
profilingCargo profile and keep it consistent across the team. citeturn1view0
cargo build --profile profiling --target x86_64-pc-windows-gnu
# or: x86_64-pc-windows-msvc (if you have that toolchain working)samply record --save-only -o profile.json -- .\uffs.exe <args>--save-only usage is referenced in samply issues and practice. citeturn3search2turn9search5
samply record -a --windows-symbol-server https://msdl.microsoft.com/download/symbolsWindows symbol server usage is shown in samply docs. citeturn1view0
samply load profile.jsonLoading saved profiles via samply load is the recommended way to get symbolication. citeturn6search14turn9search8
(URLs in code blocks so they’re easy to copy/paste.)
- samply (cross-platform, uses Firefox Profiler, Windows ETW, symbol servers)
https://github.com/mstange/samply
- “Need
samply loadto view saved profile with symbols” (discussion / rough edges)
https://github.com/mstange/samply/issues/83
- PerfView (Windows ETW tool)
https://github.com/microsoft/perfview
- PerfView SpeedScope export overview
https://deepwiki.com/microsoft/perfview/7.1-speedscope-export
- Microsoft: Symbol servers / symbol stores (SymStore)
https://learn.microsoft.com/en-us/windows/win32/debug/symbol-servers-and-symbol-stores
- Microsoft: public symbol server + symbol path configuration
https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/microsoft-public-symbols
https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/symbol-path
- Microsoft: WPA can open ETL traces produced by WPR/Xperf
https://learn.microsoft.com/en-us/windows-hardware/test/wpt/opening-and-analyzing-etl-files-in-wpa
- samply / wholesym cross-platform symbolication support (PDB/PE etc.)
https://docs.rs/wholesym