Fix mel spectrogram preprocessor allocating gigabytes of planned memory#18229
Fix mel spectrogram preprocessor allocating gigabytes of planned memory#18229mergennachin merged 1 commit intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18229
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 9 PendingAs of commit d6c31f6 with merge base 1e17e28 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
This PR fixes excessive planned-memory allocation during torch.export of the Whisper mel-spectrogram preprocessor by correcting the bound used for the waveform’s dynamic length dimension, with a tighter cap for streaming mode to keep STFT intermediate buffers small.
Changes:
- Fix offline export dynamic max to
max_audio_len * sampling_rate(seconds → samples) instead of mistakenly multiplying byn_samples. - Add a streaming-specific dynamic max cap at
2 * sampling_rateto prevent multi-GB memory plans.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
The dynamic dimension max was computed as max_audio_len * n_samples (samples per 30s chunk), not max_audio_len * sampling_rate. With max_audio_len=300, this produced 144M samples (150 minutes) instead of 4.8M (5 minutes), causing a ~3.3 GB planned buffer for STFT intermediates. For streaming mode, the max was even worse: 600 * 480K = 288M samples, producing a 6.6 GB planned buffer — even though streaming processes ~1640 samples per step. Fix both paths: - Offline: use max_audio_len * sampling_rate (300s → 4.8M samples, ~110 MB) - Streaming: cap at 2 seconds (32K samples, ~0.7 MB)
a08eec5 to
d6c31f6
Compare
| if model.streaming: | ||
| # Streaming processes small windows per step. 2 seconds gives | ||
| # comfortable headroom while keeping the memory plan tight. | ||
| max_samples = 2 * model.sampling_rate |
There was a problem hiding this comment.
Any performance issues with this? In the streaming mode, each inference takes 2s worth of samples and start over again for next two seconds?
There was a problem hiding this comment.
No performance issue. The max_samples = 2 * sampling_rate is only the dynamic shape upper bound at export time. It tells the memory planner the maximum buffer size to allocate. It doesn't affect how inference runs.
At runtime, the streaming preprocessor is called with ~1,640 samples per step (~0.1s). The exported graph handles any input size from 1 up to the declared max.
The 2-second cap just means if someone somehow passed more than 32,000 samples in a single call, it would fail. In practice the streaming window is fixed at 1,640 samples.
|
@pytorchbot cherry-pick --onto release/1.2 -c critical |
…ry (#18229) The dynamic dimension max was computed as max_audio_len * n_samples (samples per 30s chunk), not max_audio_len * sampling_rate. With max_audio_len=300, this produced 144M samples (150 minutes) instead of 4.8M (5 minutes), causing a ~3.3 GB planned buffer for STFT intermediates. For streaming mode, the max was even worse: 600 * 480K = 288M samples, producing a 6.6 GB planned buffer — even though streaming processes ~1640 samples per step. Fix both paths: - Offline: use max_audio_len * sampling_rate (300s → 4.8M samples, ~110 MB) - Streaming: cap at 2 seconds (32K samples, ~0.7 MB) Peak RSS for voxtral runner: (before) 9,556 MB, after (4,712 MB) (cherry picked from commit 776979f)
Cherry picking #18229The cherry pick PR is at #18238 and it is recommended to link a critical cherry pick PR with an issue. Details for Dev Infra teamRaised by workflow job |
The dynamic dimension max was computed as max_audio_len * n_samples
(samples per 30s chunk), not max_audio_len * sampling_rate. With
max_audio_len=300, this produced 144M samples (150 minutes) instead of
4.8M (5 minutes), causing a ~3.3 GB planned buffer for STFT
intermediates.
For streaming mode, the max was even worse: 600 * 480K = 288M samples,
producing a 6.6 GB planned buffer — even though streaming processes
~1640 samples per step.
Fix both paths:
Peak RSS for voxtral runner: (before) 9,556 MB, after (4,712 MB)