Async DirectIO model loading on Linux #18012

JTischbein · 2025-12-13T22:56:27Z

Implements Direct I/O (uncached) file reading on Linux to improve model loading performance by bypassing the page cache. This is especially beneficial for large model files.

While mmap is fast on loading the same model multiple times, uncached read provides consistent model loading times at the speed of the sequential disk read speed. On DGX Spark loading GPT-OSS-120B-MXFP4 using mmap takes ~110s, in the following loads ~67s. With these changes it takes consistently ~10.5s. The speedup depends on the model size, the disk read speed and for sequential loading the available RAM.

I would propose to set uncached reads as default, Windows already has async uncached IO (PR)

common/arg.cpp

ggerganov

This results in a huge load speedup on DGX Spark and also at the end of the program leaves the memory in state free instead of buff/cache.

Currently, the implementation is gated behind defined(__linux__). Is this functionality generally supported across all linux platforms? If I am reading this correctly, it boils down to having O_DIRECT support for open().

Also, do we expect this change to also have effect on non-DGX Spark systems?

lemmi · 2025-12-14T21:19:22Z

On my strix halo machine with btrfs, this is strictly worse than master or with mmap. mmap shows the highest throughput while loading the model (~6GByte/s), master is around 3GByte/s and this patch is 2GByte/s.

ehoogeveen-medweb · 2025-12-14T21:27:34Z

IIRC with Strix Halo and ROCm/HIP, loading a model into memory reserved for the GPU using mmap has a major performance issue, hanging basically indefinitely for larger models. Given that reserving memory for the GPU also means having less RAM available to the CPU, it would be great if this DirectIO doesn't have that issue as it would make ROCm/HIP more viable for larger models. Vulkan doesn't have this issue.

Uncached model read

3074b50

JTischbein requested a review from ggerganov as a code owner December 13, 2025 22:56

loci-dev mentioned this pull request Dec 14, 2025

UPSTREAM PR #18012: Async DirectIO model loading on Linux auroralabs-loci/llama.cpp#559

Open

taronaeo reviewed Dec 14, 2025

View reviewed changes

common/arg.cpp Outdated Show resolved Hide resolved

JTischbein added 2 commits December 14, 2025 09:41

Removing additional --mmap arg

26cc75f

Removing trailing whitespaces

ceccfb9

ggerganov reviewed Dec 14, 2025

View reviewed changes

jeffbolznv mentioned this pull request Dec 15, 2025

vulkan: Implement set_tensor_async and the event interfaces #18047

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Async DirectIO model loading on Linux #18012

Async DirectIO model loading on Linux #18012

JTischbein commented Dec 13, 2025

Uh oh!

Uh oh!

ggerganov left a comment •

edited

Loading

Uh oh!

lemmi commented Dec 14, 2025

Uh oh!

ehoogeveen-medweb commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Async DirectIO model loading on Linux #18012

Are you sure you want to change the base?

Async DirectIO model loading on Linux #18012

Conversation

JTischbein commented Dec 13, 2025

Uh oh!

Uh oh!

ggerganov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lemmi commented Dec 14, 2025

Uh oh!

ehoogeveen-medweb commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ggerganov left a comment •

edited

Loading