-
Notifications
You must be signed in to change notification settings - Fork 149
Add HNSW ACE build method #1597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Added `cuvsHnswAceParams` structure for ACE configuration. - Implemented `cuvsHnswBuild` function to facilitate index construction using ACE. - Updated HNSW index parameters to include ACE settings. - Created new tests for HNSW index building and searching using ACE. - Updated documentation to reflect the new ACE parameters and usage.
tfeher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Julian for the PR! I think it is a nice improvement, but I would still recommend to simplify further along the following lines
- Treat CAGRA parameters as implementation details. We want to keep the usage and the documentation simple for users familiar with HNSW index building.
- Still provide a way to adjust these parameters.
- Clarify that the results are not an exact equivalent to a HNSW graph, but in practice the graph is a good replacement that works well in HNSW search.
tfeher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @julianmi for the updates! There are only two issues remaining:
- Redundancy with
ef_constructionparameter (see comments below) - Build - deserialize - search workflow
Considering the second point, currently the resulting index can be either in memory or stored in a file. (Although #1604 removes the in-memory ACE path, we might still decide later to enable other in-memory build methods.)
auto hnsw_index = hnsw::build(res, hnsw_params, dataset);
// index is saved to disk at hnsw_index->file_path(), the hnsw_index structure just a wrapper around the file name and index params
hnsw::search(res, search_params, *hnsw_index, queries, indices, distances);This would fail with the following error message if the index does not fit memory.
Searching HNSW index
terminate called after throwing an instance of 'raft::logic_error'
what(): RAFT failure at file=/home/scratch.tfeher_gpu_2/cuvs_1597/cpp/src/neighbors/detail/hnsw.hpp line=1140: Cannot search an HNSW index that is stored on disk. The index must be deserialized into memory first using hnsw::deserialize().
Obtained 7 stack frames
Instead, when the index is on disk, then we need to use the following step
auto hnsw_index = hnsw::build(res, hnsw_params, dataset);
hnsw::deserialize(res, hnsw_params, hnsw_index->file_path(), dataset.extent(1), hnsw_params.metric, &hnsw_index_deserialized);
hnsw::search(res, search_params, *hnsw_index_deserialized, queries, indices, distances);Do we have to dictate such usage patterns? Could we instead let hnsw::search load the index from file if needed?
- Renamed parameter `m` to `M` in HNSW structures and related functions for consistency. - Removed `ef_construction` from `cuvsHnswAceParams` and related classes, as it is no longer needed. - Load the HNSW index from file before search if needed.
tfeher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @julianmi for the updates, the PR looks good to me!
divyegala
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python approval only
This PR adds a direct
hnsw::buildAPI that uses the ACE (Augmented Core Extraction) algorithm to build HNSW indexes on the GPU. ACE enables building HNSW indexes for datasets too large to fit in GPU memory by partitioning the data and building sub-indexes.CC @tfeher
C++ API
hnsw::build()function with ACE parameters for direct HNSW index construction. This serializes an HNSW index to disk ifuse_diskis true.hnsw::graph_build_params::ace_paramsstruct with configurable options:npartitions- number of partitions for parallel buildef_construction- index quality parameterbuild_dir- directory for disk-based build artifactsuse_disk- force disk-based storage modeann_hnsw_ace.cuhC API
cuvsHnswBuildfunction with ACE parametersann_hnsw_ace.cuPython
hnsw.AceParamsclass for configuring ACE buildstest_hnsw_ace.pyJava
HnswAceParamsclassHnswAceBuildAndSearchIT.javaDocumentation
cuvs_hnswsection to the parameter tuning guide with ACE parametersExample
hnsw_ace_example.cudemonstrating the build → deserialize → search workflow