Simplify the DataProvider contract for graph search#1067
Draft
hildebrandmw wants to merge 5 commits into
Draft
Conversation
added 2 commits
May 13, 2026 17:49
Contributor
|
get_element is basically unused in garnet, except for implementing VEMB (return an vector) and other debugging-style functions. This sounds like a very promising direction, and I am very supportive. |
DataProvider contract for graph search
This was referenced May 15, 2026
hildebrandmw
added a commit
that referenced
this pull request
May 16, 2026
Remove `flat_search` from `DiskANNIndex` and the `IdIterator` trait from `diskann`. Since the only caller was from `diskann-disk`, add a new `flat_search` inherent method to `DiskIndexSearcher`. The flat search method is not compatible with the experimental direction in #1067 and with #983 on the horizon, this is safe to move for now.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Executive Summary: What happens if we start simplifying our trait hierarchy and rely on
ExpandBeamto do the heavy lifting?This is controversial and I have no intention (yet) of turning this into a full PR, but it demonstrates the feasibility of something I've come to realize recently. The whole
Accessor->BuildQueryComputer->ExpandBeamoriginated from earlier patterns in the library of building up indexing algorithms from tiny pieces, enabling a user to implement a small set of methods and get the whole indexing algorithm for free.This breaks down, though, when we consider performance: the more information we provide a backend database about the operation we want to perform, the better the database can optimize. This lead to
on_elements_unordered(bulk element retrieval, allowing for example prefetching),distances_unordered(bulk retrieval with a concrete distance computations), and finallyExpandBeam. Of these, the latter has proved by far the most useful, being used bydiskann-diskanddiskann-garnetto specialize the algorithm to their preferred data access pattern.I've been exploring how to make filtered search more of a first-class citizen to allow us to batch together predicate queries and potentially fuse predicate evaluation with data access. I was halfway through writing a parallel Accessor/Computer/ExpandBeam/Strategy hierarchy for predicate aware search (and struggling with expressing the more complicated type relationships required), when I realized that all we really needed were a few slightly different flavors of
ExpandBeam. Instead of building a complex nest of interrelated traits to express these subtly differentExpandBeammethods, why not just require users to implement the necessaryExpandBeam? The required semantics are not that complicated and it gives the implementor full control of the data access.This PR is an experiment in removing all but the
ElementRefGAT of theAccessortrait and instead makingExpandBeamthe most relevant method. There's more simplification that can be done: fusingSearchExtwithExpandBeam, getting rid of the peskyAsNeighborbound fromExpandBeam, simplifyingBuildQueryComputerto maybe not even need thePreprocessedDistanceFunctionbound. However, this is already a large reduction in code and has a couple extra benefits:ExpandBeamimplementations for inmem are actually more efficient than the previous provided implementations because we can more aggressively evaluate the visited predicate (and merge the neighbor buffer in the prefetchid_bufferto save an allocation on eachExpandBeamcall. This is not a significant increase in code because we simply move the oldon_elements_unordered.diskanntesting the provided implementations that are now no longer needed.ExpandBeamwithSearchExt, then all the relevant bits for search are grouped in a single place instead of spread out for easier discovery.diskann, with potentially more simplification if we continue down this path.ExpandBeamis perhaps a little more prone to error. This is a distinct negative, but if everything is collected into a single place, it's easier to audit.Now for some tradeoffs:
flat_searchfunction. I kind of view this as a bonus: the existing flat search relies on an ID iterator (not easy for most backends) and uses an inefficient loop. The call indiskann-diskwould likely be much more efficient if it was just done directly on the PQ dataset.diskann-label-filter. Coarsening the granularity of data access methods means the wrapping done by the label provider is less effective. Again, though, this is probably a feature: since real implementations are moving towardsExpandBeamfor performance, the label provider may not be a good approach anyways. We certainly had this experience in Remove the Caching Provider #1052.Accessor's get-element. This can be grouped into the largerdiskann-providerscleanup.Fillfordiskann::graph::workingset::Mapis no longer applicable. Users have to write their own relatively easyFillmethod. I don't think this is as big of a problem as it sounds because I don't think any serious implementation uses the provided implementation: it's too slow.Why is this relevant now? Aside from the filtered search work, #983 is working towards adding flat search and introduces an even finer granularity of traits in the hierarchy, when my gut tells me we should be moving the other direction towards slightly coarser traits. Also, if we have an opportunity to simplify our codebase, I think we should take it.
Why is this possible now? #859 moved one of the last big
get_elementrelated operations behind a bulk operation. Surprisingly, that removed all calls to the rawget_elementfrom insert or inplace delete.