Enhance heuristic pruning to handle duplicate clusters#282
Enhance heuristic pruning to handle duplicate clusters#282DIlkhush00 wants to merge 2 commits intointel:mainfrom
Conversation
ibhati
left a comment
There was a problem hiding this comment.
Thanks @DIlkhush00 for working on this issue. The solution looks good overall. I left one comment below, and I was also wondering if we could add a test that specifically triggers this case.
Additionally, could you generate a synthetic dataset with a larger number of vectors (e.g., 1M) and measure the build-time impact? It would also be helpful to benchmark GIST-1M with and without this change to confirm that:
Datasets without duplicates show no recall or build-time regression.
In scenarios where duplicates do occur, this logic does not significantly impact build time, and recall remains consistent.
Thanks again for your contribution!
| in_result = true; | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
I’m confused why cid could already be in result here. We just checked above that this candidate's distance differs from anchor_dist, so I wouldn’t expect it to be present (as all the results have same distance). Should this be an assert instead (i.e., this candidate must not already be in result)? Am I missing a scenario?
There was a problem hiding this comment.
Ah, right... by logic any remaining candidate shouldn't already be in result. I'll change this to an assert and update both strategy accordingly.
|
Sounds good. I'll add a test that trigger this case and profile the build to ensure my changes don't introduce any significant regressions. |
Signed-off-by: Dilkhush Purohit <dilkhushpurohit01@gmail.com>
a06937f to
af5a1f1
Compare
|
Hi @ibhati , as suggested, I tested the build-time impact using a synthetic 1M vector dataset (128 dimensions). The results are as follows:
The indexing overhead seems to be negligible like under 0.5%. Also let me know if you'd like me to share the script to generate the datasets. |
This looks great — thanks @DIlkhush00! |
133b2dc to
765ad9d
Compare
|
Sure, here's the script: import numpy as np
import pathlib
def save_fvecs(path, data):
"""Save a numpy array to .fvecs format."""
data = data.astype(np.float32)
num_vectors, dim = data.shape
dim_column = np.full((num_vectors, 1), dim, dtype=np.int32)
with open(path, "wb") as f:
for i in range(num_vectors):
dim_column[i].tofile(f)
data[i].tofile(f)
def main():
np.random.seed(42)
base = pathlib.Path("data/bench_1m")
base.mkdir(parents=True, exist_ok=True)
num_vectors = 1_000_000
dim = 128
print(f"Generating {num_vectors} random vectors...")
data = np.random.random((num_vectors, dim)).astype(np.float32)
save_fvecs(base / "db_random.fvecs", data)
print("Creating duplicate-rich version (injecting clusters)...")
data_duplicates = data.copy()
# inject 100 clusters of 1000 identical vectors each
for i in range(100):
source_vector = data[i]
start_idx = 10000 + (i * 1000)
data_duplicates[start_idx : start_idx + 1000] = source_vector
save_fvecs(base / "db_duplicates.fvecs", data_duplicates)
print("Generating 1000 queries...")
queries = np.random.random((1000, dim)).astype(np.float32)
save_fvecs(base / "queries.fvecs", queries)
print(f"Done. Files saved to {base}")
if __name__ == "__main__":
main()
|
This PR fixes #80 . It adds a post-pruning step to both:
IterativePruneStrategyProgressivePruneStrategyApproach: If a duplicate cluster is detected, the last (worst) slot in the
resultis replaced with the closest candidate from the pool that does not have the same distance.