Enhance heuristic pruning to handle duplicate clusters by DIlkhush00 · Pull Request #282 · intel/ScalableVectorSearch

DIlkhush00 · 2026-03-04T21:12:04Z

This PR fixes #80 . It adds a post-pruning step to both:

IterativePruneStrategy
ProgressivePruneStrategy

Approach: If a duplicate cluster is detected, the last (worst) slot in the result is replaced with the closest candidate from the pool that does not have the same distance.

ibhati

Thanks @DIlkhush00 for working on this issue. The solution looks good overall. I left one comment below, and I was also wondering if we could add a test that specifically triggers this case.
Additionally, could you generate a synthetic dataset with a larger number of vectors (e.g., 1M) and measure the build-time impact? It would also be helpful to benchmark GIST-1M with and without this change to confirm that:

Datasets without duplicates show no recall or build-time regression.
In scenarios where duplicates do occur, this logic does not significantly impact build time, and recall remains consistent.

Thanks again for your contribution!

ibhati · 2026-03-04T22:15:32Z

+                    in_result = true;
+                    break;
+                }
+            }


I’m confused why cid could already be in result here. We just checked above that this candidate's distance differs from anchor_dist, so I wouldn’t expect it to be present (as all the results have same distance). Should this be an assert instead (i.e., this candidate must not already be in result)? Am I missing a scenario?

Ah, right... by logic any remaining candidate shouldn't already be in result. I'll change this to an assert and update both strategy accordingly.

DIlkhush00 · 2026-03-07T21:23:21Z

Sounds good. I'll add a test that trigger this case and profile the build to ensure my changes don't introduce any significant regressions.

Signed-off-by: Dilkhush Purohit <dilkhushpurohit01@gmail.com>

DIlkhush00 · 2026-04-16T19:48:19Z

Hi @ibhati , as suggested, I tested the build-time impact using a synthetic 1M vector dataset (128 dimensions). The results are as follows:

Clean Dataset: ~543s (without fix) vs. ~545s (with fix).
Duplicate Dataset: ~546s (with fix).

The indexing overhead seems to be negligible like under 0.5%. Also let me know if you'd like me to share the script to generate the datasets.

ibhati · 2026-04-17T16:19:43Z

Hi @ibhati , as suggested, I tested the build-time impact using a synthetic 1M vector dataset (128 dimensions). The results are as follows:

Clean Dataset: ~543s (without fix) vs. ~545s (with fix).

Duplicate Dataset: ~546s (with fix).

The indexing overhead seems to be negligible like under 0.5%. Also let me know if you'd like me to share the script to generate the datasets.

This looks great — thanks @DIlkhush00!
As a final step before we merge, do you have a script to generate the 1M‑vector synthetic dataset you used for testing? I’d like to run it on my end to double‑check that everything behaves as expected.
Thanks again!

…on test

DIlkhush00 · 2026-04-19T07:50:00Z

Sure, here's the script:

import numpy as np
import pathlib

def save_fvecs(path, data):
    """Save a numpy array to .fvecs format."""
    data = data.astype(np.float32)
    num_vectors, dim = data.shape
    dim_column = np.full((num_vectors, 1), dim, dtype=np.int32)
    with open(path, "wb") as f:
        for i in range(num_vectors):
            dim_column[i].tofile(f)
            data[i].tofile(f)

def main():
    np.random.seed(42)
    base = pathlib.Path("data/bench_1m")
    base.mkdir(parents=True, exist_ok=True)
    
    num_vectors = 1_000_000
    dim = 128
    
    print(f"Generating {num_vectors} random vectors...")
    data = np.random.random((num_vectors, dim)).astype(np.float32)
    save_fvecs(base / "db_random.fvecs", data)
    
    print("Creating duplicate-rich version (injecting clusters)...")
    data_duplicates = data.copy()

    # inject 100 clusters of 1000 identical vectors each
    for i in range(100):
        source_vector = data[i]
        start_idx = 10000 + (i * 1000)
        data_duplicates[start_idx : start_idx + 1000] = source_vector
    save_fvecs(base / "db_duplicates.fvecs", data_duplicates)
    
    print("Generating 1000 queries...")
    queries = np.random.random((1000, dim)).astype(np.float32)
    save_fvecs(base / "queries.fvecs", queries)
    print(f"Done. Files saved to {base}")

if __name__ == "__main__":
    main()

DIlkhush00 requested review from ahuber21 and ibhati as code owners March 4, 2026 21:12

ibhati requested changes Mar 4, 2026

View reviewed changes

enhance heuristic pruning to handle duplicate clusters

af5a1f1

Signed-off-by: Dilkhush Purohit <dilkhushpurohit01@gmail.com>

DIlkhush00 force-pushed the dilkhush00/duplicate-cluster-fix branch from a06937f to af5a1f1 Compare April 16, 2026 14:59

DIlkhush00 requested review from mihaic and yuejiaointel as code owners April 16, 2026 18:32

resolve duplicate cluster issue in heuristic pruning and add regressi…

765ad9d

…on test

DIlkhush00 force-pushed the dilkhush00/duplicate-cluster-fix branch from 133b2dc to 765ad9d Compare April 19, 2026 07:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance heuristic pruning to handle duplicate clusters#282

Enhance heuristic pruning to handle duplicate clusters#282
DIlkhush00 wants to merge 2 commits intointel:mainfrom
DIlkhush00:dilkhush00/duplicate-cluster-fix

DIlkhush00 commented Mar 4, 2026

Uh oh!

ibhati left a comment

Uh oh!

ibhati Mar 4, 2026

Uh oh!

DIlkhush00 Mar 7, 2026

Uh oh!

DIlkhush00 commented Mar 7, 2026

Uh oh!

DIlkhush00 commented Apr 16, 2026

Uh oh!

ibhati commented Apr 17, 2026

Uh oh!

DIlkhush00 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DIlkhush00 commented Mar 4, 2026

Uh oh!

ibhati left a comment

Choose a reason for hiding this comment

Uh oh!

ibhati Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

DIlkhush00 Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

DIlkhush00 commented Mar 7, 2026

Uh oh!

DIlkhush00 commented Apr 16, 2026

Uh oh!

ibhati commented Apr 17, 2026

Uh oh!

DIlkhush00 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants