[API] Support np.where via ILKernelGenerator by Nucs · Pull Request #606 · SciSharp/NumSharp

Nucs · 2026-04-12T10:20:37Z

Summary

Add IL-generated SIMD optimization for np.where(condition, x, y)
Uses DynamicMethod to generate type-specific kernels at runtime
Vector256/Vector128.ConditionalSelect for SIMD element selection
4x loop unrolling for instruction-level parallelism
Native long indexing for large arrays
Supports all 12 dtypes (11 via SIMD, Decimal via scalar fallback)

Implementation

Component	Description
`WhereKernel<T>`	Delegate type for IL-generated kernels
`GetWhereKernel<T>()`	Get/generate cached kernel
`WhereExecute<T>()`	Main entry with automatic fallback
Mask creation	Grouped by element size (1/2/4/8 bytes)

Eligibility for SIMD Path

bool canUseKernel = ILKernelGenerator.Enabled &&
                    cond.typecode == NPTypeCode.Boolean &&
                    cond.Shape.IsContiguous &&
                    xArr.Shape.IsContiguous &&
                    yArr.Shape.IsContiguous;

Falls back to iterator path for:

Non-contiguous/broadcasted arrays
Non-bool conditions (need truthiness conversion)

Test Plan

26 new WhereSimdTests for SIMD correctness
36 existing np_where_Test pass
21 battle tests pass
All 12 dtypes covered

Closes #604

Nucs · 2026-04-12T11:11:53Z

Performance Results: AVX2 Mask Expansion Optimization

After implementing AVX2/SSE4.1 intrinsics for mask expansion, here are the benchmark results:

Kernel Performance (double, 1M elements)

Metric	Value
Kernel time	2.62 ms
Throughput	381 M elements/ms
NumPy baseline	~1.86 ms
Ratio vs NumPy	~1.4x slower

Scaling

Size	Kernel (ms)	Throughput
1K	0.0024	416 M/ms
10K	0.027	368 M/ms
100K	0.28	356 M/ms
1M	2.62	381 M/ms

How It Works

Replaced scalar conditional mask creation with single-instruction SIMD expansion:

// Before: 4 scalar conditionals for 8-byte elements
Vector256.Create(
    bools[0] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
    bools[1] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
    bools[2] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
    bools[3] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul
);

// After: 2-3 instructions using AVX2
var bytes128 = Vector128.CreateScalar(*(uint*)bools).AsByte();
var expanded = Avx2.ConvertToVector256Int64(bytes128).AsUInt64();  // vpmovzxbq
return Vector256.GreaterThan(expanded, Vector256<ulong>.Zero);

Element Size	Intrinsic	Effect
8 bytes	`vpmovzxbq`	4 bytes → 4 qwords
4 bytes	`vpmovzxbd`	8 bytes → 8 dwords
2 bytes	`vpmovzxbw`	16 bytes → 16 words

All 12 dtypes supported with scalar fallback for non-AVX2/SSE4.1 systems.

Nucs · 2026-04-12T11:48:39Z

Update: Inlined IL - Now 3.9x FASTER than NumPy!

By inlining the mask creation directly in IL instead of calling helper methods:

Version	Kernel Time	vs NumPy
With method Call	2.6 ms	1.4x slower
Inlined IL	0.48 ms	3.9x faster
NumPy	1.86 ms	baseline

What Changed

Instead of emitting Call opcodes to mask helper methods, the IL now emits the full AVX2 instruction sequence inline:

ldind.u4           ; Load 4 bool bytes
call CreateScalar  ; Put in Vector128
call AsByte        ; Reinterpret
call vpmovzxbq     ; AVX2 zero-extend bytes to qwords
call AsUInt64      ; Reinterpret  
call get_Zero      ; Vector256<ulong>.Zero
call GreaterThan   ; Create mask

This eliminates:

Method call overhead (~12%)
Runtime Avx2.IsSupported checks in hot path
JIT optimization barriers at call boundaries

The kernel now processes 2,083 million elements per second - significantly faster than NumPy's ~540 M/ms.

…n, x, y) Add IL-generated kernels for np.where using runtime code generation: - Uses DynamicMethod to generate type-specific kernels at runtime - Vector256/Vector128.ConditionalSelect for SIMD element selection - 4x loop unrolling for better instruction-level parallelism - Full long indexing support for arrays > 2^31 elements - Supports all 12 dtypes (11 via SIMD, Decimal via scalar fallback) - Kernels cached per type for reuse Architecture: - WhereKernel<T> delegate: (bool* cond, T* x, T* y, T* result, long count) - GetWhereKernel<T>(): Returns cached IL-generated kernel - WhereExecute<T>(): Main entry point with automatic fallback IL Generation: - 4x unrolled SIMD loop (processes 4 vectors per iteration) - Remainder SIMD loop (1 vector at a time) - Scalar tail loop for remaining elements - Mask creation methods by element size (1/2/4/8 bytes) - All arithmetic uses long types natively (no int-to-long casts) Falls back to iterator path for: - Non-contiguous/broadcasted arrays (stride=0) - Non-bool conditions (need truthiness conversion) Files: - src/NumSharp.Core/APIs/np.where.cs: Kernel dispatch logic - src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Where.cs: IL generation - test/NumSharp.UnitTest/Backends/Kernels/WhereSimdTests.cs: 26 tests Closes #604

Replace scalar conditional mask creation with SIMD intrinsics: V256 mask creation (for AVX2): - 8-byte elements: Avx2.ConvertToVector256Int64 (vpmovzxbq) - 4-byte elements: Avx2.ConvertToVector256Int32 (vpmovzxbd) - 2-byte elements: Avx2.ConvertToVector256Int16 (vpmovzxbw) V128 mask creation (for SSE4.1): - 8-byte elements: Sse41.ConvertToVector128Int64 (pmovzxbq) - 4-byte elements: Sse41.ConvertToVector128Int32 (pmovzxbd) - 2-byte elements: Sse41.ConvertToVector128Int16 (pmovzxbw) Each intrinsic replaces 4-16 scalar conditionals with a single zero-extend + compare instruction sequence. Also fixes reflection lookups for Vector256/Vector128.Load, Store, and ConditionalSelect methods that were failing because these are generic method definitions requiring special handling. Performance (1M double elements): - Kernel: 2.6ms @ 381 M elements/ms - NumPy baseline: ~1.86ms - Ratio: ~1.4x slower (down from ~3x before optimization) All 12 dtypes supported with fallback for non-AVX2/SSE4.1 systems.

Instead of emitting Call opcodes to mask helper methods, now emit the AVX2/SSE4.1 instructions directly inline in the IL stream. This eliminates: - Method call overhead (~12% per call) - Runtime Avx2.IsSupported checks in hot path - JIT optimization barriers at call boundaries The IL now emits the full mask creation sequence: - 8-byte: ldind.u4 → CreateScalar → AsByte → ConvertToVector256Int64 → AsUInt64 → GreaterThan - 4-byte: ldind.i8 → CreateScalar → AsByte → ConvertToVector256Int32 → AsUInt32 → GreaterThan - 2-byte: Load → ConvertToVector256Int16 → AsUInt16 → GreaterThan - 1-byte: Load → GreaterThan (direct comparison) Performance (1M double elements): - Previous (method call): 2.6 ms - Inlined IL: 0.48 ms (5.4x faster) - NumPy baseline: 1.86 ms (NumSharp is now 3.9x FASTER) Fixed reflection lookups for AsByte/AsUInt* which are extension methods on Vector128/Vector256 static classes, not instance methods.

Implements NumPy 2.x NEP50 "weak scalar" semantics for np.where, detecting scalar NDArrays via Shape.IsScalar for clean type promotion without requiring per-type overloads. TYPE PROMOTION RULES: 1. Same-type scalars: preserve type - int + int → int32 (both same type, preserve) - byte + byte → byte - float + float → float32 2. Mixed-type scalars: use array-array promotion - int + long → int64 - int + double → float64 - byte + short → int16 3. NEP50 weak scalar: scalar + array → array dtype wins - int scalar + uint8 array → uint8 - int scalar + float32 array → float32 4. Cross-kind promotion uses standard rules - float scalar + int32 array → float64 IMPLEMENTATION: - Simplified to 4 overloads (NDArray, object+NDArray, NDArray+object, object+object) - Detect scalar NDArrays via Shape.IsScalar (works for both implicit conversion and explicit np.array() calls) - Input arrays converted to output dtype before kernel/iterator dispatch NOTE: Unlike NumPy where Python int literals widen to int64, C# int literals create int32 scalar NDArrays indistinguishable from explicit np.array(1, dtype=int32). We preserve same-type scalars rather than widening, which is consistent with C#'s typed literal semantics.

Extended np.asanyarray to handle all common C# collection types: Collections supported via IEnumerable<T> pattern matching: - List<T>, IList<T>, ICollection<T>, IEnumerable<T> - IReadOnlyList<T>, IReadOnlyCollection<T> - ReadOnlyCollection<T> - LinkedList<T> - HashSet<T>, SortedSet<T> - Queue<T>, Stack<T> - ArraySegment<T> (implements IEnumerable<T>) - ImmutableArray<T>, ImmutableList<T>, ImmutableHashSet<T> - Any LINQ query result (IEnumerable<T>) Special handling for types not implementing IEnumerable<T>: - Memory<T> - uses direct cast and ToArray() - ReadOnlyMemory<T> - uses direct cast and ToArray() Implementation approach: - Clean pattern matching on IEnumerable<T> for all 12 NumSharp types - No method reflection (direct LINQ .ToArray() calls) - Memory/ReadOnlyMemory handled via type switch with direct casts Supported element types (NumSharp's 12 types): bool, byte, short, ushort, int, uint, long, ulong, char, float, double, decimal Note: sbyte, IntPtr, UIntPtr are NOT supported (not in NPTypeCode)

Added fallback support for collections that don't implement generic IEnumerable<T> but still implement the non-generic interfaces: Non-generic IEnumerable fallback: - ArrayList, Hashtable.Keys/Values, BitArray, etc. - Any legacy collection implementing only IEnumerable - Element type detected from first non-null item Non-generic IEnumerator fallback: - Direct enumerator objects (e.g., from yield return methods) - Element type detected from first non-null item Implementation: - Enumerate items into List<object> - Detect element type from first item - Convert to typed array via type switch (no reflection) - Returns null for unsupported element types (falls through to error) This completes the collection support hierarchy: 1. IEnumerable<T> - direct pattern matching (most efficient) 2. Memory<T>/ReadOnlyMemory<T> - special handling (no IEnumerable<T>) 3. IEnumerable (non-generic) - fallback with type detection 4. IEnumerator (non-generic) - fallback with type detection

- ConvertMemory: single type switch with ternary for ReadOnly vs mutable - ConvertNonGenericEnumerable: delegate to ConvertEnumerator via GetEnumerator()

…andling NumPy parity fixes based on battletest comparison: 1. Tuple/ValueTuple support (NEW): - Both Tuple<> and ValueTuple<> now iterate their elements - Uses ITuple interface (available in .NET Core 2.0+) - NumPy: np.asanyarray((1,2,3)) -> dtype=int64, shape=(3,) - NumSharp now matches this behavior 2. Empty non-generic collections (FIX): - Empty ArrayList/IEnumerable now returns empty double[] - Matches NumPy's default of float64 for empty collections - Previously threw NotSupportedException Tests added: - ValueTuple_IsIterable, Tuple_IsIterable - ValueTuple_MixedTypes_UsesFirstElementType - EmptyTuple_ReturnsEmptyDoubleArray - EmptyArrayList_ReturnsEmptyDoubleArray - Misaligned tests documenting intentional NumPy differences

…ions Adds proper type promotion when collections contain mixed numeric types: - int + double -> double (matches NumPy float64 promotion) - int + bool -> int - float + any int -> double - decimal wins if present Implementation: - Added FindCommonNumericType() to detect widest compatible type - Changed ConvertObjectListToNDArray to use Convert.To* methods instead of direct casts, enabling cross-type conversion - Updated ConvertTuple and ConvertEnumerator to use type promotion Tests added: - ValueTuple_MixedTypes_PromotesToCommonType: (1, 2.5, 3) -> double - ValueTuple_IntAndBool_PromotesToInt: (1, true, 3) -> int

Use pattern matching `is T v ? v : Convert.ToT(item)` instead of always calling Convert.ToT(). This gives direct unbox speed for homogeneous collections (the common case) while still handling mixed types correctly. Benchmark results (100K iterations, size 1000): - Convert.ToInt32 always: 4088 ns/op - is int ? v : Convert: 1038 ns/op (3.9x faster) This optimization affects: - ArrayList and other non-generic IEnumerable - Tuple/ValueTuple via ITuple interface - Any path through ConvertObjectListToNDArray No behavior change - mixed type collections still work via Convert fallback.

…mization For IEnumerable<T>, use optimized extraction: 1. List<T>: CollectionsMarshal.AsSpan() + CopyTo (direct memory access) 2. ICollection<T>: CopyTo() (avoids enumerator overhead) 3. Other: fallback to LINQ ToArray() Benchmark results (size 10000, List<int>): - Old (ToArray): 14129 ns/op - New (ToArrayFast): 11665 ns/op - Speedup: 1.21x (21% faster) The CollectionsMarshal.AsSpan approach gives direct access to List<T>'s internal array, avoiding the allocation and copy overhead of ToArray(). For ICollection<T>, CopyTo() is used which avoids enumerator overhead.

Replace `new T[n]` with `GC.AllocateUninitializedArray<T>(n)` in all array allocations within np.asanyarray. Since we immediately overwrite all elements, the default zeroing is wasted work. Affected paths: - ToArrayFast<T>: List<T> and ICollection<T> extraction - ConvertObjectListToNDArray: All 12 dtype allocations Benchmark (GC.AllocateUninitializedArray vs new T[]): - Size 1000: 38 ns vs 156 ns (4x faster allocation) This optimization compounds with the previous CollectionsMarshal and pattern-match optimizations for significant cumulative improvement.

Optimizations applied: 1. FindCommonNumericType: - Early exit when decimal found (highest priority) - Early exit when float/double found (promotes to double) - Use CollectionsMarshal.AsSpan for bounds-check-free iteration - Stackalloc for type code deduplication - Reuse existing _FindCommonType_Scalar for consistent promotion 2. ConvertObjectListToNDArray: - Use CollectionsMarshal.AsSpan(items) for ~10-15% speedup - Eliminates bounds checking in tight conversion loops 3. ConvertEnumerator: - Pre-size List<object> when ICollection count is known - Eliminates resize allocations for known-size collections 4. ConvertTuple: - Pre-size List<object> with tuple.Length Net: -27 lines while adding performance improvements

…to MSTest v3 Post-rebase cleanup after master migrated the test suite from TUnit to MSTest v3 (commits ac02033, e0db3c3). The 4 test files introduced on this branch still used TUnit's [Test] attribute and `using TUnit.Core;`, which broke the build. Changes per file: - Replace `using TUnit.Core;` (removed) - Add `[TestClass]` attribute to the test class - Replace all `[Test]` attributes with `[TestMethod]` Files migrated: - test/NumSharp.UnitTest/Logic/np.where.Test.cs - test/NumSharp.UnitTest/Logic/np.where.BattleTest.cs - test/NumSharp.UnitTest/Backends/Kernels/WhereSimdTests.cs - test/NumSharp.UnitTest/Creation/np.asanyarray.Tests.cs Verified: all 112 np.where tests and 62 np.asanyarray tests pass on net8.0 and net10.0.

np.asanyarray(new object[]{1, 2.5, 3}) threw NotSupportedException because `case Array array` matched object[] first and `new NDArray(object[])` rejects object as an element type. object[] has no fixed dtype, so routing through the non-generic IEnumerable path (which applies NumPy-like type promotion) is the correct behavior. Added an explicit `case object[] objArr` branch that delegates to ConvertNonGenericEnumerable, which already handles: - Homogeneous object[]: detected via FindCommonNumericType, single dtype - Mixed object[]: promoted to common type (e.g. int + double -> double) - Empty object[]: returns empty double[] (matches NumPy float64 default) - Bool+int mix: promotes bool to int via Convert.ToInt32 (True=1, False=0) Regression tests added in np.asanyarray.Tests.cs covering all four cases. All 66 np.asanyarray tests pass on net8.0 and net10.0.

Code review caught dead code paths and over-narrative comments. Net change is -293/+19 across three files. ILKernelGenerator.Where.cs (-249 lines): - Delete `GetNPTypeCode<T>` (use shared InfoOf<T>.NPTypeCode instead). - Delete `GetMaskCreationMethod256/128` and the entire 200-line `Static Mask Creation Methods (fallback)` region (CreateMaskV256_*Byte and CreateMaskV128_*Byte). They were never called -- the inline IL emitter at EmitInlineMaskCreationV256/V128 handles the mask creation directly via the cached MethodInfo lookups. The static helpers existed as an early prototype fallback path that became unreachable. - Delete `_v256ZeroULong` field with the meaningless `IsStatic ? null! : null!` tautology (only `_v256GetZeroULong` is used). np.where.cs (+2 lines): - Add `default: throw NotSupportedException(...)` to `WhereKernelDispatch` switch. The kernel path is currently only reached for the 12 supported NPTypeCodes, but the missing default would silently fall through and return an uninitialized result if a new NPTypeCode were ever added without updating this switch. The iterator-path switch (line 142) already has this guard. np.asanyarray.cs (-43/+18 net): - Cap `stackalloc NPTypeCode[span.Length]` at 12 (max possible unique NPTypeCodes given the seenMask deduplication). The previous unbounded stackalloc could blow the stack for very large user lists. - Remove dead `hasDecimal` variable (set but never read; the early-exit for decimal returns immediately on first hit). - Trim narrative/microbenchmark comments per CLAUDE.md guidance: removed "Optimized: ...3-7x faster", "optimization #4", "~4x faster than always using Convert", "Pre-sized list (optimization: ...)", and a handful of WHAT-the-code-does comments that restated obvious switch arms. - Tighten Tuple/Enumerator helpers (collapse trivial if/else into ternary). Verified: 178 np.where + np.asanyarray tests still pass on net8.0 + net10.0.

…-circuits Second-round code review caught one real bug and several minor efficiency issues. 1. Fix: FindCommonNumericType promoted pure-float object[] to double np.asanyarray(new object[]{1.5f, 2.5f}) returned Double instead of Single. Root cause: the early-exit `if (hasDouble || hasFloat) return typeof(double)` fired before the `uniqueCount == 1` check that preserves the original dtype. Removing the hasFloat arm lets the general path handle it: - Pure float32 -> uniqueCount == 1 -> returns firstType (Single) -- matches NumPy - int + float32 -> _FindCommonType_Scalar -> returns Double -- matches NumPy NEP50 - Pure float64 -> unchanged (still Double) - decimal-wins-everything early exit preserved. Two regression tests added: - ObjectArray_AllFloat_PreservesSingle - ObjectArray_MixedIntAndFloat32_PromotesToDouble 2. Perf: skip type promotion in np.where when x.dtype == y.dtype Previously _FindCommonType(x, y) always ran, even when both operands shared a dtype. Short-circuit to x.GetTypeCode in that case, saving one dict lookup + two astype traversals per call. The NEP50 lookup still runs when dtypes differ, preserving scalar+array promotion semantics. 3. Perf: skip broadcast_arrays when all three shapes already match broadcast_arrays allocates three fresh NDArrays plus helper Shape[]. For the common case of np.where(mask, arr, other_arr) where all three arrays share a shape, this is wasted. Skip it when condition.Shape == x.Shape == y.Shape (Shape == compares by dimensions). 4. Perf: cache Vector256/Vector128 generic MethodInfo EmitWhereV256BodyWithOffset and EmitWhereV128BodyWithOffset did Array.Find(typeof(Vector*).GetMethods(), ...) three times per call, each scanning ~100 methods. Per kernel generation (4-way unrolled + 1 remainder call = 5 calls), that was 15 reflection scans per T, or ~180 on first use across all 12 dtypes. Cached as six static readonly fields; only MakeGenericMethod(typeof(T)) runs per call. 5. Polish: doc + error message - where(NDArray) xmldoc was copy-pasted from the 3-arg overload ("Return elements chosen from x or y"); rewritten to describe nonzero semantics. - object[] NotSupportedException now names the actual problem ("element type is not a supported NumSharp dtype") instead of just reporting the length. Verified: 180 np.where + np.asanyarray tests pass on net8.0 + net10.0.

np.where's IL kernel had ~35 MethodInfo fields scattered across the file using `Array.Find(...)!` null-forgiveness, which throws NullReferenceException at first use if a framework method ever gets renamed/removed. The existing CachedMethods nested class in ILKernelGenerator.cs follows a fail-fast `?? throw new MissingMethodException(type, name)` pattern, keyed per MethodInfo, and is the project convention for every other kernel partial. Changes: - Make `CachedMethods` a `partial` nested class so Where-specific reflection can live alongside the kernel file it serves. (ILKernelGenerator.cs: 1 line.) - Delete the 35 `_v128*/_v256*/_avx2*/_sse41*` private fields from ILKernelGenerator.Where.cs and move them into a new "Where Kernel Methods" region inside a partial `CachedMethods` declaration at the bottom of that file. Renamed to PascalCase (e.g. _v256LoadByte -> V256LoadByte) to match the existing CachedMethods naming convention. - Introduce three small helpers inside CachedMethods: - FindGenericMethod(Type, string name, int? paramCount) - wraps the `Array.Find(GetMethods(), m => m.IsGenericMethodDefinition && ...)` pattern with a MissingMethodException fail-fast throw. Handles the overload count disambiguation for Load/Store. - FindMethodExact(Type, string name, Type[] argTypes) - wraps GetMethod with a fail-fast throw. Used for Avx2/Sse41 specific overloads. - GetZeroGetter(Type vectorOfT) - wraps Property("Zero").GetMethod with a fail-fast throw. Used for the 8 Vector*<T>.Zero getters. - Update all 41 call sites in EmitInlineMaskCreationV256/V128 and EmitWhereV256/V128BodyWithOffset to use CachedMethods.Xxx. Behaviour unchanged; 180 np.where + np.asanyarray tests still pass on net8.0 + net10.0. The single real benefit is earlier and clearer failure if any of the ~35 framework API names change in a future .NET release.

…tibility CI failure on macos-latest (ARM64/Apple Silicon) reported 31 np.where tests throwing PlatformNotSupportedException at runtime: PlatformNotSupportedException: Operation is not supported on this platform. at System.Runtime.Intrinsics.X86.Sse41.ConvertToVector128Int64(Vector128`1 value) at IL_Where_Int64(...) Root cause: the SIMD-emit path was gated only on `VectorBits >= 128`. On ARM64, `Vector128.IsHardwareAccelerated` is true (maps to Neon), so VectorBits is 128, and the kernel emits calls to Sse41/Avx2 byte-lane expansion intrinsics which are x86-only. Breakdown of the byte-mask expansion path by element size: - 1-byte (byte): portable Vector*.Load/GreaterThan — safe on any SIMD platform - 2-byte: Sse41.ConvertToVector128Int16 / Avx2.ConvertToVector256Int16 - 4-byte: Sse41.ConvertToVector128Int32 / Avx2.ConvertToVector256Int32 - 8-byte: Sse41.ConvertToVector128Int64 / Avx2.ConvertToVector256Int64 Fix: in GenerateWhereKernelIL, compute `useV256`/`useV128` with an additional Sse41.IsSupported / Avx2.IsSupported guard — but only when elementSize > 1, since the 1-byte path is portable. If neither x86 intrinsic set is available for the required lane size, skip SIMD emission entirely; the scalar IL loop that follows handles correctness. Also passes the useV256 decision to EmitWhereSIMDLoop explicitly instead of recomputing it from VectorBits inside the loop, which was both duplicative and ignored the IsSupported guard. Result: on ARM64, byte-typed arrays still use Neon-backed SIMD; int/long/float/ double/short fall back to the scalar IL kernel. On x86 nothing changes. Verified: 180 np.where + np.asanyarray tests pass on Windows x64 (net8.0 + net10.0). ARM path awaits CI verification.

Nucs force-pushed the np_where branch from 46de8bc to 10ae98b Compare April 12, 2026 12:05

Nucs marked this pull request as draft April 15, 2026 16:16

Nucs added 19 commits April 20, 2026 19:08

refactor(asanyarray): consolidate duplicate code

23ad1c1

- ConvertMemory: single type switch with ternary for ReadOnly vs mutable - ConvertNonGenericEnumerable: delegate to ConvertEnumerator via GetEnumerator()

Nucs force-pushed the np_where branch from 10ae98b to a5862bd Compare April 20, 2026 18:44

Nucs marked this pull request as ready for review April 20, 2026 18:52

Nucs merged commit 6079495 into master Apr 20, 2026
7 checks passed

Nucs deleted the np_where branch April 20, 2026 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API] Support np.where via ILKernelGenerator#606

[API] Support np.where via ILKernelGenerator#606
Nucs merged 19 commits intomasterfrom
np_where

Nucs commented Apr 12, 2026

Uh oh!

Nucs commented Apr 12, 2026

Uh oh!

Nucs commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Nucs commented Apr 12, 2026

Summary

Implementation

Eligibility for SIMD Path

Test Plan

Uh oh!

Nucs commented Apr 12, 2026

Performance Results: AVX2 Mask Expansion Optimization

Kernel Performance (double, 1M elements)

Scaling

How It Works

Uh oh!

Nucs commented Apr 12, 2026

Update: Inlined IL - Now 3.9x FASTER than NumPy!

What Changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant