Skip to content

[API] Support np.where via ILKernelGenerator#606

Merged
Nucs merged 19 commits intomasterfrom
np_where
Apr 20, 2026
Merged

[API] Support np.where via ILKernelGenerator#606
Nucs merged 19 commits intomasterfrom
np_where

Conversation

@Nucs
Copy link
Copy Markdown
Member

@Nucs Nucs commented Apr 12, 2026

Summary

  • Add IL-generated SIMD optimization for np.where(condition, x, y)
  • Uses DynamicMethod to generate type-specific kernels at runtime
  • Vector256/Vector128.ConditionalSelect for SIMD element selection
  • 4x loop unrolling for instruction-level parallelism
  • Native long indexing for large arrays
  • Supports all 12 dtypes (11 via SIMD, Decimal via scalar fallback)

Implementation

Component Description
WhereKernel<T> Delegate type for IL-generated kernels
GetWhereKernel<T>() Get/generate cached kernel
WhereExecute<T>() Main entry with automatic fallback
Mask creation Grouped by element size (1/2/4/8 bytes)

Eligibility for SIMD Path

bool canUseKernel = ILKernelGenerator.Enabled &&
                    cond.typecode == NPTypeCode.Boolean &&
                    cond.Shape.IsContiguous &&
                    xArr.Shape.IsContiguous &&
                    yArr.Shape.IsContiguous;

Falls back to iterator path for:

  • Non-contiguous/broadcasted arrays
  • Non-bool conditions (need truthiness conversion)

Test Plan

  • 26 new WhereSimdTests for SIMD correctness
  • 36 existing np_where_Test pass
  • 21 battle tests pass
  • All 12 dtypes covered

Closes #604

@Nucs
Copy link
Copy Markdown
Member Author

Nucs commented Apr 12, 2026

Performance Results: AVX2 Mask Expansion Optimization

After implementing AVX2/SSE4.1 intrinsics for mask expansion, here are the benchmark results:

Kernel Performance (double, 1M elements)

Metric Value
Kernel time 2.62 ms
Throughput 381 M elements/ms
NumPy baseline ~1.86 ms
Ratio vs NumPy ~1.4x slower

Scaling

Size Kernel (ms) Throughput
1K 0.0024 416 M/ms
10K 0.027 368 M/ms
100K 0.28 356 M/ms
1M 2.62 381 M/ms

How It Works

Replaced scalar conditional mask creation with single-instruction SIMD expansion:

// Before: 4 scalar conditionals for 8-byte elements
Vector256.Create(
    bools[0] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
    bools[1] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
    bools[2] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
    bools[3] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul
);

// After: 2-3 instructions using AVX2
var bytes128 = Vector128.CreateScalar(*(uint*)bools).AsByte();
var expanded = Avx2.ConvertToVector256Int64(bytes128).AsUInt64();  // vpmovzxbq
return Vector256.GreaterThan(expanded, Vector256<ulong>.Zero);
Element Size Intrinsic Effect
8 bytes vpmovzxbq 4 bytes → 4 qwords
4 bytes vpmovzxbd 8 bytes → 8 dwords
2 bytes vpmovzxbw 16 bytes → 16 words

All 12 dtypes supported with scalar fallback for non-AVX2/SSE4.1 systems.

@Nucs
Copy link
Copy Markdown
Member Author

Nucs commented Apr 12, 2026

Update: Inlined IL - Now 3.9x FASTER than NumPy!

By inlining the mask creation directly in IL instead of calling helper methods:

Version Kernel Time vs NumPy
With method Call 2.6 ms 1.4x slower
Inlined IL 0.48 ms 3.9x faster
NumPy 1.86 ms baseline

What Changed

Instead of emitting Call opcodes to mask helper methods, the IL now emits the full AVX2 instruction sequence inline:

ldind.u4           ; Load 4 bool bytes
call CreateScalar  ; Put in Vector128
call AsByte        ; Reinterpret
call vpmovzxbq     ; AVX2 zero-extend bytes to qwords
call AsUInt64      ; Reinterpret  
call get_Zero      ; Vector256<ulong>.Zero
call GreaterThan   ; Create mask

This eliminates:

  • Method call overhead (~12%)
  • Runtime Avx2.IsSupported checks in hot path
  • JIT optimization barriers at call boundaries

The kernel now processes 2,083 million elements per second - significantly faster than NumPy's ~540 M/ms.

@Nucs Nucs marked this pull request as draft April 15, 2026 16:16
Nucs added 19 commits April 20, 2026 19:08
…n, x, y)

Add IL-generated kernels for np.where using runtime code generation:
- Uses DynamicMethod to generate type-specific kernels at runtime
- Vector256/Vector128.ConditionalSelect for SIMD element selection
- 4x loop unrolling for better instruction-level parallelism
- Full long indexing support for arrays > 2^31 elements
- Supports all 12 dtypes (11 via SIMD, Decimal via scalar fallback)
- Kernels cached per type for reuse

Architecture:
- WhereKernel<T> delegate: (bool* cond, T* x, T* y, T* result, long count)
- GetWhereKernel<T>(): Returns cached IL-generated kernel
- WhereExecute<T>(): Main entry point with automatic fallback

IL Generation:
- 4x unrolled SIMD loop (processes 4 vectors per iteration)
- Remainder SIMD loop (1 vector at a time)
- Scalar tail loop for remaining elements
- Mask creation methods by element size (1/2/4/8 bytes)
- All arithmetic uses long types natively (no int-to-long casts)

Falls back to iterator path for:
- Non-contiguous/broadcasted arrays (stride=0)
- Non-bool conditions (need truthiness conversion)

Files:
- src/NumSharp.Core/APIs/np.where.cs: Kernel dispatch logic
- src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Where.cs: IL generation
- test/NumSharp.UnitTest/Backends/Kernels/WhereSimdTests.cs: 26 tests

Closes #604
Replace scalar conditional mask creation with SIMD intrinsics:

V256 mask creation (for AVX2):
- 8-byte elements: Avx2.ConvertToVector256Int64 (vpmovzxbq)
- 4-byte elements: Avx2.ConvertToVector256Int32 (vpmovzxbd)
- 2-byte elements: Avx2.ConvertToVector256Int16 (vpmovzxbw)

V128 mask creation (for SSE4.1):
- 8-byte elements: Sse41.ConvertToVector128Int64 (pmovzxbq)
- 4-byte elements: Sse41.ConvertToVector128Int32 (pmovzxbd)
- 2-byte elements: Sse41.ConvertToVector128Int16 (pmovzxbw)

Each intrinsic replaces 4-16 scalar conditionals with a single
zero-extend + compare instruction sequence.

Also fixes reflection lookups for Vector256/Vector128.Load, Store,
and ConditionalSelect methods that were failing because these are
generic method definitions requiring special handling.

Performance (1M double elements):
- Kernel: 2.6ms @ 381 M elements/ms
- NumPy baseline: ~1.86ms
- Ratio: ~1.4x slower (down from ~3x before optimization)

All 12 dtypes supported with fallback for non-AVX2/SSE4.1 systems.
Instead of emitting Call opcodes to mask helper methods, now emit
the AVX2/SSE4.1 instructions directly inline in the IL stream.

This eliminates:
- Method call overhead (~12% per call)
- Runtime Avx2.IsSupported checks in hot path
- JIT optimization barriers at call boundaries

The IL now emits the full mask creation sequence:
- 8-byte: ldind.u4 → CreateScalar → AsByte → ConvertToVector256Int64 → AsUInt64 → GreaterThan
- 4-byte: ldind.i8 → CreateScalar → AsByte → ConvertToVector256Int32 → AsUInt32 → GreaterThan
- 2-byte: Load → ConvertToVector256Int16 → AsUInt16 → GreaterThan
- 1-byte: Load → GreaterThan (direct comparison)

Performance (1M double elements):
- Previous (method call): 2.6 ms
- Inlined IL:             0.48 ms (5.4x faster)
- NumPy baseline:         1.86 ms (NumSharp is now 3.9x FASTER)

Fixed reflection lookups for AsByte/AsUInt* which are extension
methods on Vector128/Vector256 static classes, not instance methods.
Implements NumPy 2.x NEP50 "weak scalar" semantics for np.where, detecting
scalar NDArrays via Shape.IsScalar for clean type promotion without
requiring per-type overloads.

TYPE PROMOTION RULES:

1. Same-type scalars: preserve type
   - int + int → int32 (both same type, preserve)
   - byte + byte → byte
   - float + float → float32

2. Mixed-type scalars: use array-array promotion
   - int + long → int64
   - int + double → float64
   - byte + short → int16

3. NEP50 weak scalar: scalar + array → array dtype wins
   - int scalar + uint8 array → uint8
   - int scalar + float32 array → float32

4. Cross-kind promotion uses standard rules
   - float scalar + int32 array → float64

IMPLEMENTATION:

- Simplified to 4 overloads (NDArray, object+NDArray, NDArray+object, object+object)
- Detect scalar NDArrays via Shape.IsScalar (works for both implicit conversion
  and explicit np.array() calls)
- Input arrays converted to output dtype before kernel/iterator dispatch

NOTE: Unlike NumPy where Python int literals widen to int64, C# int literals
create int32 scalar NDArrays indistinguishable from explicit np.array(1, dtype=int32).
We preserve same-type scalars rather than widening, which is consistent with
C#'s typed literal semantics.
Extended np.asanyarray to handle all common C# collection types:

Collections supported via IEnumerable<T> pattern matching:
- List<T>, IList<T>, ICollection<T>, IEnumerable<T>
- IReadOnlyList<T>, IReadOnlyCollection<T>
- ReadOnlyCollection<T>
- LinkedList<T>
- HashSet<T>, SortedSet<T>
- Queue<T>, Stack<T>
- ArraySegment<T> (implements IEnumerable<T>)
- ImmutableArray<T>, ImmutableList<T>, ImmutableHashSet<T>
- Any LINQ query result (IEnumerable<T>)

Special handling for types not implementing IEnumerable<T>:
- Memory<T> - uses direct cast and ToArray()
- ReadOnlyMemory<T> - uses direct cast and ToArray()

Implementation approach:
- Clean pattern matching on IEnumerable<T> for all 12 NumSharp types
- No method reflection (direct LINQ .ToArray() calls)
- Memory/ReadOnlyMemory handled via type switch with direct casts

Supported element types (NumSharp's 12 types):
bool, byte, short, ushort, int, uint, long, ulong,
char, float, double, decimal

Note: sbyte, IntPtr, UIntPtr are NOT supported (not in NPTypeCode)
Added fallback support for collections that don't implement generic
IEnumerable<T> but still implement the non-generic interfaces:

Non-generic IEnumerable fallback:
- ArrayList, Hashtable.Keys/Values, BitArray, etc.
- Any legacy collection implementing only IEnumerable
- Element type detected from first non-null item

Non-generic IEnumerator fallback:
- Direct enumerator objects (e.g., from yield return methods)
- Element type detected from first non-null item

Implementation:
- Enumerate items into List<object>
- Detect element type from first item
- Convert to typed array via type switch (no reflection)
- Returns null for unsupported element types (falls through to error)

This completes the collection support hierarchy:
1. IEnumerable<T> - direct pattern matching (most efficient)
2. Memory<T>/ReadOnlyMemory<T> - special handling (no IEnumerable<T>)
3. IEnumerable (non-generic) - fallback with type detection
4. IEnumerator (non-generic) - fallback with type detection
- ConvertMemory: single type switch with ternary for ReadOnly vs mutable
- ConvertNonGenericEnumerable: delegate to ConvertEnumerator via GetEnumerator()
…andling

NumPy parity fixes based on battletest comparison:

1. Tuple/ValueTuple support (NEW):
   - Both Tuple<> and ValueTuple<> now iterate their elements
   - Uses ITuple interface (available in .NET Core 2.0+)
   - NumPy: np.asanyarray((1,2,3)) -> dtype=int64, shape=(3,)
   - NumSharp now matches this behavior

2. Empty non-generic collections (FIX):
   - Empty ArrayList/IEnumerable now returns empty double[]
   - Matches NumPy's default of float64 for empty collections
   - Previously threw NotSupportedException

Tests added:
- ValueTuple_IsIterable, Tuple_IsIterable
- ValueTuple_MixedTypes_UsesFirstElementType
- EmptyTuple_ReturnsEmptyDoubleArray
- EmptyArrayList_ReturnsEmptyDoubleArray
- Misaligned tests documenting intentional NumPy differences
…ions

Adds proper type promotion when collections contain mixed numeric types:
- int + double -> double (matches NumPy float64 promotion)
- int + bool -> int
- float + any int -> double
- decimal wins if present

Implementation:
- Added FindCommonNumericType() to detect widest compatible type
- Changed ConvertObjectListToNDArray to use Convert.To* methods
  instead of direct casts, enabling cross-type conversion
- Updated ConvertTuple and ConvertEnumerator to use type promotion

Tests added:
- ValueTuple_MixedTypes_PromotesToCommonType: (1, 2.5, 3) -> double
- ValueTuple_IntAndBool_PromotesToInt: (1, true, 3) -> int
Use pattern matching `is T v ? v : Convert.ToT(item)` instead of always
calling Convert.ToT(). This gives direct unbox speed for homogeneous
collections (the common case) while still handling mixed types correctly.

Benchmark results (100K iterations, size 1000):
- Convert.ToInt32 always: 4088 ns/op
- is int ? v : Convert:   1038 ns/op  (3.9x faster)

This optimization affects:
- ArrayList and other non-generic IEnumerable
- Tuple/ValueTuple via ITuple interface
- Any path through ConvertObjectListToNDArray

No behavior change - mixed type collections still work via Convert fallback.
…mization

For IEnumerable<T>, use optimized extraction:
1. List<T>: CollectionsMarshal.AsSpan() + CopyTo (direct memory access)
2. ICollection<T>: CopyTo() (avoids enumerator overhead)
3. Other: fallback to LINQ ToArray()

Benchmark results (size 10000, List<int>):
- Old (ToArray):      14129 ns/op
- New (ToArrayFast):  11665 ns/op
- Speedup: 1.21x (21% faster)

The CollectionsMarshal.AsSpan approach gives direct access to List<T>'s
internal array, avoiding the allocation and copy overhead of ToArray().
For ICollection<T>, CopyTo() is used which avoids enumerator overhead.
Replace `new T[n]` with `GC.AllocateUninitializedArray<T>(n)` in all
array allocations within np.asanyarray. Since we immediately overwrite
all elements, the default zeroing is wasted work.

Affected paths:
- ToArrayFast<T>: List<T> and ICollection<T> extraction
- ConvertObjectListToNDArray: All 12 dtype allocations

Benchmark (GC.AllocateUninitializedArray vs new T[]):
- Size 1000: 38 ns vs 156 ns (4x faster allocation)

This optimization compounds with the previous CollectionsMarshal and
pattern-match optimizations for significant cumulative improvement.
Optimizations applied:

1. FindCommonNumericType:
   - Early exit when decimal found (highest priority)
   - Early exit when float/double found (promotes to double)
   - Use CollectionsMarshal.AsSpan for bounds-check-free iteration
   - Stackalloc for type code deduplication
   - Reuse existing _FindCommonType_Scalar for consistent promotion

2. ConvertObjectListToNDArray:
   - Use CollectionsMarshal.AsSpan(items) for ~10-15% speedup
   - Eliminates bounds checking in tight conversion loops

3. ConvertEnumerator:
   - Pre-size List<object> when ICollection count is known
   - Eliminates resize allocations for known-size collections

4. ConvertTuple:
   - Pre-size List<object> with tuple.Length

Net: -27 lines while adding performance improvements
…to MSTest v3

Post-rebase cleanup after master migrated the test suite from TUnit to MSTest v3
(commits ac02033, e0db3c3). The 4 test files introduced on this branch still used
TUnit's [Test] attribute and `using TUnit.Core;`, which broke the build.

Changes per file:
- Replace `using TUnit.Core;` (removed)
- Add `[TestClass]` attribute to the test class
- Replace all `[Test]` attributes with `[TestMethod]`

Files migrated:
- test/NumSharp.UnitTest/Logic/np.where.Test.cs
- test/NumSharp.UnitTest/Logic/np.where.BattleTest.cs
- test/NumSharp.UnitTest/Backends/Kernels/WhereSimdTests.cs
- test/NumSharp.UnitTest/Creation/np.asanyarray.Tests.cs

Verified: all 112 np.where tests and 62 np.asanyarray tests pass on net8.0 and net10.0.
np.asanyarray(new object[]{1, 2.5, 3}) threw NotSupportedException because
`case Array array` matched object[] first and `new NDArray(object[])` rejects
object as an element type. object[] has no fixed dtype, so routing through the
non-generic IEnumerable path (which applies NumPy-like type promotion) is the
correct behavior.

Added an explicit `case object[] objArr` branch that delegates to
ConvertNonGenericEnumerable, which already handles:
- Homogeneous object[]: detected via FindCommonNumericType, single dtype
- Mixed object[]: promoted to common type (e.g. int + double -> double)
- Empty object[]: returns empty double[] (matches NumPy float64 default)
- Bool+int mix: promotes bool to int via Convert.ToInt32 (True=1, False=0)

Regression tests added in np.asanyarray.Tests.cs covering all four cases.
All 66 np.asanyarray tests pass on net8.0 and net10.0.
Code review caught dead code paths and over-narrative comments. Net change is
-293/+19 across three files.

ILKernelGenerator.Where.cs (-249 lines):
- Delete `GetNPTypeCode<T>` (use shared InfoOf<T>.NPTypeCode instead).
- Delete `GetMaskCreationMethod256/128` and the entire 200-line
  `Static Mask Creation Methods (fallback)` region (CreateMaskV256_*Byte and
  CreateMaskV128_*Byte). They were never called -- the inline IL emitter at
  EmitInlineMaskCreationV256/V128 handles the mask creation directly via the
  cached MethodInfo lookups. The static helpers existed as an early prototype
  fallback path that became unreachable.
- Delete `_v256ZeroULong` field with the meaningless `IsStatic ? null! : null!`
  tautology (only `_v256GetZeroULong` is used).

np.where.cs (+2 lines):
- Add `default: throw NotSupportedException(...)` to `WhereKernelDispatch`
  switch. The kernel path is currently only reached for the 12 supported
  NPTypeCodes, but the missing default would silently fall through and return
  an uninitialized result if a new NPTypeCode were ever added without updating
  this switch. The iterator-path switch (line 142) already has this guard.

np.asanyarray.cs (-43/+18 net):
- Cap `stackalloc NPTypeCode[span.Length]` at 12 (max possible unique
  NPTypeCodes given the seenMask deduplication). The previous unbounded
  stackalloc could blow the stack for very large user lists.
- Remove dead `hasDecimal` variable (set but never read; the early-exit
  for decimal returns immediately on first hit).
- Trim narrative/microbenchmark comments per CLAUDE.md guidance:
  removed "Optimized: ...3-7x faster", "optimization #4", "~4x faster than
  always using Convert", "Pre-sized list (optimization: ...)", and a handful
  of WHAT-the-code-does comments that restated obvious switch arms.
- Tighten Tuple/Enumerator helpers (collapse trivial if/else into ternary).

Verified: 178 np.where + np.asanyarray tests still pass on net8.0 + net10.0.
…-circuits

Second-round code review caught one real bug and several minor efficiency issues.

1. Fix: FindCommonNumericType promoted pure-float object[] to double

   np.asanyarray(new object[]{1.5f, 2.5f}) returned Double instead of Single.
   Root cause: the early-exit `if (hasDouble || hasFloat) return typeof(double)`
   fired before the `uniqueCount == 1` check that preserves the original dtype.
   Removing the hasFloat arm lets the general path handle it:
   - Pure float32 -> uniqueCount == 1 -> returns firstType (Single)  -- matches NumPy
   - int + float32 -> _FindCommonType_Scalar -> returns Double -- matches NumPy NEP50
   - Pure float64 -> unchanged (still Double)
   - decimal-wins-everything early exit preserved.

   Two regression tests added:
   - ObjectArray_AllFloat_PreservesSingle
   - ObjectArray_MixedIntAndFloat32_PromotesToDouble

2. Perf: skip type promotion in np.where when x.dtype == y.dtype

   Previously _FindCommonType(x, y) always ran, even when both operands shared a
   dtype. Short-circuit to x.GetTypeCode in that case, saving one dict lookup +
   two astype traversals per call. The NEP50 lookup still runs when dtypes
   differ, preserving scalar+array promotion semantics.

3. Perf: skip broadcast_arrays when all three shapes already match

   broadcast_arrays allocates three fresh NDArrays plus helper Shape[]. For the
   common case of np.where(mask, arr, other_arr) where all three arrays share a
   shape, this is wasted. Skip it when condition.Shape == x.Shape == y.Shape
   (Shape == compares by dimensions).

4. Perf: cache Vector256/Vector128 generic MethodInfo

   EmitWhereV256BodyWithOffset and EmitWhereV128BodyWithOffset did
   Array.Find(typeof(Vector*).GetMethods(), ...) three times per call, each
   scanning ~100 methods. Per kernel generation (4-way unrolled + 1 remainder
   call = 5 calls), that was 15 reflection scans per T, or ~180 on first use
   across all 12 dtypes. Cached as six static readonly fields; only
   MakeGenericMethod(typeof(T)) runs per call.

5. Polish: doc + error message

   - where(NDArray) xmldoc was copy-pasted from the 3-arg overload ("Return
     elements chosen from x or y"); rewritten to describe nonzero semantics.
   - object[] NotSupportedException now names the actual problem ("element type
     is not a supported NumSharp dtype") instead of just reporting the length.

Verified: 180 np.where + np.asanyarray tests pass on net8.0 + net10.0.
np.where's IL kernel had ~35 MethodInfo fields scattered across the file using
`Array.Find(...)!` null-forgiveness, which throws NullReferenceException at first
use if a framework method ever gets renamed/removed. The existing CachedMethods
nested class in ILKernelGenerator.cs follows a fail-fast `?? throw new
MissingMethodException(type, name)` pattern, keyed per MethodInfo, and is the
project convention for every other kernel partial.

Changes:

- Make `CachedMethods` a `partial` nested class so Where-specific reflection can
  live alongside the kernel file it serves. (ILKernelGenerator.cs: 1 line.)

- Delete the 35 `_v128*/_v256*/_avx2*/_sse41*` private fields from
  ILKernelGenerator.Where.cs and move them into a new "Where Kernel Methods"
  region inside a partial `CachedMethods` declaration at the bottom of that
  file. Renamed to PascalCase (e.g. _v256LoadByte -> V256LoadByte) to match the
  existing CachedMethods naming convention.

- Introduce three small helpers inside CachedMethods:
  - FindGenericMethod(Type, string name, int? paramCount) - wraps the
    `Array.Find(GetMethods(), m => m.IsGenericMethodDefinition && ...)` pattern
    with a MissingMethodException fail-fast throw. Handles the overload count
    disambiguation for Load/Store.
  - FindMethodExact(Type, string name, Type[] argTypes) - wraps GetMethod with a
    fail-fast throw. Used for Avx2/Sse41 specific overloads.
  - GetZeroGetter(Type vectorOfT) - wraps Property("Zero").GetMethod with a
    fail-fast throw. Used for the 8 Vector*<T>.Zero getters.

- Update all 41 call sites in EmitInlineMaskCreationV256/V128 and
  EmitWhereV256/V128BodyWithOffset to use CachedMethods.Xxx.

Behaviour unchanged; 180 np.where + np.asanyarray tests still pass on net8.0 +
net10.0. The single real benefit is earlier and clearer failure if any of the
~35 framework API names change in a future .NET release.
…tibility

CI failure on macos-latest (ARM64/Apple Silicon) reported 31 np.where tests
throwing PlatformNotSupportedException at runtime:

    PlatformNotSupportedException: Operation is not supported on this platform.
      at System.Runtime.Intrinsics.X86.Sse41.ConvertToVector128Int64(Vector128`1 value)
      at IL_Where_Int64(...)

Root cause: the SIMD-emit path was gated only on `VectorBits >= 128`. On ARM64,
`Vector128.IsHardwareAccelerated` is true (maps to Neon), so VectorBits is 128,
and the kernel emits calls to Sse41/Avx2 byte-lane expansion intrinsics which
are x86-only.

Breakdown of the byte-mask expansion path by element size:
  - 1-byte (byte): portable Vector*.Load/GreaterThan — safe on any SIMD platform
  - 2-byte: Sse41.ConvertToVector128Int16 / Avx2.ConvertToVector256Int16
  - 4-byte: Sse41.ConvertToVector128Int32 / Avx2.ConvertToVector256Int32
  - 8-byte: Sse41.ConvertToVector128Int64 / Avx2.ConvertToVector256Int64

Fix: in GenerateWhereKernelIL, compute `useV256`/`useV128` with an additional
Sse41.IsSupported / Avx2.IsSupported guard — but only when elementSize > 1,
since the 1-byte path is portable. If neither x86 intrinsic set is available
for the required lane size, skip SIMD emission entirely; the scalar IL loop
that follows handles correctness.

Also passes the useV256 decision to EmitWhereSIMDLoop explicitly instead of
recomputing it from VectorBits inside the loop, which was both duplicative and
ignored the IsSupported guard.

Result: on ARM64, byte-typed arrays still use Neon-backed SIMD; int/long/float/
double/short fall back to the scalar IL kernel. On x86 nothing changes.

Verified: 180 np.where + np.asanyarray tests pass on Windows x64 (net8.0 +
net10.0). ARM path awaits CI verification.
@Nucs Nucs marked this pull request as ready for review April 20, 2026 18:52
@Nucs Nucs merged commit 6079495 into master Apr 20, 2026
7 checks passed
@Nucs Nucs deleted the np_where branch April 20, 2026 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[API] Support np.where via ILKernelGenerator

1 participant