Conversation
Member
Author
Performance Results: AVX2 Mask Expansion OptimizationAfter implementing AVX2/SSE4.1 intrinsics for mask expansion, here are the benchmark results: Kernel Performance (double, 1M elements)
Scaling
How It WorksReplaced scalar conditional mask creation with single-instruction SIMD expansion: // Before: 4 scalar conditionals for 8-byte elements
Vector256.Create(
bools[0] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
bools[1] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
bools[2] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul,
bools[3] != 0 ? 0xFFFFFFFFFFFFFFFFul : 0ul
);
// After: 2-3 instructions using AVX2
var bytes128 = Vector128.CreateScalar(*(uint*)bools).AsByte();
var expanded = Avx2.ConvertToVector256Int64(bytes128).AsUInt64(); // vpmovzxbq
return Vector256.GreaterThan(expanded, Vector256<ulong>.Zero);
All 12 dtypes supported with scalar fallback for non-AVX2/SSE4.1 systems. |
Member
Author
Update: Inlined IL - Now 3.9x FASTER than NumPy!By inlining the mask creation directly in IL instead of calling helper methods:
What ChangedInstead of emitting This eliminates:
The kernel now processes 2,083 million elements per second - significantly faster than NumPy's ~540 M/ms. |
…n, x, y) Add IL-generated kernels for np.where using runtime code generation: - Uses DynamicMethod to generate type-specific kernels at runtime - Vector256/Vector128.ConditionalSelect for SIMD element selection - 4x loop unrolling for better instruction-level parallelism - Full long indexing support for arrays > 2^31 elements - Supports all 12 dtypes (11 via SIMD, Decimal via scalar fallback) - Kernels cached per type for reuse Architecture: - WhereKernel<T> delegate: (bool* cond, T* x, T* y, T* result, long count) - GetWhereKernel<T>(): Returns cached IL-generated kernel - WhereExecute<T>(): Main entry point with automatic fallback IL Generation: - 4x unrolled SIMD loop (processes 4 vectors per iteration) - Remainder SIMD loop (1 vector at a time) - Scalar tail loop for remaining elements - Mask creation methods by element size (1/2/4/8 bytes) - All arithmetic uses long types natively (no int-to-long casts) Falls back to iterator path for: - Non-contiguous/broadcasted arrays (stride=0) - Non-bool conditions (need truthiness conversion) Files: - src/NumSharp.Core/APIs/np.where.cs: Kernel dispatch logic - src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Where.cs: IL generation - test/NumSharp.UnitTest/Backends/Kernels/WhereSimdTests.cs: 26 tests Closes #604
Replace scalar conditional mask creation with SIMD intrinsics: V256 mask creation (for AVX2): - 8-byte elements: Avx2.ConvertToVector256Int64 (vpmovzxbq) - 4-byte elements: Avx2.ConvertToVector256Int32 (vpmovzxbd) - 2-byte elements: Avx2.ConvertToVector256Int16 (vpmovzxbw) V128 mask creation (for SSE4.1): - 8-byte elements: Sse41.ConvertToVector128Int64 (pmovzxbq) - 4-byte elements: Sse41.ConvertToVector128Int32 (pmovzxbd) - 2-byte elements: Sse41.ConvertToVector128Int16 (pmovzxbw) Each intrinsic replaces 4-16 scalar conditionals with a single zero-extend + compare instruction sequence. Also fixes reflection lookups for Vector256/Vector128.Load, Store, and ConditionalSelect methods that were failing because these are generic method definitions requiring special handling. Performance (1M double elements): - Kernel: 2.6ms @ 381 M elements/ms - NumPy baseline: ~1.86ms - Ratio: ~1.4x slower (down from ~3x before optimization) All 12 dtypes supported with fallback for non-AVX2/SSE4.1 systems.
Instead of emitting Call opcodes to mask helper methods, now emit the AVX2/SSE4.1 instructions directly inline in the IL stream. This eliminates: - Method call overhead (~12% per call) - Runtime Avx2.IsSupported checks in hot path - JIT optimization barriers at call boundaries The IL now emits the full mask creation sequence: - 8-byte: ldind.u4 → CreateScalar → AsByte → ConvertToVector256Int64 → AsUInt64 → GreaterThan - 4-byte: ldind.i8 → CreateScalar → AsByte → ConvertToVector256Int32 → AsUInt32 → GreaterThan - 2-byte: Load → ConvertToVector256Int16 → AsUInt16 → GreaterThan - 1-byte: Load → GreaterThan (direct comparison) Performance (1M double elements): - Previous (method call): 2.6 ms - Inlined IL: 0.48 ms (5.4x faster) - NumPy baseline: 1.86 ms (NumSharp is now 3.9x FASTER) Fixed reflection lookups for AsByte/AsUInt* which are extension methods on Vector128/Vector256 static classes, not instance methods.
Implements NumPy 2.x NEP50 "weak scalar" semantics for np.where, detecting scalar NDArrays via Shape.IsScalar for clean type promotion without requiring per-type overloads. TYPE PROMOTION RULES: 1. Same-type scalars: preserve type - int + int → int32 (both same type, preserve) - byte + byte → byte - float + float → float32 2. Mixed-type scalars: use array-array promotion - int + long → int64 - int + double → float64 - byte + short → int16 3. NEP50 weak scalar: scalar + array → array dtype wins - int scalar + uint8 array → uint8 - int scalar + float32 array → float32 4. Cross-kind promotion uses standard rules - float scalar + int32 array → float64 IMPLEMENTATION: - Simplified to 4 overloads (NDArray, object+NDArray, NDArray+object, object+object) - Detect scalar NDArrays via Shape.IsScalar (works for both implicit conversion and explicit np.array() calls) - Input arrays converted to output dtype before kernel/iterator dispatch NOTE: Unlike NumPy where Python int literals widen to int64, C# int literals create int32 scalar NDArrays indistinguishable from explicit np.array(1, dtype=int32). We preserve same-type scalars rather than widening, which is consistent with C#'s typed literal semantics.
Extended np.asanyarray to handle all common C# collection types: Collections supported via IEnumerable<T> pattern matching: - List<T>, IList<T>, ICollection<T>, IEnumerable<T> - IReadOnlyList<T>, IReadOnlyCollection<T> - ReadOnlyCollection<T> - LinkedList<T> - HashSet<T>, SortedSet<T> - Queue<T>, Stack<T> - ArraySegment<T> (implements IEnumerable<T>) - ImmutableArray<T>, ImmutableList<T>, ImmutableHashSet<T> - Any LINQ query result (IEnumerable<T>) Special handling for types not implementing IEnumerable<T>: - Memory<T> - uses direct cast and ToArray() - ReadOnlyMemory<T> - uses direct cast and ToArray() Implementation approach: - Clean pattern matching on IEnumerable<T> for all 12 NumSharp types - No method reflection (direct LINQ .ToArray() calls) - Memory/ReadOnlyMemory handled via type switch with direct casts Supported element types (NumSharp's 12 types): bool, byte, short, ushort, int, uint, long, ulong, char, float, double, decimal Note: sbyte, IntPtr, UIntPtr are NOT supported (not in NPTypeCode)
Added fallback support for collections that don't implement generic IEnumerable<T> but still implement the non-generic interfaces: Non-generic IEnumerable fallback: - ArrayList, Hashtable.Keys/Values, BitArray, etc. - Any legacy collection implementing only IEnumerable - Element type detected from first non-null item Non-generic IEnumerator fallback: - Direct enumerator objects (e.g., from yield return methods) - Element type detected from first non-null item Implementation: - Enumerate items into List<object> - Detect element type from first item - Convert to typed array via type switch (no reflection) - Returns null for unsupported element types (falls through to error) This completes the collection support hierarchy: 1. IEnumerable<T> - direct pattern matching (most efficient) 2. Memory<T>/ReadOnlyMemory<T> - special handling (no IEnumerable<T>) 3. IEnumerable (non-generic) - fallback with type detection 4. IEnumerator (non-generic) - fallback with type detection
- ConvertMemory: single type switch with ternary for ReadOnly vs mutable - ConvertNonGenericEnumerable: delegate to ConvertEnumerator via GetEnumerator()
…andling NumPy parity fixes based on battletest comparison: 1. Tuple/ValueTuple support (NEW): - Both Tuple<> and ValueTuple<> now iterate their elements - Uses ITuple interface (available in .NET Core 2.0+) - NumPy: np.asanyarray((1,2,3)) -> dtype=int64, shape=(3,) - NumSharp now matches this behavior 2. Empty non-generic collections (FIX): - Empty ArrayList/IEnumerable now returns empty double[] - Matches NumPy's default of float64 for empty collections - Previously threw NotSupportedException Tests added: - ValueTuple_IsIterable, Tuple_IsIterable - ValueTuple_MixedTypes_UsesFirstElementType - EmptyTuple_ReturnsEmptyDoubleArray - EmptyArrayList_ReturnsEmptyDoubleArray - Misaligned tests documenting intentional NumPy differences
…ions Adds proper type promotion when collections contain mixed numeric types: - int + double -> double (matches NumPy float64 promotion) - int + bool -> int - float + any int -> double - decimal wins if present Implementation: - Added FindCommonNumericType() to detect widest compatible type - Changed ConvertObjectListToNDArray to use Convert.To* methods instead of direct casts, enabling cross-type conversion - Updated ConvertTuple and ConvertEnumerator to use type promotion Tests added: - ValueTuple_MixedTypes_PromotesToCommonType: (1, 2.5, 3) -> double - ValueTuple_IntAndBool_PromotesToInt: (1, true, 3) -> int
Use pattern matching `is T v ? v : Convert.ToT(item)` instead of always calling Convert.ToT(). This gives direct unbox speed for homogeneous collections (the common case) while still handling mixed types correctly. Benchmark results (100K iterations, size 1000): - Convert.ToInt32 always: 4088 ns/op - is int ? v : Convert: 1038 ns/op (3.9x faster) This optimization affects: - ArrayList and other non-generic IEnumerable - Tuple/ValueTuple via ITuple interface - Any path through ConvertObjectListToNDArray No behavior change - mixed type collections still work via Convert fallback.
…mization For IEnumerable<T>, use optimized extraction: 1. List<T>: CollectionsMarshal.AsSpan() + CopyTo (direct memory access) 2. ICollection<T>: CopyTo() (avoids enumerator overhead) 3. Other: fallback to LINQ ToArray() Benchmark results (size 10000, List<int>): - Old (ToArray): 14129 ns/op - New (ToArrayFast): 11665 ns/op - Speedup: 1.21x (21% faster) The CollectionsMarshal.AsSpan approach gives direct access to List<T>'s internal array, avoiding the allocation and copy overhead of ToArray(). For ICollection<T>, CopyTo() is used which avoids enumerator overhead.
Replace `new T[n]` with `GC.AllocateUninitializedArray<T>(n)` in all array allocations within np.asanyarray. Since we immediately overwrite all elements, the default zeroing is wasted work. Affected paths: - ToArrayFast<T>: List<T> and ICollection<T> extraction - ConvertObjectListToNDArray: All 12 dtype allocations Benchmark (GC.AllocateUninitializedArray vs new T[]): - Size 1000: 38 ns vs 156 ns (4x faster allocation) This optimization compounds with the previous CollectionsMarshal and pattern-match optimizations for significant cumulative improvement.
Optimizations applied: 1. FindCommonNumericType: - Early exit when decimal found (highest priority) - Early exit when float/double found (promotes to double) - Use CollectionsMarshal.AsSpan for bounds-check-free iteration - Stackalloc for type code deduplication - Reuse existing _FindCommonType_Scalar for consistent promotion 2. ConvertObjectListToNDArray: - Use CollectionsMarshal.AsSpan(items) for ~10-15% speedup - Eliminates bounds checking in tight conversion loops 3. ConvertEnumerator: - Pre-size List<object> when ICollection count is known - Eliminates resize allocations for known-size collections 4. ConvertTuple: - Pre-size List<object> with tuple.Length Net: -27 lines while adding performance improvements
…to MSTest v3 Post-rebase cleanup after master migrated the test suite from TUnit to MSTest v3 (commits ac02033, e0db3c3). The 4 test files introduced on this branch still used TUnit's [Test] attribute and `using TUnit.Core;`, which broke the build. Changes per file: - Replace `using TUnit.Core;` (removed) - Add `[TestClass]` attribute to the test class - Replace all `[Test]` attributes with `[TestMethod]` Files migrated: - test/NumSharp.UnitTest/Logic/np.where.Test.cs - test/NumSharp.UnitTest/Logic/np.where.BattleTest.cs - test/NumSharp.UnitTest/Backends/Kernels/WhereSimdTests.cs - test/NumSharp.UnitTest/Creation/np.asanyarray.Tests.cs Verified: all 112 np.where tests and 62 np.asanyarray tests pass on net8.0 and net10.0.
np.asanyarray(new object[]{1, 2.5, 3}) threw NotSupportedException because
`case Array array` matched object[] first and `new NDArray(object[])` rejects
object as an element type. object[] has no fixed dtype, so routing through the
non-generic IEnumerable path (which applies NumPy-like type promotion) is the
correct behavior.
Added an explicit `case object[] objArr` branch that delegates to
ConvertNonGenericEnumerable, which already handles:
- Homogeneous object[]: detected via FindCommonNumericType, single dtype
- Mixed object[]: promoted to common type (e.g. int + double -> double)
- Empty object[]: returns empty double[] (matches NumPy float64 default)
- Bool+int mix: promotes bool to int via Convert.ToInt32 (True=1, False=0)
Regression tests added in np.asanyarray.Tests.cs covering all four cases.
All 66 np.asanyarray tests pass on net8.0 and net10.0.
Code review caught dead code paths and over-narrative comments. Net change is -293/+19 across three files. ILKernelGenerator.Where.cs (-249 lines): - Delete `GetNPTypeCode<T>` (use shared InfoOf<T>.NPTypeCode instead). - Delete `GetMaskCreationMethod256/128` and the entire 200-line `Static Mask Creation Methods (fallback)` region (CreateMaskV256_*Byte and CreateMaskV128_*Byte). They were never called -- the inline IL emitter at EmitInlineMaskCreationV256/V128 handles the mask creation directly via the cached MethodInfo lookups. The static helpers existed as an early prototype fallback path that became unreachable. - Delete `_v256ZeroULong` field with the meaningless `IsStatic ? null! : null!` tautology (only `_v256GetZeroULong` is used). np.where.cs (+2 lines): - Add `default: throw NotSupportedException(...)` to `WhereKernelDispatch` switch. The kernel path is currently only reached for the 12 supported NPTypeCodes, but the missing default would silently fall through and return an uninitialized result if a new NPTypeCode were ever added without updating this switch. The iterator-path switch (line 142) already has this guard. np.asanyarray.cs (-43/+18 net): - Cap `stackalloc NPTypeCode[span.Length]` at 12 (max possible unique NPTypeCodes given the seenMask deduplication). The previous unbounded stackalloc could blow the stack for very large user lists. - Remove dead `hasDecimal` variable (set but never read; the early-exit for decimal returns immediately on first hit). - Trim narrative/microbenchmark comments per CLAUDE.md guidance: removed "Optimized: ...3-7x faster", "optimization #4", "~4x faster than always using Convert", "Pre-sized list (optimization: ...)", and a handful of WHAT-the-code-does comments that restated obvious switch arms. - Tighten Tuple/Enumerator helpers (collapse trivial if/else into ternary). Verified: 178 np.where + np.asanyarray tests still pass on net8.0 + net10.0.
…-circuits
Second-round code review caught one real bug and several minor efficiency issues.
1. Fix: FindCommonNumericType promoted pure-float object[] to double
np.asanyarray(new object[]{1.5f, 2.5f}) returned Double instead of Single.
Root cause: the early-exit `if (hasDouble || hasFloat) return typeof(double)`
fired before the `uniqueCount == 1` check that preserves the original dtype.
Removing the hasFloat arm lets the general path handle it:
- Pure float32 -> uniqueCount == 1 -> returns firstType (Single) -- matches NumPy
- int + float32 -> _FindCommonType_Scalar -> returns Double -- matches NumPy NEP50
- Pure float64 -> unchanged (still Double)
- decimal-wins-everything early exit preserved.
Two regression tests added:
- ObjectArray_AllFloat_PreservesSingle
- ObjectArray_MixedIntAndFloat32_PromotesToDouble
2. Perf: skip type promotion in np.where when x.dtype == y.dtype
Previously _FindCommonType(x, y) always ran, even when both operands shared a
dtype. Short-circuit to x.GetTypeCode in that case, saving one dict lookup +
two astype traversals per call. The NEP50 lookup still runs when dtypes
differ, preserving scalar+array promotion semantics.
3. Perf: skip broadcast_arrays when all three shapes already match
broadcast_arrays allocates three fresh NDArrays plus helper Shape[]. For the
common case of np.where(mask, arr, other_arr) where all three arrays share a
shape, this is wasted. Skip it when condition.Shape == x.Shape == y.Shape
(Shape == compares by dimensions).
4. Perf: cache Vector256/Vector128 generic MethodInfo
EmitWhereV256BodyWithOffset and EmitWhereV128BodyWithOffset did
Array.Find(typeof(Vector*).GetMethods(), ...) three times per call, each
scanning ~100 methods. Per kernel generation (4-way unrolled + 1 remainder
call = 5 calls), that was 15 reflection scans per T, or ~180 on first use
across all 12 dtypes. Cached as six static readonly fields; only
MakeGenericMethod(typeof(T)) runs per call.
5. Polish: doc + error message
- where(NDArray) xmldoc was copy-pasted from the 3-arg overload ("Return
elements chosen from x or y"); rewritten to describe nonzero semantics.
- object[] NotSupportedException now names the actual problem ("element type
is not a supported NumSharp dtype") instead of just reporting the length.
Verified: 180 np.where + np.asanyarray tests pass on net8.0 + net10.0.
np.where's IL kernel had ~35 MethodInfo fields scattered across the file using
`Array.Find(...)!` null-forgiveness, which throws NullReferenceException at first
use if a framework method ever gets renamed/removed. The existing CachedMethods
nested class in ILKernelGenerator.cs follows a fail-fast `?? throw new
MissingMethodException(type, name)` pattern, keyed per MethodInfo, and is the
project convention for every other kernel partial.
Changes:
- Make `CachedMethods` a `partial` nested class so Where-specific reflection can
live alongside the kernel file it serves. (ILKernelGenerator.cs: 1 line.)
- Delete the 35 `_v128*/_v256*/_avx2*/_sse41*` private fields from
ILKernelGenerator.Where.cs and move them into a new "Where Kernel Methods"
region inside a partial `CachedMethods` declaration at the bottom of that
file. Renamed to PascalCase (e.g. _v256LoadByte -> V256LoadByte) to match the
existing CachedMethods naming convention.
- Introduce three small helpers inside CachedMethods:
- FindGenericMethod(Type, string name, int? paramCount) - wraps the
`Array.Find(GetMethods(), m => m.IsGenericMethodDefinition && ...)` pattern
with a MissingMethodException fail-fast throw. Handles the overload count
disambiguation for Load/Store.
- FindMethodExact(Type, string name, Type[] argTypes) - wraps GetMethod with a
fail-fast throw. Used for Avx2/Sse41 specific overloads.
- GetZeroGetter(Type vectorOfT) - wraps Property("Zero").GetMethod with a
fail-fast throw. Used for the 8 Vector*<T>.Zero getters.
- Update all 41 call sites in EmitInlineMaskCreationV256/V128 and
EmitWhereV256/V128BodyWithOffset to use CachedMethods.Xxx.
Behaviour unchanged; 180 np.where + np.asanyarray tests still pass on net8.0 +
net10.0. The single real benefit is earlier and clearer failure if any of the
~35 framework API names change in a future .NET release.
…tibility
CI failure on macos-latest (ARM64/Apple Silicon) reported 31 np.where tests
throwing PlatformNotSupportedException at runtime:
PlatformNotSupportedException: Operation is not supported on this platform.
at System.Runtime.Intrinsics.X86.Sse41.ConvertToVector128Int64(Vector128`1 value)
at IL_Where_Int64(...)
Root cause: the SIMD-emit path was gated only on `VectorBits >= 128`. On ARM64,
`Vector128.IsHardwareAccelerated` is true (maps to Neon), so VectorBits is 128,
and the kernel emits calls to Sse41/Avx2 byte-lane expansion intrinsics which
are x86-only.
Breakdown of the byte-mask expansion path by element size:
- 1-byte (byte): portable Vector*.Load/GreaterThan — safe on any SIMD platform
- 2-byte: Sse41.ConvertToVector128Int16 / Avx2.ConvertToVector256Int16
- 4-byte: Sse41.ConvertToVector128Int32 / Avx2.ConvertToVector256Int32
- 8-byte: Sse41.ConvertToVector128Int64 / Avx2.ConvertToVector256Int64
Fix: in GenerateWhereKernelIL, compute `useV256`/`useV128` with an additional
Sse41.IsSupported / Avx2.IsSupported guard — but only when elementSize > 1,
since the 1-byte path is portable. If neither x86 intrinsic set is available
for the required lane size, skip SIMD emission entirely; the scalar IL loop
that follows handles correctness.
Also passes the useV256 decision to EmitWhereSIMDLoop explicitly instead of
recomputing it from VectorBits inside the loop, which was both duplicative and
ignored the IsSupported guard.
Result: on ARM64, byte-typed arrays still use Neon-backed SIMD; int/long/float/
double/short fall back to the scalar IL kernel. On x86 nothing changes.
Verified: 180 np.where + np.asanyarray tests pass on Windows x64 (net8.0 +
net10.0). ARM path awaits CI verification.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
np.where(condition, x, y)DynamicMethodto generate type-specific kernels at runtimeImplementation
WhereKernel<T>GetWhereKernel<T>()WhereExecute<T>()Eligibility for SIMD Path
Falls back to iterator path for:
Test Plan
Closes #604