ENH: Add SIMD sin/cos implementation with numpy-simd-routines by seiko2plus · Pull Request #29699 · numpy/numpy

seiko2plus · 2025-09-06T15:16:21Z

numpy-simd-routines added as subrepo in meson subprojects
directory and the current FP configuration is static, ~1ulp used for double-precision
~4ulp for single-precision with handling floating-point errors,
special-cases extended precision for large arguments,
subnormals are enabled by default too.

numpy-simd-routines supports all SIMD extensions that are supported
by Google Highway including non-FMA extensions and is fully independent
from libm to guarantee unified results across all compilers and
platforms.

Note: that there was no SIMD optimization enabled for sin/cos
for double-precision before, only single-precision.

Benchmark

X86 Platform

The following benchmark was tested against GCC 14.3.0 on an x86 CPU (Ryzen 7 7700X).

Environment

> uname -a
Linux seiko-pc 6.12.60 #1-NixOS SMP PREEMPT_DYNAMIC Mon Dec  1 10:43:41 UTC 2025 x86_64 GNU/Linux
> gcc --version
gcc (GCC) 14.3.0
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> python --version
Python 3.12.12
> lscpu
Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             48 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      16
  On-line CPU(s) list:       0-15
Vendor ID:                   AuthenticAMD
  Model name:                AMD Ryzen 7 7700X 8-Core Processor
    CPU family:              25
    Model:                   97
    Thread(s) per core:      2
    Core(s) per socket:      8
    Socket(s):               1
    Stepping:                2
    Frequency boost:         enabled
    CPU(s) scaling MHz:      72%
    CPU max MHz:             5573.0000
    CPU min MHz:             545.0000
    BogoMIPS:                8982.62
    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl 
                             xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_leg
                             acy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp
                              ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec
                              xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_cl
                             ean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpo
                             pcntdq rdpid overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze

x86-64-v4

export NPY_DISABLE_CPU_FEATURES=""
spin bench --compare parent/main --cpu-affinity 7 -t "\(<ufunc '(cos|sin)'>"

Change	Before [`b70fc77`] <brings_npsr~2>	After [`c5fc842`] <brings_npsr>	Ratio	Benchmark (Parameter)
-	689±6μs	640±30μs	0.93	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	683±0.9μs	602±2μs	0.88	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-	682±1μs	603±1μs	0.88	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-	695±0.7μs	604±1μs	0.87	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-	354±0.7μs	231±0.4μs	0.65	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-	363±0.2μs	234±4μs	0.64	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-	1.04±0.02ms	632±20μs	0.61	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	1.02±0ms	612±10μs	0.6	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-	1.32±0.01ms	795±2μs	0.6	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-	1.72±0.03ms	806±3μs	0.47	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-	1.44±0.02ms	603±0.3μs	0.42	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-	5.05±0.1ms	2.07±0.01ms	0.41	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-	4.98±0.01ms	1.88±0.03ms	0.38	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-	5.95±0.02ms	2.26±0.07ms	0.38	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-	2.18±0.05ms	792±1μs	0.36	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-	1.78±0.03ms	608±4μs	0.34	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	5.65±0.02ms	1.87±0.2ms	0.33	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-	5.65±0.01ms	1.87±0.01ms	0.33	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-	5.77±0.01ms	1.90±0.01ms	0.33	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-	6.00±0.04ms	1.98±0.02ms	0.33	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	1.36±0.01ms	443±2μs	0.32	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-	5.75±0ms	1.68±0.04ms	0.29	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	4.97±0ms	1.46±0.04ms	0.29	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-	2.42±0.2ms	609±2μs	0.25	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	5.62±0.01ms	1.25±0.02ms	0.22	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-	5.72±0ms	1.25±0.01ms	0.22	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	4.97±0ms	1.08±0ms	0.22	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-	7.08±0.01ms	1.58±0.02ms	0.22	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	7.07±0ms	1.20±0.01ms	0.17	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-	1.49±0ms	233±10μs	0.16	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-	5.61±0ms	828±0.4μs	0.15	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-	5.73±0ms	859±8μs	0.15	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')

x86-64-v3

export NPY_DISABLE_CPU_FEATURES="X86_V4"
spin bench --compare parent/main --cpu-affinity 7 -t "\(<ufunc '(cos|sin)'>"

Change	Before [`b70fc77`] <brings_npsr~2>	After [`c5fc842`] <brings_npsr>	Ratio	Benchmark (Parameter)
+	844±1μs	939±50μs	1.11	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-	4.98±0ms	4.57±0.1ms	0.92	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-	4.98±0ms	4.20±0.01ms	0.84	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-	6.03±0.06ms	5.00±0.06ms	0.83	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-	6.06±0.06ms	4.78±0.04ms	0.79	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	1.37±0.01ms	913±40μs	0.67	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-	1.37±0.01ms	901±10μs	0.66	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	1.45±0ms	941±2μs	0.65	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-	830±0.3μs	511±2μs	0.62	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-	7.07±0.01ms	4.35±0.1ms	0.61	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	844±2μs	496±5μs	0.59	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-	1.46±0.01ms	864±9μs	0.59	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	3.21±0.03ms	1.83±0ms	0.57	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-	7.08±0.01ms	3.98±0.06ms	0.56	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-	3.63±0.09ms	1.84±0.01ms	0.51	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-	5.65±0.01ms	2.82±0.1ms	0.5	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-	5.64±0ms	2.74±0.3ms	0.49	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-	5.75±0ms	2.76±0.1ms	0.48	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	3.00±0.09ms	1.44±0ms	0.48	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-	3.78±0.1ms	1.82±0.02ms	0.48	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-	5.77±0.01ms	2.68±0.02ms	0.46	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-	5.73±0ms	2.11±0.03ms	0.37	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	5.62±0ms	2.03±0.01ms	0.36	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-	5.62±0.02ms	1.67±0ms	0.3	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-	5.72±0ms	1.73±0ms	0.3	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-	3.23±0.02ms	866±10μs	0.27	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-	3.48±0.1ms	856±3μs	0.25	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	3.68±0.06ms	851±2μs	0.23	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	3.03±0.08ms	497±4μs	0.16	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')

x86-64-v2

export NPY_DISABLE_CPU_FEATURES="X86_V4 X86_V3"
spin bench --compare parent/main --cpu-affinity 7 -t "\(<ufunc '(cos|sin)'>"

Change	Before [`b70fc77`] <brings_npsr~2>	After [`c5fc842`] <brings_npsr>	Ratio	Benchmark (Parameter)
+	4.98±0ms	7.04±0.1ms	1.41	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'd')
+	2.12±0ms	2.88±0.05ms	1.36	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')
+	4.97±0.01ms	6.68±0.02ms	1.34	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'd')
+	2.12±0ms	2.84±0.01ms	1.34	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
+	2.13±0ms	2.82±0ms	1.32	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')
+	4.98±0.01ms	6.19±0.05ms	1.24	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
+	4.97±0ms	5.77±0.05ms	1.16	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
+	5.96±0.03ms	6.90±0.2ms	1.16	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
+	2.13±0ms	2.45±0ms	1.15	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')
+	5.95±0.06ms	6.75±0.03ms	1.13	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	7.07±0.01ms	6.31±0.02ms	0.89	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	7.08±0ms	5.86±0.03ms	0.83	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-	5.64±0.01ms	4.47±0.3ms	0.79	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-	5.76±0.01ms	4.15±0.1ms	0.72	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-	5.64±0.02ms	3.94±0.06ms	0.7	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-	5.76±0.01ms	4.01±0.1ms	0.7	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	2.01±0.01ms	1.36±0.02ms	0.67	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	2.02±0.02ms	1.35±0.03ms	0.67	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	2.02±0ms	1.30±0ms	0.64	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-	5.62±0.01ms	3.54±0.08ms	0.63	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-	5.73±0.01ms	3.58±0.06ms	0.62	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	5.62±0.01ms	3.11±0.07ms	0.55	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-	5.72±0.01ms	3.12±0.05ms	0.54	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-	2.01±0ms	926±3μs	0.46	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-	3.29±0ms	1.45±0.1ms	0.44	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	3.31±0ms	1.41±0.04ms	0.42	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-	3.31±0ms	1.36±0.01ms	0.41	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-	3.28±0ms	1.30±0ms	0.4	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-	3.29±0ms	1.32±0.03ms	0.4	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	3.30±0ms	1.32±0.03ms	0.4	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-	3.30±0ms	947±4μs	0.29	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-	3.28±0ms	925±2μs	0.28	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')

Binary size change: _multiarray_umath.cpython-312-x86_64-linux-gnu.so

stripped	size (bytes)
before	9,895,936
after	10,014,720
diff	+118,784

seiko2plus · 2025-12-09T05:54:43Z

I have updated the benchmark to use GCC tests instead of Clang, as GCC is commonly the default compiler for wheels on Linux. At the moment, I only have access to bare-metal x86 hardware; Perhaps I should try cloud services—e.g., AWS—for testing on other architectures. The ufunc inner implementation is optimized for size as much as possible and only increases +118,784B (stripped) on x86/gcc.

cc: @charris, @rgommers, @mattip, @seberg, @r-devulap, @Mousius

seiko2plus · 2025-12-09T06:08:05Z

Ah, regarding the reverted #23399, as discussed in the thread/C6EYZZSR4EWGVKHAZXLE7IBILRMNVK7L/. The default precision for f64 is set up to 1ULP, including for large arguments, with the ability to add a build-time option to change this behavior later:

numpy/numpy/_core/src/common/simd/simd.hpp

Lines 79 to 100 in 67d0274

    
           using PreciseHigh = decltype(npsr::Precise{}); 
        
           using PreciseLow = decltype(npsr::Precise{npsr::kLowAccuracy}); 
        
           struct PresiceDummy {}; 
        
           template <typename T> 
        
           struct PreciseByType {}; 
        
           template <> 
        
           struct PreciseByType<float> { using Type = PreciseLow; }; 
        
           template <> 
        
           struct PreciseByType<double> { 
        
           #if NPY_HWY_F64 
        
               using Type = PreciseHigh; 
        
           #else 
        
               // If float64 SIMD isn’t available, use a dummy type. 
        
               // The scalar path will run, but `Type` must still be defined. 
        
               // The dummy is never passed; it only satisfies interfaces. 
        
               // This also avoids spurious FP exceptions during RAII. 
        
               using Type = PresiceDummy; 
        
           #endif 
        
           }; 
        
           } // namespace detail 
        
           template <typename T> 
        
           using Precise = typename detail::PreciseByType<T>::Type;

cc: @mhvk

rgommers · 2025-12-09T07:31:49Z

At the moment, I only have access to bare-metal x86 hardware; Perhaps I should try cloud services—e.g., AWS—for testing on other architectures.

No more GCC compile farm access? AWS seems okay of course. If you want me to run this on macOS arm64 (M1), I can.

The ufunc inner implementation is optimized for size as much as possible and only increases +118,784B (stripped) on x86/gcc.

That's impressively small, glad to see that.

seberg · 2025-12-09T08:43:46Z

@seiko2plus thanks a lot for this hard work! I don't want to discuss it on the PR here, but maybe you can send a very brief mail summarizing precision changes for the float32 version? (I am slightly worried might get into the same dance as with the 64bit version and SVML before, so should at least increase awareness.)

mhvk

@seiko2plus - looks very impressive! A few small comments inline, with that about the iterator perhaps the most important (if it is possible; can be for follow-up). I also made a small comment over at your new npsr routines.

mhvk · 2025-12-09T16:25:02Z

+    // Note: the intrinsics of NumPy SIMD routines are inlined by default.
+    if (NPY_UNLIKELY(is_mem_overlap(args[0], sin, args[1], sout, len) ||
+                     sin != sizeof(T) || sout != sizeof(T))) {
+        // this for non-contiguous or overlapping case


I always have trouble knowing exactly what one can ask the iterator, but is it not possible to set some flag that ensures that for non-contiguous or overlapping data, a copy is made to a contiguous buffer? I ask since if so we do not have to do this inside the loop (with similar duplicated code presumably elsewhere).

This is already the case I believe. But these paths get used by .accumulate unfortunately. Although, it may be that we can simplify the check based on that knowledge?

EDIt: Sorry to be clear, I think that is the case, I am not 100% sure.

I don't think it can be used by .accumulate since this ufunc has nin=1... If a flag is indeed passed, maybe it is worth replacing this with an assert on the stride and see if anything breaks? Logically, if we are passing the iterator a flag that we require contiguous input and output, then we should never get anything else...

Ah, there is no flag to enforce a contiguous loop, but we could probably add one, if we need this more I think that makes sense. The buffer machinery will have a lot of overhead unfortunately, but overall it is still better likely.

For overlap, I believe the rule should be (or am I missing something?!):

If memory overlap exists it must be exactly the same memory (i.e. args[0] == args[1] and steps identical).

Except for .accumulate which you are right, doesn't apply here.

Ah, if there is no "needs contiguous" flag, it is out of scope here. But I think it does make sense to add the option to request buffering to a contiguous and aligned chunk to the iterator. I would think the overhead cannot be that much worse than it is here, since one would already be inside the iterator anyway typically for non-contiguous data. And it would certainly be nice to keep the underlying loops simple...

I would think the overhead cannot be that much worse than it is here

Well, the buffering has a "huge" amount of one time overhead. For very large arrays, yes the overhead may well be lower in practice (or at least nicer, as it's repeated fewer times).

@seberg - you mentioned there is no flag, but is one needed? If sin/cos are defined as an array method, would things not work automatically if one filled only the contiguous_loop slot? Or is that at the moment only treated as something that will be used if present if data are contiguous.

I would think the overhead cannot be that much worse than it is here

Well, the buffering has a "huge" amount of one time overhead. For very large arrays, yes the overhead may well be lower in practice (or at least nicer, as it's repeated fewer times).

The other thing is that if we centralize this, there will be more of an incentive to speed it up. Isn't the buffer pre-allocated by the compiler here? It would make more sense to have just one scratch buffer in the iterator, perhaps using your mechanism that allocates if it is too small.

@mhvk no not yet, the contiguous loop is an optimization only, only with that flag or a "non-contig to contig wrapper" could it be more. Dunno what is better, centralizing or not, I mostly like it because it removes unnecessary repitition both in code and binary, whether it is slower or faster.
(but yeah, we shouldn't worry about that here.)

So, basically it would be some way to tell the ufunc (in this case) that the NPY_ITER_CONTIG flag should be passed on to the iterator... Anyway, let's take the discussion out of this PR -- see #30413.

mhvk · 2025-12-09T16:37:21Z

+/// npsr is tag free by design so we only include it within main namespace (np::simd)
+namespace sr = npsr::HWY_NAMESPACE;
+
+/// Default precision configrations for NumPy SIMD Routines


Might be good here to tell what low and high imply (or at least where to look it up)

mhvk · 2025-12-09T16:38:28Z

-        assert_array_max_ulp(np.cos(tx_f32, out=tx_f32), np.float32(np.cos(x_f64)), maxulp=2)
+        assert_array_max_ulp(np.sin(x_f32, out=x_f32), np.float32(np.sin(x_f64)), maxulp=maxulp_f32)
+        assert_array_max_ulp(np.cos(tx_f32, out=tx_f32), np.float32(np.cos(x_f64)), maxulp=maxulp_f32)



It maybe good to add an explicit test that cos(0)==1 for all float, as it is the thing that threw our tests off in a previous iteration (and I think it is something you now explicitly capture).

Mousius · 2026-02-04T13:49:10Z

Generally seems better overall, testing on some metal AWS instances:

c8g.metal

| Change   | Before [4900421c] <main>   | After [2f990392]    |   Ratio | Benchmark (Parameter)                                                             |
|----------|----------------------------|---------------------|---------|-----------------------------------------------------------------------------------|
| +        | 3.89±0.02ms                | 5.61±0.02ms         |    1.44 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')           |
| +        | 3.91±0ms                   | 5.56±0.02ms         |    1.42 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'd')           |
| +        | 3.86±0ms                   | 5.29±0.03ms         |    1.37 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')           |
| +        | 3.88±0.01ms                | 5.24±0.01ms         |    1.35 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'd')           |
| +        | 3.85±0.01ms                | 5.00±0.01ms         |    1.3  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')           |
| +        | 3.90±0ms                   | 4.96±0.01ms         |    1.27 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')           |
| +        | 385±10μs                   | 430±8μs             |    1.12 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 2, 'e')        |
| +        | 391±5μs                    | 431±10μs            |    1.1  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 2, 'e')        |
| +        | 389±7μs                    | 429±4μs             |    1.1  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 4, 2, 'e') |
| +        | 392±7μs                    | 427±10μs            |    1.09 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 4, 2, 'e')        |
| +        | 384±10μs                   | 420±10μs            |    1.09 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 1, 2, 'e') |
| +        | 2.13±0.03ms                | 2.32±0.04ms         |    1.09 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp2'>, 1, 2, 'd')          |
| +        | 383±6μs                    | 412±8μs             |    1.08 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 4, 1, 'e') |
| +        | 375±6μs                    | 400±7μs             |    1.07 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 1, 'e')        |
| +        | 1.19±0.02ms                | 1.26±0.03ms         |    1.06 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'reciprocal'>, 1, 2, 'd')           |
| +        | 2.07±0ms                   | 2.18±0.02ms         |    1.05 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp'>, 4, 2, 'f')           |
| +        | 421±5μs                    | 443±5μs             |    1.05 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 4, 1, 'e')      |
| -        | 5.95±0.01ms                | 5.58±0.02ms         |    0.94 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')                  |
| -        | 50.1±0.4μs                 | 47.2±0.8μs          |    0.94 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'absolute'>, 1, 1, 'e')      |
| -        | 2.92±0.01ms                | 2.72±0.02ms         |    0.93 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log'>, 1, 1, 'f')           |
| -        | 2.38±0ms                   | 2.17±0ms            |    0.91 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f')                  |
| -        | 2.37±0.01ms                | 2.15±0ms            |    0.91 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')                  |
| -        | 2.37±0ms                   | 2.14±0ms            |    0.9  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')                  |
| -        | 2.34±0ms                   | 2.12±0ms            |    0.9  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')                  |
| -        | 5.87±0.01ms                | 5.24±0.01ms         |    0.89 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')                  |
| -        | 1.61±0ms                   | 1.43±0ms            |    0.89 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 1, 'e')              |
| -        | 1.61±0ms                   | 1.43±0ms            |    0.89 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 2, 'e')              |
| -        | 1.61±0ms                   | 1.43±0ms            |    0.89 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 1, 'e')              |
| -        | 1.61±0ms                   | 1.44±0ms            |    0.89 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 2, 'e')              |
| -        | 1.61±0ms                   | 1.43±0ms            |    0.89 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 1, 'e')              |
| -        | 1.61±0ms                   | 1.43±0ms            |    0.89 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 2, 'e')              |
| -        | 1.61±0ms                   | 1.43±0ms            |    0.89 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 1, 'e')              |
| -        | 1.61±0ms                   | 1.44±0ms            |    0.89 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 2, 'e')              |
| -        | 1.61±0ms                   | 1.43±0ms            |    0.89 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 1, 1, 'e')       |
| -        | 1.61±0ms                   | 1.43±0ms            |    0.89 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 1, 2, 'e')       |
| -        | 1.61±0ms                   | 1.43±0ms            |    0.89 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 4, 1, 'e')       |
| -        | 1.61±0ms                   | 1.44±0ms            |    0.89 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 4, 2, 'e')       |
| -        | 2.40±0.01ms                | 2.13±0.06ms         |    0.89 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp'>, 4, 1, 'd')           |
| -        | 1.61±0ms                   | 1.43±0ms            |    0.89 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 1, 1, 'e')       |
| -        | 1.61±0ms                   | 1.43±0ms            |    0.89 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 1, 2, 'e')       |
| -        | 1.61±0ms                   | 1.43±0ms            |    0.89 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 4, 1, 'e')       |
| -        | 1.61±0ms                   | 1.44±0ms            |    0.89 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 4, 2, 'e')       |
| -        | 5.86±0.04ms                | 4.96±0.01ms         |    0.85 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')                  |
| -        | 7.08±0.03ms                | 5.61±0.02ms         |    0.79 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')                  |
| -        | 2.77±0ms                   | 2.15±0ms            |    0.78 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')                  |
| -        | 2.78±0ms                   | 2.13±0ms            |    0.77 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')                  |
| -        | 7.04±0.02ms                | 5.28±0.01ms         |    0.75 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')                  |
| -        | 7.01±0.02ms                | 4.99±0ms            |    0.71 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')                  |
| -        | 3.04±0ms                   | 2.15±0ms            |    0.71 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')           |
| -        | 3.00±0ms                   | 2.12±0ms            |    0.71 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')           |
| -        | 3.09±0ms                   | 2.17±0ms            |    0.7  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')           |
| -        | 3.05±0ms                   | 2.14±0ms            |    0.7  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')           |
| -        | 5.84±0.01ms                | 3.96±0ms            |    0.68 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')                  |
| -        | 3.49±0ms                   | 2.13±0.01ms         |    0.61 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')           |
| -        | 3.54±0ms                   | 2.14±0ms            |    0.6  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')           |
| -        | 1.95±0ms                   | 1.16±0ms            |    0.59 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')                  |
| -        | 6.99±0.02ms                | 3.99±0ms            |    0.57 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')                  |
| -        | 2.11±0ms                   | 1.17±0ms            |    0.55 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')                  |
| -        | 17.6±2ms                   | 9.11±1ms            |    0.52 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'd')                 |
| -        | 2.67±0ms                   | 1.17±0ms            |    0.44 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')           |
| -        | 2.63±0ms                   | 1.16±0ms            |    0.44 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')           |

c7g.metal

|----------|----------------------------|---------------------|---------|-------------------------------------------------------------------------|
| +        | 4.27±0ms                   | 5.98±0.02ms         |    1.4  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'd') |
| +        | 4.59±0.01ms                | 6.05±0.02ms         |    1.32 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd') |
| +        | 4.26±0.01ms                | 5.27±0.01ms         |    1.24 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'd') |
| +        | 4.30±0ms                   | 5.02±0.02ms         |    1.17 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd') |
| +        | 4.58±0ms                   | 5.36±0.01ms         |    1.17 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd') |
| +        | 4.60±0ms                   | 5.08±0ms            |    1.1  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd') |
| -        | 4.30±0ms                   | 4.08±0ms            |    0.95 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd') |
| -        | 4.60±0.01ms                | 4.17±0ms            |    0.91 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd') |
| -        | 7.28±0.01ms                | 5.95±0.01ms         |    0.82 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')        |
| -        | 7.21±0.02ms                | 5.27±0.01ms         |    0.73 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')        |
| -        | 7.15±0.04ms                | 4.98±0ms            |    0.7  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')        |
| -        | 8.80±0.03ms                | 6.03±0.01ms         |    0.69 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')        |
| -        | 8.62±0.03ms                | 5.36±0.01ms         |    0.62 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')        |
| -        | 2.97±0ms                   | 1.83±0ms            |    0.62 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')        |
| -        | 8.62±0.02ms                | 5.09±0.01ms         |    0.59 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')        |
| -        | 3.12±0.01ms                | 1.82±0ms            |    0.58 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')        |
| -        | 7.15±0.03ms                | 4.08±0ms            |    0.57 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')        |
| -        | 3.20±0ms                   | 1.83±0ms            |    0.57 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f')        |
| -        | 3.20±0.01ms                | 1.84±0ms            |    0.57 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')        |
| -        | 3.76±0ms                   | 1.85±0ms            |    0.49 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')        |
| -        | 8.59±0.09ms                | 4.18±0ms            |    0.49 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')        |
| -        | 3.74±0.01ms                | 1.84±0ms            |    0.49 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')        |
| -        | 3.80±0ms                   | 1.82±0ms            |    0.48 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f') |
| -        | 3.84±0ms                   | 1.83±0ms            |    0.48 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f') |
| -        | 3.90±0ms                   | 1.84±0ms            |    0.47 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f') |
| -        | 3.96±0ms                   | 1.85±0.01ms         |    0.47 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f') |
| -        | 4.39±0ms                   | 1.84±0.01ms         |    0.42 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f') |
| -        | 2.55±0ms                   | 1.04±0ms            |    0.41 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')        |
| -        | 4.50±0.01ms                | 1.86±0.01ms         |    0.41 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f') |
| -        | 2.69±0.01ms                | 1.05±0.01ms         |    0.39 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')        |
| -        | 3.26±0ms                   | 1.04±0.01ms         |    0.32 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f') |
| -        | 3.33±0ms                   | 1.04±0.01ms         |    0.31 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f') |

Happy to merge as-is to help @seiko2plus keep moving 😸

mhvk · 2026-02-04T16:32:21Z

Looks like the only slowdowns are UnaryFPSpecial for double -- is that because the data in that case include "weird" numbers (like denormal). If so, this indeed seems a great improvement.

seberg · 2026-04-29T09:11:35Z

I don't really remember the state of this PR (it is tagged for 2.5). Should we push for it?

One serious worry I have is still this part:

~4ulp for single-precision with handling floating-point errors,
which to me seems very optimistic that nobody cares (and we had discussed it also in some meeting to a similar result). Yes many (maybe most) float32 users may not care about precision, but that doesn't mean that no-one will be bitten by it unfortunately.
My opinion is that we really need a "fast-math" kind of switch for this type of thing, and otherwise stay roughly within e.g. glibc precision.

numpy-simd-routines added as subrepo in meson subprojects directory and the current FP configuration is static, ~1ulp used for double-precision ~4ulp for single-precision with handling floating-point errors, special-cases extended precision for large arguments, subnormals are enabled by default too. numpy-simd-routines supports all SIMD extensions that are supported by Google Highway including non-FMA extensions and is fully independent from libm to guarantee unified results across all compilers and platforms. Full benchmarks will be provided within the pull-request, the following benchmark was tested against clang-19 and x86 CPU (Ryzen7 7700X) with AVX512 enabled. Note: that there was no SIMD optimization enabled for sin/cos for double-precision, only single-precision. | Before | After | Ratio | Benchmark (Parameter) | |---------------|-------------|--------|------------------------------------------| | 713±6μs | 633±6μs | 0.89 | UnaryFP(<ufunc 'cos'>, 1, 2, 'f') | | 717±9μs | 637±6μs | 0.89 | UnaryFP(<ufunc 'cos'>, 4, 1, 'f') | | 705±3μs | 607±10μs | 0.86 | UnaryFP(<ufunc 'sin'>, 4, 1, 'f') | | 714±10μs | 595±0.5μs | 0.83 | UnaryFP(<ufunc 'sin'>, 1, 2, 'f') | | 370±0.3μs | 277±4μs | 0.75 | UnaryFP(<ufunc 'cos'>, 1, 1, 'f') | | 373±2μs | 236±0.6μs | 0.63 | UnaryFP(<ufunc 'sin'>, 1, 1, 'f') | | 1.06±0.01ms | 648±3μs | 0.61 | UnaryFP(<ufunc 'cos'>, 4, 2, 'f') | | 1.06±0.01ms | 617±30μs | 0.58 | UnaryFP(<ufunc 'sin'>, 4, 2, 'f') | | 5.06±0.06ms | 2.61±0.3ms | 0.52 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'd') | | 1.48±0ms | 715±5μs | 0.48 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'f') | | 1.50±0.01ms | 639±6μs | 0.43 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'f') | | 5.15±0.1ms | 1.96±0.01ms | 0.38 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'd') | | 5.72±0.02ms | 2.09±0.1ms | 0.37 | UnaryFP(<ufunc 'cos'>, 4, 2, 'd') | | 5.76±0.01ms | 2.03±0.08ms | 0.35 | UnaryFP(<ufunc 'sin'>, 4, 2, 'd') | | 5.07±0.08ms | 1.76±0.2ms | 0.35 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'd') | | 6.04±0.04ms | 2.05±0.09ms | 0.34 | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'd') | | 5.79±0.03ms | 1.90±0.2ms | 0.33 | UnaryFP(<ufunc 'sin'>, 4, 1, 'd') | | 2.29±0.1ms | 762±40μs | 0.33 | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'f') | | 5.72±0.1ms | 1.75±0.07ms | 0.31 | UnaryFP(<ufunc 'cos'>, 4, 1, 'd') | | 6.04±0.03ms | 1.82±0.2ms | 0.3 | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'd') | | 2.49±0.1ms | 748±30μs | 0.3 | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'f') | | 2.23±0.1ms | 634±6μs | 0.28 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'f') | | 1.31±0.03ms | 367±5μs | 0.28 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'f') | | 2.55±0.09ms | 654±30μs | 0.26 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'f') | | 4.97±0.03ms | 1.14±0ms | 0.23 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'd') | | 5.67±0.01ms | 1.22±0.03ms | 0.22 | UnaryFP(<ufunc 'cos'>, 1, 2, 'd') | | 5.76±0.03ms | 1.28±0.06ms | 0.22 | UnaryFP(<ufunc 'sin'>, 1, 2, 'd') | | 1.26±0.01ms | 272±2μs | 0.22 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'f') | | 7.03±0.02ms | 1.31±0.01ms | 0.19 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'd') | | 5.67±0.01ms | 810±9μs | 0.14 | UnaryFP(<ufunc 'cos'>, 1, 1, 'd') | | 5.71±0.01ms | 817±40μs | 0.14 | UnaryFP(<ufunc 'sin'>, 1, 1, 'd') | | 7.05±0.03ms | 915±4μs | 0.13 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'd') |

Allow up to 3 ULP error for float32 sin/cos when native FMA is not available.

- Extend the C++ doc scope to better explain precision control, which is chosen based on the data type. - Add test cases for sine/cosine facts—for example, `cos(0) == 1`.

Mousius · 2026-06-22T09:48:28Z

@seberg I think the current FP32 routines are also 4ULP, so this doesn't change anything and we can move forwards with it. I'm in agreement with you that we should give users the ability to opt-in/out of lower precision though.

seberg · 2026-06-22T10:00:35Z

            np.sin(x_f32, out=x_f32),
            np.float32(np.sin(x_f64)),
-            maxulp=2,
+            maxulp=maxulp_f32,


@Mousius if there is no big change then we don't have to worry about it, but this looks a bit like there may be one? (But maybe not, fma not being available sounds possibly niche these days, but not sure.)

My assumption with this is that the expectation was always 4ULP, but some of the routines would perform better than that.

maxulp_f32 is ~3 ulp for SIMD extensions without native FMA support, which I think is acceptable. If we wanted to tighten that, we could add an extra correction step — but I'd prefer to do that only after we have a proper test suite in place (we need a large dataset to validate edge cases reliably).

As a quick workaround, if the relaxed tolerance is really a concern, we can temporarily force the high-precision profile for those extensions until the correction step is implemented.

Ah, also I suppose we agreed that the low-prec profile can even go further to ~4ulp.

Ah, also I suppose we agreed that the low-prec profile can even go further to ~4ulp.

This was the reason why I commented: I am not sure that this was ever quite agreed upon. And if it was, I suspect that this was before the recent fallout and discussion around lower fp16 precision.
https://mail.python.org/archives/list/numpy-discussion@python.org/thread/QLR57SMUPQVMGKXWQYR5CSE6JBPK43IG/#4GWQFDFYG7HCBHQEX5K4VP7DDGN25ZNZ

Having noticeably lower precision than typical math library/current code seems like something that we need to be very careful about to me.

Yeash, the conversation around making it all toggle-able was a long time ago.

Just to clarify - do you want to block this PR @seberg? I don't see it as a change in behaviour.

I commented on the PR mainly based on the PR description. We had serious issues with float64 being less precise, and whatever I discuss/think about it, I am not convinced we can just say that float32 you can cut corners without worry. The same argument was made with float16 and we had to revert it.

So blocking this PR? No, because you are both promising that there is in practice no precision change. But blocking anything that makes precision worse and argues it is OK because 4 ULP are OK for float32: yes.

After the whole float16 thing and the float64 precision change issue and with people noticing different precision across architectures, I have serious concerns about saying that being a bit less precise than normal C-code/glibc code is OK.
(That isn't to say that we have to match it exactly, but I have always been nervous about going noticeably less precise and I feel like we have been bitten quite often when being optimistic about it.)

Or maybe in other words: I am not happy to keep being optimistic about this, if we really want a much faster but less precise version, I think we should go through the trouble of implementing the choice.

seiko2plus · 2026-06-22T12:03:59Z

My apologies for the delayed response.

@rgommers,

No more GCC compile farm access? AWS seems okay of course. If you want me to run this on macOS arm64 (M1), I can.

My concerns were mainly for SVE, which is not supported by M1. I was expecting a performance gain similar to AVX2, but we need to dig more, I suppose. However, the current gain is acceptable. It seems AWS is the only option I have.

@seberg,

thanks a lot for this hard work! I don't want to discuss it on the PR here, but maybe you can send a very brief mail summarizing precision changes for the float32 version? (I am slightly worried might get into the same dance as with the 64bit version and SVML before, so should at least increase awareness.)

I sent an email a long time ago:
https://mail.python.org/archives/list/numpy-discussion@python.org/thread/YIU5TZOENCRQ7GUZRKXYMUK2EIYWJINE/
And I have clarified the default precision of ~4ulp for fp32 and ~1ulp for fp64.

One thing worth noting here: we already ship an enabled SIMD kernel for float32, and it is largely equivalent to the low-precision profile in this PR. The two differences are that the existing kernel falls back to libm for large inputs and lacks support for non-FMA SIMD extensions.

I don't really remember the state of this PR (it is tagged for 2.5). Should we push for it?

The build was totally broken due to the meson subproject functionality not working as it's supposed to due to a diverged rebase, which is resolved by numpy/meson#27, #31633.

New CI errors are caused by the LUT optimization patch numpy/numpy-simd-routines#7; (my bad) I will prepare a fix for it.

looks very impressive! A few small comments inline, with that about the iterator perhaps the most important (if it is possible; can be for follow-up). I also made a small comment over at your new npsr routines.

@mhvk,
Thank you, I have covered your suggestions in a separated commit:
c981beb

@Mousius,

Generally seems better overall, testing on some metal AWS instances:
| - | 5.84±0.01ms | 3.96±0ms | 0.68 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd') |

Not good to me or at least not as I was expecting. Just wondering, I will dig into it. We should have a similar gain to X86_V3, but definitely not within this PR.

@mhvk,

Looks like the only slowdowns are UnaryFPSpecial for double -- is that because the data in that case include "weird" numbers (like denormal). If so, this indeed seems a great improvement.

inf/nans too, the performance gain is not good overall (compared to x86) even for the happy paths, but it's acceptable for now.

github-actions Bot added the 01 - Enhancement label Sep 6, 2025

seiko2plus force-pushed the brings_npsr branch 4 times, most recently from 09414c8 to af5d98a Compare September 7, 2025 22:33

rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Sep 10, 2025

seiko2plus added this to the 2.4.0 release milestone Sep 13, 2025

seiko2plus mentioned this pull request Oct 7, 2025

Generate trigonometric constants and lookup tables using Sollya numpy/numpy-simd-routines#4

Merged

seiko2plus force-pushed the brings_npsr branch 2 times, most recently from 2f4c0a6 to 9b9a438 Compare October 8, 2025 02:30

charris modified the milestones: 2.4.0 release, 2.5.0 Release Nov 20, 2025

seiko2plus force-pushed the brings_npsr branch from 9b9a438 to 67d0274 Compare December 8, 2025 15:46

seiko2plus marked this pull request as ready for review December 8, 2025 15:46

mhvk mentioned this pull request Dec 9, 2025

Maybe small improvement for accuracy for cos for regular angles numpy/numpy-simd-routines#6

Open

mhvk reviewed Dec 9, 2025

View reviewed changes

mhvk mentioned this pull request Dec 10, 2025

ENH: Allow ufuncs to request the iterator to provide contiguous arrays #30413

Open

seiko2plus force-pushed the brings_npsr branch from 67d0274 to e16efa5 Compare December 19, 2025 06:58

seiko2plus force-pushed the brings_npsr branch from 2f99039 to 81047d6 Compare February 21, 2026 03:31

seberg modified the milestones: 2.5.0 Release, 2.6.0 Release Apr 29, 2026

seiko2plus mentioned this pull request Jun 13, 2026

Init a new branch to update to meson v1.11 numpy/meson#27

Merged

21 tasks

seiko2plus added 3 commits June 22, 2026 11:50

Relax sin/cos ULP test for float32 on non-FMA

fe19146

Allow up to 3 ULP error for float32 sin/cos when native FMA is not available.

SIMD, TST: Apply Marten’s suggestions

c981beb

- Extend the C++ doc scope to better explain precision control, which is chosen based on the data type. - Add test cases for sine/cosine facts—for example, `cos(0) == 1`.

SIMD: Enable SVE/RVV dynamic dispatch for sin/cos kernels

a8fdb07

seiko2plus force-pushed the brings_npsr branch from 81047d6 to a8fdb07 Compare June 22, 2026 08:54

seberg reviewed Jun 22, 2026

View reviewed changes

Uh oh!

Uh oh!

Conversation

seiko2plus commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

X86 Platform

Binary size change: _multiarray_umath.cpython-312-x86_64-linux-gnu.so

Uh oh!

seiko2plus commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seiko2plus commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgommers commented Dec 9, 2025

Uh oh!

seberg commented Dec 9, 2025

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mousius commented Feb 4, 2026

Uh oh!

mhvk commented Feb 4, 2026

Uh oh!

seberg commented Apr 29, 2026

Uh oh!

Mousius commented Jun 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seiko2plus commented Jun 22, 2026

Uh oh!

seiko2plus commented Sep 6, 2025 •

edited

Loading

seiko2plus commented Dec 9, 2025 •

edited

Loading

seiko2plus commented Dec 9, 2025 •

edited

Loading

seberg Dec 9, 2025 •

edited

Loading

seberg Dec 9, 2025 •

edited

Loading

seberg Dec 9, 2025 •

edited

Loading

seberg Jun 22, 2026 •

edited

Loading