ENH: Add SIMD sin/cos implementation with numpy-simd-routines#29699
ENH: Add SIMD sin/cos implementation with numpy-simd-routines#29699seiko2plus wants to merge 4 commits into
Conversation
09414c8 to
af5d98a
Compare
2f4c0a6 to
9b9a438
Compare
9b9a438 to
67d0274
Compare
|
I have updated the benchmark to use GCC tests instead of Clang, as GCC is commonly the default compiler for wheels on Linux. At the moment, I only have access to bare-metal x86 hardware; Perhaps I should try cloud services—e.g., AWS—for testing on other architectures. The ufunc inner implementation is optimized for size as much as possible and only increases cc: @charris, @rgommers, @mattip, @seberg, @r-devulap, @Mousius |
|
Ah, regarding the reverted #23399, as discussed in the thread/C6EYZZSR4EWGVKHAZXLE7IBILRMNVK7L/. The default precision for f64 is set up to 1ULP, including for large arguments, with the ability to add a build-time option to change this behavior later: numpy/numpy/_core/src/common/simd/simd.hpp Lines 79 to 100 in 67d0274 cc: @mhvk |
No more GCC compile farm access? AWS seems okay of course. If you want me to run this on macOS arm64 (M1), I can.
That's impressively small, glad to see that. |
|
@seiko2plus thanks a lot for this hard work! I don't want to discuss it on the PR here, but maybe you can send a very brief mail summarizing precision changes for the float32 version? (I am slightly worried might get into the same dance as with the 64bit version and SVML before, so should at least increase awareness.) |
mhvk
left a comment
There was a problem hiding this comment.
@seiko2plus - looks very impressive! A few small comments inline, with that about the iterator perhaps the most important (if it is possible; can be for follow-up). I also made a small comment over at your new npsr routines.
| // Note: the intrinsics of NumPy SIMD routines are inlined by default. | ||
| if (NPY_UNLIKELY(is_mem_overlap(args[0], sin, args[1], sout, len) || | ||
| sin != sizeof(T) || sout != sizeof(T))) { | ||
| // this for non-contiguous or overlapping case |
There was a problem hiding this comment.
I always have trouble knowing exactly what one can ask the iterator, but is it not possible to set some flag that ensures that for non-contiguous or overlapping data, a copy is made to a contiguous buffer? I ask since if so we do not have to do this inside the loop (with similar duplicated code presumably elsewhere).
There was a problem hiding this comment.
This is already the case I believe. But these paths get used by .accumulate unfortunately. Although, it may be that we can simplify the check based on that knowledge?
EDIt: Sorry to be clear, I think that is the case, I am not 100% sure.
There was a problem hiding this comment.
I don't think it can be used by .accumulate since this ufunc has nin=1... If a flag is indeed passed, maybe it is worth replacing this with an assert on the stride and see if anything breaks? Logically, if we are passing the iterator a flag that we require contiguous input and output, then we should never get anything else...
There was a problem hiding this comment.
Ah, there is no flag to enforce a contiguous loop, but we could probably add one, if we need this more I think that makes sense. The buffer machinery will have a lot of overhead unfortunately, but overall it is still better likely.
For overlap, I believe the rule should be (or am I missing something?!):
- If memory overlap exists it must be exactly the same memory (i.e. args[0] == args[1] and steps identical).
- Except for
.accumulatewhich you are right, doesn't apply here.
There was a problem hiding this comment.
Ah, if there is no "needs contiguous" flag, it is out of scope here. But I think it does make sense to add the option to request buffering to a contiguous and aligned chunk to the iterator. I would think the overhead cannot be that much worse than it is here, since one would already be inside the iterator anyway typically for non-contiguous data. And it would certainly be nice to keep the underlying loops simple...
There was a problem hiding this comment.
I would think the overhead cannot be that much worse than it is here
Well, the buffering has a "huge" amount of one time overhead. For very large arrays, yes the overhead may well be lower in practice (or at least nicer, as it's repeated fewer times).
There was a problem hiding this comment.
@seberg - you mentioned there is no flag, but is one needed? If sin/cos are defined as an array method, would things not work automatically if one filled only the contiguous_loop slot? Or is that at the moment only treated as something that will be used if present if data are contiguous.
There was a problem hiding this comment.
I would think the overhead cannot be that much worse than it is here
Well, the buffering has a "huge" amount of one time overhead. For very large arrays, yes the overhead may well be lower in practice (or at least nicer, as it's repeated fewer times).
The other thing is that if we centralize this, there will be more of an incentive to speed it up. Isn't the buffer pre-allocated by the compiler here? It would make more sense to have just one scratch buffer in the iterator, perhaps using your mechanism that allocates if it is too small.
There was a problem hiding this comment.
@mhvk no not yet, the contiguous loop is an optimization only, only with that flag or a "non-contig to contig wrapper" could it be more. Dunno what is better, centralizing or not, I mostly like it because it removes unnecessary repitition both in code and binary, whether it is slower or faster.
(but yeah, we shouldn't worry about that here.)
There was a problem hiding this comment.
So, basically it would be some way to tell the ufunc (in this case) that the NPY_ITER_CONTIG flag should be passed on to the iterator... Anyway, let's take the discussion out of this PR -- see #30413.
| /// npsr is tag free by design so we only include it within main namespace (np::simd) | ||
| namespace sr = npsr::HWY_NAMESPACE; | ||
|
|
||
| /// Default precision configrations for NumPy SIMD Routines |
There was a problem hiding this comment.
Might be good here to tell what low and high imply (or at least where to look it up)
| assert_array_max_ulp(np.cos(tx_f32, out=tx_f32), np.float32(np.cos(x_f64)), maxulp=2) | ||
| assert_array_max_ulp(np.sin(x_f32, out=x_f32), np.float32(np.sin(x_f64)), maxulp=maxulp_f32) | ||
| assert_array_max_ulp(np.cos(tx_f32, out=tx_f32), np.float32(np.cos(x_f64)), maxulp=maxulp_f32) | ||
|
|
There was a problem hiding this comment.
It maybe good to add an explicit test that cos(0)==1 for all float, as it is the thing that threw our tests off in a previous iteration (and I think it is something you now explicitly capture).
67d0274 to
e16efa5
Compare
|
Generally seems better overall, testing on some metal AWS instances: c8g.metalc7g.metalHappy to merge as-is to help @seiko2plus keep moving 😸 |
|
Looks like the only slowdowns are |
2f99039 to
81047d6
Compare
|
I don't really remember the state of this PR (it is tagged for 2.5). Should we push for it? One serious worry I have is still this part:
|
numpy-simd-routines added as subrepo in meson subprojects directory and the current FP configuration is static, ~1ulp used for double-precision ~4ulp for single-precision with handling floating-point errors, special-cases extended precision for large arguments, subnormals are enabled by default too. numpy-simd-routines supports all SIMD extensions that are supported by Google Highway including non-FMA extensions and is fully independent from libm to guarantee unified results across all compilers and platforms. Full benchmarks will be provided within the pull-request, the following benchmark was tested against clang-19 and x86 CPU (Ryzen7 7700X) with AVX512 enabled. Note: that there was no SIMD optimization enabled for sin/cos for double-precision, only single-precision. | Before | After | Ratio | Benchmark (Parameter) | |---------------|-------------|--------|------------------------------------------| | 713±6μs | 633±6μs | 0.89 | UnaryFP(<ufunc 'cos'>, 1, 2, 'f') | | 717±9μs | 637±6μs | 0.89 | UnaryFP(<ufunc 'cos'>, 4, 1, 'f') | | 705±3μs | 607±10μs | 0.86 | UnaryFP(<ufunc 'sin'>, 4, 1, 'f') | | 714±10μs | 595±0.5μs | 0.83 | UnaryFP(<ufunc 'sin'>, 1, 2, 'f') | | 370±0.3μs | 277±4μs | 0.75 | UnaryFP(<ufunc 'cos'>, 1, 1, 'f') | | 373±2μs | 236±0.6μs | 0.63 | UnaryFP(<ufunc 'sin'>, 1, 1, 'f') | | 1.06±0.01ms | 648±3μs | 0.61 | UnaryFP(<ufunc 'cos'>, 4, 2, 'f') | | 1.06±0.01ms | 617±30μs | 0.58 | UnaryFP(<ufunc 'sin'>, 4, 2, 'f') | | 5.06±0.06ms | 2.61±0.3ms | 0.52 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'd') | | 1.48±0ms | 715±5μs | 0.48 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'f') | | 1.50±0.01ms | 639±6μs | 0.43 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'f') | | 5.15±0.1ms | 1.96±0.01ms | 0.38 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'd') | | 5.72±0.02ms | 2.09±0.1ms | 0.37 | UnaryFP(<ufunc 'cos'>, 4, 2, 'd') | | 5.76±0.01ms | 2.03±0.08ms | 0.35 | UnaryFP(<ufunc 'sin'>, 4, 2, 'd') | | 5.07±0.08ms | 1.76±0.2ms | 0.35 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'd') | | 6.04±0.04ms | 2.05±0.09ms | 0.34 | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'd') | | 5.79±0.03ms | 1.90±0.2ms | 0.33 | UnaryFP(<ufunc 'sin'>, 4, 1, 'd') | | 2.29±0.1ms | 762±40μs | 0.33 | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'f') | | 5.72±0.1ms | 1.75±0.07ms | 0.31 | UnaryFP(<ufunc 'cos'>, 4, 1, 'd') | | 6.04±0.03ms | 1.82±0.2ms | 0.3 | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'd') | | 2.49±0.1ms | 748±30μs | 0.3 | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'f') | | 2.23±0.1ms | 634±6μs | 0.28 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'f') | | 1.31±0.03ms | 367±5μs | 0.28 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'f') | | 2.55±0.09ms | 654±30μs | 0.26 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'f') | | 4.97±0.03ms | 1.14±0ms | 0.23 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'd') | | 5.67±0.01ms | 1.22±0.03ms | 0.22 | UnaryFP(<ufunc 'cos'>, 1, 2, 'd') | | 5.76±0.03ms | 1.28±0.06ms | 0.22 | UnaryFP(<ufunc 'sin'>, 1, 2, 'd') | | 1.26±0.01ms | 272±2μs | 0.22 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'f') | | 7.03±0.02ms | 1.31±0.01ms | 0.19 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'd') | | 5.67±0.01ms | 810±9μs | 0.14 | UnaryFP(<ufunc 'cos'>, 1, 1, 'd') | | 5.71±0.01ms | 817±40μs | 0.14 | UnaryFP(<ufunc 'sin'>, 1, 1, 'd') | | 7.05±0.03ms | 915±4μs | 0.13 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'd') |
Allow up to 3 ULP error for float32 sin/cos when native FMA is not available.
- Extend the C++ doc scope to better explain precision control, which is chosen based on the data type. - Add test cases for sine/cosine facts—for example, `cos(0) == 1`.
|
@seberg I think the current FP32 routines are also 4ULP, so this doesn't change anything and we can move forwards with it. I'm in agreement with you that we should give users the ability to opt-in/out of lower precision though. |
| np.sin(x_f32, out=x_f32), | ||
| np.float32(np.sin(x_f64)), | ||
| maxulp=2, | ||
| maxulp=maxulp_f32, |
There was a problem hiding this comment.
@Mousius if there is no big change then we don't have to worry about it, but this looks a bit like there may be one? (But maybe not, fma not being available sounds possibly niche these days, but not sure.)
There was a problem hiding this comment.
My assumption with this is that the expectation was always 4ULP, but some of the routines would perform better than that.
There was a problem hiding this comment.
maxulp_f32 is ~3 ulp for SIMD extensions without native FMA support, which I think is acceptable. If we wanted to tighten that, we could add an extra correction step — but I'd prefer to do that only after we have a proper test suite in place (we need a large dataset to validate edge cases reliably).
As a quick workaround, if the relaxed tolerance is really a concern, we can temporarily force the high-precision profile for those extensions until the correction step is implemented.
There was a problem hiding this comment.
Ah, also I suppose we agreed that the low-prec profile can even go further to ~4ulp.
There was a problem hiding this comment.
Ah, also I suppose we agreed that the low-prec profile can even go further to ~4ulp.
This was the reason why I commented: I am not sure that this was ever quite agreed upon. And if it was, I suspect that this was before the recent fallout and discussion around lower fp16 precision.
https://mail.python.org/archives/list/numpy-discussion@python.org/thread/QLR57SMUPQVMGKXWQYR5CSE6JBPK43IG/#4GWQFDFYG7HCBHQEX5K4VP7DDGN25ZNZ
Having noticeably lower precision than typical math library/current code seems like something that we need to be very careful about to me.
There was a problem hiding this comment.
Yeash, the conversation around making it all toggle-able was a long time ago.
Just to clarify - do you want to block this PR @seberg? I don't see it as a change in behaviour.
There was a problem hiding this comment.
I commented on the PR mainly based on the PR description. We had serious issues with float64 being less precise, and whatever I discuss/think about it, I am not convinced we can just say that float32 you can cut corners without worry. The same argument was made with float16 and we had to revert it.
So blocking this PR? No, because you are both promising that there is in practice no precision change. But blocking anything that makes precision worse and argues it is OK because 4 ULP are OK for float32: yes.
After the whole float16 thing and the float64 precision change issue and with people noticing different precision across architectures, I have serious concerns about saying that being a bit less precise than normal C-code/glibc code is OK.
(That isn't to say that we have to match it exactly, but I have always been nervous about going noticeably less precise and I feel like we have been bitten quite often when being optimistic about it.)
Or maybe in other words: I am not happy to keep being optimistic about this, if we really want a much faster but less precise version, I think we should go through the trouble of implementing the choice.
|
My apologies for the delayed response.
My concerns were mainly for
I sent an email a long time ago: One thing worth noting here: we already ship an enabled SIMD kernel for
The build was totally broken due to the New CI errors are caused by the LUT optimization patch numpy/numpy-simd-routines#7; (my bad) I will prepare a fix for it.
@mhvk,
Not good to me or at least not as I was expecting. Just wondering, I will dig into it. We should have a similar gain to
|
numpy-simd-routines added as subrepo in meson subprojects
directory and the current FP configuration is static, ~1ulp used for double-precision
~4ulp for single-precision with handling floating-point errors,
special-cases extended precision for large arguments,
subnormals are enabled by default too.
numpy-simd-routines supports all SIMD extensions that are supported
by Google Highway including non-FMA extensions and is fully independent
from libm to guarantee unified results across all compilers and
platforms.
Note: that there was no SIMD optimization enabled for sin/cos
for double-precision before, only single-precision.
Benchmark
X86 Platform
The following benchmark was tested against GCC 14.3.0 on an x86 CPU (Ryzen 7 7700X).
Environment
x86-64-v4
x86-64-v3
x86-64-v2
Binary size change: _multiarray_umath.cpython-312-x86_64-linux-gnu.so