Blogged: https://brandewinder.com/2024/09/01/should-i-use-simd-vectors/
Where I try SIMD-accelerated vectors, it works much better than anticipated, and I'd like to understand why... #dotnet #fsharp
@brandewinder SIMD is always great! 2 small comments:
- You should really not assume Vector<float>.Count is 4, otherwise your code will fail on most Intel machines out there (I assume you might be running on mac M1+?)
- You can further optimize it by not relying on slice (which creates several unnecessary checks during the loop), see https://gist.github.com/xoofx/2fc4e25ed32732bcfe0559e8c07076bb and the attached generated assembly for both version
Things can be slightly more optimized with batching more values per loop as well
@brandewinder You might want also to prefer using Vector256 (should work even on Vector128 only machines for this kind of code), as the codegen should be slightly better as for Vector128 machines, it will batch 8 floats instead of only 4
@xoofx thank you! Learnt something with MemoryMarshal, it is interesting, besides shaving off some time, it makes the code cleaner.
@xoofx as for Vector<float>.Count being 4, this might be because float in F# actually refers to double in C#... Could Vector<T> have different Count on different machines?
@brandewinder oh, right, completely forgot about that detail in F#
Yes, so that means that it is computing double indeed, so you have 4 doubles, which is a Vector256, and yes, you can get a different Count depending on the CPU. On Apple M1, you would get Vector<double>.Count == 2
@xoofx interesting, really appreciate the pointers! Since you obviously know the area, perhaps I can ask a question: what would be the smart way to make the code handle arrays of size that are not clean multiple of Vector size? My intuition says, expand everything to clean multiples ahead of time, but perhaps there are clever tricks :)
@brandewinder reallocating is not recommended, so usually you have a simple loop working on remaining 1-by-1 elements, nothing fancy
Sometimes you can use tricks like still using a Vector element for the remaining if you can mask things out (you fetch the vector minus count elements, and you mask with 0 what you have already calculated). Still requires a remaining loop if your entire loop is still less than Vector.Count.
@brandewinder Also, don't forget to have always code supporting no SIMD at all with Vector/128/256/512.IsHardwareAccelerated, so the remaining loop is outside of this code path and always active