There's no C++ there, you are using plain C. Try using C++ data structures and algorithms (std::vector or std::array, notably the latter) and you might see some improvement as the compiler will have even more information about the input format to perform optimizations.
I guess I used C++ in the title back then, because I actually coded this in a C++ file. You are right, the title could also just mention C.
I'm not sure how using the data structures or algorithms from the standard library would help making the code perform better for this specific problem.
Notably, you would never have to use the SIMD intrinsics manually; using a std::vector or std::array would give the compiler more information (e.g. it knows what is the length beforehand) so it would vectorize the code more automatically. It's amazing what compilers can do today!
That's an interesting article as well. Thanks for the link.
Note, I'm absolutely aware of the fact that compilers do an amazing job, I rely on that every day. However, I do think there are some differences between the algorithm you optimized compared to the one I optimized here.
Most notably, in your algorithm each block can be processed in parallel and you are in no way constrained by branch prediction. Both things are not true for the find_non_zero_indices function in this post, which makes it much harder for the compiler to optimize this automatically. Also, it does not know which case to optimize for, the performance greatly depends on the input data when the algorithm is not turned into something branchless.
If you have some time for it, please try rewriting find_non_zero_indices__baseline using std::vector and std::array and check if it actually becomes faster. I think the signature of the function should stay the same, but if you think it helps performance, feel free to change the signature to whatever works better (without changing the core problem of course).
There's no C++ there, you are using plain C. Try using C++ data structures and algorithms (
std::vector
orstd::array
, notably the latter) and you might see some improvement as the compiler will have even more information about the input format to perform optimizations.I guess I used C++ in the title back then, because I actually coded this in a C++ file. You are right, the title could also just mention C.
I'm not sure how using the data structures or algorithms from the standard library would help making the code perform better for this specific problem.
Notably, you would never have to use the SIMD intrinsics manually; using a
std::vector
orstd::array
would give the compiler more information (e.g. it knows what is the length beforehand) so it would vectorize the code more automatically. It's amazing what compilers can do today!Shameless plug, I wrote about that
Other than that, it's more you would write the same code with less boilerplate.
That's an interesting article as well. Thanks for the link.
Note, I'm absolutely aware of the fact that compilers do an amazing job, I rely on that every day. However, I do think there are some differences between the algorithm you optimized compared to the one I optimized here.
Most notably, in your algorithm each block can be processed in parallel and you are in no way constrained by branch prediction. Both things are not true for the
find_non_zero_indices
function in this post, which makes it much harder for the compiler to optimize this automatically. Also, it does not know which case to optimize for, the performance greatly depends on the input data when the algorithm is not turned into something branchless.If you have some time for it, please try rewriting
find_non_zero_indices__baseline
usingstd::vector
andstd::array
and check if it actually becomes faster. I think the signature of the function should stay the same, but if you think it helps performance, feel free to change the signature to whatever works better (without changing the core problem of course).Yes, that is a very interesting idea that I will try and surely learn more :)