nice improvement on my fixed point RD code just by adding #pragma omp parallel for schedule(static)
still no SIMD on ARM though
judging by this i should be able to squeeze quite a bit out of the GPU on my raspi: https://github.com/mn416/QPULib#example-3-heat-flow-simulation
the most expensive phase is the 2D convolution so i can probably find a lot of fast reference code if needed