You should change _mm_store_ps to unaligned store, just how you use unaligned load for these thing. Unless your structure has attribute/pragma for 16-byte alignment.
For non-SSE4 code, you can do conditional move cheaper with _mm_andn_ps - that does mask negation for you. Then you need only and+andn+or - just 3 operations. Alternatively do a ^ ((a ^ b) & mask - also only 3 operations, no bitwise negation needed.
But in case you're using SSE4 code unconditionally (because blendps) then you can use _mm_test_all_zeroes instead of _mm_movemask_ps + compare in IsZero
Also I strongly suggest to never use Ofast optimization level. It breaks floating point operations, because it enables ffast-math
You should change _mm_store_ps to unaligned store, just how you use unaligned load for these thing. Unless your structure has attribute/pragma for 16-byte alignment.
For non-SSE4 code, you can do conditional move cheaper with _mm_andn_ps - that does mask negation for you. Then you need only and+andn+or - just 3 operations. Alternatively do a ^ ((a ^ b) & mask - also only 3 operations, no bitwise negation needed.
But in case you're using SSE4 code unconditionally (because blendps) then you can use _mm_test_all_zeroes instead of _mm_movemask_ps + compare in IsZero
Also I strongly suggest to never use Ofast optimization level. It breaks floating point operations, because it enables ffast-math