Fluid Solver Optimization (CPU, OpenMP, CUDA)
Reduced runtime 30.85s → 3.69s (8.36×) on CPU; best OpenMP ~10.77× at ~16-24 threads; CUDA kernels reached ~90.16 GiB/s.
Optimized a 3D Stable Fluids solver end-to-end: (1) CPU locality + ILP (loop reordering, tiling, division to mult), (2) shared-memory parallelism with OpenMP (collapse, reductions, static scheduling; Red/Black Gauss-Seidel), (3) GPU acceleration with CUDA (persistent device memory, kernelized solver steps, Nsight-guided tuning). Profiled bottlenecks and scalability with perf/gprof and Nsight.