Performance of Direct Dense Linear System Solvers on the GPU
Nicolas Galoppo von Borries, COMP290 class project, Fall 2004
Problem statement
Although there is a definite demand for GPU-based algorithms for scientific applications [4], they are still in their infancy when it comes down to solving general linear algebra routines (dense linear systems, eigendecomposition, SVDs, ...). The question is if GPUs can be useful for scientific applications (i.e. real simulation and not just "visual simulation"), therefore we want to address these fundamental problems. In the past, researchers have looked into solving sparse matrix problems [1] and other linear systems as part of more applied physical systems [2]. The hardest component of all of these algorithms was basic matrix-vector multiplication. It has also been reported that matrix-matrix multiplication is particularily inefficient on the GPU [5], mainly due to bad localization of such operations, leading to texture cache issues. In this particular project, we would like to tackle the fundamental problem of direct linear system solving through Gauss elimination on the GPU. In simple Gauss-Jordan elimination, the fundamental operations are basic row operations, unlike in matrix-matrix multiplication. We hope that this observation will help to find more GPU-friendly algorithms. We target a specific application, requiring the solution of a large number of small dense linear systems.
Large 3D compressible fluid simulations are commonly performed by so-called CFD solvers on the CPU. The solver can be fairly complicated, but we would like to assess the viablility of performing a specific part of the computation on the GPU, offloading the CPU for other tasks in the meanwhile. The task at hand is to simulate the the kinetic reactions at specific locations in the large 3D simulation grid. For each voxel in the grid, the GPU has to solve a small (12x12) linear system provided by the main solver and return the solution vector (for each voxel). The idea is to pack a large number of those systems together such that they can be computed on the GPU efficiently in the least number of passes as possible.
For the simplest plume thermochemistry sets, there exist about 12 species participating in 16 reactions. On the CPU, the work involved is in constructing the elements of a 12 x 12 array of influence coefficients. Some of these reactions can involve one or three species one the reactant or product side, but the form remains the same. As each specie affects the effective composition of the other through the reaction equations, this solution is based solved implicitly as a linear set of equations, in this case a 12x12 matrix:

On the GPU, for each time step at each voxel, it involves the solution of this series of linear algebraic equations. This is a extremely data parallel problem, typically involving several thousand cells on the compuational grid for several thousand time iterations.
Goals
The goal of the project is not real-time simulation, but acceleration. At first, we would like to assess whether the fluid solver can be offloaded up by using GPUs in the near future. Even if we can implement an algorithm that can compete with CPU implementations such as the ATLAS benchmark, that would be a great win. We want to know what kind of research challenges it would bring, what would be the ideal data representation, and which features of the current hardware are the most useful in this kind of problem.
Secondly, the goal of the project is to investigate which new hardware features, in terms of functionality, memory, cache and pipeline architecture are going to be necessary to further enhance the parallel nature of the GPU processor such that it can surpass the efficiency of the CPU in terms of these basic operations.
The advantage of tackling the fundamental problem of matrix solving, has three aspects:
- Generality. We make abstraction of the system at hand. Ideally, we would like to build some kind of library of linear algebra routines on the GPU, although this is beyond the scope of this project. Once the library exists, the algorithms can be plugged into any physical simulator with virtually no modification.
- Realism. More realistic simulations are now possible, because no simplifications of the physical model must be made to map it to the GPU, as has been done for physical simulations on the GPU in the past. This is especially worthwhile in scientific applications
GPGPU has moved beyond the days of just getting the basic 'toy' simulation working. The GPU is just a parallel processor; the next (difficult) steps are largely in making it easier to program non-graphics algorithms. The big question will be whether GPUs can be useful for anything in the "real simulation" world, and not just visual simulation.
Milestones
These are the important milestones I would like to reach:
- October 15:
- Implement Gauss elimination of large (NxN) matrices, no pivoting (done!)
- Remarks:
- The basic algorithm works, so far we haven't observed precision issues due to the single precision math on the GPU
- The first implementation used N pbuffers to store each row. The idea was to localize row operations in small sqrt(N)xsqrt(N) pbuffers, such that texture access would be localized (thus more efficient). Ofcourse, the main problem with this approach are the context switches. The context switches (once for every row operation) are killing the performance.
- Currently, I am working on an alternative representation, having only one pbuffer (and an additional spare 'working' surface) to store the matrix, eliminating the expensive context switches. The rows shouldn't be stored as 1XN parts of the pbuffer, but as sqrt(N)xsqrt(N) patches of the large pbuffer.
- Get some preliminary timings compared to ATLAS libraries, determine correctness (done!)
- Remarks:
- Matlab 6.5 was uses the ATLAS libraries (with Intel SSE) and has been reported as one of the fastest linear algebra solvers around nowadays. It was recently demonstrated [6] that in certain circumstances the widely held view that you can always dramatically improve on the cpu time required for lengthy computations by using compiled C or Fortran code instead of advanced QPEs such as Matlab is wrong. I used Matlab 6.5 as my reference benchmark.
- I timed a first implementation of the straight-forward, non-optimized Gauss-Jordan Elimination algorithm on the NVidia 6800 Ultra. The table below shows that my implementation is about 1.6 times slower than Matlab on a 2000x2000 matrix. All timings were done disregarding matrix construction and data uploading, for Matlab aswell as for the GPU.
1000x1000 2000x2000 2500x2500 3000x3000 Matlab 0.571 s 3.35 s 6.14 s 10.73 s NVidia 6800 Ultra 0.612 s 5.25 s 10.06 s* 17.35 s* Performance Ratio 1.07 1.57 1.64 * 1.6 * Computation times of Gauss Elimination, Matlab vs. NVidia 6800 Ultra, in function of matrix size.
* Due to yet unknown reasons, my implementation breaks down at the 2000x2000 border, yielding unrealistically large computation times. I have determined the timings with * by removing one line of code, being aglCopyTexSubImage()call. I am currently investigating the bug (Driver related? Graphics memory filled so the driver does the texture copy over the system bus?). - I haven't been able to determine the bottleneck on the GPU yet, but interestingly enough, the performance ratio results in the table above seem to indicate that we are losing against the CPU because of cache limitations on the NVidia. The relative advantage of the CPU increases with matrix size, indicating that we are hitting the cache size upper limit earlier on the GPU than on the CPU. More investigation will be necessary.
- Next, I will optimize the algorithm using some of the following techniques:
- Store the rows in the textures in a more texture-cache friendly layout. Use rectangular regions for the rows instead of lines, motivated by the rumour that GPU texture caches might be organized in a 2D fashion.
- Make sure that my working pbuffers are in on different surfaces of the same context, eliminating the expensive (?) context switch for each row operation.
- Make use of the fact that, as the algorithm goes down the rows, only the lower right part of the matrix still needs updating.
- November 15: The full update report has been posted here.
- Optimize algorithm by precomputation of row division factors by seperate pass into separate pbuffer. This should save alot of divisions and some texture lookups (done!)
- Avoid context switches by use of double-buffered pbuffer. Ping-pong between surfaces (front/back) of the same buffer instead of doing ping-pong between separate pbuffers. (done!)
-
Compare different row layouts, matrix packing and algorithm strategy in the number of passes
to assess texture cache efficiency vs. rendering pass overhead. Choose the best algorithm so far in terms of efficiency (while still correct).
- Matrix-row to texture quad mapping (done).
- Block processing (done).
- Improve texture access pattern by quad strip / triangle strip rendering.
- 4 component wide efficiency. Will it lead to 4-way boost ?!? (done)
- FP16 efficiency.
- Extend the framework such that it can accept the necessary chemical kinetics data sets and pack the large number of small 12x12 (make it nxn) matrices into one large matrix. Adapt the algorithm such that it can deal with different elimination factors for the different matrices, all in the same rendering pass.
- December 15: [Updated Nov 15, 2005]
Given the discussion in the November update, here's an updated list of my deliverables for December, 15, 2004.
- Investigate vertex processing limitation issue. As reported, I suspect we are hitting the peak vertex processing rate at some point when rendering a large number of quad strips. Nevertheless, from the preliminary calculations and given the numbers NVidia provides for the NV40, it seems we are only at one fifth of peak vertex processing performance. Possibly there is still another issue (possibly bandwidth related).
- Map to framebuffer operations. Even if we are able to approximate the ideal case where all texture fetches hit the cache, we would only barely match the CPU. Another approach might be to make use of the register combiner features of the GPU to perform part of (only subtraction) or all of the row operation (subtraction and multiplication). The hope is that we will get a performance boost by letting the GPU do what it's supposed to be good at, ie. framebuffer operations. The drawback of this approach is the limitation to FP16 values.
- Report. Write a 4-8 page report, in conference paper style. Include previous work, new results and contributions. Highlight limitations in GPU architecture. Suggest new hardware feature requirements, based on experiences and comparisons of the different algorithmic techniques.
References
-
Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid
Jeff Bolz, Ian Farmer, Eitan Grinspun and Peter Schroder, ACM Transactions on Graphics 2003 -
Linear algebra operators for GPU implementation of numerical algorithms
Jens Kruger and Ruediger Westermann, ACM Transactions on Graphics 2003 - GPGPU.org forum post: Evaluation Query
-
Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication
Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan, ACM Graphics Hardware 2004 -
Timing Comparisons of Mathematica, MATLAB, R, S-Plus, C & Fortran
Ian McLeod and Hao Yu, Technical report, available online, March 2, 2002