Performance of Direct Dense Linear System Solvers on the GPU
November 15 2004 Update
Milestones status for November, 15
-
Provide breakdown of bottlenecks.
As reported in the previous update, I suspected that my algorithm was texture-cache access bound. I wrote two different variations to the algorithm, referred to as notex and alwayscache. These do not actually produce correct results, but I use them to perform timings such that I can pinpoint the bottlenecks of the algorithm, aswell as empirical bounds on the efficiency achievable by the algorithm on a particular GPU. The first (notex) has all texture accesses removed, whereas in the second one (alwayscache), all texture accesses are to the same location, effectively guaranteeing texture cache hit. A head-to-head comparison with my previous timings is provided in the next graph and table. The numbers acknowledge that I am particularily texture cache bound. Indeed, looking at the sequence of texture accesses in the algorithm, it is obvious that the access pattern induces severe texture trashing. The effect is more apparent as the size of the matrix goes up, because texels in subsequent rows are further apart. The GPU (barely) matches up with the CPU at a matrix size of 1000x1000 elements, but lags behind in larger sizes. This is the main reason why I tried several approaches to improve the access pattern of the algorithm, as described in a next paragraph.
Note that the alwayscache bars give a lower bound, achievable for our algorithm and the given architecture, if all texture accesses would hit the cache. The notex bars indicate the texture bandwidth overhead. The numbers indicate that, theoretically, we should be able to compete with the CPU.
Timing comparison graphs, for different variants of the algorithm at different matrix sizes. The alwayscache bars give a lower bound on the time needed for our algorithm, if all texture accesses would hit the cache.
-
Optimize algorithm by precomputation of row division factors by seperate pass into separate pbuffer. This should save alot of divisions and some texture lookups.
This approach is currently being used in the main algorithm. It corresponds to step (1) of the algorithm described below.
-
Compare different row layouts, matrix packing and algorithm strategies in the number of passes
to assess texture cache efficiency vs. rendering pass overhead. Choose the best algorithm so far in terms of efficiency (while still correct).
-
Matrix-row to texture quad mapping.
This approach turned out to be much slower than the simpler, original row-to-row mapping. The idea was to improve the texture access pattern, given that we suspect the GPU texture cache lines are organized in a 2D layout. Nevertheless, the typical row size (1000-4000) yields quads of size 30x30 to 60x60, which is probably larger than the GPU texture cache anyway. The next approach is probably a better one anyway. As an aside, there is a considerable overhead caused by the increase of context switching (multiple pbuffers necessary) and from the use of multiple-render-to-target (MRT). This is probably the main reason for its bad performance.
-
Block processing.
Because the first algorithm processes the matrix row-by-row (because of the way the GPU rasterizes the full-screen quad), texture accesses in the eliminating row k are sequential and start over at the beginning of row k for each other row in the matrix. It is obvious that these accesses are not localized. Therefore, it would be better to process the elements in the matrix at iteration k a few columns at a time, for all rows, such that the texture fetches in the eliminating row k are localized in time and position, guaranteeing texture cache hits. We improve locality of the texture access pattern for fetches of the eliminating row (in the second part of the algorithm) by processing a limited number of columns instead of all columns at once, by rendering quad strips as illustrated in the following figure.

If we were to process only one column at a time, this would guarantee texture cache hit in the eliminating row k, but it would also lead to texture trashing when fetching the row factor aik. In row-by-row processing, this fetch is highly localized, but obviously not anymore in column processing.
Initially, I implemented the above approach by drawing the quad strips in immediate mode. That incurred a considerable transfer bandwidth overhead for large matrices with small block sizes, because of the high number of vertices to be transferred to the GPU. Therefore, the algorithm became 1000% slower, because one big quad consists of only 4 vertices, which can be sent to the GPU almost for free.
Then, I moved the quad strip vertices to vertex buffer objects, stored on the GPU. The figure below compares the performance boost. On the other hand, I expected to see a performance increase compared to row-order processing, which I didn't. The timings indicate that the performance converges to the row-order processing as the block size goes to the matrix size. This implies that block-order processing only adds overhead. This means that our attempt at increasing texture cache hits was unsuccessful with block-order processing.
As an aside, I suspect that the overhead induced by block-order processing is because we are hitting the maximum vertex processing rate of the NV40. NVidia claims 600M vertices/s. For a 2000x2000 matrix, blocksize 4x4, there are 500x1000 vertices. There are 2000 passes in the algorithm (one for each row). Therefore, the algorithm should take
(0.5M verts/pass * 2000 passes) / 600M (verts/s) = 1.67 s. For this particular matrix size, my timings on a 6800 GT show an overhead of 8.63 s (see noop column), ~5 times as slow as the number claimed for an 6800 Ultra.
Quadstrip rendering in immediate mode versus stored in vertex buffer objects. The 'noop' bars indicate the vertex processing and fill overhead. VBOs boost performance significantly, but seem to have a fixed overhead (as can be seen at larger block sizes).
-
Matrix-row to texture quad mapping.
-
Avoid context switches by use of double-buffered pbuffer.
Ping-pong between surfaces (front/back) of the same buffer instead of doing ping-pong between separate pbuffers. As I expected, the timings below acknowledge that I am not context-switch bound, as the use of double-buffered pbuffers only marginally increases efficiency.
-
Branching in the fragment program.
In order to simplify the second pass of original algorithm, I make use of the branching abilities of the fragment processor in the NV40. The NV40 provides true branching, so this approach will only work efficiently on the latest hardware. On previous hardware, all branches are evaluated before the final returned value is assigned, based on the branch outcome.
I have adapted the
rowopfragment program to include a branch on the row being processed. In the new version, the fragment is killed if the fragment belongs to the row k that performs the elimination of the kth column (ie. in the kth iteration of the algorithm; code snippet below). This approach simplifies the second pass of the algorithm. We can now render a full screen quad, instead of two separate quads (above and below eliminating row).The table and graph below show that I was able to achieve a 20% speedup by applying the described optimization.
- 4 component wide efficiency.
It is widely acknowledged that the main advantage of the GPU's SIMD pixel processing architecture is the ability to perform operations on 4 components, basically for free. I have compared the efficiency of my algorithm by timing the performance for two versions of the algorithm. The first version uses 1 separate component, whereas the second version works on 4 components simultaneously. As is shown in the resulting graph, the previous stated assumption does not hold for any algorithm. For one, I don't believe that going from a 1-component algorithm to a 4-component algorithm is for free, when the algorithm is texture bandwidth bound. I believe this is definitely the case in my algorithm, as the graphs indicate that the 4-component algorithm is 4-5 times slower than the simpler 1-component one. This acknowledges the observations in [4].

-
Extend the framework such that it can accept the necessary chemical kinetics data sets and pack the large number of small 12x12 (make it nxn) matrices into one large matrix. Adapt the algorithm such that it can deal with different elimination factors for the different matrices, all in the same rendering pass.
The current implementation is not competitive with the CPU, on large matrices. Given the additional complexity of the algorithm by packing the small 12x12 systems into one large matrix, I expect this only to get worse. Therefore, I have opted to put this deliverable further down the road, possibly not even within this project's term. It would be better to improve the algorithm for large systems first, in order to pinpoint the major bottlenecks in the algorithm and/or GPU architecture that prevent competitivity with the CPU.
The algorithm
Here's a high-level description of the algorithm to compute the Gauss-Jordan reduction without pivoting on the GPU at this point, in pseudocode. The algorithm solves an NxN system by Gauss Elimination of a Nx(N+1) matrix, where the last column is the right-hand side of the system. I allocate a double-buffered Nx(N+1) pbuffer, no context switching is required.
for k = 1 to N (1) normalize k'th row (2) eliminate k'th column element a_ik of upper (k-1) rows (3) eliminate k'th column element a_ik of lower (N-k-1) rows switch 'target' and 'source' buffer end for
(1) normalize k'th row: // divide the elements in the k'th row by its k'th element // -> makes the k'th element equal to 1' glDrawBuffer(target buffer) bind source buffer as texture bind 'divide' fragment program (a) render k'th row // copy back to source buffer glDrawBuffer(source buffer) bind target buffer as texture bind 'copy' fragment program (b) render k'th row (2) eliminate k'th column element a_ik of upper (k-1) rows // row'(i) = row(i) - row(i)[k]*row(k)[k] glDrawBuffer(target buffer) bind source buffer as texture bind 'rowop' fragment program (c) render quad from (0,0) to (k-2, N) (3) eliminate k'th column element a_ik of lower (N-k-1) rows render quad from (k, 0) to (N-1, N)
These are the 3 fragment programs used:
void divide( in float2 coord : WPOS, uniform samplerRECT input, uniform float2 columncoord, out float result : COLOR0 ) { float diag = f1texRECT(input, float2(columncoord.x, coord.y)); float a = f1texRECT(input, coord); result = a / diag; }
void copy( in float2 coord : WPOS, uniform samplerRECT input, out float result : COLOR0 ) { result = f1texRECT(input, coord); }
void rowop( in float2 coord : WPOS, uniform samplerRECT input, uniform float2 columncoord, out float result : COLOR0 ) { float aik = f1texRECT(input, float2(columncoord.x, coord.y)); float fkj = f1texRECT(input, float2(coord.x, columncoord.y)); result = aik * fkj; // separate from first computation to account for latency float a = f1texRECT(input, coord); result = a - result; }
As described in the previous section, I have adapted the rowop fragment program to include a branch on the row being processed. This line was added to the beginning of the rowop fragment program, to yield an adapted (c') fragment program:
Because of this change, part (2) and (3) can now be performed simultaneously in pass (2'):if (columncoord.x == coord.y) discard;
(2') eliminate k'th column element a_ik of upper (k-1) and lower (N-k-1) rows // row'(i) = row(i) - row(i)[k]*row(k)[k] glDrawBuffer(target buffer) bind source buffer as texture bind 'rowop' fragment program (c') render quad from (0,0) to (N-1, N)
Summary of the results
This table summarizes the different experiments I have done so far. The accompanying graphs illustrate some of my conclusions from the previous sections.
As a final remark, the notex bars in the graphs give a estimate for the fill overhead on the GPU, an upper bound on the performance of Gauss-Jordan elimination on the GPU. This seems to be at approximately 200% the performance of the CPU, but implementing Gauss-Jordan without texture fetches is ofcourse impossible. It remains to be seen how close we can get to that upper bound. I propose a framebuffer type of algorithm below, but it might be necessary to look at linear system solvers that have a lower theoretical complexity than Gauss-Jordan elimination.
| 1000x1000 | 1500x1500 | 2000x2000 | 2500x2500 | 3000x3000 | 3500x3500 | 4000x4000 | |
| Matlab | 0.571 | 1.622 | 3.350 | 6.140 | 10.730 | 15.510 | 22.800 |
| Nvidia 6800 Ultra | 0.825 | 2.738 | 6.440 | 12.530 | 21.680 | 34.390 | 51.300 |
| Nvidia 6800 GT | 0.967 | 3.089 | 7.270 | 14.150 | 24.450 | 38.780 | 57.840 |
| Nvidia 6800 GT (doublebuffered) | 0.928 | 3.156 | 7.536 | 14.148 | 24.443 | 38.767 | 57.830 |
| Nvidia 6800 GT (branched, doublebuffered) | 0.748 | 2.475 | 5.864 | 11.393 | 19.630 | 31.126 | 47.184 |
| Nvidia 6800 GT (notex) | 0.958 | 1.397 | 1.946 | 2.903 | 4.981 | 7.874 | 11.717 |
| Nvidia 6800 GT (alwayscache) | 0.965 | 1.441 | 2.932 | 5.684 | 9.780 | 15.480 | 23.080 |
| Branch improvement | 0.194 | 0.216 | 0.222 | 0.195 | 0.197 | 0.197 | 0.184 |
| Performance ratio (Matlab) | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| Performance ratio (Ultra) | 1.44 | 1.69 | 1.92 | 2.04 | 2.02 | 2.22 | 2.25 |
| Performance ratio (GT) | 1.69 | 1.90 | 2.17 | 2.30 | 2.28 | 2.50 | 2.54 |
| Performance ratio (doublebuffered) | 1.63 | 1.95 | 2.25 | 2.30 | 2.28 | 2.50 | 2.54 |
| Performance ratio (branched, doublebuffered) | 1.31 | 1.53 | 1.75 | 1.86 | 1.83 | 2.01 | 2.07 |
| Performance ratio (notex GT) | 1.68 | 0.86 | 0.58 | 0.47 | 0.46 | 0.51 | 0.51 |
| Performance ratio (alwayscache GT) | 1.69 | 0.89 | 0.88 | 0.93 | 0.91 | 1.00 | 1.01 |
Overview of timings, for different architectures and algorithm variants. Times for Gauss-Jordan elimination, in seconds, for the indicated matrix sizes.

Timing comparison graphs, for different architectures and algorithms variants. Times for Gauss-Jordan elimination, in seconds, for the indicated matrix sizes. Note that the 'alwayscache' bars give a lower bound, achievable for our algorithm and the given architecture, if all texture accesses would hit the cache. Theoretically, we should be able to compete with the CPU.

Relative performance comparison graphs, CPU (Matlab) versus the different architectures and algorithms variants.
Up Next: updated deliverables for December, 15
Given the discussion above, here's an updated list of my deliverables for December, 15, 2004.
- Investigate vertex processing limitation issue. As reported above, I suspect we are hitting the peak vertex processing rate at some point when rendering a large number of quad strips. Nevertheless, from the preliminary calculations and given the numbers NVidia provides for the NV40, it seems we are only at one fifth of peak vertex processing performance. Possibly there is still another issue (possibly bandwidth related).
- Map to framebuffer operations. Even if we are able to approximate the ideal case where all texture fetches hit the cache, we would only barely match the CPU. Another approach might be to make use of the register combiner features of the GPU to perform part of (only subtraction) or all of the row operation (subtraction and multiplication). The hope is that we will get a performance boost by letting the GPU do what it's supposed to be good at, ie. framebuffer operations. The drawback of this approach is the limitation to FP16 values.
- Report. Write a 4-8 page report, in conference paper style. Include previous work, new results and contributions. Highlight limitations in GPU architecture. Suggest new hardware feature requirements, based on experiences and comparisons of the different algorithmic techniques.
References
-
Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid
Jeff Bolz, Ian Farmer, Eitan Grinspun and Peter Schroder, ACM Transactions on Graphics 2003 -
Linear algebra operators for GPU implementation of numerical algorithms
Jens Kruger and Ruediger Westermann, ACM Transactions on Graphics 2003 - GPGPU.org forum post: Evaluation Query
-
Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication
Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan, ACM Graphics Hardware 2004 -
Timing Comparisons of Mathematica, MATLAB, R, S-Plus, C & Fortran
Ian McLeod and Hao Yu, Technical report, available online, March 2, 2002