Performance of Direct Dense Linear System Solvers on the GPU

November 15 2004 Update

Milestones status for November, 15

The algorithm

Here's a high-level description of the algorithm to compute the Gauss-Jordan reduction without pivoting on the GPU at this point, in pseudocode. The algorithm solves an NxN system by Gauss Elimination of a Nx(N+1) matrix, where the last column is the right-hand side of the system. I allocate a double-buffered Nx(N+1) pbuffer, no context switching is required.

for k = 1 to N (1) normalize k'th row (2) eliminate k'th column element a_ik of upper (k-1) rows (3) eliminate k'th column element a_ik of lower (N-k-1) rows switch 'target' and 'source' buffer end for
(1) normalize k'th row: // divide the elements in the k'th row by its k'th element // -> makes the k'th element equal to 1' glDrawBuffer(target buffer) bind source buffer as texture bind 'divide' fragment program (a) render k'th row // copy back to source buffer glDrawBuffer(source buffer) bind target buffer as texture bind 'copy' fragment program (b) render k'th row (2) eliminate k'th column element a_ik of upper (k-1) rows // row'(i) = row(i) - row(i)[k]*row(k)[k] glDrawBuffer(target buffer) bind source buffer as texture bind 'rowop' fragment program (c) render quad from (0,0) to (k-2, N) (3) eliminate k'th column element a_ik of lower (N-k-1) rows render quad from (k, 0) to (N-1, N)

These are the 3 fragment programs used:

void divide( in float2 coord : WPOS, uniform samplerRECT input, uniform float2 columncoord, out float result : COLOR0 ) { float diag = f1texRECT(input, float2(columncoord.x, coord.y)); float a = f1texRECT(input, coord); result = a / diag; }
void copy( in float2 coord : WPOS, uniform samplerRECT input, out float result : COLOR0 ) { result = f1texRECT(input, coord); }
void rowop( in float2 coord : WPOS, uniform samplerRECT input, uniform float2 columncoord, out float result : COLOR0 ) { float aik = f1texRECT(input, float2(columncoord.x, coord.y)); float fkj = f1texRECT(input, float2(coord.x, columncoord.y)); result = aik * fkj; // separate from first computation to account for latency float a = f1texRECT(input, coord); result = a - result; }

As described in the previous section, I have adapted the rowop fragment program to include a branch on the row being processed. This line was added to the beginning of the rowop fragment program, to yield an adapted (c') fragment program:

if (columncoord.x == coord.y) discard;
Because of this change, part (2) and (3) can now be performed simultaneously in pass (2'):
(2') eliminate k'th column element a_ik of upper (k-1) and lower (N-k-1) rows // row'(i) = row(i) - row(i)[k]*row(k)[k] glDrawBuffer(target buffer) bind source buffer as texture bind 'rowop' fragment program (c') render quad from (0,0) to (N-1, N)

Summary of the results

This table summarizes the different experiments I have done so far. The accompanying graphs illustrate some of my conclusions from the previous sections.

As a final remark, the notex bars in the graphs give a estimate for the fill overhead on the GPU, an upper bound on the performance of Gauss-Jordan elimination on the GPU. This seems to be at approximately 200% the performance of the CPU, but implementing Gauss-Jordan without texture fetches is ofcourse impossible. It remains to be seen how close we can get to that upper bound. I propose a framebuffer type of algorithm below, but it might be necessary to look at linear system solvers that have a lower theoretical complexity than Gauss-Jordan elimination.

1000x1000 1500x1500 2000x2000 2500x2500 3000x3000 3500x3500 4000x4000
Matlab 0.571 1.622 3.350 6.140 10.730 15.510 22.800
Nvidia 6800 Ultra 0.825 2.738 6.440 12.530 21.680 34.390 51.300
Nvidia 6800 GT 0.967 3.089 7.270 14.150 24.450 38.780 57.840
Nvidia 6800 GT (doublebuffered) 0.928 3.156 7.536 14.148 24.443 38.767 57.830
Nvidia 6800 GT (branched, doublebuffered) 0.748 2.475 5.864 11.393 19.630 31.126 47.184
Nvidia 6800 GT (notex) 0.958 1.397 1.946 2.903 4.981 7.874 11.717
Nvidia 6800 GT (alwayscache) 0.965 1.441 2.932 5.684 9.780 15.480 23.080
Branch improvement 0.194 0.216 0.222 0.195 0.197 0.197 0.184
Performance ratio (Matlab) 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Performance ratio (Ultra) 1.44 1.69 1.92 2.04 2.02 2.22 2.25
Performance ratio (GT) 1.69 1.90 2.17 2.30 2.28 2.50 2.54
Performance ratio (doublebuffered) 1.63 1.95 2.25 2.30 2.28 2.50 2.54
Performance ratio (branched, doublebuffered) 1.31 1.53 1.75 1.86 1.83 2.01 2.07
Performance ratio (notex GT) 1.68 0.86 0.58 0.47 0.46 0.51 0.51
Performance ratio (alwayscache GT) 1.69 0.89 0.88 0.93 0.91 1.00 1.01

Overview of timings, for different architectures and algorithm variants. Times for Gauss-Jordan elimination, in seconds, for the indicated matrix sizes.

Timing comparison graphs, for different architectures and algorithms variants. Times for Gauss-Jordan elimination, in seconds, for the indicated matrix sizes. Note that the 'alwayscache' bars give a lower bound, achievable for our algorithm and the given architecture, if all texture accesses would hit the cache. Theoretically, we should be able to compete with the CPU.

Relative performance comparison graphs, CPU (Matlab) versus the different architectures and algorithms variants.

Up Next: updated deliverables for December, 15

Given the discussion above, here's an updated list of my deliverables for December, 15, 2004.

References

  1. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid
    Jeff Bolz, Ian Farmer, Eitan Grinspun and Peter Schroder, ACM Transactions on Graphics 2003
  2. Linear algebra operators for GPU implementation of numerical algorithms
    Jens Kruger and Ruediger Westermann, ACM Transactions on Graphics 2003
  3. GPGPU.org forum post: Evaluation Query
  4. Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication
    Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan, ACM Graphics Hardware 2004
  5. Timing Comparisons of Mathematica, MATLAB, R, S-Plus, C & Fortran
    Ian McLeod and Hao Yu, Technical report, available online, March 2, 2002