Title page for ETD etd-12242010-124006


Type of Document Dissertation
Author Belgin, Mehmet
Author's Email Address mehmetb@vt.edu
URN etd-12242010-124006
Title Structure-based Optimizations for Sparse Matrix-Vector Multiply
Degree PhD
Department Computer Science
Advisory Committee
Advisor Name Title
Back, Godmar V. Committee Co-Chair
Ribbens, Calvin J. Committee Co-Chair
Cameron, Kirk W. Committee Member
Gugercin, Serkan Committee Member
Sandu, Adrian Committee Member
Keywords
  • Code Generators
  • Vectorization
  • Sparse
  • SpMV
  • SMVM
  • Matrix Vector Multiply
  • PBR
  • OSF
  • thread pool
  • parallel SpMV
Date of Defense 2010-12-14
Availability unrestricted
Abstract
This dissertation introduces two novel techniques, OSF and PBR, to improve the performance of Sparse Matrix-vector Multiply (SMVM) kernels, which dominate the runtime of iterative solvers for systems of linear equations. SMVM computations that use sparse formats typically achieve only a small fraction of peak CPU speeds because they are memory bound due to their low flops:byte ratio, they access memory irregularly, and exhibit poor ILP due to inefficient pipelining. We particularly focus on improving the flops:byte ratio, which is the main limiter on performance, by exploiting recurring structures or sub-structures in matrices. Our techniques also support micro-architecture level optimizations to further improve performance.

Operation Stacking Framework (OSF) stacks problems in large ensemble computations, which run the same sparse kernel using an identical matrix structure, such that they share a single copy of the indexing information to significantly reduce memory bandwidth usage. OSF provides performance improvements of up to 1.94x on an AMD Opteron compared to the CSR method. We validate performance results using hardware event counters, which demonstrate significantly improved cache and pipeline utilization.

Pattern-based Representation (PBR) exploits recurring block nonzero patterns by generating custom code for each recurring block pattern. In this way, no indexing data for individual nonzero elements are read from memory, reducing the overall size of the indices by up to 98%. Our code generator emits highly tuned codes that utilize SSE vectorization and software prefetching. PBR accurately identifies a block size that achieves optimal or near-optimal performance using a linear multiple regression performance model. On recent multicore machines, PBR provides performance improvements of up to 3.4x sequentially and 5x in parallel, compared to the CSR method. The PBR library we provide converts matrices at runtime, allowing our method to be used as a drop-in replacement for existing methods. We compare PBR’s overhead relative to its benefits and show that PBR is beneficial for many applications that repetitively call the SMVM kernel for the same matrix structure.

Files
  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  Belgin_Mehmet_D_2010.pdf 4.92 Mb 00:22:45 00:11:42 00:10:14 00:05:07 00:00:26

Browse All Available ETDs by ( Author | Department )

dla home
etds imagebase journals news ereserve special collections
virgnia tech home contact dla university libraries

If you have questions or technical problems, please Contact DLA.