GPGPUGeneral-Purpose computation onGraphics Processing UnitsMonday, April 13, 20091

OUTLINE Thedevelopment of GPUs: From GPU to GPGPU GPUhardware: NVIDIA 200-Series case study GPUsoftware: A look into the driver GPGPUsoftware: Programming GPUs todayNVIDIA CUDA case study Ongoingand expected developments: The future of GPGPU (?)Monday, April 13, 20092

GPUS FOR GRAPHICSThe leading cause for innovation in GPU architectureMonday, April 13, 20093

GROWTH GPUmarket grows despite recession (Tom's Hardware) About6% rate, backing 50G video game market (ars technica) Super-exponential NVIDIAGeForce GTX 280 - 1400 million Terra-FLOP Nottransistor count growthcapacityjust for Games: Multimedia, Physics, HPC, . (?)Monday, April 13, 20094

A GRAPHICS CARDNVIDIA GTX280, 1GB GDDR3Monday, April 13, 20095

A COMPLEX GFX SCENEin-game screenshot fromGRID (Codemasters)Monday, April 13, 20096

3D RENDERING Modeling Animation Rendering Ray-tracing: Algorithmforphotorealistic 2Drepresentation of 3D scenes Rasterization: Algorithmforefficient real-time 3DrenderingMonday, April 13, 20097

GRAPHICSPIPELINEMapping 3D-world to screenthrough a GFX-APIMonday, April 13, 20098

SHADER PROGRAMMING Graphics-API-specific Fulllanguage implementation (Cg/GLSL/HLSL)control flow support Operate Shaderon read-only textures (or output of previous stage)model 4.0 : Vertex / Geometry / Pixel Directx11 Shipintroduces two more shaders (programmable stages)intermediate language “binaries”Monday, April 13, 20099

GFX APICHALLENGES Task(function) / Data parallelism GFX Pipeline / many-corearchitecture Scene-dependent Bottlenecks Commonworkload Unify shaders fixed function unitsoperations SIMD-like execution Massivememory accesses with/out patterns Memory access/hierarchy innovationMonday, April 13, 200910

GPU CHARACTERISTICS SIMD: SingleInstruction, Multiple Data Example: RGBA, XYZW Many-core SIMD Wide SIMD Lots of ALUscontrol flow predicated execution Hardware NeedsMonday, April 13, 2009quadruplesmultithreading high processor utilizationcompile-time, static information to be realized11

SAMPLE SHADER EXECUTIONExecute a 32-wide SIMD instruction / threadswitch to another on every 1 to 4 cyclesMonday, April 13, 200912

GPU MEMORY SYSTEM Focuson high bandwidth rather than low latency need Flatto transfer massive textures between host and devicebut exposed memory hierarchy Limited Bigor programmer-manageable cachingSIMD-wide register files MemoryMonday, April 13, 2009access coalescing necessary for performance13

GPU EXECUTION SYSTEM Scheduleand assign threadsto maintain pipeline flow Resizeand manage buffersand reorder memory accessto avoid costly collisions Enablecontrol flow withleast possible overheadMonday, April 13, 2009 Loadbalancing underpipeline limitations/enhancements: ex. occludedfragmentsshading ex. bottlenecks, reseedingpossibilities14

GPUS FOR GPGPUHow GPU architecture transforms to enableGeneral Purpose programmingMonday, April 13, 200915

CASE STUDY: NVIDIA GTX280The NVIDIA 200-Series representative,built with GPGPU in mindMonday, April 13, 200916

GT200 CHARACTERISTICS Monolithic die 240-Stream-Processor Array 80/80 Texture Address /Filtering 1GB GDDR3 Frame Buffer 1.4B Transistor Count TSMC 65nm ManufacturingProcess 350 Price 32 ROPs 602MHz Core Clock 1296MHz Shader ClockMonday, April 13, 20091107MHz Memory Clock512-bit Memory Bus Width17

BUILDING BLOCKS 10x TPC: Texture/ProcessorCluster 3x SM: StreamingMultiprocessor (aka ThreadProcessor Arrays) 8x SP: StreamingProcessors (aka ShaderCores or ThreadProcessors)Monday, April 13, 200918

OVERVIEW OFPROCESSING UNITS Blockscheduler issues blocks ofthreads in round robin fashionto SMs real-timeaccounting for loadbalancing and resourceexploitation SMs SPsfetch instructionsexecuteMonday, April 13, 200919

SOME TERMINOLOGY Core: SP Multiprocessor: SM Thread: thinkof every instruction on a SIMD unit as a differentthread (ex. 8 quads of XYZW pixels 32 threads) Warp: aset of 32 threads Scoreboarding: Asimpe technique to issue data-independentinstructions dynamically, out of orderMonday, April 13, 200920

STREAMING MULTIPROCESSORKey componentsMonday, April 13, 200921

SM AND THREADS Highly 1024threaded, single-issue processor with 8-wide SIMDthreads concurrently: 32 threads in 32 warps in flight Scheduledblocks of no more than 512 threads Oneto two entries per warp in-flight in the instruction buffer Issuewarps to SPs using scoreboard (no full renaming) Issuelogic prioritizes in close proximity to ICacheMonday, April 13, 200922

SM AND MEMORY 16K register file partitioned across SPs Each SP has 2K entry to be used by 128 threads, organized in 16 or24 banks 4-128 entries/thread, statically allocated at compile time16K shared memory 4096 entries organized in 16 banks with 32-bank width Support atomic instructions across threads of a block (ex. CAS)Monday, April 13, 200923

SIMT SIMT: Single NoInstruction Multiple Threadspeculation; wait till address resolution and continue Widthis not visible architecturally, unlike SIMD which demandspacking data into vectors N-waydivergent “gracefully” executes serially Threadsare independent; no register sharing, only shared memorysharing (though warp voting is allowed)Monday, April 13, 200924

SM EXECUTION UNITS Executesat shader-clock speed ; slower core-clock for controllogic and storage arrays Fused-MAD, singleFPU for double precision, Special Function Unit(SFU) CPI 4 Dualon ALUissue illusion:Monday, April 13, 200925

MEMORY PIPELINE 4Balignment for coalescing LD/SDissued in SMs,executed in special units Memoryaccesses issued inwarp, executed in half-warp 128-portregister-file/sharedmemory to sustain servicerate at low core clock speedMonday, April 13, 200926

CPU VS GPU ROUND 1Many transistors for cache Most transistors for PUsControl flow optimizations:Out of order cores, branchpredictors Little control flow, sharedamong chunks of PUs Fixed function units, Lately IEEE-compliant,limited double FP support Medium clock/memory speeds IEEE compliant,double FP precision High clock/memory speedsMonday, April 13, 200927

CPU VS GPU 2-3Cache levels, at least I/Ocoherent CPU DRAM LowSMPsDimms, ECClatency Canhandle stack-based,pointer-chasing patternsMonday, April 13, 2009ROUND 2 Bigregister files, fast on chipmemory, programmer/compiler managed multi-GPUonly for gfx (SLI) PCB-mountedhighperformance RAM, no ECC Highthroughput, memorybandwidth28

GPUS FOR GPGPUThe software stack enabling GPGPU:A case study with NVIDIA’s CUDAMonday, April 13, 200929

ALAS: GPGPU TIME! Intention: usethe programmable part of the GPU for generalpurpose computing Method: writeprograms in a high-level data-parallel language to becompiled by the driver JIT compiler and run on the GPU Result: appliedproperly and on proper applications, it can delivermany orders of magnitude of speedup Culprits: youMonday, April 13, 2009have to remember all those architecture notes so far30

THE GPU DRIVER Keycomponent: a Just-In-Time (JIT) compiler for Shading languageto GPU ISA Focuson extreme register pressure, SIMD optimizations, loopunrolling, scheduling Aidthe hardware by informing of buffer needs statically Interfacewith the OS, deal with all other stuff the OS wants Probablythe hardest and most complicated driver in a PCMonday, April 13, 200931

GPGPU LANGUAGES? In the beginning it was all OpenGL-hacks Noreal languages - mostly extensions over C-like languages ascompiler directives (pragmas) AMD CTM: very low-level extensions allowing one to build aGPGPU framework Sh, Brook for GPUs: closer to proper languages but unsuccessfulMonday, April 13, 200932

GPGPU “LANGUAGES” CUDA (Compute Unified Device Architecture) by NVIDIA Version 2.2 out last Wednesday (1.0 more than two years ago) Applications such as PhysX, Media transcoders, CAD, Research. Not just language extensions: an SDK, including runtimecomponents, the driver, etcOpenCL (Open Computing Language) by the Khronos Group Spec 1.0, no implementation yetMonday, April 13, 200933

CUDA needto express functionsto be executed on device needdata-managementroutines (explicit memoryhierarchy management) expressparallelism bydistributing jobs throughthread idsMonday, April 13, 200934


ABOUT THETHREAD HIERARCHY Everyblock must be completely independent from computationsin any other block No hard limit - scalability/portability ( Hint:scheduled on different SMs) Thenumber of threads per block is limited by device capacity(Hint: executed on same SM) Threadsof the same block can perform atomic ops and besynchronized over shared memory ( syncthreads() ) Thegrid structure can be 2D - the block structure can be 3DMonday, April 13, 200936

ABOUT THE MEMORYHIERARCHY Registers are the fastest ( 4 cycles/access) ; but are limited (Hint:Register pressure) Shared memory can be as fast as registers if accessed properly (Hint:memory banks). Unpredictable order and result of synchronousaccesses, unless atomic. Constant and texture caches are not under software control but are asfast as shared memory (on cache hit) Global memory access is slow ( 400-600 cycles) - Memory accesscoalescing enables higher bandwidthMonday, April 13, 200937

MEMORY ACCESSCOALESCING Remember: atthe very bottom, GPUs are wide-SIMDarchitectures Coalescingmemory accesses will increase bandwidth from slowglobal memory: grab as much from the common parts as possible Coalescingshared memory access will allow for register-like, highthroughput performance from it: place so bank conflicts, henceserialization, is avoidedMonday, April 13, 200938

GLOBAL MEMORYACCESSCOALESCINGLeft: random float memory access withina 64B segment, resulting in one memorytransaction.Center: misaligned float memory access,resulting in one transaction.Right: misaligned float memory access,resulting in two transactions.Monday, April 13, 200939

SHARED MEMORY ACCESS COALESCINGLeft: No bank conflictsRight: Strides of 2 and 8 words 2 and 8 -way bank conflictsMonday, April 13, 200940

COMMUNICATION ANDSYNCHRONIZATION Nomemory barriers: Implement one by adding a new kernelfunction Noglobal synchronization constructs: If you need it, your work isprobably not data-parallel enough Noguarantees in how blocks are scheduled (Hint: Threadscheduler) Noguarantees in warps-orderMonday, April 13, 200941


CUDA TOOLSET Driver: CUDA SDK: nvccis built as Open64-optimizing compiler extensionscompiler, handy wrapper routines, sample projects Developin emulation-mode: run CUDA code on your CPU Profiler: veryimportant tool for performance analysis Debugger: inemulation mode (Linux only: debug on graphics cardexecution with GDB)Monday, April 13, 200943

GREAT EXPECTATIONS Youcan write code probably right away, but sub-optimal Descentspeedup even for suboptimal algorithm on data parallelapplication (order of magnitute) Twoorders of magnitude speedup on optimized algorithm,workload split across blocks/threads and especially memoryaccesses Startat home.html#Monday, April 13, 200944

EXAMPLE: MATRIX MULTIPLY A[hAx wA] * B[hB x wB] C[hA x wC] Canbe naturally split in blocks Copysub-matrices to sharedmemory Bringin memory once, formore than one blocksMonday, April 13, 200945

Host mul// Thread block size#define BLOCK SIZE 16// Launch the device computationMuld dimGrid, dimBlock (Ad, Bd, wA, wB, Cd);// Forward declaration of the device// multiplication functionglobal void Muld(float*, float*, int, int, float*);// Read C from the devicecudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);// Host multiplication function// Compute C A * B//hA is the height of A//wA is the width of A ,//wB is the width of Bvoid Mul(const float* A, const float* B, int hA, int wA, int wB,float* C){int size;// Free device memorycudaFree(Ad);cudaFree(Bd);cudaFree(Cd);}// Load A and B to the devicefloat* Ad;size hA * wA * sizeof(float);cudaMalloc((void**)&Ad, size);cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);float* Bd;size wA * wB * sizeof(float);cudaMalloc((void**)&Bd, size);cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);// Allocate C on the devicefloat* Cd;size hA * wB * sizeof(float);cudaMalloc((void**)&Cd, size);// Compute the execution configuration assuming// the matrix dimensions are multiples of BLOCK SIZEdim3 dimBlock(BLOCK SIZE, BLOCK SIZE);dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);Monday, April 13, 200946

Device mul// Compute C A * B//wA is the width of A//wB is the width of Bglobal void Muld(float* A, float* B, int wA,int wB, float* C){// Block indexint bx blockIdx.x;int by blockIdx.y;for (int a aBegin, b bBegin;a aEnd;a aStep, b bStep) {// Shared memory for the sub-matrix of Ashared float As[BLOCK SIZE][BLOCK SIZE];// Shared memory for the sub-matrix of Bshared float Bs[BLOCK SIZE][BLOCK SIZE];// Thread indexint tx threadIdx.x;int ty threadIdx.y;// Load the matrices from global memory to shared memory;// each thread loads one element of each matrixAs[ty][tx] A[a wA * ty tx];Bs[ty][tx] B[b wB * ty tx];// Index of the first sub-matrix of A processed by the blockint aBegin wA * BLOCK SIZE * by;// Synchronize to make sure the matrices are loadedsyncthreads();// Index of the last sub-matrix of A processed by the blockint aEnd aBegin wA - 1;// Multiply the two matrices together;// each thread computes one element// of the block sub-matrixfor (int k 0; k BLOCK SIZE; k)Csub As[ty][k] * Bs[k][tx];// Step size used to iterate through the sub-matrices of Aint aStep BLOCK SIZE;// Index of the first sub-matrix of B processed by the blockint bBegin BLOCK SIZE *