The Architecture and Application of Modern Graphics
Processing Units
Executive Summary
The modern Graphics Processing Unit (GPU) represents a
monumental leap in computational power, evolving from processing millions of
calculations per second for early 3D games to tens of trillions for
contemporary titles. This performance is achieved through a fundamentally
different architecture than a Central Processing Unit (CPU). While a CPU is a
flexible, high-speed processor excelling at sequential tasks (analogous to a
jumbo jet), a GPU is a specialized, massively parallel processor designed to execute
the same simple instruction across vast datasets simultaneously (analogous to a
cargo ship). The physical architecture of a high-end GPU, such as the GA102
chip, features a hierarchical structure of over 10,000 specialized cores (CUDA,
Tensor, and Ray Tracing) organized into clusters and multiprocessors.
Manufacturing realities lead to a "binning" process, where chips with
minor defects are sold as lower-tier products with some cores deactivated. This
immense processing capability demands extremely high-bandwidth memory, leading
to innovations like GDDR6X and HBM, which use advanced multi-level voltage
signaling to achieve data transfer rates exceeding one terabyte per second. The
GPU's computational model is built on the "embarrassingly parallel"
principle, primarily using Single Instruction, Multiple Threads (SIMT)
architecture to manage tasks like 3D vertex transformation. This parallel
processing prowess has expanded the GPU's application far beyond gaming into
fields like cryptocurrency mining (though now largely superseded by ASICs) and,
most significantly, powering the matrix multiplication operations at the heart
of artificial intelligence and neural networks.
--------------------------------------------------------------------------------
1. The Exponential Growth of Computational Demand
The performance requirements for real-time graphics
rendering have grown by several orders of magnitude over the past few decades.
This trajectory illustrates the immense computational challenge that modern
GPUs are designed to solve.
• 1996 (Mario 64): Required
approximately 100 million calculations per second.
• 2011 (Minecraft): Required
approximately 100 billion calculations per second.
• Modern Titles (Cyberpunk 2077): Require a
graphics card capable of around 36 trillion calculations per
second.
To contextualize this scale, achieving 36 trillion
calculations per second would require the combined effort of
approximately 4,400 Earths full of people, with every
individual completing one calculation each second.
2. Core Architectural Distinction: GPU vs. CPU
GPUs and CPUs are designed for different purposes, a
difference best understood through the analogy of a cargo ship (GPU) versus
a jumbo jet (CPU). The amount of data processed is the cargo
capacity, and the rate of processing is the speed.
• GPU (Cargo Ship): Designed for high
throughput. It processes a massive number of calculations at a relatively
slower rate. It is highly specialized for simple, repetitive tasks (like basic
arithmetic) and is less flexible. GPUs cannot run operating systems or directly
interface with networks or input devices.
• CPU (Jumbo Jet): Designed for low
latency. It performs a smaller number of calculations at a much faster rate. It
is a flexible, general-purpose processor capable of running complex, varied
programs and managing an entire computer system, including the operating system
and network connections.
|
Feature
|
Graphics Processing Unit (GPU)
|
Central Processing Unit (CPU)
|
|
Primary Design
|
Massively parallel processing
|
Fast, sequential processing
|
|
Analogy
|
Massive cargo ship
|
Jumbo jet airplane
|
|
Core Count
|
Extremely high (e.g., >10,000 cores)
|
Low (e.g., 24 cores)
|
|
Task Flexibility
|
Low; specialized for simple instructions
|
High; runs a wide variety of programs
|
|
System Role
|
Specialized co-processor
|
General-purpose system brain
|
|
Ideal Workload
|
Large datasets requiring parallel execution (e.g.,
graphics, AI)
|
Varied tasks requiring rapid evaluation
|
3. Physical Architecture of the GA102 GPU
The NVIDIA GA102 chip, used in the 3080 and 3090 series
cards, is a complex integrated circuit built from 28.3 billion
transistors. Its architecture is organized hierarchically to manage
thousands of processing cores.
Hierarchical Core Structure
The chip's computational resources are divided into
progressively smaller units:
1. Graphics Processing Clusters (GPCs): The
chip contains 7 GPCs.
2. Streaming Multiprocessors (SMs): Each
GPC contains 12 SMs.
3. Specialized Cores: Each SM contains
multiple core types, including:
◦ CUDA (or Shading) Cores: 32
per "warp" unit within an SM.
◦ Tensor Cores: 1
per "warp" unit within an SM.
◦ Ray Tracing (RT) Cores: 1
per SM.
A fully functional GA102 chip contains a total of 10,752
CUDA cores, 336 Tensor cores, and 84 Ray Tracing cores.
Core Functions and Design
Each core type is a specialized calculator for a specific
function:
• CUDA Cores: These are simple binary
calculators primarily used in video games. A single CUDA core contains
approximately 410,000 transistors and is optimized for the Fused
Multiply and Add (FMA) operation (A * B + C), completing one such
operation per clock cycle. Half of the cores handle 32-bit floating-point
numbers, while the other half handle either 32-bit integers or floating-point
numbers.
• Tensor Cores: These are matrix
multiplication and addition calculators. Their primary use is in geometric
transformations for graphics and, more critically, for processing neural
networks in AI applications.
• Ray Tracing Cores: These are the largest
but least numerous cores, designed specifically to execute computationally
intensive ray tracing algorithms for realistic lighting effects.
More complex operations like division or trigonometric
functions are handled by a smaller number of Special Function Units (four
per SM).
Manufacturing and Product Binning
The GA102 chip design is used across multiple graphics card
models (3080, 3080 Ti, 3090, 3090 Ti), which have different prices and
performance levels. This is a result of a manufacturing practice called "binning."
• During fabrication, microscopic defects (from dust or
patterning errors) can render small parts of the chip non-functional.
• Due to the GPU's highly repetitive design, a defect
in one SM does not affect the rest of the chip.
• Instead of discarding the entire chip, manufacturers
test each one, find the defective regions, and permanently deactivate the
nearby circuitry.
• Chips are then categorized or "binned"
based on the number of fully functional cores.
|
Graphics Card Model
|
Functional CUDA Cores
|
Status
|
|
3090 Ti
|
10,752
|
Flawless GA102 chip
|
|
3090
|
10,496
|
Minor defects
|
|
3080 Ti
|
10,240
|
Moderate defects
|
|
3080
|
8,704
|
Equivalent of 16 deactivated SMs
|
In addition to core counts, these cards also differ in
maximum clock speed and the quantity/generation of their graphics memory.
4. The Role of High-Bandwidth Graphics Memory
GPUs are "data-hungry machines" that require a
continuous, high-speed flow of data to feed their thousands of cores. This
necessitates specialized, high-bandwidth memory systems.
Data Throughput Requirements
When a game loads, 3D models and environmental data are
moved from the system's solid-state drive into the graphics memory chips on the
card. The GPU's small onboard cache (e.g., 6 MB L2 cache on the GA102) can only
hold a tiny fraction of a scene, so data is constantly transferred between the
GPU and the graphics memory. The 3090 graphics card features 24 GB of
GDDR6X SD-RAM.
GDDR vs. System DRAM
Graphics memory is engineered for maximum bandwidth, in
contrast to the DRAM that supports a CPU.
• Graphics Memory (GDDR6X): Utilizes a very
wide bus. The 24 chips on the 3090 combine for a 384-bit bus width,
achieving a total bandwidth of approximately 1.15 terabytes per second.
• System Memory (DRAM): Uses a
narrower 64-bit bus width, resulting in a maximum bandwidth closer
to 64 gigabytes per second.
Advanced Data Encoding (PAM4 vs. PAM3)
To push data transfer rates higher, modern graphics memory
technologies transmit data using multiple voltage levels, going beyond simple
binary 1s and 0s.
• PAM4 (Pulse Amplitude Modulation 4): Used
in GDDR6X memory, this scheme employs four different voltage levels to
send two bits of data simultaneously.
• PAM3 (Pulse Amplitude Modulation 3): Used
in the next-generation GDDR7, this scheme uses three voltage levels to
encode binary bits into ternary digits. This approach was chosen by industry
engineers to reduce encoder complexity, improve the signal-to-noise ratio, and
enhance power efficiency.
Specialized AI Memory (HBM)
For AI accelerators, Micron has developed High Bandwidth
Memory (HBM), which prioritizes even greater bandwidth.
• Structure: HBM consists of stacks of DRAM
chips connected vertically with Through-Silicon Vias (TSVs),
forming a "cube of AI memory" that surrounds the AI processor.
• Performance (HBM3E): A single HBM3E cube
can hold 24-36 GB of memory. A typical AI accelerator can be surrounded by up
to 192 GB of this high-speed memory. Micron's HBM3E uses 30%
less power than competing products. These AI systems can cost between $25,000
to $40,000.
5. The Computational Model: Parallel Processing at Scale
GPUs excel at solving problems classified as "embarrassingly
parallel," where a large problem can be broken down into many
smaller, independent tasks that require little to no coordination.
Practical Application: 3D Graphics Rendering (SIMD)
The core principle is Single Instruction, Multiple
Data (SIMD). A single operation is applied to millions of data points in
parallel. This is exemplified by vertex transformation in 3D graphics:
1. The Task: A 3D scene is composed of many
objects (e.g., 5,629 objects), each built from thousands of vertices (e.g., 8.3
million total vertices). Each object's vertices exist in a local "model
space." To render the scene, all vertices must be transformed into a
common "world space."
2. The Instruction: The transformation for
a single vertex involves a simple addition calculation.
3. The Parallelism: This single instruction
is applied simultaneously to all vertices of an object, and this process is
repeated for every object in the scene. A single frame may involve 25
million such addition calculations, all of which are independent and
can be executed in parallel across the GPU's thousands of cores.
Execution Hierarchy and the Evolution to SIMT
The workload is managed through a structured hierarchy
scheduled by the GPU's Gigathread Engine:
• Thread: A single instruction executed on
a single CUDA core.
• Warp: A group of 32 threads.
• Thread Block: A group of warps handled by
a single SM.
• Grid: A group of thread blocks computed
across the entire GPU.
Modern GPUs have evolved from a strict SIMD model to a more
flexible Single Instruction, Multiple Threads (SIMT) architecture.
• SIMD (pre-2016): All 32 threads in a warp
executed in perfect lockstep.
• SIMT (current): The same set of
instructions is sent to all threads in a warp, but individual threads can
progress at different rates. Each thread is given its own program counter. This
provides greater flexibility and efficiency, particularly when code contains conditional
branches ("warp divergence"). Threads within an SM can also share
data via a shared 128 KB L1 cache.
6. Applications Beyond Graphics Rendering
The massive parallel processing capabilities of GPUs have
been leveraged for tasks far beyond video games.
Cryptocurrency Mining (Bitcoin)
Historically, GPUs were ideal for mining Bitcoin. The
process involves repeatedly running the SHA-256 hashing algorithm with
different input values (a "nonce") to find a random output number
that begins with a specific number of zeros.
• Parallel Application: A GPU could run
thousands of iterations of the SHA-256 algorithm in parallel, each with a
different nonce. A card like the 3090 could generate approximately 95
million hashes (or "lottery tickets") per second.
• Obsolescence: This application is now
dominated by ASICs (Application-Specific Integrated Circuits),
custom-built machines that are vastly more efficient. A modern ASIC miner can
perform 250 trillion hashes per second, equivalent to the power of
2,600 graphics cards.
Artificial Intelligence and Neural Networks
AI is a primary modern application for GPUs, directly
utilizing the hardware's architecture.
• Core Operation: Neural networks and
generative AI rely on trillions to quadrillions of matrix multiplication and
addition operations.
• Tensor Core Function: The GPU's Tensor
Cores are specifically designed for this task. They take three matrices,
multiply the first two, and add the third, performing all calculations
concurrently because all input data is available simultaneously. This makes
GPUs the ideal hardware for accelerating AI workloads.