Wednesday, September 10, 2025

How nvidia graphics cards work


The Architecture and Application of Modern Graphics Processing Units

Executive Summary

The modern Graphics Processing Unit (GPU) represents a monumental leap in computational power, evolving from processing millions of calculations per second for early 3D games to tens of trillions for contemporary titles. This performance is achieved through a fundamentally different architecture than a Central Processing Unit (CPU). While a CPU is a flexible, high-speed processor excelling at sequential tasks (analogous to a jumbo jet), a GPU is a specialized, massively parallel processor designed to execute the same simple instruction across vast datasets simultaneously (analogous to a cargo ship). The physical architecture of a high-end GPU, such as the GA102 chip, features a hierarchical structure of over 10,000 specialized cores (CUDA, Tensor, and Ray Tracing) organized into clusters and multiprocessors. Manufacturing realities lead to a "binning" process, where chips with minor defects are sold as lower-tier products with some cores deactivated. This immense processing capability demands extremely high-bandwidth memory, leading to innovations like GDDR6X and HBM, which use advanced multi-level voltage signaling to achieve data transfer rates exceeding one terabyte per second. The GPU's computational model is built on the "embarrassingly parallel" principle, primarily using Single Instruction, Multiple Threads (SIMT) architecture to manage tasks like 3D vertex transformation. This parallel processing prowess has expanded the GPU's application far beyond gaming into fields like cryptocurrency mining (though now largely superseded by ASICs) and, most significantly, powering the matrix multiplication operations at the heart of artificial intelligence and neural networks.

--------------------------------------------------------------------------------

1. The Exponential Growth of Computational Demand

The performance requirements for real-time graphics rendering have grown by several orders of magnitude over the past few decades. This trajectory illustrates the immense computational challenge that modern GPUs are designed to solve.

• 1996 (Mario 64): Required approximately 100 million calculations per second.

• 2011 (Minecraft): Required approximately 100 billion calculations per second.

• Modern Titles (Cyberpunk 2077): Require a graphics card capable of around 36 trillion calculations per second.

To contextualize this scale, achieving 36 trillion calculations per second would require the combined effort of approximately 4,400 Earths full of people, with every individual completing one calculation each second.

2. Core Architectural Distinction: GPU vs. CPU

GPUs and CPUs are designed for different purposes, a difference best understood through the analogy of a cargo ship (GPU) versus a jumbo jet (CPU). The amount of data processed is the cargo capacity, and the rate of processing is the speed.

• GPU (Cargo Ship): Designed for high throughput. It processes a massive number of calculations at a relatively slower rate. It is highly specialized for simple, repetitive tasks (like basic arithmetic) and is less flexible. GPUs cannot run operating systems or directly interface with networks or input devices.

• CPU (Jumbo Jet): Designed for low latency. It performs a smaller number of calculations at a much faster rate. It is a flexible, general-purpose processor capable of running complex, varied programs and managing an entire computer system, including the operating system and network connections.

Feature

Graphics Processing Unit (GPU)

Central Processing Unit (CPU)

Primary Design

Massively parallel processing

Fast, sequential processing

Analogy

Massive cargo ship

Jumbo jet airplane

Core Count

Extremely high (e.g., >10,000 cores)

Low (e.g., 24 cores)

Task Flexibility

Low; specialized for simple instructions

High; runs a wide variety of programs

System Role

Specialized co-processor

General-purpose system brain

Ideal Workload

Large datasets requiring parallel execution (e.g., graphics, AI)

Varied tasks requiring rapid evaluation

3. Physical Architecture of the GA102 GPU

The NVIDIA GA102 chip, used in the 3080 and 3090 series cards, is a complex integrated circuit built from 28.3 billion transistors. Its architecture is organized hierarchically to manage thousands of processing cores.

Hierarchical Core Structure

The chip's computational resources are divided into progressively smaller units:

1. Graphics Processing Clusters (GPCs): The chip contains 7 GPCs.

2. Streaming Multiprocessors (SMs): Each GPC contains 12 SMs.

3. Specialized Cores: Each SM contains multiple core types, including:

    ◦ CUDA (or Shading) Cores: 32 per "warp" unit within an SM.

    ◦ Tensor Cores: 1 per "warp" unit within an SM.

    ◦ Ray Tracing (RT) Cores: 1 per SM.

A fully functional GA102 chip contains a total of 10,752 CUDA cores336 Tensor cores, and 84 Ray Tracing cores.

Core Functions and Design

Each core type is a specialized calculator for a specific function:

• CUDA Cores: These are simple binary calculators primarily used in video games. A single CUDA core contains approximately 410,000 transistors and is optimized for the Fused Multiply and Add (FMA) operation (A * B + C), completing one such operation per clock cycle. Half of the cores handle 32-bit floating-point numbers, while the other half handle either 32-bit integers or floating-point numbers.

• Tensor Cores: These are matrix multiplication and addition calculators. Their primary use is in geometric transformations for graphics and, more critically, for processing neural networks in AI applications.

• Ray Tracing Cores: These are the largest but least numerous cores, designed specifically to execute computationally intensive ray tracing algorithms for realistic lighting effects.

More complex operations like division or trigonometric functions are handled by a smaller number of Special Function Units (four per SM).

Manufacturing and Product Binning

The GA102 chip design is used across multiple graphics card models (3080, 3080 Ti, 3090, 3090 Ti), which have different prices and performance levels. This is a result of a manufacturing practice called "binning."

• During fabrication, microscopic defects (from dust or patterning errors) can render small parts of the chip non-functional.

• Due to the GPU's highly repetitive design, a defect in one SM does not affect the rest of the chip.

• Instead of discarding the entire chip, manufacturers test each one, find the defective regions, and permanently deactivate the nearby circuitry.

• Chips are then categorized or "binned" based on the number of fully functional cores.

Graphics Card Model

Functional CUDA Cores

Status

3090 Ti

10,752

Flawless GA102 chip

3090

10,496

Minor defects

3080 Ti

10,240

Moderate defects

3080

8,704

Equivalent of 16 deactivated SMs

In addition to core counts, these cards also differ in maximum clock speed and the quantity/generation of their graphics memory.

4. The Role of High-Bandwidth Graphics Memory

GPUs are "data-hungry machines" that require a continuous, high-speed flow of data to feed their thousands of cores. This necessitates specialized, high-bandwidth memory systems.

Data Throughput Requirements

When a game loads, 3D models and environmental data are moved from the system's solid-state drive into the graphics memory chips on the card. The GPU's small onboard cache (e.g., 6 MB L2 cache on the GA102) can only hold a tiny fraction of a scene, so data is constantly transferred between the GPU and the graphics memory. The 3090 graphics card features 24 GB of GDDR6X SD-RAM.

GDDR vs. System DRAM

Graphics memory is engineered for maximum bandwidth, in contrast to the DRAM that supports a CPU.

• Graphics Memory (GDDR6X): Utilizes a very wide bus. The 24 chips on the 3090 combine for a 384-bit bus width, achieving a total bandwidth of approximately 1.15 terabytes per second.

• System Memory (DRAM): Uses a narrower 64-bit bus width, resulting in a maximum bandwidth closer to 64 gigabytes per second.

Advanced Data Encoding (PAM4 vs. PAM3)

To push data transfer rates higher, modern graphics memory technologies transmit data using multiple voltage levels, going beyond simple binary 1s and 0s.

• PAM4 (Pulse Amplitude Modulation 4): Used in GDDR6X memory, this scheme employs four different voltage levels to send two bits of data simultaneously.

• PAM3 (Pulse Amplitude Modulation 3): Used in the next-generation GDDR7, this scheme uses three voltage levels to encode binary bits into ternary digits. This approach was chosen by industry engineers to reduce encoder complexity, improve the signal-to-noise ratio, and enhance power efficiency.

Specialized AI Memory (HBM)

For AI accelerators, Micron has developed High Bandwidth Memory (HBM), which prioritizes even greater bandwidth.

• Structure: HBM consists of stacks of DRAM chips connected vertically with Through-Silicon Vias (TSVs), forming a "cube of AI memory" that surrounds the AI processor.

• Performance (HBM3E): A single HBM3E cube can hold 24-36 GB of memory. A typical AI accelerator can be surrounded by up to 192 GB of this high-speed memory. Micron's HBM3E uses 30% less power than competing products. These AI systems can cost between $25,000 to $40,000.

5. The Computational Model: Parallel Processing at Scale

GPUs excel at solving problems classified as "embarrassingly parallel," where a large problem can be broken down into many smaller, independent tasks that require little to no coordination.

Practical Application: 3D Graphics Rendering (SIMD)

The core principle is Single Instruction, Multiple Data (SIMD). A single operation is applied to millions of data points in parallel. This is exemplified by vertex transformation in 3D graphics:

1. The Task: A 3D scene is composed of many objects (e.g., 5,629 objects), each built from thousands of vertices (e.g., 8.3 million total vertices). Each object's vertices exist in a local "model space." To render the scene, all vertices must be transformed into a common "world space."

2. The Instruction: The transformation for a single vertex involves a simple addition calculation.

3. The Parallelism: This single instruction is applied simultaneously to all vertices of an object, and this process is repeated for every object in the scene. A single frame may involve 25 million such addition calculations, all of which are independent and can be executed in parallel across the GPU's thousands of cores.

Execution Hierarchy and the Evolution to SIMT

The workload is managed through a structured hierarchy scheduled by the GPU's Gigathread Engine:

• Thread: A single instruction executed on a single CUDA core.

• Warp: A group of 32 threads.

• Thread Block: A group of warps handled by a single SM.

• Grid: A group of thread blocks computed across the entire GPU.

Modern GPUs have evolved from a strict SIMD model to a more flexible Single Instruction, Multiple Threads (SIMT) architecture.

• SIMD (pre-2016): All 32 threads in a warp executed in perfect lockstep.

• SIMT (current): The same set of instructions is sent to all threads in a warp, but individual threads can progress at different rates. Each thread is given its own program counter. This provides greater flexibility and efficiency, particularly when code contains conditional branches ("warp divergence"). Threads within an SM can also share data via a shared 128 KB L1 cache.

6. Applications Beyond Graphics Rendering

The massive parallel processing capabilities of GPUs have been leveraged for tasks far beyond video games.

Cryptocurrency Mining (Bitcoin)

Historically, GPUs were ideal for mining Bitcoin. The process involves repeatedly running the SHA-256 hashing algorithm with different input values (a "nonce") to find a random output number that begins with a specific number of zeros.

• Parallel Application: A GPU could run thousands of iterations of the SHA-256 algorithm in parallel, each with a different nonce. A card like the 3090 could generate approximately 95 million hashes (or "lottery tickets") per second.

• Obsolescence: This application is now dominated by ASICs (Application-Specific Integrated Circuits), custom-built machines that are vastly more efficient. A modern ASIC miner can perform 250 trillion hashes per second, equivalent to the power of 2,600 graphics cards.

Artificial Intelligence and Neural Networks

AI is a primary modern application for GPUs, directly utilizing the hardware's architecture.

• Core Operation: Neural networks and generative AI rely on trillions to quadrillions of matrix multiplication and addition operations.

• Tensor Core Function: The GPU's Tensor Cores are specifically designed for this task. They take three matrices, multiply the first two, and add the third, performing all calculations concurrently because all input data is available simultaneously. This makes GPUs the ideal hardware for accelerating AI workloads.

 


No comments: