Memory latency is one of the factors impacting application performance. Join me on a quest to understand memory latency, how to measure it, and knowing when it can be improved.
For this type of posts people usually use C/C++ but I’m going to stick to Go and occasionally bring in Rust for comparison since I’m mostly using these two languages and I want to show you the tools available to you.
We’re going to mix in both top-down and bottom-up approaches in this series of blog posts. The goal of this series is to learn how to identify memory latency and act uppon those signals.
CPU Primer
CPUs can’t store data - they need to communicate with memory. There are different types of memory involved, each with different latencies:
Memory Type | Latency | Size |
---|---|---|
L1 Cache | 1-3 cycles | 32-64KB |
L2 Cache | 10-25 cycles | 256KB-1MB |
L3 Cache | 40-75 cycles | 8-32MB |
Main Memory (RAM) | 200-400 cycles | GBs |
SSD | 50,000+ cycles | TBs |
CPUs work in cycles. A CPU cycle, also known as a clock cycle or machine cycle, is the basic unit of time in a CPU. It represents one complete operation of the CPU’s internal clock.
The Instruction Pipeline
Without pipeline optimization, CPU operations happen sequentially in this order:
- Fetching - retrieving the instruction from memory
- Decoding - interpreting the instruction and determining required actions
- Executing - performing the operation specified by the instruction
- Memory Access (if needed) - fetching data from memory
- Write Back (if needed) - writing the result back to memory
If each stage takes one clock cycle, a single instruction would take 5 cycles to complete. In a simple sequential model, processing 1000 instructions would take 5000 cycles.
How Pipelining Changes Everything
CPU pipelining transforms instruction processing from a sequential to simultaneous. Instead of waiting for each instruction to complete entirely before starting next, the CPU divides instruction processing into discrete stages and processes them simultaneously.
Here’s how it works: while one instruction is being executed (stage 3), the next instruction can be decoded (stage 2), and the one after that can be fetched (stage 1). This creates a pipeline where multiple instructions are “in flight” at different stages.
|
|
After the initial 5-cycle startup delay, the CPU completes one instruction per cycle instead of one every 5 cycles - a 5x improvement in throughput. Processing 1000 instructions now takes approximately 1004 cycles instead of 5000.
Pipeline Hazards and Memory Interaction
The pipeline’s efficiency depends heavily on memory access patterns. Here’s some of the hazards that can disrupt the smooth flow:
Data Hazards occur if one instruction needs data that isn’t ready yet
Control Hazards happen with branching (if statements, loops). The CPU doesn’t know which instruction to fetch next until the branch condition is evaluated. Modern CPUs use branch prediction to guess the outcome and speculatively execute instructions, but mispredictions cost a lof of cycles (pipeline flushes).
CPU tries to prevent these hazards:
- Keep frequently used data close to the CPU in caches (L1, L2, L3)
- Guess what data it might need next and fetch it early (Prefetching)
- Rearrange instructions to avoid stalls (out-of-order execution)
Example of pipeline stall
|
|
Memory Latency
Let’s explore ways to identify memory latencies in Go.
We have two implementations of binary search tree. One implementation is recursive memory-unaware and second is memory-aware by keeping data in contagious block and having predictable access patterns.
Full implementation can be found here: https://github.com/Elvis339/compiler.rs/blob/main/code/memory-and-performance/cmd/memaccess/memaccess.go
Pointer Base BST
|
|
Contiguous BST
|
|
Analysis
This benchmark compares two Binary Search Tree implementations a traditional pointer-based approach versus a contiguous array-based approach. The contiguous implementation demonstrates 53% better performance despite using significantly more memory.
Implementation | Avg Time/Op | Memory Used | Allocations | Performance Gain |
---|---|---|---|---|
PointerBST | 279.54 ms | ~600 KB | ~25,000 | Baseline |
ContiguousBST | 128.91 ms | ~9 MB | 2 | 53% faster |
The reason for ~2.17x improvement partially comes from not having recursive calls, the burden of function setup is heavy. Additionally as previously described we’re experiencing better spatial locality. Let’s dive deeper into the specific performance factors:
Modern CPUs fetch memory in cache lines (typically 64 bytes). When you request one byte, the CPU automatically fetches the entire cache line:
|
|
CPUs have automatic hardware prefetchers that detect sequential access patterns:
|
|
Tools
Unfortunately, Go’s pprof tool doesn’t provide clear insights into memory access patterns for this type of performance analysis. For deeper memory latency investigation, we need to examine the actual CPU instructions generated.
While perf
on Linux provides excellent memory profiling capabilities, we’ll use assembly analysis to understand the performance differences:
Assembly Reveals the Truth
The ARM assembly output shows exactly why the array BST is faster:
Pointer BST - Random memory access:
|
|
Array BST - Predictable arithmetic:
|
|
The key insights that assembly proves is that the memory layour affects the fundamental CPU instructions generated. Better spatial locality translates directly to more efficient machine code.
Conclusion
Modern CPUs are incredibly fast, but their performance depends on keeping the instruction pipeline fed with data. Modern hardware and compilers have techniques to avoid pipeline stalls, but we as software engineers must understand abstractions below and help with the optimization.
- CPUs access data through multiple cache levels (L1, L2, L3) before reaching main memory. Understanding this hierarchy helps explain why some code runs faster than others.
- Accessing nearby memory locations allows hardware prefetchers to predict patterns and keep data flowing to the CPU. Sequential array access leverages cache lines efficiently, while scattered pointer chasing defeats these optimizations entirely. This is why in specific cases arrays outpreform maps.
Being memory-aware means:
- Choosing data structures that promote spatial locality (arrays vs linked lists)
- Understanding how your data layout affects cache behavior
- Recognizing that algorithm complexity isn’t everything - you must include memory access into the mix
Tools for analysis: While profilers like Go’s pprof show where time is spent, they don’t reveal memory access patterns. Assembly analysis helps understand the actual CPU instructions generated. On macOS, Instruments provides memory profiling capabilities, though it’s more valuable for higher-level analysis than micro-optimizations.
Our BST comparison proved that identical algorithms with different memory layouts produce fundamentally different performance characteristics.
What’s next?
In the following blog post series, we’ll analyze Go’s experimental Green Tea garbage collector, which applies these same spatial locality principles to automatic memory management.