For the sake of the article length, I won’t walk through running every command here.

Instead, I’ve documented everything to make it seamless for you to reproduce these results yourself. I encourage you to run the profiling tools and discover the patterns firsthand.

All code, scripts, and detailed instructions are available here

Recap

  • In the first blog post, we explored the effects of thoughtful memory layout. The array-based BST was 53% faster than the pointer-based version despite using more memory, because sequential access leverages CPU cache lines and having predictable access patterns optimize hardware prefetchers.
  • In the previous blog post, we learned what garbage collection is and discovered that Go’s current GC suffers from the exact same problem: it spends 85% of its time chasing scattered pointers through memory, with over 35% of CPU cycles stalled on memory accesses - confirming our findings from the first post about spatial locality.
  • Now, we’re going to explore how to identify when GC is bottlenecking your algorithms using real profiling tools like pprof and Instruments, and demonstrate how Green Tea’s span-based scanning approach could solve these memory access problems.

Reading GC Traces

We have a graph implementation that creates 2M random nodes and performs BFS by visiting all nodes. Running make run EXEC=graph sets GODEBUG=gctrace=1, which hints to any running Go application to start garbage collection tracing.

Current Go GC trace output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
gc 1 @0.001s 20%:                    # gc 1 - First garbage collection cycle, occurred 0.001 seconds after program start, 20% of total program time spent in GC so far
   Clock: 0.023+9.3+0.17 ms          # 0.023ms - STW (stop-the-world) sweep termination, 9.3ms - concurrent marking phase, 0.17ms - STW mark termination
   CPU: 0.19+0.22/17/0.26+1.4 ms     # 0.19ms - STW sweep CPU time, 0.22ms - mutator assist, 17ms - background workers, 0.26ms - dedicated workers, 1.4ms - STW mark CPU time
   Heap: 15->20->19 MB, 15 MB goal   # 15MB - heap size at GC start, 20MB - peak heap during GC, 19MB - live heap after GC, 15MB - target for next GC
    
gc 2 @0.024s 14%:
    Clock: 0.008+9.3+0.016 ms
    CPU: 0.069+2.1/18/1.6+0.13 ms
    Heap: 35->41->41 MB, 39 MB goal
    
gc 3 @0.055s 10%:
    Clock: 0.012+8.2+0.027 ms
    CPU: 0.10+0.049/16/34+0.22 ms
    Heap: 74->83->82 MB, 82 MB goal
    
gc 4 @0.179s 6%:
    Clock: 0.014+26+0.014 ms
    CPU: 0.11+0.037/53/130+0.11 ms
    Heap: 140->144->127 MB, 165 MB goal
    
gc 5 @0.415s 7%:
    Clock: 0.014+107+0.011 ms
    CPU: 0.11+1.0/213/497+0.094 ms
    Heap: 218->228->184 MB, 256 MB goal

Green Tea trace output

Running make run EXEC=graphx also sets GODEBUG=gctrace=1 but runs the binary with Green Tea garbage collector.

Trace output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
gc 1 @0.001s 6%:                     # gc 1 - First garbage collection cycle, occurred 0.001 seconds after program start, 6% of total program time spent in GC so far
   Clock: 0.015+0.29+0.074 ms        # 0.015ms - STW sweep termination, 0.29ms - concurrent marking phase, 0.074ms - STW mark termination
   CPU: 0.12+0/0.29/0+0.59 ms        # 0.12ms - STW sweep CPU time, 0ms - mutator assist, 0.29ms - background workers, 0ms - dedicated workers, 0.59ms - STW mark CPU time
   Heap: 15->15->15 MB, 15 MB goal   # 15MB - heap size at GC start, 15MB - peak heap during GC, 15MB - live heap after GC, 15MB - target for next GC

gc 2 @0.015s 5%:
    Clock: 0.009+3.7+0.043 ms
    CPU: 0.075+0.048/7.2/9.3+0.35 ms
    Heap: 30->33->32 MB, 31 MB goal
    
gc 3 @0.035s 7%:
    Clock: 0.015+7.9+0.15 ms
    CPU: 0.12+0.66/14/21+1.2 ms
    Heap: 58->65->64 MB, 65 MB goal
    
gc 4 @0.078s 5%:
    Clock: 0.019+7.5+0.031 ms
    CPU: 0.15+0.045/14/35+0.25 ms
    Heap: 112->113->110 MB, 128 MB goal

gc 5 @0.352s 3%:
    Clock: 0.020+40+0.029 ms
    CPU: 0.16+1.3/80/181+0.23 ms
    Heap: 194->197->156 MB, 220 MB goal

The same implementation produces different performance characteristics simply by changing the garbage collection algorithm.

  • Current Go’s GC starts high with 20% of total program time spent in GC before settling down to 4-7% range
  • Green Tea starts lower and stays consistently low 2-7% range

Peaks

MetricCurrent GC PeakGreen Tea PeakDifference
Worst Marking Time107ms (GC 5)40ms (GC 5)63% lower
Highest GC %20% (GC 1)7% (GC 3)65% lower
Peak CPU Overhead497ms (GC 5)181ms (GC 5)64% reduction
Max Heap Growth228MB (15→228)197MB (15→197)14% more efficient

What If You Can’t Change the GC Algorithm?

We can still achieve significant improvements by designing your data structures and allocation patterns to be GC-friendly.

We have a compact graph implementation which is using sync.Pool for reusing temporary allocations (queue and visited map). This improves GC performance because it eliminates repeated allocations in the hot path - instead of creating new slices and maps on every BFS call, we reuse pre-allocated objects from the pool, dramatically reducing allocation pressure and giving the garbage collector fewer objects to track and clean up.

Compact GC Trace

1
2
3
4
5
gc 1 @0.000s 23%: 0.007+8.8+0.014 ms clock, 0.061+0.076/17/16+0.11 ms cpu, 62->68->68 MB, 62 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 2 @0.030s 8%: 0.025+3.6+0.010 ms clock, 0.20+0.11/6.5/13+0.084 ms cpu, 124->130->130 MB, 137 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 3 @0.216s 2%: 0.032+10+0.009 ms clock, 0.26+0/20/0.10+0.077 ms cpu, 246->253->194 MB, 261 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 4 @0.601s 1%: 0.034+9.5+0.025 ms clock, 0.27+0/14/5.1+0.20 ms cpu, 375->375->245 MB, 390 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 5 @1.820s 0%: 0.064+0.36+0.006 ms clock, 0.51+0/0.25/0.33+0.053 ms cpu, 371->371->64 MB, 490 MB goal, 0 MB stacks, 0 MB globals, 8 P (forced)

Pointer Chasing Implementation

1
2
3
4
5
6
7
8
gc 1 @0.000s 18%: 0.027+2.8+0.011 ms clock, 0.22+0.12/5.1/4.7+0.094 ms cpu, 16->19->18 MB, 17 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 2 @0.012s 9%: 0.033+3.8+0.011 ms clock, 0.26+0.047/7.1/13+0.092 ms cpu, 33->36->36 MB, 37 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 3 @0.031s 8%: 0.036+6.0+0.026 ms clock, 0.29+0.25/11/27+0.21 ms cpu, 63->70->69 MB, 73 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 4 @0.092s 6%: 0.049+15+0.036 ms clock, 0.39+0.081/31/76+0.28 ms cpu, 119->121->115 MB, 140 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 5 @0.330s 7%: 0.031+89+0.024 ms clock, 0.25+1.1/177/428+0.19 ms cpu, 196->205->166 MB, 230 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 6 @0.802s 7%: 0.063+155+0.025 ms clock, 0.50+0/311/798+0.20 ms cpu, 283->318->241 MB, 332 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 7 @1.426s 6%: 0.037+156+0.019 ms clock, 0.29+0/312/780+0.15 ms cpu, 436->436->256 MB, 483 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 8 @2.780s 3%: 0.041+0.21+0.004 ms clock, 0.33+0/0.19/0.040+0.037 ms cpu, 353->353->0 MB, 513 MB goal, 0 MB stacks, 0 MB globals, 8 P (forced)

Looking at the GC traces, the compact version performs 3 fewer GC cycles (5 vs 8), but the traces seem similar.

Benchmarks should be your first resort for performance indicators. These benchmarks show about 477,828,000 ns/op difference on my machine in favor of the compact implementation, which translates to roughly 478ms faster per operation.

Pprof Analysis

If we run go tool pprof traces/graph_compact_cpu.pprof in one terminal and go tool pprof traces/graph_ptr-chasing_cpu.pprof in another, we’re observing different CPU profiles.

Left Terminal (Compact Implementation):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
(pprof) top10
Showing nodes accounting for 1340ms, 90.07% of 1510ms total
Showing top 10 nodes out of 67
flat  flat%   sum%        cum   cum%
370ms 25.83% 25.83%      480ms 31.79%  runtime.mapaccess1_fast64
290ms 19.21% 45.03%      290ms 19.21%  runtime.(remap).dverflow (inline)
240ms 15.89% 60.93%     1140ms 75.50%  main.(*compactGraph).bfs
150ms  9.93% 70.86%      150ms  9.93%  runtime.madvise
80ms  5.30% 76.16%      120ms  7.95%  runtime.scanobject
50ms  3.31% 79.47%      370ms 25.83%  runtime.mapaccess_fast64
50ms  3.31% 82.78%       50ms  3.31%  runtime.memclrNoHeapPointers
40ms  2.65% 85.43%       40ms  2.65%  runtime.heapBitsSetType
60ms  2.65% 88.08%       60ms  2.65%  runtime.pthread-cond-signal
30ms  1.99% 90.07%       30ms  1.99%  main.createCompactGraph

Right Terminal (Pointer-Chasing Implementation):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
(pprof) top10
Showing nodes accounting for 3720ms, 79.49% of 4680ms total
Dropped 57 nodes (cum <= 23.40ms)
Showing top 10 nodes out of 83
flat  flat%   sum%        cum   cum%
940ms 20.09% 20.09%      940ms 20.09%  runtime.madvise
570ms 12.18% 32.26%      760ms 14.96%  runtime.mapaccess1_fast64
530ms 11.32% 43.59%     1350ms 32.85%  main.bfs
530ms 11.32% 54.91%     1980ms 40.60%  runtime.scanobject
390ms  8.33% 63.25%      390ms 17.50%  runtime.greyobject
250ms  5.34% 68.59%      250ms  5.34%  runtime.(▸span).heapBitsSmallForAddr
160ms  3.42% 72.01%      160ms  3.42%  runtime.(▸htstack).pop
130ms  2.78% 74.79%      130ms  2.78%  runtime.dverflow (inline)
110ms  2.35% 77.14%      600ms  8.55%  runtime.getptmty
110ms  2.35% 79.49%      110ms  2.35%  runtime.memclrNoHeapPointers

Takeaway

The pointer-chasing implementation shows performance degradation compared to compact implementation:

  • 20.09% of total execution spent on madvise syscall, indicating the GC is working hard to manage fragmented memory
  • 40.60% of total execution spent on runtime.scanobject (object scanning during GC marking)
  • 32.85% of total execution spent on algorithm (main.bfs)

In contrast, the compact implementation shows:

  • 9.93% spent on runtime.madvise (50% reduction)
  • 5.30% spent on runtime.scanobject (87% reduction)
  • 75.50% spent in the actual algorithm (main.(*compactGraph).bfs)

Next

We started somewhere in the middle, then went up the stack (pun intended) now we’re going to go a bit deeper and explore virtual memory, address translation, and memory fragmentation. Understanding these lower-level concepts should complete the picture.