The GC Summary Report Wasn't Wrong - It Just Wasn't Complete

Share on:

Background

My old article compared Virtual Thread based implementations of Spring Core Reactor and JDK 21, building on the comparative analysis before it. Both articles measured the usual things - total processing time, memory footprint, GC pauses, CPU time - and both relied on the summary report generated by GCEasy.

The conclusion from that data was straightforward: JDK based implementation requires more GC activity and a heavier memory footprint, but this does not have any significant impact on application performance.

That conclusion still holds. But a summary report only tells you what happened in aggregate - total pause time, peak memory, total CPU seconds. It doesn't tell you what was happening collection by collection underneath those totals. So this article goes one level deeper. I went back to the raw GC logs behind that comparison - the same 500,000 object run, JDK based implementation vs Spring Core Reactor based implementation, both on Virtual Threads - and walked through them region by region instead of relying on the summary charts alone.

I'll cover:

  1. The experiment setup (same as before, for context)
  2. What the summary numbers say
  3. What the raw log says once you walk it collection by collection
  4. Why the difference matters once you move beyond a one-shot benchmark
  5. The structured prompt I used to do this analysis, and where it needed correction

1. The experiment

Same workload, same constraints as my previous two articles:

  1. Read 500,000 objects from a list
  2. For each object, perform IO-intensive operations by invoking downstream APIs
  3. Each object's processing is independent of the others
  4. Both implementations run on Virtual Threads (JDK 21, Project Loom)

Two implementations under comparison:

  • JDK based: Executors.newThreadPerTaskExecutor with a virtual thread factory, tasks submitted via CompletableFuture.runAsync
  • Spring Core Reactor based: Flux.flatMap with Schedulers.fromExecutor, using the same virtual thread executor

Both ran on the same machine. Same JDK 21.0.1. Same G1 collector, default settings. The only variable is the concurrency model - how each framework schedules and manages the work in flight. The actual business logic per object is identical.

This time, instead of feeding the GC logs into GCEasy and reading the resulting charts, I kept the raw .log files and went through them directly.

2. What the summary numbers say

Here's the aggregate view - the kind of comparison both my previous articles were built on:

MetricJDK + Virtual ThreadsSpring Reactor + Virtual Threads
Total runtime~32.5s~61.0s
Total Young GC pauses2749
Total GC pause time1,599ms126ms
Average pause58.8ms2.55ms
Max single pause139.6ms10.9ms
Peak heap capacity reached6,444M (touching 7,620M during the mark cycle)508M (settles down to 188M)

Read at this level, the story is simple: JDK finishes the job almost twice as fast. It pays for that with 12 times more GC pause time and a heap that grows considerably larger. Going by total numbers alone, that trade-off looks reasonable - finish faster, spend a bit more on garbage collection, no real harm done.

This is exactly the conclusion my earlier articles arrived at. And at the level of these numbers, it's correct.

3. What the raw log says

The summary numbers tell you what happened in total. They don't tell you how the heap behaved while it was happening. For that, you need to walk the Old Gen region counts collection by collection - not just before/after totals, but the actual region growth pattern logged at every single GC event.

Here's what that looks like for Spring Reactor:

1GC(0)  Old: 0->1    regions  (0M -> 4M)
2GC(1)  Old: 1->6    regions  (4M -> 24M)
3GC(2)  Old: 6->8    regions  (24M -> 32M)
4GC(3)  Old: 8->8    regions  (32M -> 32M)
5GC(4)  Old: 8->8    regions  (32M -> 32M)
6...
7GC(20) Old: 9->10   regions  (36M -> 40M)
8...
9GC(49) Old: 9->9    regions  (36M -> 36M)

Old Gen grows for the first 2 collections, then settles down and stays at roughly 32–40 regions for the remaining 47 collections in the entire run. Once it reaches that point, it barely moves again. Almost everything Spring Reactor allocates dies inside Young Gen and never makes it to Old Gen at all - which is exactly the behaviour G1 is designed around.

Now here's the same view for JDK:

1GC(0)  Old: 0->1      regions  (0M -> 4M)
2GC(5)  Old: 22->28    regions  (88M -> 112M)
3GC(10) Old: 95->130   regions  (380M -> 520M)
4GC(15) Old: 298->347  regions  (1192M -> 1388M)
5GC(20) Old: 615->670  regions  (2460M -> 2680M)
6GC(27) Old: 985->1053 regions  (3940M -> 4212M)

There's no settling point anywhere in this run. Old Gen climbs from 4M to 4,212M, and it's still climbing at the very last collection logged. There's no GC(X) where the before and after numbers stop moving. It just keeps growing for the entire 32.5 seconds.

That's the part the summary chart doesn't show you. JDK isn't finishing faster because it's collecting garbage more efficiently. It's finishing faster because it keeps retaining objects in Old Gen and deferring the cost of cleaning them up. The pause time numbers in the summary table are real, but they're not the full cost of what's happening on the heap - they're the cost paid so far, up to the point where the benchmark happened to stop.

4. Why this matters beyond a single shot benchmark

500,000 objects, processed once, start to finish, in 32.5 seconds - that's a finite batch job. It has a clean exit. The moment the last object is processed, the JVM shuts down and whatever was sitting in Old Gen never has to be dealt with.

A long-running service doesn't get that exit. If this same allocation pattern kept running - request after request, batch after batch, hour after hour - Old Gen wouldn't get to stop climbing just because the demo ended. It would keep growing until G1 has no choice but to run a full collection, or worse, until the heap runs out of room to grow into.

Look again at the JDK growth numbers. By GC(27), Old Gen is sitting at 4,212M out of a heap that's already expanded to 6,444M, with a concurrent mark cycle starting up right after. That mark cycle is G1's way of saying it's starting to feel the pressure. In a batch job that ends a few seconds later, that pressure never gets tested. In a service that keeps running, it would.

This is the distinction that matters for anyone deciding whether to carry this comparison into a production decision. "GC overhead doesn't significantly affect performance" is true for this benchmark, as written, for this duration, for this object count & for this workload. It is not automatically true for a service running this same code continuously. Those are two different claims, and a summary chart alone won't tell you which one you're actually looking at.

I should be clear about something here: my original conclusion across both articles wasn't wrong. For the workload as tested - a finite run, processing a fixed list once - JDK's higher GC cost genuinely doesn't hurt. What changes with this deeper look is the scope of that conclusion. It holds for a batch run. It needs separate validation before you apply it to a long-running service carrying the same allocation pattern hour after hour.

5. The structured prompt behind this analysis

Walking 27 and 49 GC events by hand, region by region, across two logs, is tedious but not difficult. What made it faster - and more thorough than I'd have bothered to be on my own - was running both logs through a structured prompt designed specifically to look for this kind of pattern, rather than reading the summary numbers back to me.

The prompt asked for five specific things:

  1. Walk the Old Gen region growth across the full timeline for each log. Identify whether it plateaus, and if so, at which collection number
  2. For any log that doesn't plateau: based on the growth rate shown, estimate what happens if the run continued for 2x or 5x the observed duration
  3. Determine whether the implementation with better summary numbers is actually collecting garbage more efficiently, or whether it's deferring cleanup past wherever the benchmark happens to end
  4. Point out anything about the benchmark's design - a finite run, a fixed object count - that could make a summary-level conclusion misleading if carried over to a different deployment context
  5. State plainly whether the raw data supports, partially supports, or contradicts the original summary-level conclusion

The first pass at this didn't go particularly well - Final version of prompt got shaped up based on multiple refinement as it was constantly leaning towards obvious and known findings

Once I asked specifically about Plateaus , the useful part came out - Finally!

A few things it couldn't tell me, and that I had to work out separately:

  • Whether a single run is sufficient, or whether I should be averaging across repeated runs is still unclear.
  • Whether 500,000 objects completing in 32–61 seconds resembles anything close to a realistic production workload - a burst, a steady stream, or something else entirely. That context has to come from me, not from the log.

Conclusion

A GCEasy summary tells you what happened in total - pause time, peak memory, GC count. It's a perfectly reasonable place to start, and for a single-shot benchmark, it can be enough to draw conclusion from.

What it can't tell you is whether the heap behind those numbers ever reached a steady state, or whether it was still climbing when the benchmark happened to stop. That distinction only shows up if you walk the raw log.

For this specific comparison: JDK finishes faster, and that finding stands. But it gets there by allowing Old Gen grow without a plateau for the entire run, while Spring Reactor reaches a steady state within the first 3 collections and stays there for the remaining 47. Whether that trade-off is acceptable depends entirely on your use case and workload - whether you're running this once or running it forever.

P.S. - The full prompt used for this analysis is available in my se-ai-templates repository on GitHub.

comments powered by Disqus