I Measured Neural Network Training Every 5 Steps for 10,000 Iterations

how neural networks learned. Train them, watch the loss go down, save checkpoints every epoch. Standard workflow. Then I measured training dynamics at 5-step intervals as an alternative of epoch-level, and all the pieces I believed I knew fell apart.

The query that began this journey: Does a neural network’s capability expand during training, or is it fixed from initialization? Until 2019, all of us assumed the reply was obvious—parameters are fixed, so capability have to be fixed too. But Ansuini et al. discovered something that shouldn’t be possible: the effective representational dimensionality during training. Yang et al. confirmed it in 2024.

This changes all the pieces. If learning space expands the network learns, how can we mechanistically understand what it’s actually doing?

High-Frequency Training Checkpoints

Once we are training a DNN with 10,000 steps, we used to establish chack points every 100 or 200 steps. Measuring at 5-step intervals generates an excessive amount of records that aren’t easy to administer. But these high-frequency checkpoints reveal very invaluable details about how a DNN learns.

High-frequency checkpoints provide details about:

Whether early training mistakes could be recovered from (they often can’t)
Why some architectures work and others fail
When interpretability evaluation should occur (spoiler: way sooner than we thought)
design higher training approaches

During an applied research project I even have measured DNN training at high resolution — every 5 steps as an alternative of each 100 or 500. I used a basic MLP architecture with the identical dataset I’ve been using for the last 10 years.

Figure 1. Experimental setupWe detect discrete transitions using z-score evaluation with
rolling statistics:

The outcomes were surprising. Deep neural networks, even easy architectures, expand their effective parameter space during training. I had assumed this space was predetermined by the architecture itself. As a substitute, DNNs undergo discrete transitions—small jumps that increase the effective dimensionality of their learning space.

**Figure 2**: Effective dimensionality of activation patterns during training, measured using stable rank. We see three distinct phases emerge: initial collapse (steps 0-300) where dimensionality drops from 2500 to 500, expansion phase (steps 300-5000) where dimensionality climbs to 1000, and stabilization (steps 5000-8000) where dimensionality plateaus. This means steps 0-2000 constitute a qualitatively
distinct developmental window. Image by creator.

In Figure 2 we will see the monitoring of activation effective dimensionality during training. We see these transitions concentrate in the primary 25% of coaching, and are hidden at larger checkpoint intervals (100-1000 steps). We wanted a high-frequency checkpointing (5 steps) to detect most of them. The curve also shows an interesting behavior. The initial collapse represents loss landscape restructuring where random initialization gives method to a task-aligned structure. Then we see an expansion phase with gradual dimensionality growth. Between 2000-3000 steps, there may be a stabilization that reflects DNN architectural capability limits.

**Figure 3:** Representational dimensionality (measured using stable rank) shows strong negative correlation with loss (ρ = −0.951) and moderate negative correlation with gradient magnitude (ρ = −0.701). As loss decreases from 2.0 to close zero, dimensionality expands from 9.0 to 9.6. Counterintuitively, improved performance correlates with expanded slightly than com- pressed representations. Image by creator.

This changes how we must always take into consideration DNN training, interpretability, and architecture design.

Exploration vs Expansion

Consider the next two scenarios:

Scenario A: Fixed Capability (Exploration)	Scenario B: Expanding Capability (Innovation)
Your network starts with a hard and fast representational capability. Training explores different regions of this pre-determined space. It’s like navigating a map that exists from the start. Early training just means “haven’t found the great region yet”.	Your network starts with minimal capability. Training representational structures. Its like constructing roads while traveling — each road enables latest destinations. Early training establishes what becomes learnable later.

Which is it?

The query matters because if capability expands, then early training isn’t recoverable. You may’t just “train longer” to repair early mistakes. So, interpretability has a where features form in sequence. Understanding this sequence is vital. Furehtermore, architecture design appears to be about expansion rate not only final capability. Finally, critical periods exist. If we miss the window, we miss the potential.

When We Must Measure High-Frequency Checkpoints

Expansion vs Exploration

**Figure 4**: High-frequency sampling vs. Low Frequency sampling within the experiment describred in Figure 1. We detect discrete transitions using z-score evaluation with rolling statistics. High-frequency sampling captures rapid transitions that coarse-grained measurement misses. This comparison tests whether temporal resolution affects observable dynamics.

As seen in Figures 2 and three, high-frequency sampling reveals interesting information. We will indentify three different phases:

Phase 1: Collapse (steps 0-300) The network restructures from random initialization. Dimensionality drops sharply because the loss landscape is reshaped across the task. This isn’t learning yet, it’s preparation for learning.

Phase 2: Expansion (steps 300-5,000)
Dimensionality climbs steadily. That is capability expansion. The network is constructing representational structures. Easy features that enable complex features that enable higher-order features.

Phase 3: Stabilization (steps 5,000-8,000) Growth plateaus. Architectural constraints bind. The network refines what it has slightly than constructing latest capability.

This plots reveals expansion, not exploration. The network at step 5,000 can represent functions that were unimaginable at step 300 because they didn’t exist.

Capability Expands, Parameters Don’t

**Figure 5**: Comparison of activation space to weight space.
Weight space dimensionality stays nearly constant
(9.72-9.79) with only one detected “jump” across 8000 steps. Image by creator

The comparison between activation and weight spaces shows that each follow different dynamics with high-frequency sampling. The activation space shows ap. 85 discrete jumps (including Gaussian noise)The burden space shows only one. The identical network with the identical training run. It confirms that the network at step 8000 computes functions inaccessible at step 500 despite a similar parameter count. That is the clearest evidence for expansion.

Transitions Are Fast and Early

We’ve got seen how high-frequency sampling shows many more transitions. Low-frequency checkpointing would miss nearly all of them. These transitions concentrate early. Two thirds of all transitions occur in the primary 2,000 steps — just 25% of total training time. It signifies that if we wish to know what features form and when, we’d like to look during steps 0-2,000, not at convergence. By step 5,000, the story is over.

Expansion Couples to Optimization

If we glance again at Figure 3, we see that as loss decreases, dimensionality expands. The network doesn’t simplify because it learns. It becomes more complex. Dimensionality correlates strongly with loss (ρ = -0.951) and moderately with gradient magnitude (ρ = -0.701). This might seem counterintuitive: improved performance correlates with slightly than representations. We would expect networks to seek out simpler, more compressed representations as they learn. As a substitute, they expand into higher-dimensional spaces.

Why?

A possible explanation is that complex tasks require complex representations. The network doesn’t find a less complicated explanation and builds the representational changes needed to separate classes, recognize patterns, and generalize.

Practical Deployment

We’ve got seen a distinct method to understand and debug DNN training across any domain.

In real deployment scenarios, we will track representational dimensionality in real-time, detect when expansion phases occur, and run interpretability analyses at each transition point. This tells us precisely when our network is constructing latest representational structures—and when it’s finished. The measurement approach is architecture-agnostic: it really works whether you’re training CNNs for vision, transformers for language, RL agents for control, or multimodal models for cross-domain tasks.

Example 1: Intervention experiments that map causal dependencies. Disrupt training during specific windows and measure which downstream capabilities are lost. If corrupting data during steps 2,000-5,000 permanently damages texture recognition but the identical corruption at step 6,000 has no effect, you’ve found when texture features crystallize and what they rely upon. This works identically for object recognition in vision models, syntactic structure in language models, or state discrimination in RL agents.

Example 2: For production deployment, continuous dimensionality monitoring catches representational problems during training when you may still fix them. If layers stop expanding, you could have architectural bottlenecks. If expansion becomes erratic, you could have instability. If early layers saturate while late layers fail to expand, you could have information flow problems. Standard loss curves won’t show these issues until it’s too late—dimensionality tracking surfaces them immediately.

Example 3: The architecture design implications are equally practical. Measure expansion dynamics through the first 5-10% of coaching across candidate architectures. Select for clean phase transitions and structured bottom-up development. These networks aren’t just more performant—they’re fundamentally more interpretable because features form in clear sequential layers slightly than tangled simultaneity.

What’s Next

So we’ve established that networks expand their representational space during training, that we will measure these transitions at high resolution, and that this opens latest approaches to interpretability and intervention. The natural query: are you able to apply this to your individual work?

I’m releasing the entire measurement infrastructure as open source. I included validated implementations for MLPs, CNNs, ResNets, Transformers, and Vision Transformers, with hooks for custom architectures.

Every little thing runs with three lines added to your training loop.

The GitHub repository provides experiment templates for the experiments discussed above: feature formation mapping, intervention protocols, cross-architecture transfer prediction, and production monitoring setups. The measurement methodology is validated. What matters now could be what you discover whenever you apply it to your domain.

Try it:

pip install ndtracker

Quickstart, instructions, and examples within the repository: Neural Dimensionality Tracker (NDT)

The code is production-ready. The protocols are documented. The questions are open. I would love to see what you discover whenever you measure your training dynamics at high resolution irrespective of the context and the architecture.

You may share your results, open issues along with your findings, or simply ⭐️ the repo if this changes how you consider training. Remember, the interpretability timeline exists across all neural architectures.

Javier Marín | LinkedIn | Twitter

References & Further Reading

Achille, A., Rovere, M., & Soatto, S. (2019). Critical learning periods in deep networks. In . https://openreview.net/forum?id=BkeStsCcKQ
Frankle, J., Dziugaite, G. K., Roy, D. M., & Carbin, M. (2020). Linear mode connectivity and the lottery ticket hypothesis. In (pp. 3259-3269). PMLR. https://proceedings.mlr.press/v119/frankle20a.html
Ansuini, A., Laio, A., Macke, J. H., & Zoccolan, D. (2019). Intrinsic dimension of information representations in deep neural networks. In (Vol. 32, pp. 6109-6119). https://proceedings.neurips.cc/paper/2019/hash/cfcce0621b49c983991ead4c3d4d3b6b-Abstract.html
Yang, J., Zhao, Y., & Zhu, Q. (2024). ε-rank and the staircase phenomenon: Recent insights into neural network training dynamics. . https://arxiv.org/abs/2412.05144
Olah, C., Mordvintsev, A., & Schubert, L. (2017). Feature visualization. , (11), e7. https://doi.org/10.23915/distill.00007
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2021). A mathematical framework for transformer circuits. . https://transformer-circuits.pub/2021/framework/index.html

I Measured Neural Network Training Every 5 Steps for 10,000 Iterations

High-Frequency Training Checkpoints

Exploration vs Expansion