Context
centers, network slowdowns can appear out of nowhere. A sudden burst of traffic from distributed systems, microservices, or AI training jobs can overwhelm switch buffers in seconds. The issue shouldn’t be just knowing when something goes fallacious. It’s having the ability to see it coming before it happens.
Telemetry systems are widely used to watch network health, but most operate in a reactive mode. They flag congestion only after performance has degraded. Once a link is saturated or a queue is full, you’re already past the purpose of early diagnosis, and tracing the unique cause becomes significantly harder.
In-band Network Telemetry, or INT, tries to unravel that gap by tagging live packets with metadata as they travel through the network. It gives you a real-time view of how traffic flows, where queues are build up, where latency is creeping in, and the way each switch is handling forwarding. It’s a robust tool when used rigorously. However it comes with a value. Enabling INT on every packet can introduce serious overhead and push a flood of telemetry data to the control plane, much of which you may not even need.
What if we may very well be more selective? As an alternative of tracking every part, we forecast where trouble is prone to form and enable INT only for those regions and only for a short while. This fashion, we get detailed visibility when it matters most without paying the total cost of always-on monitoring.
The Problem with All the time-On Telemetry
INT gives you a robust, detailed view of what’s happening contained in the network. You possibly can track queue lengths, hop-by-hop latency, and timestamps directly from the packet path. But there’s a value: this telemetry data adds weight to each packet, and for those who apply it to all traffic, it could eat up significant bandwidth and processing capability.
To get around that, many systems take shortcuts:
Sampling: Tag only a fraction (e.g. — 1%) of packets with telemetry data.
Event-triggered telemetry: Activate INT only when something bad is already happening, like a queue crossing a threshold.
These techniques help control overhead, but they miss the critical early moments of a traffic surge, the part you most want to grasp for those who’re trying to stop slowdowns.
Introducing a Predictive Approach
As an alternative of reacting to symptoms, we designed a system that may forecast congestion before it happens and activate detailed telemetry proactively. The thought is straightforward: if we are able to anticipate when and where traffic goes to spike, we are able to selectively enable INT only for that hotspot and just for the fitting window of time.
This keeps overhead low but gives you deep visibility when it actually matters.
System Design
We got here up with an easy approach that makes network monitoring more intelligent. It could actually predict when and where monitoring is definitely needed. The thought shouldn’t be to sample every packet and never to attend for congestion to occur. As an alternative, we wish a system that would catch signs of trouble early and selectively enable high-fidelity monitoring only when it’s needed.
So, how’d we get this done? We created the next 4 critical components, each for a definite task.
Data Collector
We start by collecting network data to watch how much data is moving through different network ports at any given moment. We use sFlow for data collection since it helps to gather vital metrics without affecting network performance. These metrics are captured at regular intervals to get a real-time view of the network at any time.
Forecasting Engine
The Forecasting engine is a very powerful component of our system. It’s built using a Long Short-Term Memory (LSTM) model. We went with LSTM since it learns how patterns evolve over time, making it suitable for network traffic. We’re not searching for perfection here. The vital thing is to identify unusual traffic spikes that typically show up before congestion starts.
Telemetry Controller
The controller listens to those forecasts and makes decisions. When a predicted spike crosses alert threshold the system would respond. It sends a command to the switches to change into an in depth monitoring mode, but just for the flows or ports that matter. It also knows when to back off, turning off the additional telemetry once conditions return to normal.
Programmable Data Plane
The ultimate piece is the switch itself. In our setup, we use P4 programmable BMv2 switches that allow us adjust packet behavior on the fly. More often than not, the switch simply forwards traffic without making any changes. But when the controller activates INT, the switch begins embedding telemetry metadata into packets that match specific rules. These rules are pushed by the controller and allow us to goal just the traffic we care about.
This avoids the tradeoff between constant monitoring and blind sampling. As an alternative, we get detailed visibility exactly when it is required, without flooding the system with unnecessary data the remainder of the time.
Experimental Setup
We built a full simulation of this technique using:
- Mininet for emulating a leaf-spine network
- BMv2 (P4 software switch) for programmable data plane behavior
- sFlow-RT for real-time traffic stats
- TensorFlow + Keras for the LSTM forecasting model
- Python + gRPC + P4Runtime for the controller logic
The LSTM was trained on synthetic traffic traces generated in Mininet using iperf. Once trained, the model runs in a loop, making predictions every 30 seconds and storing forecasts for the controller to act on.
Here’s a simplified version of the prediction loop:
For each 30 seconds:
latest_sample = data_collector.current_traffic()
slinding_window += latest_sample
if sliding_window size >= window size:
forecast = forecast_engine.predict_upcoming_traffic()
if forecast > alert_threshold:
telem_controller.trigger_INT()
Switches respond immediately by switching telemetry modes for specific flows.
Why LSTM?
We went with an LSTM model because network traffic tends to have structure. It’s not entirely random. There are patterns tied to time of day, background load, or batch processing jobs, and LSTMs are particularly good at picking up on those temporal relationships. Unlike simpler models that treat each data point independently, an LSTM can remember what got here before and use that memory to make higher short-term predictions. For our use case, meaning spotting early signs of an upcoming surge just by how the previous few minutes behaved. We didn’t need it to forecast exact numbers, simply to flag when something abnormal is perhaps coming. LSTM gave us simply enough accuracy to trigger proactive telemetry without overfitting to noise.
Evaluation
We didn’t run large-scale performance benchmarks, but through our prototype and system behavior in test conditions, we are able to outline the sensible benefits of this design approach.
Lead Time Advantage
Considered one of the essential advantages of a predictive system like that is its ability to catch trouble early. Reactive telemetry solutions typically wait until a queue threshold is crossed or performance degrades, which implies you’re already behind the curve. In contrast, our design anticipates congestion based on traffic trends and prompts detailed monitoring prematurely, giving operators a clearer picture of what led to the difficulty, not only the symptoms once they seem.
Monitoring Efficiency
A key goal on this project was to maintain overhead low without compromising visibility. As an alternative of applying full INT across all traffic or counting on coarse-grained sampling, our system selectively enables high-fidelity telemetry for brief bursts, and only where forecasts indicate potential problems. While we haven’t quantified the precise cost savings, the design naturally limits overhead by keeping INT focused and short-lived, something that static sampling or reactive triggering can’t match.
Conceptual Comparison of Telemetry Strategies
While we didn’t record overhead metrics, the intent of the design was to seek out a middle ground, delivering deeper visibility than sampling or reactive systems but at a fraction of the fee of always-on telemetry. Here’s how the approach compares at a high level:

Conclusion
We desired to work out a greater option to monitor the network traffic. By combining machine learning and programmable switches, we built a system that predicts congestion before it happens and prompts detailed telemetry in only the fitting place and time.
It looks as if a minor change to predict as a substitute of react, but it surely opens up a brand new level of observability. As telemetry becomes increasingly vital in AI-scale data centers and low-latency services, this type of intelligent monitoring will change into a baseline expectation, not only a pleasant to have.