From Connections to Meaning: Why Heterogeneous Graph Transformers (HGT) Change Demand Forecasting

forecasting errors are usually not brought on by bad time-series models.

They’re brought on by ignoring structure.

SKUs don’t behave independently. They interact through shared plants, product groups, warehouses, and storage locations. A requirement shock to at least one SKU often propagates to others — yet most forecasting systems model each SKU in isolation.

In my previous article, we showed that explicitly modeling these connections matters. Using an actual FMCG supply-chain graph, an easy Graph Neural Network (GraphSAGE) reduced SKU-level forecast error by over 27% in comparison with a powerful naïve baseline, purely by allowing information to flow across related SKUs.

But GraphSAGE makes a simplifying assumption: all relationships are equal.

A shared plant is treated the identical as a shared product group. Substitutes and complements are averaged right into a single signal. This limits the model’s ability to anticipate real demand shifts.

This text explores what happens when the model is allowed not only to the supply-chain network, but to understand the meaning of every relationship inside it.

We show how Heterogeneous Graph Transformers (HGT) introduce relationship-aware learning into demand forecasting, and why that seemingly small change produces more anticipatory forecasts, tighter error distributions, and materially higher outcomes — even on intermittent, every day per-SKU demand — turning connected forecasts into meaning-aware, operationally grounded predictions.

A temporary recap: What GraphSAGE told us

Within the previous article, we trained a spatio-temporal GraphSAGE model on an actual FMCG supply-chain graph with:

40 SKUs
9 plants
21 product groups
36 subgroups
13 storage locations

Each SKU was connected to others through shared plants, groups, and locations — making a dense web of operational dependencies. The temporal characteristics displayed lumpy production and intermittent demand, a typical scenario in FMCG.

GraphSAGE allowed each SKU to aggregate information from its neighbors. That produced a big jump in forecast quality.

Model	WAPE (SKU-daily)
Naïve baseline	0.86
GraphSAGE	~0.62

At the toughest possible level — every day, per-SKU, intermittent demand — a WAPE of ~0.62 is already almost production-grade in FMCG.

However the error plots showed something essential:

The model followed trends well
It handled zeros well
Nevertheless it smoothed away extreme spikes
And it reacted as a substitute of anticipating

Because GraphSAGE assumes that all relationships are equal. Assuming all relations have equal weightage means the model cannot learn that:

A requirement spike in a complementary SKU in the identical plant should increase my forecast
But a spike in a substitute SKU in the identical product group should reduce it

Let’s see how Heterogeneous Graph Transformer (HGT) addresses the challenge.

What HGT adds: Relationship-aware learning

Heterogeneous Graph Transformers are built for graphs where:

There are multiple sorts of nodes (SKUs, plants, warehouses, groups) and/or
There are multiple sorts of edges (shared plants, product groups etc.)

On this case, while all nodes within the graph are SKUs, the relationships between them are heterogeneous. Here, HGT is just not used to model multiple entity types, but to learn relation-aware message passing.

The model learns separate transformation and a spotlight mechanisms for every style of SKU–SKU relationship, allowing demand signals to propagate otherwise depending on two SKUs are connected.

It learns:

“How should information flow across each style of relationship?”

Formally, as a substitute of 1 aggregation function, HGT learns:

[
h_i = sum_{r in {text{plant}, text{group}, text{subgroup}, text{storage}}}
sum_{j in N_r(i)} alpha_{r,i,j} W_r h_j
]

where

represents the style of operational relationship between SKUs (shared plant, product group, etc.)
allows the model to treat each relationship otherwise
lets the model deal with essentially the most influential neighbors
The set comprises all SKUs which might be directly connected to SKU through a shared relationship .

This lets the model learn, for instance:

Plant edges propagate capability and production signals
Product-group edges propagate substitution and demand transfer
Warehouse edges propagate inventory buffering

The graph becomes economically meaningful, not only topologically connected.

Implementation (high-level)

Similar to within the GraphSAGE model, we use:

The identical SupplyGraph dataset, temporal features, normalization and sliding window of 14 days.

The difference is within the spatial encoder. The next is an summary of the architecture.

Heterogeneous Graph Encoder
- Nodes: SKUs
- Edges: shared plant, shared group, shared sub-group and shared storage
- HGT layers learn relation-specific message passing
Temporal Encoder
- A time-series encoder processes the last 14 days of embeddings
- This captures how the graph evolves over time
Output Head
- A regressor predicts next-day sales per SKU

Every part else — training, loss, evaluation — stays similar to GraphSAGE. So any difference in performance comes purely from higher structural understanding.

The housing market analogy — now with meaning

Within the previous article, we used an easy housing-market analogy to elucidate why graph-based forecasting works.

Let’s upgrade it.

GraphSAGE: structure without meaning

GraphSAGE is like predicting the value of your own home by :

The historical price of house
The typical price movement of nearby houses

This already improves over treating your own home in isolation. But GraphSAGE makes a critical simplifying assumption:

In practice, this implies GraphSAGE treats all nearby entities as similar signals. A luxury villa, a college, a shopping center, a highway, or a factory are all just “neighbors” whose price signals get averaged together.

The model learns houses are connected — but not they’re connected.

HGT: structure with meaning

Now imagine a more realistic housing model.

Every data point continues to be a house — there aren’t any different node types.
But houses are connected through different sorts of relationships:

Some share the identical school district
Some share the identical builder or construction quality
Some are near parks
Others are near highways or industrial zones

Each of those relationships affects prices otherwise.

Schools and parks are inclined to increase value
Highways and factories often reduce it
Luxury houses matter greater than neglected ones

A Heterogeneous Graph Transformer (HGT) learns these distinctions explicitly. As an alternative of averaging all neighbor signals, HGT learns:

which of relationship a neighbor represents, and
how strongly that relationship should influence the prediction.

That distinction is what turns a connected demand forecast right into a meaning-aware, operationally grounded prediction.

Comparison of Results

Here is the comparison of WAPE of HGT with GraphSAGE and naive baseline:

Model	WAPE
Naive baseline	0.86
GraphSAGE	0.62
HGT	0.58

At a daily-per SKU WAPE below 0.60, the Heterogeneous Graph Transformer (HGT) delivers a transparent production-grade step-change over each traditional forecasting and GraphSAGE. The outcomes depict a ~32% reduction in misallocated demand vs. traditional forecasting and an extra 6–7% improvement over GraphSAGE

The next scatter chart depicts the actual vs predicted sales on the scale for each GraphSAGE (purple dots) and HGT (cyan dots). While each models are good, there may be a greater dispersion of purple dots of GraphSAGE as in comparison with the tight clustering of the cyan HGT ones, corresponding to the 6% improvement in WAPE.

Actual vs predicted (GraphSAGE vs HGT)

At the size of this dataset (≈ 1.1 million units), that improvement translates into ~45,000 fewer units misallocated over the evaluation period.

Operationally, reducing misallocation by this magnitude results in:

Fewer emergency production changes
Lower expediting and premium freight costs
More stable plant and warehouse operations
Higher service levels on high-volume SKUs
Less inventory trapped within the fallacious locations

Importantly, these improvements come without adding business rules, planner overrides, or manual tuning.

And the bias comparison is as follows:

Model	Mean Forecast	Bias (Units)	Bias %
Naïve	~701	0	0%
GraphSAGE	~733	+31	~4.5%
HGT	~710	~8.4	~1.2%

HGT introduces a very small positive bias — roughly 1–2%.

That is well inside production-safe limits and aligns with how FMCG planners operate in practice, where a slight upward bias is commonly preferred to avoid stock-outs.

The actual difference between GraphSAGE and HGT is clear once we compare the forecasts for the top-4 SKUs by volume. Here is the GraphSAGE chart:

Forecast v Actual – Top 4 SKUs (GraphSAGE)

And the identical for HGT :

The excellence is clear from the realm highlighted in the primary chart and across all of other SKUs:

HGT is just not reactive like GraphSAGE. It’s a stronger forecast, anticipating and tracking the peaks and troughs of the particular demand, relatively than smoothing out the fluctuations.
This can be a results of the differential learning of the structural relations between neighboring SKUs, which lets it predict the change in demand confidently before it has already began.

And at last, the performance across SKUs with non-zero volumes clearly shows that all the high-volume SKUs have a WAPE < 0.60, which is desirable for a production forecast and is an improvement over GraphSAGE.

Explainability

HGT makes it practical to implement explainability to the forecasts — essential for planners to have faith on the causality of features. When the model predicts a dip, and we will show it’s because “Neighbor X in the identical subgroup is trending down,” planners can validate the signal against real-world logistics, turning an AI prediction into actionable business insight.

Lets take a look at the influence of various spatial and temporal features in the course of the forecast for the primary 7 days and last 7 days of the duration for the SKU with most volume (). Here is the comparison of the temporal features:

And the spatial features:

The charts show that different features and SKU/edges play a job during different time periods:

For the primary 7 days, Sales Lag(7d) has the utmost influence (23%) which changes to Rolling Mean (21%) for the last 7 days.
Similarly in the course of the initial 7 days, there may be heavy reliance on SOS005L04P, likely a primary storage node or precursor SKU that dictates immediate availability. By the tip of the test duration, the influence redistributes. SOS005L04P shares the stage with SOS002L09P (~40% Share each) each from the identical subgroup as our goal SKU. This implies the model is now aggregating signals from a broader subgroup of related products to form a more holistic view.

This kind of evaluation is crucial to grasp and forecast the impacts of selling campaigns and promotions or external aspects reminiscent of rates of interest on specific SKUs. These needs to be included within the spatial structure as additional nodes within the graph with the SKUs linked to it.

Not All Supply Chains Are Created Equal

The use case here is a comparatively easy case with only SKUs as nodes. And that’s because in FMCG, plants and warehouses act largely as buffers — they smooth volatility but rarely hard-stop the system. That’s the reason, HGT could learn much of their effect purely from edge types like or without modeling them as explicit nodes. Supply chains may be way more complex. For instance, automotive supply chains are very different. A paint shop, engine line, or regional distribution center is a hard capability bottleneck: when it’s constrained, demand for specific trims or colours collapses no matter market demand. In that setting, HGT still advantages from typed relationships, nevertheless it also requires explicit Plant and Warehouse nodes with their very own time-series signals (capability, output, backlogs, delays) to model how supply-side physics interact with customer demand. In other words, FMCG needs structure-aware graphs; automotive needs causality-aware graphs.

Other aspects which might be common across industries are promotions, marketing spends, seasonality, external aspects reminiscent of economic conditions (eg; fuel prices) or competitor launches in a segment. These also affect SKUs in other ways. For eg; fuel price increase or a brand new regulation may dampen sales of ICE vehicles and increase sale of electrical ones. Such aspects should be included within the graph as nodes and their relations to the SKUs included within the spatial model. And their temporal features need to incorporate the historical data when the events occurred. This may enable HGT to learn the consequences of those aspects on demand within the weeks and months following the event.

Key Takeaways

Supply-chain demand is just not just connected — it’s . Treating all SKU relationships as equal leaves doesn’t harness the total predictive potential.
GraphSAGE proves that networks matter: simply allowing SKUs to exchange information across shared plants, groups, and locations delivers a big accuracy hop over classical forecasting.
Heterogeneous Graph Transformers go one step further by learning SKUs are connected. A shared plant, a shared subgroup, and a shared warehouse don’t propagate demand in the identical way — and HGT learns that distinction directly from data.
That structural awareness translates into real outcomes: lower WAPE, tighter forecast dispersion, higher peak anticipation, and materially fewer misallocated units — without business rules, manual tuning, or planner overrides.
Explainability becomes operational, not cosmetic. Relation-aware attention allows planners to trace forecasts back to economically meaningful drivers, turning predictions into trusted decisions.
The broader lesson: as supply chains grow more interdependent, forecasting models must evolve from to . In FMCG this implies structure-aware graphs; in additional constrained industries like automotive, it means causality-aware graphs with explicit bottlenecks.

In brief: when the model understands the meaning of connections, forecasting stops being reactive — and starts becoming anticipatory.

What’s next? From Concepts to Code

Across this text and the previous one, we moved step-by-step through the evolution of demand forecasting — from isolated time-series models, to GraphSAGE, and at last to Heterogeneous Graph Transformers — showing how each shift progressively improves forecast quality by higher reflecting how real supply chains operate.

The subsequent logical step is to maneuver from concepts to code.

In the subsequent article, we’ll translate these ideas into an end-to-end, implementable workflow. Using focused code examples, we’ll walk through how one can:

Construct the supply-chain graph and define relationship types
Engineer temporal features for intermittent, SKU-level demand
Design and train GraphSAGE and HGT models
Evaluate performance using production-grade metrics
Visualize forecasts, errors, and relation-aware attention
Add explainability so planners can understand a forecast modified

The goal is just not just to point out , but how one can construct a production-ready, interpretable graph-based forecasting system that practitioners can adapt to their very own supply chains.

If this text explained , the subsequent one will show exactly how one can make them work in code.

Reference

SupplyGraph: A Benchmark Dataset for Supply Chain Planning using Graph Neural Networks : Authors: Azmine Toushik Wasi, MD Shafikul Islam, Adipto Raihan Akib

_{Images utilized in this text are generated using Google Gemini. Charts and underlying code created by me.}

From Connections to Meaning: Why Heterogeneous Graph Transformers (HGT) Change Demand Forecasting

A temporary recap: What GraphSAGE told us

What HGT adds: Relationship-aware learning

Implementation (high-level)

The housing market analogy — now with meaning

GraphSAGE: structure without meaning

HGT: structure with meaning

Comparison of Results

Explainability

Not All Supply Chains Are Created Equal

Key Takeaways

What’s next? From Concepts to Code

Reference

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Machine Learning at Scale: Managing More Than One Model in Production

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library

Ulysses Sequence Parallelism: Training with Million-Token Contexts

I Stole a Wall Street Trick to Solve a Google Trends Data Problem

CUDA 13.2 Introduces Enhanced CUDA Tile Support and Recent Python Features

From Connections to Meaning: Why Heterogeneous Graph Transformers (HGT) Change Demand Forecasting

A temporary recap: What GraphSAGE told us

What HGT adds: Relationship-aware learning

Implementation (high-level)

The housing market analogy — now with meaning

GraphSAGE: structure without meaning

HGT: structure with meaning

Comparison of Results

Explainability

Not All Supply Chains Are Created Equal

Key Takeaways

What’s next? From Concepts to Code

Reference

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.