In my last article [1], I threw out a number of ideas centered around constructing structured graphs, mainly focused on descriptive or unsupervised exploration of information through graph structures. Nevertheless, once we use graph features to enhance our models, the temporal nature of the information should be taken into account. If we wish to avoid undesired effects, we should be careful to not leak future information into our training process. This implies our graph (and the features derived from it) should be constructed in a time-aware, incremental way.
Data leakage is such a paradoxical problem that a 2023 study by Sayash Kapoor and Arvind Narayanan [2] found that, as much as that moment, it had affected 294 research papers across 17 scientific fields. They classify the forms of data leakages starting from textbook errors to open research problems.
The problem is that in prototyping, results often seem very promising once they are really not. More often than not, people don’t realize this until models are deployed in production, wasting the time and resources of a whole team. Then, performance normally falls in need of expectations without understanding why. This issue can develop into the Achilles’ heel that undermines all business AI initiatives.
…
ML-base leakage
Data leakage occurs when the training data accommodates information in regards to the output that won’t be available during inference. This causes overly optimistic evaluation metrics during development, creating misleading expectations. Nevertheless, when deployed in real-time systems with the correct data flow, the model predictions develop into untrustworthy since it learned from information not accessible.
Ethically, we must strive to supply results that actually reflect the capabilities of our models, quite than sensational or misleading findings. When a model moves from prototyping to production, it shouldn’t fail to generalize properly; if it does, its practical value is undermined. Models that fail to generalize well can exhibit significant problems during inference or deployment, compromising their usefulness.
This is very dangerous in sensitive contexts like fraud detection, which regularly involve imbalanced data scenarios (with fewer fraud cases than non-fraud). In these situations, the harm brought on by data leakage is more pronounced since the model might overfit to leaked data related to the minority class, producing seemingly good results for the minority label, which is the toughest to predict. This could result in missed fraud detections, leading to serious practical consequences.
Data leakage examples could be categorized into textbook errors and open research problems [2] as follows:
Textbook Errors:
- Imputing missing values using your complete dataset as a substitute of only the training set, causing information in regards to the test data to leak into training.
- Duplicated or very similar instances appearing each in training and test sets, resembling images of the identical object taken from barely different angles.
- Lack of clear separation between training and test datasets, or no test set in any respect, resulting in models accessing test information before evaluation.
- Using proxies of consequence variables that obliquely reveal the goal variable.
- Random data splitting in scenarios where multiple related records belong to a single entity, resembling multiple claim status events from the identical customer.
- Synthetic data augmentation performed over the entire dataset, as a substitute of only on the training set.
Open problems for research:
- Temporal leakage occurs when future data unintentionally influences training. In such cases, strict separation is difficult because timestamps could be noisy or incomplete.
- Updating database records without lineage or audit trail, for instance, changing fraud status without storing history, may cause models to coach on future or altered data unintentionally.
- Complex real-world data integration and pipeline issues that introduce leakage through misconfiguration or lack of controls.
These cases are a part of a broader taxonomy reported in machine learning research, highlighting data leakage as a critical and infrequently an underinvestigated risk for reliable modeling [3]. Such issues arise even with easy tabular data, they usually can remain hidden when working with many features if each shouldn’t be individually checked.
Now, let’s consider what happens once we include nodes and edges within the equation…
…
Graph-base leakage
Within the case of graph-based models, leakage could be sneakier than in traditional tabular settings. When features are derived from connected components or topological structures, using future nodes or edges can silently alter the graph’s structure. For instance:
- methodologies resembling graph neural networks (GNNs) learn the context not only from individual nodes but additionally from their neighbours, which may inadvertently introduce leakage if sensitive or future information is propagated across the graph structure during training.
- when the graph structure is overwritten or updated without preserving the past events means the model loses worthwhile context needed for accurate temporal evaluation, and it might again access information in the wrong time or lose traceability about possible leakage or problems with the information that originate the graphs.
- Computing graph aggregations like degree, triangles, or PageRank on your complete graph without accounting for the temporal dimension (time-agnostic aggregation) uses all edges: past, present, and future. This causes data leakage because features include information from future edges that wouldn’t be available at prediction time.
Graph temporal leakage occurs when features, edges, or node relationships from future time points are included during training in a way that violates the chronological order of events. This leads to edges or training features that incorporate data from time steps that ought to be unknown.
…
How can this be fixed?
We will construct a single graph that captures your complete history by assigning timestamps or time intervals to edges. To research the graph as much as a particular cut-off date (t), we “look back in time” by filtering any graph to incorporate only the events that occurred before or at that cutoff. This approach is good for stopping data leakage since it ensures that only past and present information is used for modeling. Moreover, it offers flexibility in defining different time windows for secure and accurate temporal evaluation.
In this text, we construct a temporal graph of insurance claims where the nodes represent individual claims, and temporal links are created when two claims share an entity (e.g., phone number, license plate, repair shop, etc.) to make sure the right event order. Graph-based features are then computed to feed fraud prediction models, fastidiously avoiding the usage of future information (no peeking).
The concept is straightforward: if two claims share a typical entity and one occurs before the opposite, we connect them in the meanwhile this connection becomes visible (figure 1). As we explained within the previous section, the way in which we model the information is crucial, not only to capture what we’re truly searching for, but additionally to enable the usage of advanced methods resembling Graph Neural Networks (GNNs).
In our graph model, we save the timestamp when an entity is first seen, capturing the moment it appears in the information. Nevertheless, in lots of real-world scenarios, it is usually useful to contemplate a time interval spanning the entity’s first and last appearances (for instance, generated with one other variable like plate or email). This interval can provide richer temporal context, reflecting the lifespan or energetic period of nodes and edges, which is worthwhile for dynamic temporal graph analyses and advanced model training.
Code
The code is accessible on this repository: Link to the repository
To run the experiments, arrange a Python ≥3.11 environment with the required libraries (e.g., torch, torch-geometric, networkx, etc.). It is strongly recommended to make use of a virtual environment (via venv or conda) to maintain dependencies isolated.
Code Pipeline
The diagram of Figure 2, shows the end-to-end workflow for fraud detection with GraphSAGE. Step 1 loads the (simulated) raw claims data. Step 2 builds a time-stamped directed graph (entity→claim and older-claim→newer-claim). Step 3 performs temporal slicing to create train, validation, and test sets, then indexes nodes, builds features, and eventually trains and validates the model.

Data
) for training and inference. Image by Creator.Step 1: Simulated Fraud Dataset
We first simulate a dataset of insurance claims. Each row within the dataset represents a claim and includes variables resembling:
- Entities:
insurer_license_plate
,insurer_phone_number
,insurer_email
,insurer_address
,repair_shop
,bank_account
,claim_location
,third_party_license_plate
- Core information:
claim_id
,claim_date
,type_of_claim
,insurer_id
,insurer_name
- Goal:
fraud
(binary variable indicating whether the claim is fraudulent or not)
These entity attributes act as potential links between claims, allowing us to infer connections through shared values (e.g., two claims using the identical repair shop or phone number). By modeling these implicit relationships as edges in a graph, we will construct powerful topological representations that capture suspicious behavioral patterns and enable downstream tasks resembling feature engineering or graph-based learning.


Step2: Graph Modeling
We use the NetworkX library to construct our graph model. For small-scale examples, NetworkX is sufficient and effective. For more advanced graph processing, tools like Memgraph or Neo4j could possibly be used. To model with NetworkX, we create nodes and edges representing entities and their relationships, enabling network evaluation and visualization inside Python.
So, we now have:
- one node per claim, with node key equal to the claim_id and attributes as node_type and claim_date
- one node per entity value (phone, plate, checking account, shop, etc.). Node key:
"{column_name}:{value}"
and attributesnode_type =
(e.g.,"insurer_phone_number"
,"bank_account"
,"repair_shop"
)label =
(just the raw value without the prefix)
The graph includes these two forms of edges:
claim_id(t-1)
→claim_id(t)
: when two claims share an entity (withedge_type='claim-claim'
)entity_value
→claim_id
: direct link to the shared entity (withedge_type='entity-claim'
)
These edges are annotated with:
edge_type
: to differentiate the relation (claim
→claim
vsentity
→claim
)entity_type
: the column from which the worth comes (likebank_account
)shared_value
: the actual value (like a phone number or license plate)timestamp
: when the sting was added (based on the present claim’s date)
To interpret our simulation, we implemented a script that generates explanations for why a claim is flagged as fraud. In Figure 4, claim 20000695 is taken into account dangerous primarily since it is related to repair shop SHOP_856, which acts as an energetic hub with multiple claims linked around similar dates, a pattern often seen in fraud “bursts.” Moreover, this claim shares a license plate and address with several other claims, creating dense connections to other suspicious cases.

This code saves the graph as a pickel file: temporal_graph_with_edge_attrs.gpickle.
Step 3: Graph preparation & Training
Representation learning transforms complex, high-dimensional data (like text, images, or sensor readings) into simplified, structured formats (often called embeddings) that capture meaningful patterns and relationships. These learned representations improve model performance, interpretability, and the flexibility to transfer learning across different tasks.
We train a neural network, to map each input to a vector in ℝᵈ that encodes what matters. In our pipeline, GraphSAGE does representation learning on the claim graph: it aggregates information from a node’s neighbours (shared phones, shops, plates, etc.) and mixes that with the node’s own attributes to supply a node embedding. Those embeddings are then fed to a small classifier head to predict fraud.
3.1. Temporal slicing
From the one full graph we create in step 2, we extract three time-sliced subgraphs for train, validation, and test. For every split we decide a cutoff date and keep only (1) claim nodes with claim_date ≤ cutoff
, and (2) edges whose timestamp ≤ cutoff
. This produces a time-consistent subgraph for that split: no information from the long run leaks into the past, matching how the model would run in production with only historical data available.
3.2 Node indexing
Give every node within the sliced graph an integer index 0…N-1
. That is just an ID mapping (like tokenization). We’ll use these indices to align features, labels, and edges in tensors.
3.3 Construct node features
Create one feature row per node:
- Type one-hot (claim, phone, email, …).
- Degree stats computed throughout the sliced graph: normalized in-degree, out-degree, and undirected degree throughout the sliced graph.
- Prior fraud from older neighbors (claims only): fraction of older connected claims (direct claim→claim predecessors) which are labeled fraud, considering only neighbors that existed before the present claim’s time.
We also set the labely
(1/0) for claims and 0 for entities, and mark claims inclaim_mask
so loss/metrics are computed only on claims.
3.4 Construct PyG Data
Translate edges (u→v)
right into a 2×E integer tensor edge_index
using the node indices and add self-loops so each node also retains its own features at every layer. Pack every part right into a PyG Data(x, edge_index, y, claim_mask)
object. Edges are directed, so message passing respects time (earlier→later).
3.5 GraphSage:
We implement a GraphSAGE architecture in PyTorch Geometric with the SAGEConv
layer. so, we run two GraphSAGE convolution layers (mean aggregation), ReLU, dropout, then a linear head to predict fraud vs non-fraud. We train full-batch (no neighbor sampling). The loss is weighted to handle class imbalance and is computed only on claim nodes via claim_mask
. After each epoch we evaluate on the validation split and select the choice threshold that maximizes F1; we keep one of the best model by val-F1 (early stopping).

3.6 Inference results.
Evaluate one of the best model on the test split using the validation-chosen threshold. Report accuracy, precision, recall, F1, and the confusion matrix. Produce a lift table/plot (how concentrated fraud is by rating decile), export a t-SNE plot of claim embeddings to visualise structure.

The lift chart evaluates how well the model ranks fraud: bars show lift by rating decile and the road shows cumulative fraud capture. In the highest 10–20% of claims (Deciles 1–2), the fraud rate is about 2–3× the typical, suggesting that reviewing the highest 20–30% of claims would capture a big share of fraud. The t-SNE plot shows several clusters where fraud concentrates, indicating the model learns meaningful relational patterns, while overlap with non-fraud points highlights remaining ambiguity and opportunities for feature or model tuning.
…
Conclusion
Using a graph that only connects older claims to newer claims (past to future) without “leaking” future fraud information, the model successfully concentrates fraud cases in the highest scoring groups, achieving about 2–3 times higher detection in the highest 10–20%. This setup is reliable enough to deploy.
As a test, it is feasible to try a version where the graph is two-way or undirected (connections each ways) and compare the spurious improvement with the one-way version. If the two-way version gets significantly higher results, it’s likely due to “temporal leakage,” meaning future information is badly influencing the model. This can be a method to prove why two-way connections shouldn’t be utilized in real use cases.
To avoid making the article too long, we’ll cover the experiments with and without leakage in a separate article. In this text, we concentrate on developing a model that meets production readiness.
There’s still room to enhance with richer features, calibration, and small model tweaks, but our focus here is to elucidate a leak-safe temporal graph methodology that addresses data leakage.
References
[1] Gomes-Gonçalves, E. (2025, January 23). Applications and Opportunities of Graphs in Insurance. . Retrieved September 11, 2025, from https://medium.com/@erikapatg/applications-and-opportunities-of-graphs-in-insurance-0078564271ab
[2] Kapoor, S., & Narayanan, A. (2023). . Link.
[3] Guignard, F., Ginsbourger, D., Levy Häner, L., & Herrera, J. M. (2024). Some combinatorics of information leakage induced by clusters. , (7), 2815–2828.
[4] Huang, S., et. al. (2024). UTG: Towards a Unified View of Snapshot and Event Based Models for Temporal Graphs. . https://arxiv.org/abs/2407.12269
[5] Labonne, M. (2022). GraphSAGE: Scaling up Graph Neural Networks. . Retrieved from https://towardsdatascience.com/introduction-to-graphsage-in-python-a9e7f9ecf9d7/
[6] An Introduction to GraphSAGE. (2025). Weights & Biases. Retrieved from https://wandb.ai/graph-neural-networks/GraphSAGE/reports/An-Introduction-to-GraphSAGE–Vmlldzo1MTEwNzQ1