Unlocking the Power of Big Data: The Fascinating World of Graph Learning

Artificial Intelligence

Unlocking the Power of Big Data: The Fascinating World of Graph Learning

admin

November 13, 2023

Unlocking the Power of Big Data: The Fascinating World of Graph Learning

Harnessing Deep Learning to Transform Untapped Data right into a Strategic Asset for Long-Term Competitiveness.

Large corporations generate and collect vast amounts of information, for instance and 90% of this data has been created lately. Yet, 73% of those data remain unused [1]. Nevertheless, as you could know, data is a goldmine for corporations working with Big Data.

Deep learning is continuously evolving, and today, the challenge is to adapt these latest solutions to specific goals to face out and enhance long-term competitiveness.

My previous manager had intuition that these two events could come together, and together facilitate access, requests, and above all stop wasting money and time.

Why is that this data left unused?

Accessing it takes too long, rights verification, and particularly content checks are essential before granting access to users.

Visualize reasons for data being unused. (generated by Bing Image Creator)

Is there an answer to robotically document latest data?

For those who’re not conversant in large enterprises, no problem — I wasn’t either. An interesting concept in such environments is the usage of Big Data, particularly HDFS (Hadoop Distributed File System), which is a cluster designed to consolidate the entire company’s data. Inside this vast pool of information, you could find structured data, and inside that structured data, Hive columns are referenced. A few of these columns are used to create additional tables and certain function sources for various datasets. Corporations keep the informations between some table by the lineage.

These columns even have various characteristics (domain, type, name, date, owner…). The goal of the project was to document the info generally known as physical data with business data.

Distinguishing between physical and business data:

To place it simply, physical data is a column name in a table, and business data is the usage of that column.

For exemple: Table named Friends accommodates columns (character, salary, address). Our physical data are character, salary, and address. Our business data are for instance,

For “Character” -> Name of the Character
For “Salary” -> Amount of the salary
For “Address” -> Location of the person

Those business data would assist in accessing data because you’ll directly have the data you needed. You’ll know that that is the dataset you would like to your project, the information you’re searching for is on this table. So that you’d just should ask and find your happiness, go early without losing your money and time.

“During my final internship, I, together with my team of interns, implemented a Big Data / Graph Learning solution to document these data.

The concept was to create a graph to structure our data and at the tip predict business data based on features. In other word from data stored on the corporate’s environnement, document each dataset to associate an use and in the longer term reduce the search cost and be more data-driven.

We had 830 labels to categorise and never so many rows. Hopefully the ability of graph learning come into play. I’m letting you read… “

Article Objectives: This text goals to supply an understanding of Big Data concepts, Graph Learning, the algorithm used, and the outcomes. It also covers deployment considerations and learn how to successfully develop a model.

To make it easier to understand my journey, the outline of this text contain :

Data Acquisition: Sourcing the Essential Data for Graph Creation
Graph-based Modeling with GSage
Effective Deployment Strategies

As I discussed earlier, data is commonly stored in Hive columns. For those who didn’t already know, these data are stored in large containers. We extract, transform, and cargo this data through techniques generally known as ETL.

What kind of data did I want?

Physical data and their characteristics (domain, name, data type).
Lineage (the relationships between physical data, in the event that they have undergone common transformations).
A mapping of ‘some physical data related to business data’ to then “let” the algorithm perform by itself.

1. Characteristics/ Features are obtained directly after we store the info; they’re mandatory as soon as we store data. For instance (depends upon your case) :

**Exemple of principal feature**s, (made by the writer)

For the features, based on empirical experience, we decided to make use of a feature hasher on three columns.

Feature Hasher: technique utilized in machine learning to convert high-dimensional categorical data, corresponding to text or categorical variables, right into a lower-dimensional numerical representation to scale back memory and computational requirements while preserving meaningful information.

You possibly can have the selection with One Hot Encoding technique if you’ve gotten similar patterns. If you ought to deliver your model, my advice can be to make use of Feature Hasher.

2. Lineage is a little more complex but not inconceivable to know. Lineage is sort of a history of physical data, where we’ve got a rough idea of what transformations have been applied and where the info is stored elsewhere.

Imagine big data in your mind and all these data. In some projects, we use data from a table and apply a change through a job (Spark).

Atlas Lineage visualized, from Atlas Website, LINK

We gather the informations of all physical data we’ve got to create connections in our graph, or no less than one in every of the connections.

3. The mapping is the muse that adds value to our project. It’s where we associate our business data with our physical data. This provides the algorithm with verified information in order that it may well classify the brand new incoming data ultimately. This mapping needed to be done by someone who understands the strategy of the corporate, and has the abilities to acknowledge difficult patterns without asking.

ML advice, from my very own experience :

Quoting Mr. Andrew NG, in classical machine learning, there’s something called the algorithm lifecycle. We frequently think concerning the algorithm, making it complicated, and never just using old Linear Regression (I’ve tried; it doesn’t work). On this lifecycle, there are all of the stages of preprocessing, modeling and monitoring… but most significantly, there may be data focusing.

This can be a mistake we frequently make; we take it as a right and begin doing data evaluation. We draw conclusions from the dataset without sometimes questioning its relevance. Don’t forget data focusing, my friends; it may well boost your performance and even result in a change of project 🙂

Returning to our article, after obtaining the info, we are able to finally create our graph.

Plot (networkx) of the distribution of our dataset, in a graph. (made by the writer)

This plot considers a batch of 2000 rows, so 2000 columns in datasets and tables. Yow will discover in the middle the business data and off-centered the physical data.

In mathematics, we denote a graph as G, G(N, V, f). N represents the nodes, V stands for vertices (edges), and f represents the features. Let’s assume all three are non-empty sets.

For the nodes (we’ve got the business data IDs within the mapping table) and likewise the physical data to trace them with lineage.

Speaking of lineage, it partly serves as edges with the links we have already got through the mapping and the IDs. We needed to extract it through an ETL process using the Apache Atlas APIs.

You’ll be able to see how an enormous data problem, after laying the foundations, can develop into easy to know but more difficult to implement, especially for a young intern…

“Ninja cartoon on a pc” (generated by Dall.E 3)

Basics of Graph Learning

This section might be dedicated to explaining GSage and why it was chosen each mathematically and empirically.

Before this internship, I used to be not accustomed to working with graphs. That’s why I purchased the book [2], which I’ve included in the outline, because it greatly assisted me in understanding the principles.

The principle is straightforward: after we discuss graph learning, we’ll inevitably discuss embedding. On this context, nodes and their proximity are mathematically translated into coefficients that reduce the dimensionality of the unique dataset, making it more efficient for calculations. In the course of the reduction, one in every of the important thing principles of the decoder is to preserve the proximities between nodes that were initially close.

One other source of inspiration was Maxime Labonne [3] for his explanations of GraphSages and Graph Convolutional Networks. He demonstrated great pedagogy and provided clear and comprehensible examples, making these concepts accessible to those that wish to enter them.

If this term doesn’t ring a bell, rest assured, just a couple of months ago, I used to be in your shoes. Architectures like Attention networks and Graph Convolutional Networks gave me quite a couple of nightmares and, more importantly, kept me awake at night.

But to avoid wasting you from taking on your entire day and, especially, your commute time, I’m going to simplify the algorithm for you.

Once you’ve gotten the embeddings in place, that’s when the magic can occur. But how does all of it work, you ask?

Schema based on the Scooby-Doo Universe to elucidate GSage (made by the writer).

“You might be known by the corporate you retain” is the sentence, you need to remember.

Because one in every of the elemental assumptions underlying GraphSAGE is that nodes residing within the same neighborhood should exhibit similar embeddings. To attain this, GraphSAGE employs aggregation functions that take a neighborhood as input and mix each neighbor’s embedding with specific weights. That’s why the mystery company embeddings can be in scooby’s neighborhood.

In essence, it gathers information from the neighborhood, with the weights being either learned or fixed depending on the loss function.

The true strength of GraphSAGE becomes evident when the aggregator weights are learned. At this point, the architecture can generate embeddings for unseen nodes using their features and neighborhood, making it a strong tool for various applications in graph-based machine learning.

Difference in training time between architecture, Maxime Labonne’s Article, Link

As you saw on this graph, training time decrease after we’re taking the identical dataset on GraphSage architecture. GAT (Graph Attention Network) and GCN (Graph Convolutional Network) are also really interesting graphs architectures. I actually encourage you to look forward !

At the primary compute, I used to be shocked, shocked to see 25 seconds to coach 1000 batches on hundreds of rows.

I do know at this point you’re fascinated with Graph Learning and you ought to learn more, my advice can be to read this guy. Great examples, great advice).

As I’m a reader of Medium, I’m curious to read code once I’m a latest article, and for you, we are able to implement a GraphSAGE architecture in PyTorch Geometric with the SAGEConv layer.

Let’s create a network with two SAGEConv layers:

The primary one uses ReLU because the activation function and a dropout layer;
The second directly outputs the node embeddings.

In our multi-class classification task, we’ve chosen to employ the cross-entropy loss as our primary loss function. This selection is driven by its suitability for classification problems with multiple classes. Moreover, we’ve incorporated L2 regularization with a strength of 0.0005.

This regularization technique helps prevent overfitting and promotes model generalization by penalizing large parameter values. It’s a well-rounded approach to make sure model stability and predictive accuracy.

import torch
from torch.nn import Linear, Dropout
from torch_geometric.nn import SAGEConv, GATv2Conv, GCNConv
import torch.nn.functional as Fclass GraphSAGE(torch.nn.Module):
"""GraphSAGE"""
def __init__(self, dim_in, dim_h, dim_out):
super().__init__()
self.sage1 = SAGEConv(dim_in, dim_h)
self.sage2 = SAGEConv(dim_h, dim_out)#830 for my case
self.optimizer = torch.optim.Adam(self.parameters(),
lr=0.01,
weight_decay=5e-4)
def forward(self, x, edge_index):
h = self.sage1(x, edge_index).relu()
h = F.dropout(h, p=0.5, training=self.training)
h = self.sage2(h, edge_index)
return F.log_softmax(h, dim=1)
def fit(self, data, epochs):
criterion = torch.nn.CrossEntropyLoss()
optimizer = self.optimizer
self.train()
for epoch in range(epochs+1):
total_loss = 0
acc = 0
val_loss = 0
val_acc = 0
# Train on batches
for batch in train_loader:
optimizer.zero_grad()
out = self(batch.x, batch.edge_index)
loss = criterion(out[batch.train_mask], batch.y[batch.train_mask])
total_loss += loss
acc += accuracy(out[batch.train_mask].argmax(dim=1), 
batch.y[batch.train_mask])
loss.backward()
optimizer.step()
# Validation
val_loss += criterion(out[batch.val_mask], batch.y[batch.val_mask])
val_acc += accuracy(out[batch.val_mask].argmax(dim=1), 
batch.y[batch.val_mask])
# Print metrics every 10 epochs
if(epoch % 10 == 0):
print(f'Epoch {epoch:>3} | Train Loss: {total_loss/len(train_loader):.3f} '
f'| Train Acc: {acc/len(train_loader)*100:>6.2f}% | Val Loss: '
f'{val_loss/len(train_loader):.2f} | Val Acc: '
f'{val_acc/len(train_loader)*100:.2f}%')
def accuracy(pred_y, y):
"""Calculate accuracy."""
return ((pred_y == y).sum() / len(y)).item()
@torch.no_grad()
def test(model, data):
"""Evaluate the model on test set and print the accuracy rating."""
model.eval()
out = model(data.x, data.edge_index)
acc = accuracy(out.argmax(dim=1)[data.test_mask], data.y[data.test_mask])
return acc

In the event and deployment of our project, we harnessed the ability of three key technologies, each serving a definite and integral purpose:

Airflow : To efficiently manage and schedule our project’s complex data workflows, we utilized the Airflow Orchestrator. Airflow is a widely adopted tool for orchestrating tasks, automating processes, and ensuring that our data pipelines ran easily and on schedule.

Mirantis: Our project’s infrastructure was built and hosted on the Mirantis cloud platform. Mirantis is renowned for providing robust, scalable, and reliable cloud solutions, offering a solid foundation for our deployment.

Jenkins: To streamline our development and deployment processes, we relied on Jenkins, a trusted name on the earth of continuous integration and continuous delivery (CI/CD). Jenkins automated the constructing, testing, and deployment of our project, ensuring efficiency and reliability throughout our development cycle.

Moreover, we stored our machine learning code in the corporate’s Artifactory. But what exactly is an Artifactory?

Artifactory: An Artifactory is a centralized repository manager for storing, managing, and distributing various artifacts, corresponding to code, libraries, and dependencies. It serves as a secure and arranged space for storing, ensuring that every one team members have easy accessibility to the essential assets. This permits seamless collaboration and simplifies the deployment of applications and projects, making it a priceless asset for efficient development and deployment workflows.

By housing our machine learning code within the Artifactory, we ensured that our models and data were available to support our deployment via Jenkins.

ET VOILA ! The answer was deployed.

I talked lots concerning the infrastrucute but not a lot concerning the Machine Learning and the outcomes we had.

The trust of the predictions :

For every physical data, we’re taking in consideration 2 predictions, due to model performances.

How’s that possible?

probabilities = torch.softmax(raw_output, dim = 1)
#torch.topk to get the highest 3 probabilites and their indices for every prediction
topk_values, topk_indices = torch.topk(probabilities, k = 2, dim = 1)

First I used a softmax to make the outputs comparable, and after I used a function named torch.topk. It returns the k largest elements of the given input tensor along a given dimension.

So, back to the primary prediction, here was our distribution after training. Let me let you know girls and boys, that’s great!

Plot (from matplotlib) of the probabilities of the model outputs, First prediction (made by the writer)

Accuracies, Losses on Train / Test / Validation.

I won’t teached you what’s accuracies and losses in ML, I assumed you might be all pros… (ask to chatgpt when you’re unsure, no shame). On the training, by different scale, you’ll be able to see convergences on the curves, which is great and show a stable learning.

Plot (matplotlib) of **accuracies and losses.** (made by the writer)

t-SNE :

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a dimensionality reduction technique used for visualizing and exploring high-dimensional data by preserving the pairwise similarities between data points in a lower-dimensional space.

In other words, imagine a random distribution before training :

Data Distribution **before training,** (made by the writer)

Remember we’re doing multi-classification, so here’s the distribution after the training. The aggregations of features appear to have done a satisfactory work. Clusters are formed and physical data appear to have joined groups, demonstrating that the training went well.

Data distribution after training, (made by the writer)

Our goal was to predict business data based on physical data (and we did it). I’m pleased to tell you that the algorithm is now in production and is onboarding latest users for the longer term.

While I cannot provide your complete solution on account of proprietary reasons, I think you’ve gotten all of the essential details or are well-equipped to implement it on your personal.

My last piece of recommendation, I swear, have an amazing team, not only individuals who work well but individuals who make you laugh every day.

If you’ve gotten any questions, please don’t hesitate to succeed in out to me. Be at liberty to attach with me, and we are able to have an in depth discussion about it.

In case I don’t see ya, good afternoon, good evening and goodnight !

Have you ever grasped all the things ?

As Chandler Bing would have said :

“It’s all the time higher to lie, than to have the complicated discussion”

Don’t forget to love and share!

[1] Inc (2018), Web Article from Inc

[2] Graph Machine Learning: Take graph data to the following level by applying machine learning techniques and algorithms (2021), Claudio Stamile

[3] GSage, Scaling up the Graph Neural Network, (2021), Maxime Labonne

Harnessing Deep Learning to Transform Untapped Data right into a Strategic Asset for Long-Term Competitiveness.

Basics of Graph Learning

The trust of the predictions :

Accuracies, Losses on Train / Test / Validation.

t-SNE :

LEAVE A REPLY Cancel reply