Routing in a Sparse Graph: a Distributed Q-Learning Approach

concerning the Small-World Experiment, conducted by Stanley Milgram within the 1960’s. He devised an experiment by which a letter was given to a volunteer person in the US, with the instruction to forward the letter to their personal contact most certainly to know one other person – the goal – in the identical country. In turn, the person receiving the letter could be asked to forward the letter again until the goal person was reached. Although many of the letters never reached the goal person, amongst those who did (survivorship bias at play here!), the common variety of hops was around 6. The “six degrees of separation” has grow to be a cultural reference to the tight interconnectivity of society.

I’m still amazed by the thought that individuals with ~10² contacts manage to attach with a random person in a network of ~10⁸ people, through a number of hops.

How is that possible? Heuristics.

Let’s assume I’m asked to send a letter to a goal person in Finland¹.

Unfortunately, I don’t have any contacts in Finland. However, I do know someone who lived in Sweden for a few years. Perhaps she knows people in Finland. If not, she probably still has contacts in Sweden, which is a neighboring country. She is my best bet to catch up with to the goal person. The purpose is, although I have no idea the topology of the social network beyond my very own personal contacts, I can use rules of thumb to forward the letter in the fitting direction.

Hello, Finland! Image from Illia Panasenko, on Unsplash.

If we adopt the perspective of the network’s nodes (the humans involved within the experiment), their available actions are to forward the message (the letter) to one in every of their outgoing edges (personal contacts). This problem of transmitting the message in the fitting direction offers a chance to rejoice with machine learning!

Nodes don’t perceive the entire network topology. We will arrange an environment that rewards them for routing the message along the shortest known path, while encouraging them to explore sub-optimal candidate paths. That feels like an excellent use case for reinforcement learning, don’t you think that?

For those considering running the code, you possibly can reach the repo here.

The Problem

We’ll be given a directed graph where edges between nodes are sparse, meaning the common variety of outgoing edges from a node is significantly smaller than the variety of nodes. Moreover, the sides may have an associated cost. This extra feature generalizes the case of the Small-World Experiment, where each hop of the letter counted for a price of 1.

The issue we’ll consider is to design a reinforcement learning algorithm that finds a path from an arbitrary start node to an arbitrary goal node in a sparse directed graph, with a price as little as possible, if such a path exists. Deterministic solutions exist for this problem. For instance, Dijkstra’s algorithm finds the shortest path from a start node to all the opposite nodes in a directed graph. This will probably be useful to guage the outcomes of our reinforcement learning algorithm, which isn’t guaranteed to seek out the optimal solution.

Q-Learning

Q-Learning is a reinforcement learning technique where an agent maintains a table of state-action pairs, related to the expected discounted cumulative reward – the , hence the -Learning. Through iterative experiments, the table is updated until a stopping criterion is met. After training, the agent can select the motion (the column of the Q-matrix) corresponding to the maximal quality, for a given state (the row of the Q-matrix).

The update rule, given a trial motion a_j, leading to the transition from state s_i to state s_k, a reward r, and a best estimate of the standard of the subsequent state s_k, is:

[ Q(i, j) leftarrow (1 – alpha) Q(i, j) + alpha left( r + gamma max_{l} Q(k, l) right) ]

_{Equation 1: The update rule for Q-Learning.}

In equation 1:

α is the educational rate, controlling how briskly recent results will erase old quality estimates.
γ is the discount factor, controlling how much future rewards compare with immediate rewards.
Q is the standard matrix. The row index i is the index of the origin state, and the column index j is the index of the chosen motion.

In brief, equation 1 states that the standard of (state, motion) needs to be partly updated with a brand new quality value, fabricated from the sum of the immediate reward and the discounted estimate of the subsequent state’s maximum quality over possible actions.

For our problem statement, a possible formulation for the state could be the pair (current node, goal node), and the set of actions could be the set of nodes. The state set would then contain N² values, and the motion set would contain N values, where N is the variety of nodes. Nevertheless, since the graph is sparse, a given origin node has only a small subset of nodes as outgoing edges. This formulation would end in a Q-matrix where the massive majority of the N³ entries are never visited, consuming memory needlessly.

Distributed agents

A greater use of resources is to distribute the agents. Each node could be considered an agent. The agent’s state is the goal node, hence its Q-matrix has N rows and N_out columns (the variety of outgoing edges for this specific node). With N agents, the overall variety of matrix entries is N²N_out, which is lower than N³.

To summarize:

We’ll be training N agents, one for every node within the graph.
Each agent will learn a Q-matrix of dimensions [N x N_out]. The variety of outgoing edges (N_out) can vary from one node to a different. For a sparsely connected graph, N_out << N.
The row indices of the Q-matrix correspond to the state of the agent, i.e., the goal node.
The column indices of the Q-matrix correspond to the outgoing edge chosen by an agent to route a message toward the goal node.
Q[i, j] represents a node’s estimate of the standard of forwarding the message to its j^th outgoing edge, given the goal node is i.
When a node receives a message, it's going to include the goal node. Because the sender of the previous message isn't needed to find out the routing of the subsequent message, it's going to not be included within the latter.

Code

The core class, the node, will probably be named QNode.

class QNode:
    def __init__(self, number_of_nodes=0, connectivity_average=0, connectivity_std_dev=0, Q_arr=None, neighbor_nodes=None,
                 state_dict=None):
        if state_dict isn't None:
            self.Q = state_dict['Q']
            self.number_of_nodes = state_dict['number_of_nodes']
            self.neighbor_nodes = state_dict['neighbor_nodes']
        else:  # state_dict is None
            if Q_arr is None:
                self.number_of_nodes = number_of_nodes
                number_of_neighbors = connectivity_average + connectivity_std_dev * np.random.randn()
                number_of_neighbors = round(number_of_neighbors)
                number_of_neighbors = max(number_of_neighbors, 2)  # No less than two out-connections
                number_of_neighbors = min(number_of_neighbors, self.number_of_nodes)  # Not greater than N connections
                self.neighbor_nodes = random.sample(range(self.number_of_nodes), number_of_neighbors)  # [1, 4, 5, ...]
                self.Q = np.zeros((self.number_of_nodes, number_of_neighbors))  # Optimistic initialization: all rewards will probably be negative
                # q = self.Q[state, action]: state = goal node; motion = chosen neighbor node (converted to column index) to route the message to

            else:  # state_dict is None and Q_arr isn't None
                self.Q = Q_arr
                self.number_of_nodes = self.Q.shape[0]
                self.neighbor_nodes = neighbor_nodes

The category QNode has three fields: the variety of nodes within the graph, the list of outgoing edges, and the Q-matrix. The Q-matrix is initialized with zeros. The rewards received from the environment will probably be the negative of the sting costs. Hence, the standard values will all be negative. For that reason, initializing with zeros is an optimistic initialization.

When a message reaches a QNode object, it selects one in every of its outgoing edges through the algorithm. If ε is small, the epsilon-greedy algorithm selects more often than not the outgoing edge with the best Q-value. Infrequently, it's going to select an outgoing edge randomly:

def epsilon_greedy(self, target_node, epsilon):
        rdm_nbr = random.random()
        if rdm_nbr < epsilon:  # Random alternative
            random_choice = random.alternative(self.neighbor_nodes)
            return random_choice
        else:  # Greedy alternative
            neighbor_columns = np.where(self.Q[target_node, :] == self.Q[target_node, :].max())[0]  # [1, 4, 5]
            neighbor_column = random.alternative(neighbor_columns)
            neighbor_node = self.neighbor_node(neighbor_column)
            return neighbor_node

The opposite class is the graph, called QGraph.

class QGraph:
    def __init__(self, number_of_nodes=10, connectivity_average=3, connectivity_std_dev=0, cost_range=[0.0, 1.0],
                 maximum_hops=100, maximum_hops_penalty=1.0):
        self.number_of_nodes = number_of_nodes
        self.connectivity_average = connectivity_average
        self.connectivity_std_dev = connectivity_std_dev
        self.cost_range = cost_range
        self.maximum_hops = maximum_hops
        self.maximum_hops_penalty = maximum_hops_penalty
        self.QNodes = []
        for node in range(self.number_of_nodes):
            self.QNodes.append(QNode(self.number_of_nodes, self.connectivity_average, self.connectivity_std_dev))

        self.cost_arr = cost_range[0] + (cost_range[1] - cost_range[0]) * np.random.random((self.number_of_nodes, self.number_of_nodes))

Its primary fields are the list of nodes and the array of edge costs. Notice that the actual edges are a part of the QNode class, as an inventory of outgoing nodes.

When we would like to generate a path from a start node to a goal node, we call the QGraph.trajectory() method, which calls the QNode.epsilon_greedy() method:

    def trajectory(self, start_node, target_node, epsilon):
        visited_nodes = [start_node]
        costs = []
        if start_node == target_node:
            return visited_nodes, costs
        current_node = start_node
        while len(visited_nodes) < self.maximum_hops + 1:
            next_node = self.QNodes[current_node].epsilon_greedy(target_node, epsilon)
            cost = float(self.cost_arr[current_node, next_node])
            visited_nodes.append(next_node)
            costs.append(cost)
            current_node = next_node
            if current_node == target_node:
                return visited_nodes, costs
        # We reached the utmost variety of hops
        return visited_nodes, costs

The trajectory() method returns the list of visited nodes along the trail and the list of costs related to the sides that were used.

The last missing piece is the update rule of equation 1, implemented by the QGraph.update_Q() method:

def update_Q(self, start_node, neighbor_node, alpha, gamma, target_node):
   cost = self.cost_arr[start_node, neighbor_node]
   reward = -cost
   # Q_orig(goal, dest) <- (1 - alpha) Q_orig(goal, dest) + alpha * ( r + gamma * max_neigh' Q_dest(goal, neigh') )
   Q_orig_target_dest = self.QNodes[start_node].Q[target_node, self.QNodes[start_node].neighbor_column(neighbor_node)]
   max_neigh_Q_dest_target_neigh = np.max(self.QNodes[neighbor_node].Q[target_node, :])
   updated_Q = (1 - alpha) * Q_orig_target_dest + alpha * (reward + gamma * max_neigh_Q_dest_target_neigh)
   self.QNodes[start_node].Q[target_node, self.QNodes[start_node].neighbor_column(neighbor_node)] = updated_Q

To coach our agents, we iteratively loop through the pairs of (start_node, target_node) with an inner loop over the neighbor nodes of start_node, and we call update_Q().

Experiments and Results

Let’s start with an easy graph of 12 nodes, with directed weighted edges.

Figure 1: A graph of 12 nodes. Image by the creator.

We will observe from Figure 1 that the one incoming node to Node-1 is Node-7, and the one incoming node to Node-7 is Node-1. Due to this fact, no other node besides these two can reach Node-1 and Node-7. When one other node is tasked with sending a message to Node-1 or Node-7, the message will bounce across the graph until the utmost variety of hops is reached. We expect to see large negative Q-values in these cases.

After we train our graph, we get statistics about the associated fee and the variety of hops as a function of the epoch, as in Figure 2:

Figure 2: Typical evolution of the associated fee and the trail lengths (variety of hops) as a function of the epoch. Image by the creator.

After training, that is the Q-matrix for Node-4:

Table 1: Q-matrix for Node-4. Image by the creator.

The trajectory from Node-4 to, say, Node-11, could be obtained by calling the trajectory() method, setting epsilon=0 for the greedy deterministic solution: [4, 3, 5, 11] for a complete undiscounted cost of 0.9 + 0.9 + 0.3 = 2.1. The Dijkstra algorithm returns the identical path.

In some rare cases, the optimal path was not found. For instance, to go from Node-6 to Node-9, a selected instance of the trained Q-graph returned [6, 11, 0, 4, 10, 2, 9] for a complete undiscounted cost of three.5, while the Dijkstra algorithm returned [6, 0, 4, 10, 2, 9] for a complete undiscounted cost of three.4. As we stated before, this is anticipated from an iterative algorithm

Conclusion

We formulated the Small-World Experiment as an issue of finding the trail with minimum cost between any pair of nodes in a sparse directed graph with weighted edges. We implemented the nodes as Q-Learning agents, who learn through the update rule to take the least costly motion, given a goal node.

With an easy graph, we showed that the training settled to an answer near the optimal one.

Thanks to your time, and be at liberty to experiment with the code. If you may have ideas for fun applications for Q-Learning, please let me know!

¹ OK, I’m going beyond the unique Small-World Experiment, which needs to be called the Small-Country Experiment.

References

Reinforcement Learning, Richard S. Sutton, Andrew G. Barto, MIT Press, 1998

Routing in a Sparse Graph: a Distributed Q-Learning Approach

The Problem

Q-Learning

Distributed agents

Code

Experiments and Results

Conclusion

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Self-Hosting Your First LLM

Researchers disclose vulnerabilities in IP KVMs from 4 manufacturers

Constructing the AI Grid with NVIDIA: Orchestrating Intelligence In every single place

State of Open Source on Hugging Face: Spring 2026

Introducing Gemini Embeddings 2 Preview

Routing in a Sparse Graph: a Distributed Q-Learning Approach

The Problem

Q-Learning

Distributed agents

Code

Experiments and Results

Conclusion

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.