concerning the Small-World Experiment, conducted by Stanley Milgram within the 1960’s. He devised an experiment by which a letter was given to a volunteer person in the US, with the instruction to forward the letter to their personal contact most certainly to know one other person – the goal – in the identical country. In turn, the person receiving the letter could be asked to forward the letter again until the goal person was reached. Although many of the letters never reached the goal person, amongst those who did (survivorship bias at play here!), the common variety of hops was around 6. The “six degrees of separation” has grow to be a cultural reference to the tight interconnectivity of society.
I’m still amazed by the thought that individuals with ~102 contacts manage to attach with a random person in a network of ~108 people, through a number of hops.
How is that possible? Heuristics.
Let’s assume I’m asked to send a letter to a goal person in Finland1.
Unfortunately, I don’t have any contacts in Finland. However, I do know someone who lived in Sweden for a few years. Perhaps she knows people in Finland. If not, she probably still has contacts in Sweden, which is a neighboring country. She is my best bet to catch up with to the goal person. The purpose is, although I have no idea the topology of the social network beyond my very own personal contacts, I can use rules of thumb to forward the letter in the fitting direction.
If we adopt the perspective of the network’s nodes (the humans involved within the experiment), their available actions are to forward the message (the letter) to one in every of their outgoing edges (personal contacts). This problem of transmitting the message in the fitting direction offers a chance to rejoice with machine learning!
Nodes don’t perceive the entire network topology. We will arrange an environment that rewards them for routing the message along the shortest known path, while encouraging them to explore sub-optimal candidate paths. That feels like an excellent use case for reinforcement learning, don’t you think that?
For those considering running the code, you possibly can reach the repo here.
The Problem
We’ll be given a directed graph where edges between nodes are sparse, meaning the common variety of outgoing edges from a node is significantly smaller than the variety of nodes. Moreover, the sides may have an associated cost. This extra feature generalizes the case of the Small-World Experiment, where each hop of the letter counted for a price of 1.
The issue we’ll consider is to design a reinforcement learning algorithm that finds a path from an arbitrary start node to an arbitrary goal node in a sparse directed graph, with a price as little as possible, if such a path exists. Deterministic solutions exist for this problem. For instance, Dijkstra’s algorithm finds the shortest path from a start node to all the opposite nodes in a directed graph. This will probably be useful to guage the outcomes of our reinforcement learning algorithm, which isn’t guaranteed to seek out the optimal solution.
Q-Learning
Q-Learning is a reinforcement learning technique where an agent maintains a table of state-action pairs, related to the expected discounted cumulative reward – the , hence the -Learning. Through iterative experiments, the table is updated until a stopping criterion is met. After training, the agent can select the motion (the column of the Q-matrix) corresponding to the maximal quality, for a given state (the row of the Q-matrix).
The update rule, given a trial motion aj, leading to the transition from state si to state sk, a reward r, and a best estimate of the standard of the subsequent state sk, is:
[ Q(i, j) leftarrow (1 – alpha) Q(i, j) + alpha left( r + gamma max_{l} Q(k, l) right) ]
Equation 1: The update rule for Q-Learning.
In equation 1:
αis the educational rate, controlling how briskly recent results will erase old quality estimates.- γ is the discount factor, controlling how much future rewards compare with immediate rewards.
Qis the standard matrix. The row indexiis the index of the origin state, and the column indexjis the index of the chosen motion.
In brief, equation 1 states that the standard of (state, motion) needs to be partly updated with a brand new quality value, fabricated from the sum of the immediate reward and the discounted estimate of the subsequent state’s maximum quality over possible actions.
For our problem statement, a possible formulation for the state could be the pair (current node, goal node), and the set of actions could be the set of nodes. The state set would then contain N2 values, and the motion set would contain N values, where N is the variety of nodes. Nevertheless, since the graph is sparse, a given origin node has only a small subset of nodes as outgoing edges. This formulation would end in a Q-matrix where the massive majority of the N3 entries are never visited, consuming memory needlessly.
Distributed agents
A greater use of resources is to distribute the agents. Each node could be considered an agent. The agent’s state is the goal node, hence its Q-matrix has N rows and Nout columns (the variety of outgoing edges for this specific node). With N agents, the overall variety of matrix entries is N2Nout, which is lower than N3.
To summarize:
- We’ll be training
Nagents, one for every node within the graph. - Each agent will learn a
Q-matrix of dimensions[N x Nout]. The variety of outgoing edges (Nout) can vary from one node to a different. For a sparsely connected graph,Nout<< N. - The row indices of the
Q-matrix correspond to the state of the agent, i.e., the goal node. - The column indices of the
Q-matrix correspond to the outgoing edge chosen by an agent to route a message toward the goal node. Q[i, j]represents a node’s estimate of the standard of forwarding the message to itsjth outgoing edge, given the goal node isi.- When a node receives a message, it's going to include the goal node. Because the sender of the previous message isn't needed to find out the routing of the subsequent message, it's going to not be included within the latter.
Code
The core class, the node, will probably be named QNode.
class QNode:
def __init__(self, number_of_nodes=0, connectivity_average=0, connectivity_std_dev=0, Q_arr=None, neighbor_nodes=None,
state_dict=None):
if state_dict isn't None:
self.Q = state_dict['Q']
self.number_of_nodes = state_dict['number_of_nodes']
self.neighbor_nodes = state_dict['neighbor_nodes']
else: # state_dict is None
if Q_arr is None:
self.number_of_nodes = number_of_nodes
number_of_neighbors = connectivity_average + connectivity_std_dev * np.random.randn()
number_of_neighbors = round(number_of_neighbors)
number_of_neighbors = max(number_of_neighbors, 2) # No less than two out-connections
number_of_neighbors = min(number_of_neighbors, self.number_of_nodes) # Not greater than N connections
self.neighbor_nodes = random.sample(range(self.number_of_nodes), number_of_neighbors) # [1, 4, 5, ...]
self.Q = np.zeros((self.number_of_nodes, number_of_neighbors)) # Optimistic initialization: all rewards will probably be negative
# q = self.Q[state, action]: state = goal node; motion = chosen neighbor node (converted to column index) to route the message to
else: # state_dict is None and Q_arr isn't None
self.Q = Q_arr
self.number_of_nodes = self.Q.shape[0]
self.neighbor_nodes = neighbor_nodes
The category QNode has three fields: the variety of nodes within the graph, the list of outgoing edges, and the Q-matrix. The Q-matrix is initialized with zeros. The rewards received from the environment will probably be the negative of the sting costs. Hence, the standard values will all be negative. For that reason, initializing with zeros is an optimistic initialization.
When a message reaches a QNode object, it selects one in every of its outgoing edges through the algorithm. If ε is small, the epsilon-greedy algorithm selects more often than not the outgoing edge with the best Q-value. Infrequently, it's going to select an outgoing edge randomly:
def epsilon_greedy(self, target_node, epsilon):
rdm_nbr = random.random()
if rdm_nbr < epsilon: # Random alternative
random_choice = random.alternative(self.neighbor_nodes)
return random_choice
else: # Greedy alternative
neighbor_columns = np.where(self.Q[target_node, :] == self.Q[target_node, :].max())[0] # [1, 4, 5]
neighbor_column = random.alternative(neighbor_columns)
neighbor_node = self.neighbor_node(neighbor_column)
return neighbor_node
The opposite class is the graph, called QGraph.
class QGraph:
def __init__(self, number_of_nodes=10, connectivity_average=3, connectivity_std_dev=0, cost_range=[0.0, 1.0],
maximum_hops=100, maximum_hops_penalty=1.0):
self.number_of_nodes = number_of_nodes
self.connectivity_average = connectivity_average
self.connectivity_std_dev = connectivity_std_dev
self.cost_range = cost_range
self.maximum_hops = maximum_hops
self.maximum_hops_penalty = maximum_hops_penalty
self.QNodes = []
for node in range(self.number_of_nodes):
self.QNodes.append(QNode(self.number_of_nodes, self.connectivity_average, self.connectivity_std_dev))
self.cost_arr = cost_range[0] + (cost_range[1] - cost_range[0]) * np.random.random((self.number_of_nodes, self.number_of_nodes))
Its primary fields are the list of nodes and the array of edge costs. Notice that the actual edges are a part of the QNode class, as an inventory of outgoing nodes.
When we would like to generate a path from a start node to a goal node, we call the QGraph.trajectory() method, which calls the QNode.epsilon_greedy() method:
def trajectory(self, start_node, target_node, epsilon):
visited_nodes = [start_node]
costs = []
if start_node == target_node:
return visited_nodes, costs
current_node = start_node
while len(visited_nodes) < self.maximum_hops + 1:
next_node = self.QNodes[current_node].epsilon_greedy(target_node, epsilon)
cost = float(self.cost_arr[current_node, next_node])
visited_nodes.append(next_node)
costs.append(cost)
current_node = next_node
if current_node == target_node:
return visited_nodes, costs
# We reached the utmost variety of hops
return visited_nodes, costs
The trajectory() method returns the list of visited nodes along the trail and the list of costs related to the sides that were used.
The last missing piece is the update rule of equation 1, implemented by the QGraph.update_Q() method:
def update_Q(self, start_node, neighbor_node, alpha, gamma, target_node):
cost = self.cost_arr[start_node, neighbor_node]
reward = -cost
# Q_orig(goal, dest) <- (1 - alpha) Q_orig(goal, dest) + alpha * ( r + gamma * max_neigh' Q_dest(goal, neigh') )
Q_orig_target_dest = self.QNodes[start_node].Q[target_node, self.QNodes[start_node].neighbor_column(neighbor_node)]
max_neigh_Q_dest_target_neigh = np.max(self.QNodes[neighbor_node].Q[target_node, :])
updated_Q = (1 - alpha) * Q_orig_target_dest + alpha * (reward + gamma * max_neigh_Q_dest_target_neigh)
self.QNodes[start_node].Q[target_node, self.QNodes[start_node].neighbor_column(neighbor_node)] = updated_Q
To coach our agents, we iteratively loop through the pairs of (start_node, target_node) with an inner loop over the neighbor nodes of start_node, and we call update_Q().
Experiments and Results
Let’s start with an easy graph of 12 nodes, with directed weighted edges.

We will observe from Figure 1 that the one incoming node to Node-1 is Node-7, and the one incoming node to Node-7 is Node-1. Due to this fact, no other node besides these two can reach Node-1 and Node-7. When one other node is tasked with sending a message to Node-1 or Node-7, the message will bounce across the graph until the utmost variety of hops is reached. We expect to see large negative Q-values in these cases.
After we train our graph, we get statistics about the associated fee and the variety of hops as a function of the epoch, as in Figure 2:

After training, that is the Q-matrix for Node-4:

The trajectory from Node-4 to, say, Node-11, could be obtained by calling the trajectory() method, setting epsilon=0 for the greedy deterministic solution: [4, 3, 5, 11] for a complete undiscounted cost of 0.9 + 0.9 + 0.3 = 2.1. The Dijkstra algorithm returns the identical path.
In some rare cases, the optimal path was not found. For instance, to go from Node-6 to Node-9, a selected instance of the trained Q-graph returned [6, 11, 0, 4, 10, 2, 9] for a complete undiscounted cost of three.5, while the Dijkstra algorithm returned [6, 0, 4, 10, 2, 9] for a complete undiscounted cost of three.4. As we stated before, this is anticipated from an iterative algorithm
Conclusion
We formulated the Small-World Experiment as an issue of finding the trail with minimum cost between any pair of nodes in a sparse directed graph with weighted edges. We implemented the nodes as Q-Learning agents, who learn through the update rule to take the least costly motion, given a goal node.
With an easy graph, we showed that the training settled to an answer near the optimal one.
Thanks to your time, and be at liberty to experiment with the code. If you may have ideas for fun applications for Q-Learning, please let me know!
1 OK, I’m going beyond the unique Small-World Experiment, which needs to be called the Small-Country Experiment.
References
Reinforcement Learning, Richard S. Sutton, Andrew G. Barto, MIT Press, 1998
