A Generalizable MARL-LP Approach for Scheduling in Logistics

Introduction

that always operates with surprising inefficiency: manual processes, piles of paperwork, legal complexities. Many corporations still run on paper or Excel and don’t even collect data on their shipments.

But what if an organization is large enough to avoid wasting thousands and thousands — and even a whole bunch of thousands and thousands — of dollars through optimization (to say nothing of the environmental impact)? Or what if an organization is small, but poised for rapid growth?

Shipment Movements in Logistics Network Simulation

Optimization is usually non-existent or rudimentary — designed for operational convenience reasonably than maximizing savings. The industry is clearly lagging behind, yet there may be a TON of cash on the table. Shipment networks span the globe, from Alaska to Sydney. I won’t bore you with market size statistics here. Insiders already know the dimensions, and outsiders could make an informed (or not so educated) guess.

And that’s where I got here in. As a Data Science and Machine Learning specialist, I discovered myself in a big, fast-growing logistics company. Crucially, the team there wasn’t just going through the motions; they genuinely desired to optimize. This led to the creation of a line-haul optimization project that I led for 2 years — and that’s the story I’m here to inform.

This project will at all times hold a warm spot in my heart, though it never fully made it to production. I imagine it holds massive potential — specifically in the mix of logistics and RL’s unique ability to generalize decision-making.

While traditional optimization projects often deal with maximizing the target function or execution speed, probably the most interesting metric here is what number of unseen cases we will solve with the identical model (zero-shot or few-shot).

In other words, we’re aiming for a generalizable zero-shot policy.

Ideally, we train an agent, drop it into recent conditions (ones it has never seen), and it just works— with none retraining or with only minimal fine-tuning. We don’t need perfection; we just need it to perform ‘ok’ to not breach the SLA.

Then we will say: ‘Cool, the agent generalized this case, too.’

I’m confident that this approach can yield models able to ever-increasing generalization over time. I imagine that is the long run of the industry.

And as considered one of my favorite stand-up comedians once said:

Eventually, any person will do it anyway. Let or not it’s us.

Business Context

The corporate had scaled rapidly, growing right into a network of over 100 line-haul terminals. At this magnitude, manual scheduling reached its operational limit. Once established, a schedule — together with its underlying business contracts and arrangements — would often remain static for months and not using a single change.

We observed a consistent inefficiency: trucks were incessantly dispatched with suboptimal loads — either underutilized (driving up unit costs) or bottlenecked by last-minute overflows.

The financial impact of this inefficiency was significant. In a network of this size, even a 1% increase in vehicle utilization translates to thousands and thousands of dollars in annual savings. Subsequently, maximizing vehicle utilization became the first lever for cost reduction.

Big Picture Problem

We had access to historical shipment data. While the storage format was removed from convenient, the amount was sufficient for modeling. Due to the efforts of my data engineering and data science colleagues, this raw data was transformed right into a clean, usable state (I’ll cover the precise data engineering challenges in a separate article).

My initial goal was to generate a ‘good’ schedule. A Schedule is defined here as a tabular dataset where every row represents a physical movement (shipment):

Timestamp: Hourly precision.
Origin & Destination: The precise edge within the graph.
Vehicle Type: The discrete asset class (e.g., 20-ton semi, 5-ton van, etc.).
Load Manifest: The actual set of aggregated ‘pallets’ packed inside.

Subsequently, constructing a schedule requires 4 distinct decisions:

Select what packages to send. What can go incorrect: if low-priority packages are sent first, priceless or urgent cargo might get stranded on the warehouse. We don’t want that, since the penalty is higher for the more priceless packages.
Select the subsequent warehouse (where to ship). Essentially, this can be a routing problem: choosing the optimal ‘next edge’ on the graph for each single package.
Select vehicle types and their quantity. This can be a balancing act. What can go incorrect: sending multiple small vehicles as an alternative of 1 large one creates fleet inefficiency, while dispatching large trucks that drive mostly empty means paying for air. Conversely, under-provisioning the fleet results in delays, costing us in each SLA penalties and popularity.
Finally, inaction can be an motion. For any given time step, the optimal move could be to send no trucks in any respect. To create an optimized schedule, the system must perfectly balance lively shipments with ‘doing nothing’.

Nevertheless, reality introduces additional complexities and constraints into the issue space:

Pace of Change: Business rules are quite a few, complex, and evolve rapidly. The true world could be way more complex and messier than a basic simulation. And changes in the true world result in expensive and time-consuming code updates.
Stochastic Demand: Demand is non-deterministic, unknown upfront, and dynamic (e.g., multiple visits to a customer inside a window).
Multi-Objective Optimization: We aren’t just minimizing cost; we’re balancing cost against SLA penalties (lateness) and fleet expenses.

So now, we understand that we not only must create a superb schedule, but additionally create a system that respects dynamic demand, truck capability, and various custom business rules, which also can often change. This crystallized into the next.

Wish-List

Low-Cost Reusability. We want the power to reuse the mechanism for brand spanking new tasks and contexts cheaply. Since real-world problems shift quickly, the answer have to be versatile — adaptable to recent settings without requiring us to retrain the model from scratch each time.
Fast Inference. While slow training is appropriate if it yields stronger generalization, the inference (decision-making) have to be fast.
‘Good Enough’ Effectiveness. The system doesn’t have to be perfect, nevertheless it must strictly adhere to the baseline SLA levels.
Global Optimization. We want to optimize the system as an entire, reasonably than optimizing its individual components in isolation.

System Specifications

Topology: Custom graph containing 2 to 100 nodes
Decision frequency: 1-hour intervals, 480 steps/episode (representing 20 days)
Agents: Decentralized hubs acting as independent decision-makers
Constraints: Hard physical limits on vehicle volume (m³) and weight (kg). Hard limit on the variety of vehicles dispatched from a terminal per hour.
Objective: Minimize global cost while adhering to dynamic SLA windows.
Primary metrics: Shipments cost, percentage of late packages (SLA violations), count of dispatched vehicles by type
Secondary “Long-term” Metrics: Average transit time and vehicle capability utilization.

Why Not Standard Solvers?

Spoiler: They’ll’t cut it, they usually are usually not ok.

Naturally, we began by exploring standard solvers and off-the-shelf tools like Google OR-Tools. Nevertheless, the consensus was discouraging: these tools would either solve our actual problem poorly, or they might perfectly solve a distinct, imaginary version of the issue. Ultimately, I concluded that this approach was a dead end.

Linear Optimization

That is the only and least expensive approach, nevertheless it has a fatal flaw: a linear formulation fails to account for temporal dynamics (every other step is dependent upon the previous one).

Essentially, LP assumes the whole optimization problem matches right into a single, static snapshot. It ignores the incontrovertible fact that every step is dependent upon the previous one. That is fundamentally incorrect and divorced from reality, where every movement within the network creates ripple effects elsewhere.

Moreover, the sheer volume of business rules makes it practically unattainable to cram all of them right into a “flat” solver. Briefly, while Linear Programming is an amazing tool, it is just too rigid for an issue of this magnitude.

Genetic Algorithms

Genetic Algorithms (GA) were closer in philosophy to what we wanted. While they do work, they arrive with significant drawbacks of their very own.

First, slow Inference. To get a result, you essentially must run the optimization from scratch each time (evolving the population). You can’t simply “train” a model and freeze the weights, because there aren’t any weights to freeze. Consequently, the system’s response time is measured in seconds and even minutes — not milliseconds — typical of a neural network or a heuristic. In a production environment coping with a whole bunch of hubs in real-time, this becomes a significant bottleneck.

Second, lack of determinism. In the event you run the scheduler twice on the identical dataset, a GA can yield two completely different schedules. Business customers often don’t like that very much, which might result in trust issues.

Why not Pure RL?

Theoretically, one could try to resolve the whole problem end-to-end using pure Reinforcement Learning. But that is certainly the hard way.

A possible pure RL solution would take considered one of two forms: either a single “God Mode” Agent that sees every part and allocates every package to each truck on every route at every step. Or a team of Sequential Agents acting one after one other.

God-Mode Agent

In the primary case, the motion space becomes unmanageable. You aren’t just choosing a route — you’ve gotten to decide on every truck (from types) times for each direction. With packages, it gets even worse: you don’t just need to pick out a subset of cargo — you’ve gotten to assign specific packages to specific trucks. Plus, you keep the choice to go away a package on the warehouse.

Even with a small fleet, the number of how to assign specific packages to specific trucks is astronomical. Asking a neural network to explore this whole space from scratch is inefficient. It might spend eons just attempting to work out which package matches into which bin.

Sequential Agents

A sequence of agents passing packages down the road would create a non-stationarity nightmare.

While Agent 1 is learning, its behavior is basically random. Agent 2 tries to adapt to Agent 1, but since Agent 1 keeps changing its strategy, Agent 2 can never stabilize. As a substitute of solving logistics, each agent is forced to infinitely adapt to its neighbor’s instability. It becomes a case of the blind leading the blind, unlikely to converge in any reasonable time.

Moreover, pure RL struggles to learn hard constraints (like maximum weight limits) without incurring massive penalties. It tends to “hallucinate” solutions — outputs that look efficient but are physically unattainable.

However, we have now Linear Programming (LP): a quick, easy solver that handles hard constraints natively. The temptation to carve out a sub-problem and offload it to LP was too great to withstand.

And that’s the reason I selected a hybrid approach.

Implemented Solution

MARL + LP Hybrid Architecture

Let’s construct an RL agent that observes the state of the logistics network and orchestrates the flow of packages — deciding exactly what volume of cargo moves between warehouses at any given moment. Ideally, this agent makes decisions , factoring within the global state of the system reasonably than simply optimizing individual warehouses in isolation.

Then, an Agent represents a selected warehouse chargeable for shipping packages to its neighbors. We then connect these agents right into a multi-agent network. Since every motion taken by an agent corresponds to a shipment to 1 or more destinations, the combination sequence of those actions constitutes the ultimate schedule.

Technically, we implemented a Multi-Agent Reinforcement Learning (MARL) framework. The RL environment trains the algorithms to generate viable transportation schedules for real-world shipments. Crucially, this project includes each the environment creation and the agent training pipelines, ensuring that the answer can adapt (via continual learning) to increasingly complex scenarios with minimal human intervention.

What agents see

Below are the important thing observations (model inputs) fed into the agent (I’ll cover more of the implementation details in Part 2).

Local Inventory: The amount of packages at each warehouse.
In-Transit Volume: The amount of packages currently traveling on the perimeters between warehouses.
Cargo Value: The entire financial value of the inventory (crucial for risk management) at each warehouse.
SLA Heatmap: The closest deadlines for the present stock (identifying urgent cargo).
Inbound Forecast: The amount of packages expected to reach inside the subsequent 24 hours.
Heuristic Hints: Used exclusively through the imitation learning stage to bootstrap training.

Version 1. Agents Slicing a PriorityQueue

On this version, packages are lined up in a priority queue, sorted in descending order based on an easy formula: = (proximity to deadline). The RL agent “slices” a portion of this queue by choosing a fraction of the highest packages and deciding which warehouse to send them to.

We use heuristics to pre-filter the choices — discarding packages we definitely don’t wish to send yet, or ruling out nonsensical destinations (e.g., shipping a package in the wrong way of its destination).

Once the RL selects the what and where, the Linear Programming solver steps in to select the amount and style of vehicles. The LP enforces hard constraints on weight, volume, and fleet availability to make sure the simulation doesn’t violate the laws of physics.

In Version 1, a single motion consists of sending packages to 1 neighbor only. The amount is decided by the “fraction” (0.0 to 1.0) chosen by the agent. “Doing nothing” is just selecting a fraction of 0.

Figure 1: V1 Architecture — The Agent tries to micromanage the queue

But then, it hit me!

Version 2. Agents Sending Trucks

TL;DR: As a substitute of choosing packages, we built an agent that selects what number of trucks to dispatch to every destination. The Linear Programming (LP) solver then decides exactly which packages to pack into those trucks.

What if the agent controlled the fleet capability directly? This enables the LP solver to handle the low-level “bin packing” work, while the RL agent focuses purely on high-level flow management. This is precisely what we wanted!

Here is the brand new division of labor:

RL Agent — Fleet Manager. Decides the amount of vehicles and their destinations.

Intuition: It looks on the map, checks the calendar, and shouts: “Send 5 trucks to the North Hub!” It handles the flow management.
Skill: Strategy, foresight, and balancing.

LP Solver — Dock Employee. Selects the precise vehicle types (optimizing the fleet mix) and picks the precise packages to pack.

Intuition: It takes the “5 trucks” order and the pile of boxes, then packs them perfectly to maximise value density.
Skill: Tetris, algebra, and physical validity.

Previously, the agent controlled a “fraction of the queue,” which determined the package count, which determined the truck count, which finally determined the reward. Now, the agent controls the truck count directly. The link between Motion and Reward became much shorter and more predictable, making training faster and more stable. In technical terms, we significantly reduced the stochastic noise within the reward signal. The LP now optimizes only the packaging and fleet mix after the strategic capability decision has already been made.

However the engineering advantages didn’t stop there. For the reason that LP now selects the packages, we not need to take care of a sorted Priority Queue. This simplified the architecture in three critical ways. First is concurrency: We eliminated the technical multiprocessing headaches related to sharing complex PriorityQueue objects between processes. Second is vectorization: We not must iterate through a queue item-by-item (a slow Python loop). We are able to now rewrite every part using matrix operations. This unlocked a large potential for speed optimization. Plus, the code became significantly shorter and cleaner. And at last, multi-destination actions: The agent can now dispatch trucks to different warehouses in a single step (unlike V1, which was limited to 1 destination per step). It became immediately clear that this was the winning architecture.

Figure 2: V2 Architecture — The “Fleet Manager” Approach

Scale-Invariant Commentary Space and Generalization

TL;DR: I exploit histogram state representations normalized to 0–1 as an alternative of absolute values to make the agents transferable to recent cases.

A core pillar of this project’s philosophy is universality — the power to reuse the answer across different tasks and recent conditions without retraining. Nevertheless, standard RL requires a rigidly fixed motion and remark space.

To reconcile this, we normalized the remark space to make it scale-invariant. As a substitute of tracking raw counts (e.g., “what number of packages were sent”), we track ratios (e.g., “what percentage of the whole backlog was sent”). This enables the agent to operate on a better level of abstraction where absolute numbers are irrelevant.

The result’s a model able to generalizing across different scenarios, enabling zero-shot transfer across nodes with vastly different capacities.

A Glimpse of the Performance

Agents Learned “LTL Consolidation” Behavior

TL;DR: Increased shipment cost led to more idle actions and fewer vehicles.

One of the vital impressive emergent behaviors was the agents’ ability to perform LTL (Less-Than-Truckload) Consolidation. Initially of coaching, the agents were trigger-happy, dispatching many partially filled trucks at every step. Over time, their behavior shifted.

The shipment cost is calculated as a product of the vehicle cost and the shipment cost multiplier. When the shipment cost multiplier increases, a shipment costs more in relation to the worth of the packages. That offers us an easy technique to adjust the shipment cost a part of the reward manually.

Figure 3: Total variety of vehicles sent by an agent. One point — one “20-day” episode

As we increased the shipment cost multiplier (making logistics costlier relative to the package value), the agents learned to be patient. They began selecting more “idle” actions, effectively accumulating inventory to send fewer, fuller trucks.

Figure 4: Total agent reward. One point — one “20-day” episode

Since it is expensive to send a truck half-empty (or half-full, depending in your worldview), agents began waiting to fill the trucks closer to 100% capability. In other words, the agents learned to optimize vehicle utilization not directly, purely as a byproduct of the associated fee/reward function.

However, sending fewer cars led to a better variety of overdue packages. I imagine this sort of trade-off — cost vs. speed — needs to be decided by each business independently, based on their specific strategy and SLAs. In our specific case, we had a tough cap on the proportion of allowed delays, hence we could optimize by staying below that cap.

More results and experiments will probably be shown in the approaching Part 3

Constraints and Advantages

As I discussed earlier, high-quality data is crucial for this engine. In the event you don’t have data, you’ve gotten no simulation, no schedules, and no package flow forecasts — the very foundation of the whole system.

You furthermore may need the willingness to adapt your small business processes. In practice, this is usually met with resistance. And, after all, you would like the raw compute power (substantial RAM + CPU) to run the simulations.

But when you can overcome these hurdles, you would possibly find that your logistics network has transformed into something way more powerful — a network that:

Can withstand overloads, peak seasons, and sudden events. It is because you’ve gotten a quick, reliable technique to generate a brand new schedule immediately by simply applying your pre-trained agents to the brand new data.
Is more efficient than the competition. MARL has the potential to attain not only local optimization, but global optimization of the whole network over a continuous time horizon.
Can rapidly expand or contract as needed. This flexibility is achieved precisely through the model’s generalization capabilities.

All one of the best to everyone, and should your shipments at all times be fast and reliable!

LinkedIn | E-mail

A Generalizable MARL-LP Approach for Scheduling in Logistics

Introduction

Business Context