Ray: Distributed Computing For All, Part 2

-

instalment in my two-part series on the Ray library, a Python framework created by AnyScale for distributed and parallel computing. Part 1 covered the way to parallelise CPU-intensive Python jobs in your local PC by distributing the workload across all available cores, leading to marked improvements in runtime. I’ll leave a link to Part 1 at the top of this text.

This part deals with an analogous theme, except we take distributing Python workloads to the subsequent level through the use of Ray to parallelise them across multi-server clusters within the cloud.

In case you’ve come to this without having read Part 1, the TL;DR of Ray is that it’s an open-source distributed computing framework designed to make it easy to scale Python programs from a laptop to a cluster with minimal code changes. That alone should hopefully be enough to pique your interest. In my very own test, on my desktop PC, I took a simple, relatively easy Python program that finds prime numbers and reduced its runtime by an element of 10 by adding just 4 lines of code.

Where are you able to run Ray clusters?

Ray clusters could be arrange on the next:

  • AWS and GCP Cloud, although unofficial integrations exist for other providers, too, reminiscent of Azure
  • AnyScale, a completely managed platform developed by the creators of Ray.
  • Kubernetes will also be used via the officially supported KubeRay project.

Prerequisites

To follow together with my process, you’ll need just a few things arrange beforehand. I’ll be using AWS for my demo, as I even have an existing account there; nonetheless, I expect the setup for other cloud providers and platforms to be very similar. It is best to have:

  • Credentials set as much as run Cloud CLI commands out of your chosen provider.
  • A default VPC and at the least one public subnet related to it that has a publicly reachable IP address.
  • An SSH Key pair file (.pem) that you could download to your local system in order that Ray (and also you) can hook up with the nodes in your cluster
  • You might have enough quotas to satisfy the requested variety of nodes and vCPUs in whichever cluster you arrange.

If you need to do some local testing of your Ray code before deploying it to a cluster, you’ll also need to put in the Ray library. We will do this using pip.

$ pip install ray

I’ll be running all the things from a WSL2 Ubuntu shell on my Windows desktop.

To confirm that Ray has been installed appropriately, you must give you the chance to make use of its command-line interpreter. In a terminal window, type in the next command.

$ ray --help

Usage: ray [OPTIONS] COMMAND [ARGS]...

Options:
  --logging-level TEXT   The logging level threshold, decisions=['debug',
                         'info', 'warning', 'error', 'critical'],
                         default='info'
  --logging-format TEXT  The logging format.
                         default="%%(asctime)st%%(levelname)s
                         %%(filename)s:%%(lineno)s -- %%(message)s"
  --version              Show the version and exit.
  --help                 Show this message and exit.

Commands:
  attach               Create or attach to a SSH session to a Ray cluster.
  check-open-ports     Check open ports within the local Ray cluster.
  cluster-dump         Get log data from a number of nodes.
...
...
...

In case you don’t see this, something has gone fallacious, and you must double-check the output of your install command.

Assuming all the things is OK, we’re good to go. 

One last essential point, though. Creating resources, reminiscent of compute clusters, on a cloud provider like AWS will incur costs, so it’s essential you bear this in mind. The excellent news is that Ray has a built-in command that can tear down any infrastructure you create, but to be secure, you must double-check that no unused and potentially costly services get left “switched on” by mistake.

Our example Python code

Step one is to switch our existing Ray code from Part 1 to run on a cluster. Here is the unique code on your reference. Recall that we are attempting to count the variety of prime numbers inside a particular numeric range.

import math
import time

# -----------------------------------------
# Change No. 1
# -----------------------------------------
import ray
ray.init()

def is_prime(n: int) -> bool:
    if n < 2: return False
    if n == 2: return True
    if n % 2 == 0: return False
    r = int(math.isqrt(n)) + 1
    for i in range(3, r, 2):
        if n % i == 0:
            return False
    return True

# -----------------------------------------
# Change No. 2
# -----------------------------------------
@ray.remote(num_cpus=1)  # pure-Python loop → 1 CPU per task
def count_primes(a: int, b: int) -> int:
    c = 0
    for n in range(a, b):
        if is_prime(n):
            c += 1
    return c

if __name__ == "__main__":
    A, B = 10_000_000, 20_000_000
    total_cpus = int(ray.cluster_resources().get("CPU", 1))

    # Start "chunky"; we are able to sweep this later
    chunks = max(4, total_cpus * 2)
    step = (B - A) // chunks

    print(f"nodes={len(ray.nodes())}, CPUs~{total_cpus}, chunks={chunks}")
    t0 = time.time()
    refs = []
    for i in range(chunks):
        s = A + i * step
        e = s + step if i < chunks - 1 else B
        # -----------------------------------------
        # Change No. 3
        # -----------------------------------------
        refs.append(count_primes.distant(s, e))

    # -----------------------------------------
    # Change No. 4
    # -----------------------------------------
    total = sum(ray.get(refs))

    print(f"total={total}, time={time.time() - t0:.2f}s")

What modifications are needed to run it on a cluster? The reply is that only one minor change is required. 

Change 

ray.init() 

to

ray.init(address=auto)

That’s certainly one of the beauties of Ray. The identical code runs almost unmodified in your local PC, and anywhere else you care to run it, including large, multi-server cloud clusters.

Establishing our cluster

On the cloud, a Ray cluster consists of a head node and a number of employee nodes. In AWS, all these nodes are simply EC2 instances. Ray clusters could be fixed-size or autoscale up and down based on the resources requested by applications running on the cluster. The top node is began first, and the employee nodes are configured with the pinnacle node’s address to form the cluster. If auto-scaling is enabled, employee nodes mechanically scale up or down based on the applying’s load and can scale down after a user-specified period (5 minutes by default).

Ray uses YAML files to establish clusters. A YAML file is only a plain-text file with a JSON-like syntax used for system configuration.

Here is the YAML file I’ll be using to establish my cluster. I discovered that the closest EC2 instance to my desktop PC, by way of CPU core count and performance, was a c7g.8xlarge. For simplicity, I’m having the pinnacle node be the identical server type as all the employees, but you possibly can mix and match different EC2 types if desired.

cluster_name: ray_test

provider:
  type: aws
  region: eu-west-1
  availability_zone: eu-west-1a

auth:
  # For Amazon Linux AMIs the SSH user is 'ec2-user'.
  # In case you switch to an Ubuntu AMI, change this to 'ubuntu'.
  ssh_user: ec2-user
  ssh_private_key: ~/.ssh/ray-autoscaler_eu-west-1.pem

max_workers: 10
idle_timeout_minutes: 10

head_node_type: head_node

available_node_types:
  head_node:
    node_config:
      InstanceType: c7g.8xlarge
      ImageId: ami-06687e45b21b1fca9
      KeyName: ray-autoscaler_eu-west-1

  worker_node:
    min_workers: 5
    max_workers: 5
    node_config:
      InstanceType: c7g.8xlarge
      ImageId: ami-06687e45b21b1fca9
      KeyName: ray-autoscaler_eu-west-1
      InstanceMarketOptions:
        MarketType: spot

# =========================
# Setup commands (run on head + staff)
# =========================
setup_commands:
  - |
    set -euo pipefail

    have_cmd() { command -v "$1" >/dev/null 2>&1; }
    have_pip_py() {
      python3 -c 'import importlib.util, sys; sys.exit(0 if importlib.util.find_spec("pip") else 1)'
    }

    # 1) Ensure Python 3 is present
    if ! have_cmd python3; then
      if have_cmd dnf; then
        sudo dnf install -y python3
      elif have_cmd yum; then
        sudo yum install -y python3
      elif have_cmd apt-get; then
        sudo apt-get update -y
        sudo apt-get install -y python3
      else
        echo "No supported package manager found to put in python3." >&2
        exit 1
      fi
    fi

    # 2) Ensure pip exists
    if ! have_pip_py; then
      python3 -m ensurepip --upgrade >/dev/null 2>&1 || true
    fi
    if ! have_pip_py; then
      if have_cmd dnf; then
        sudo dnf install -y python3-pip || true
      elif have_cmd yum; then
        sudo yum install -y python3-pip || true
      elif have_cmd apt-get; then
        sudo apt-get update -y || true
        sudo apt-get install -y python3-pip || true
      fi
    fi
    if ! have_pip_py; then
      curl -fsS https://bootstrap.pypa.io/get-pip.py -o /tmp/get-pip.py
      python3 /tmp/get-pip.py
    fi

    # 3) Upgrade packaging tools and install Ray
    python3 -m pip install -U pip setuptools wheel
    python3 -m pip install -U "ray[default]"

Here's a temporary explanation of every critical YAML section.

cluster_name: Assigns a reputation to the cluster, allowing Ray to trace and manage 
it individually from others.

provider:  Specifies which cloud to make use of (AWS here), together with the region and 
availability zone for launching instances.

auth:  Defines how Ray connects to instances over SSH - the user name and the 
private key used for authentication.

max_workers:  Sets the utmost variety of employee nodes Ray can scale as much as when 
more compute is required.

idle_timeout_minutes:  Tells Ray how long to attend before mechanically terminating 
idle employee nodes.

available_node_types:  Describes the several node types (head and staff), their 
instance sizes, AMI images, and scaling limits.

head_node_type:  Identifies which of the node types acts because the cluster's controller
(the pinnacle node).

setup_commands:  Lists shell commands that run once on each node when it's first 
created, typically to put in software or arrange the environment.

To start out the cluster creation, use this ray command from the terminal.

$ ray up -y ray_test.yaml

Ray will do its thing, creating all of the vital infrastructure, and after just a few minutes, you must see something like this in your terminal window.

...
...
...
Next steps
  So as to add one other node to this Ray cluster, run
    ray start --address='10.0.9.248:6379'

  To hook up with this Ray cluster:
    import ray
    ray.init()

  To submit a Ray job using the Ray Jobs CLI:
    RAY_ADDRESS='http://10.0.9.248:8265' ray job submit --working-dir . -- python my_script.py

  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
  for more information on submitting Ray jobs to the Ray cluster.

  To terminate the Ray runtime, run
    ray stop

  To view the status of the cluster, use
    ray status

  To observe and debug Ray, view the dashboard at
    10.0.9.248:8265

  If connection to the dashboard fails, check your firewall settings and network configuration.
Shared connection to 108.130.38.255 closed.
  Latest status: up-to-date

Useful commands:
  To terminate the cluster:
    ray down /mnt/c/Users/thoma/ray_test.yaml

  To retrieve the IP address of the cluster head:
    ray get-head-ip /mnt/c/Users/thoma/ray_test.yaml

  To port-forward the cluster's Ray Dashboard to the local machine:
    ray dashboard /mnt/c/Users/thoma/ray_test.yaml

  To submit a job to the cluster, port-forward the Ray Dashboard in one other terminal and run:
    ray job submit --address http://localhost: --working-dir . -- python my_script.py

  To hook up with a terminal on the cluster head for debugging:
    ray attach /mnt/c/Users/thoma/ray_test.yaml

  To observe autoscaling:
    ray exec /mnt/c/Users/thoma/ray_test.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'

Running a Ray job on a cluster

At this stage, the cluster has been built, and we're able to submit our Ray job to it. To provide the cluster something more substantial to work with, I increased the range for the prime search in my code from 10,000,000 to twenty,000,000 to 10,000,000–60,000,000. On my local desktop, Ray ran this in 18 seconds. 

I waited a short while for all of the cluster nodes to initialise fully, then ran the code on the cluster with this command. 

$  ray exec ray_test.yaml 'python3 ~/ray_test.py'

Here is my output.

(base) tom@tpr-desktop:/mnt/c/Users/thoma$ ray exec ray_test2.yaml 'python3 ~/primes_ray.py'
2025-11-01 13:44:22,983 INFO util.py:389 -- setting max staff for head node type to 0
Loaded cached provider configuration
In case you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 52.213.155.130
Warning: Permanently added '52.213.155.130' (ED25519) to the list of known hosts.
2025-11-01 13:44:26,469 INFO employee.py:1832 -- Connecting to existing Ray cluster at address: 10.0.5.86:6379...
2025-11-01 13:44:26,477 INFO employee.py:2003 -- Connected to Ray cluster. View the dashboard at http://10.0.5.86:8265
nodes=6, CPUs~192, chunks=384
(autoscaler +2s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +2s) No available node types can fulfill resource requests {'CPU': 1.0}*160. Add suitable node types to this cluster to resolve this issue.
total=2897536, time=5.71s
Shared connection to 52.213.155.130 closed.

As you possibly can see the time taken to run on the cluster was just over 5 seconds. So, five employee nodes ran the identical job in lower than a 3rd of the time it took on my local PC. Not too shabby.

While you’re finished together with your cluster, please run the next Ray command to tear it down.

$ ray down -y ray_test.yaml

As I discussed before, you must at all times double-check your account to make sure this command has worked as expected. 

Summary

This text, the second in a two-part series, demonstrates the way to run CPU-intensive Python code on cloud-based clusters using the Ray library. By spreading the workload across all available vCPUs, Ray ensures our code delivers fast performance and runtimes. 

I described and showed the way to create a cluster using a YAML file and the way to utilise the Ray command-line interface to submit code for execution on the cluster.

Using AWS for example platform, I took Ray Python code, which had been running on my local PC and ran it — almost unchanged — on a 6-node EC2 cluster. This showed significant performance improvements (3x) over the non-cluster run time.

Finally, I showed the way to use the ray command-line tool to tear down the AWS cluster infrastructure Ray had created.

In case you haven’t already read my first article on this series, click on the link below to envision it out.

Please note that apart from being a some-time user of their services, I even have no affiliation with AnyScale or AWS or every other organisation mentioned in this text.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x