JSON Parsing for Large Payloads: Balancing Speed, Memory, and Scalability

Introduction

campaign you arrange for Black Friday was a large success, and customers start pouring into your website. Your Mixpanel setup which might often have around 1000 customer events an hour finally ends up having tens of millions of customer events inside an hour. Thereby, your data pipeline is now tasked with parsing vast amounts of JSON data and storing it in your database. You see that your standard JSON parsing library isn’t in a position to scale as much as the sudden data growth, and your near real-time analytics reports fall behind. That is while you realize the importance of an efficient JSON parsing library. Along with handling large payloads, JSON parsing libraries should have the opportunity to serialize and deserialize highly nested JSON payloads.

In this text, we explore Python parsing libraries for giant payloads. We specifically have a look at the capabilities of ujson, orjson, and ijson. We then benchmark the usual JSON library (stdlib/json), ujson, and orjson for serialization and deserialization performance. As we use the terms serialization and deserialization throughout the article, here’s a refresher on the concepts. Serialization involves converting your Python objects to a JSON string, whereas Deserialization involves rebuilding the JSON string out of your Python data structures.

As we progress through the article, you can find a call flow diagram to assist choose the parser to make use of based in your workflow and unique parsing needs. Along with this, we also explore NDJSON and libraries to parse NDJSON payloads. Let’s start.

Stdlib JSON

Stdlib JSON supports serialization for all basic Python data types, including dicts, lists, and tuples. When the function json.loads() known as, it loads your complete JSON into memory without delay. That is high quality for smaller payloads, but for larger payloads, json.loads() may cause critical performance issues reminiscent of out-of-memory errors and choking of downstream workflows.

import json

with open("large_payload.json", "r") as f:
    json_data = json.loads(f)   #loads entire file into memory, all tokens without delay

ijson

For payloads which can be within the order of lots of of MBs, it’s advisable to make use of ijson. ijson, short for ‘iterative json’, reads files one token at a time without the memory overhead. Within the code below, we compare json and ijson.

#The ijson library reads records one token at a time
import ijson
with open("json_data.json", "r") as f:
    for record in ijson.items(f, "items.item"): #fetch one dict from the array
       process(record)

As you may see, ijson fetches one element at a time from the JSON and loads it right into a Python dict object. That is then fed to the calling function, on this case, the method(record) function. The general working of ijson has been provided within the illustration below.

A high-level illustration of ijson (Image by the Writer)

ujson

Ujson has been a widely used library in lots of applications involving large JSON payloads, because it was designed to be a faster alternative to the stdlib JSON in Python. The speed of parsing is great because the underlying code of ujson has been written in C, with Python bindings that hook up with the Python interface. The areas that needed improvement in the usual JSON library were optimized in Ujson for speed and performance. But, Ujson is not any longer utilized in newer projects, because the makers themselves have mentioned on PyPI that the library has been placed in maintenance-only mode. Below is an illustration of ujson’s processes at a high-level.

import ujson
taxonomy_data = '{"id":1, "genus":"Thylacinus", "species":"cynocephalus", "extinct": true}'
data_dict = ujson.loads(taxonomy_data) #Deserialize

with open("taxonomy_data.json", "w") as fh: #Serialize
    ujson.dump(data_dict, fh) 

with open("taxonomy_data.json", "r") as fh: #Deserialize
    data = ujson.load(fh)
    print(data)

We move to the following potential library named ‘orjson’.

orjson

Since Orjson is written in Rust, it’s optimized not just for speed but additionally has memory-safe mechanisms to forestall buffer overflows that developers face while using C-based JSON libraries like ujson. Furthermore, Orjson supports serialization of several additional datatypes beyond the usual Python datatypes, including dataclass and datetime objects. One other key difference between orjson and the opposite libraries is that orjson’s dumps() function returns a bytes object, whereas the others return a string. Returning the info as a bytes object is certainly one of the essential reasons for orjson’s fast throughput.

import orjson
book_payload = '{"id":1,"name":"The Great Gatsby","creator":"F. Scott Fitzgerald","Publishing House":"Charles Scribner's Sons"}'
data_dict = orjson.loads(book_payload) #Deserialize
print(data_dict)          
  
with open("book_data.json", "wb") as f: #Serialize
    f.write(orjson.dumps(data_dict)) #Returns bytes object

with open("book_data.json", "rb") as f:#Deserialize
    book_data = orjson.loads(f.read())
    print(book_data)

Now that we’ve explored some JSON parsing libraries, let’s test their serialization capabilities.

Testing Serialization Capabilities of JSON, ujson and orjson

We create a sample dataclass object with an integer, string and a datetime variable.

from dataclasses import dataclass
from datetime import datetime

@dataclass
class User:
    id: int
    name: str
    created: datetime

u = User(id=1, name="Thomas", created=datetime.now())

We then pass it to every of the libraries to see what happens. We start with the stdlib JSON.

import json
try:
    print("json:", json.dumps(u))
except TypeError as e:
    print("json error:", e)

As expected, we get the next error. (The usual JSON library doesn’t support serialization of “dataclass” objects and datetime objects.)

Next, we test the identical with the ujson library.

import ujson
try:
print("json:", ujson.dumps(u))
except TypeError as e:
print("json error:", e)

As we see above, ujson isn’t in a position to serialize the info class object and the datetime datatype. Lastly, we use the orjson library for serialization.

import orjson
try:
    print("orjson:", orjson.dumps(u))
except TypeError as e:
    print("orjson error:", e)

We see that orjson was in a position to serialize each the dataclass and the datetime datatypes.

Working with NDJSON (A special Mention)

We’ve seen the libraries for JSON parsing, but what about NDJSON? NDJSON (Newline Delimited JSON), as you would possibly know, is a format during which each line is a JSON object. In other words, the delimiter isn’t a comma but a newline character. For instance, that is what NDJSON looks like.

{"id": "A13434", "name": "Ella"}
{"id": "A13455", "name": "Charmont"}
{"id": "B32434", "name": "Areida"}

NDJSON is usually used for logs and streaming data, and hence, NDJSON payloads are excellent candidates for being parsed using the ijson library. For small to moderate NDJSON payloads, it is suggested to make use of the stdlib JSON. Apart from ijson and stdlib JSON, there’s a dedicated NDJSON library. Below are code snippets showing each approach.

NDJSON using stdlib JSON & ijson

As NDJSON isn’t delimited by commas, it doesn’t qualify for a bulk load, because stdlib json expects to see an inventory of dicts. In other words, stdlib JSON’s parser looks for a single valid JSON element, but is as a substitute given several JSON elements within the payload file. Due to this fact, the file needs to be parsed iteratively, line by line, and sent to the caller function for further processing.

import json
ndjson_payload = """{"id": "A13434", "name": "Ella"}
{"id": "A13455", "name": "Charmont"}
{"id": "B32434", "name": "Areida"}"""

#Writing NDJSON file
with open("json_lib.ndjson", "w", encoding="utf-8") as fh:
    for line in ndjson_payload.splitlines(): #Split string into JSON obj
        fh.write(line.strip() + "n") #Write each JSON object as its line

#Reading NDJSON file using json.loads
with open("json_lib.ndjson", "r", encoding="utf-8") as fh:
    for line in fh:
        if line.strip():                       #Remove recent lines
            item= json.loads(line)             #Deserialize
            print(item) #or send it to the caller function

With ijson, the parsing is finished as shown below. With standard JSON, we’ve only one root element, which is either a dictionary if it’s a single JSON or an array whether it is an inventory of dicts. But with NDJSON, each line is its own root element. The argument “” in ijson.items() tells the ijson parser to have a look at each root element. The arguments “” and multiple_values=True let the ijson parser know that there are multiple JSON root elements within the file, and to fetch one line (each JSON) at a time.

import ijson
ndjson_payload = """{"id": "A13434", "name": "Ella"}
{"id": "A13455", "name": "Charmont"}
{"id": "B32434", "name": "Areida"}"""

#Writing the payload to a file to be processed by ijson
with open("ijson_lib.ndjson", "w", encoding="utf-8") as fh:
    fh.write(ndjson_payload)

with open("ijson_lib.ndjson", "r", encoding="utf-8") as fh:
    for item in ijson.items(fh, "", multiple_values=True):
        print(item)

Lastly, we’ve the dedicated library NDJSON. It principally converts the NDJSON format to straightforward JSON.

import ndjson
ndjson_payload = """{"id": "A13434", "name": "Ella"}
{"id": "A13455", "name": "Charmont"}
{"id": "B32434", "name": "Areida"}"""

#writing the payload to a file to be processed by ijson
with open("ndjson_lib.ndjson", "w", encoding="utf-8") as fh:
    fh.write(ndjson_payload)

with open("ndjson_lib.ndjson", "r", encoding="utf-8") as fh:
    ndjson_data = ndjson.load(fh)   #returns an inventory of dicts

As you will have seen, NDJSON file formats can often be parsed using stdlib json and ijson. For very large payloads, ijson is the very best selection because it is memory-efficient. But in case you wish to generate NDJSON payloads from other Python objects, the NDJSON library is the best selection. It is because the function ndjson.dumps() mechanically converts python objects to NDJSON format without having to iterate over these data structures.

Now that we’ve explored NDJSON, let’s pivot back to benchmarking the libraries stdlib json, ujson, and orjson.

The rationale IJSON isn’t considered for Benchmarking

‘ijson’ being a streaming parser may be very different from the majority parsers that we checked out. If we benchmarked ijson together with these bulk parsers, we can be comparing apples to oranges. Even when we benchmarked ijson alongside the opposite parsers, we’d get the misunderstanding that ijson is the slowest, when in actual fact it serves a unique purpose altogether. ijson is optimized for memory efficiency and due to this fact has lower throughput than bulk parsers.

Generating a Synthetic JSON Payload for Benchmarking Purposes

We generate a big synthetic JSON payload having 1 million records, using the library ‘mimesis’. This data will likely be used to benchmark the libraries. The below code might be used to create the payload for this benchmarking, in case you wish to duplicate this. The generated file can be between 100 MB and 150 MB in size, which I consider, is large enough to conduct tests on benchmarking.

from mimesis import Person, Address
import json
person_name = Person("en")
complete_address = Address("en")

#streaming to a file
with open("large_payload.json", "w") as fh:
    fh.write("[")  #JSON array
    for i in range(1_000_000):
        payload = {
            "id": person_name.identifier(),
            "name": person_name.full_name(),
            "email": person_name.email(),
            "address": {
                "street": complete_address.street_name(),
                "city": complete_address.city(),
                "postal_code": complete_address.postal_code()
            }
        }
        json.dump(payload, fh)
        if i < 999_999: #To prevent a comma at the last entry
            fh.write(",") 
    fh.write("]")   #end JSON array

Below is a sample of what the generated data would appear like. As you may see, the address fields are nested to be certain that the JSON isn't just large in size but additionally represents real-world hierarchical JSONs.

[
  {
    "id": "8177",
    "name": "Willia Hays",
    "email": "[email protected]",
    "address": {
      "street": "Emerald Cove",
      "city": "Crown Point",
      "postal_code": "58293"
    }
  },
  {
    "id": "5931",
    "name": "Quinn Greer",
    "email": "[email protected]",
    "address": {
      "street": "Ohlone",
      "city": "Bridgeport",
      "postal_code": "92982"
    }
  }
]

Let’s start with benchmarking.

Benchmarking Pre-requisites

We use the read() function to store the JSON file as a string. We then use the hundreds() function in each of the libraries (json, ujson, and orjson) to deserialize the JSON string right into a Python object. To start with, we create the payload_str object from the raw JSON text.

with open("large_payload1.json", "r") as fh:
    payload_str = fh.read()   #raw JSON text

We then create a benchmarking function with two arguments. The primary argument is the function that's being tested. On this case, it's the hundreds() function. The second argument is the payload_str constructed from the file above.

def benchmark_load(func, payload_str):
    start = time.perf_counter()
    for _ in range(3):
        func(payload_str)
    end = time.perf_counter()
    return end - start

We use the above function to check for each serialization and deserialization speeds.

Benchmarking Deserialization Speed

We load the three libraries being tested. We then run the function benchmark_load() against the hundreds() function of every of those libraries.

import json, ujson, orjson, time

results = {
    "json.loads": benchmark_load(json.loads, payload_str),
    "ujson.loads": benchmark_load(ujson.loads, payload_str),
    "orjson.loads": benchmark_load(orjson.loads, payload_str),
}

for lib, t in results.items():
    print(f"{lib}: {t:.4f} seconds")

As we will see, orjson has taken the least period of time for deserialization.

Benchmarking Serialization Speed

Next, we test the serialization speed of those libraries.

import json
import ujson
import orjson
import time


results = {
    "json.dumps": benchmark("json", json.dumps, payload_str),
    "ujson.dumps": benchmark("ujson", ujson.dumps, payload_str),
    "orjson.dumps": benchmark("orjson", orjson.dumps, payload_str),
}

for lib, t in results.items():
    print(f"{lib}: {t:.4f} seconds")

On comparing run times, we see that orjson takes the least period of time to serialize Python objects to a JSON object.

Selecting the Best JSON library to your Workflow

A guide to selecting the optimal JSON library (Image by the Writer)

Clipboard & Workflow Hacks for JSON

Let’s suppose that you just’d wish to view your JSON in a text editor reminiscent of Notepad++ or share a snippet (from a big payload) on Slack with a teammate. You’ll quickly run into clipboard or text editor/IDE crashes. In such situations, one could use Pyperclip or Tkinter. Pyperclip works well for payloads inside 50 MB, whereas Tkinter works well for medium-sized payloads. For big payloads, you would write the JSON to a file to view the info.

Conclusion

JSON can seem effortless, however the larger the payload and the more nesting, the more these payloads can quickly turn right into a performance bottleneck. This text aimed to spotlight how each Python parsing library addresses this challenge. While choosing JSON parsing libraries, speed and throughput are usually not all the time the essential criteria. It's the workflow that determines whether throughput, memory efficiency, or long-term scalability is required for parsing payloads. Briefly, JSON parsing shouldn’t be a one-size-fits-all approach.

JSON Parsing for Large Payloads: Balancing Speed, Memory, and Scalability

Introduction

Stdlib JSON

ijson

ujson

orjson

Testing Serialization Capabilities of JSON, ujson and orjson

Working with NDJSON (A special Mention)

NDJSON using stdlib JSON & ijson

The rationale IJSON isn’t considered for Benchmarking

Generating a Synthetic JSON Payload for Benchmarking Purposes

Benchmarking Pre-requisites

Benchmarking Deserialization Speed

Benchmarking Serialization Speed

Selecting the Best JSON library to your Workflow

Clipboard & Workflow Hacks for JSON

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

A Gentle Introduction to Nonlinear Constrained Optimization with Piecewise Linear Approximations

Agentic RAG Failure Modes: Retrieval Thrash, Tool Storms, and Context Bloat (and How you can Spot Them Early)

Learn how to Measure AI Value

Constructing Robust Credit Scoring Models (Part 3)

MIT and Hasso Plattner Institute establish collaborative hub for AI and creativity

JSON Parsing for Large Payloads: Balancing Speed, Memory, and Scalability

Introduction

Stdlib JSON

ijson

ujson

orjson

Testing Serialization Capabilities of JSON, ujson and orjson

Working with NDJSON (A special Mention)

NDJSON using stdlib JSON & ijson

The rationale IJSON isn’t considered for Benchmarking

Generating a Synthetic JSON Payload for Benchmarking Purposes

Benchmarking Pre-requisites

Benchmarking Deserialization Speed

Benchmarking Serialization Speed

Selecting the Best JSON library to your Workflow

Clipboard & Workflow Hacks for JSON

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.