Why will we still wrestle with documents in 2025?
in any data-driven organisation, and also you’ll encounter a number of PDFs, Word files, PowerPoints, half-scanned images, handwritten notes, and the occasional surprise CSV lurking in a SharePoint folder. Business and data analysts waste hours converting, splitting, and cajoling those formats into something their Python pipelines will accept. Even the newest generative-AI stacks can choke when the underlying text is wrapped inside graphics or sprinkled across irregular table grids.
Docling was born to resolve exactly that pain. Released as an open-source project by IBM Research Zurich and now hosted under the Linux Foundation AI & Data Foundation, the library abstracts parsing, layout understanding, OCR, table reconstruction, multimodal export, and even audio transcription behind one reasonably straightforward API and CLI command.
Although docling supports the processing of HTML, MS Office format files, Image formats and others, we’ll be mostly taking a look at using it to process PDF files.
As a knowledge scientist or ML engineer, why should I care about Docling?
Often, the actual bottleneck isn’t constructing the model — it’s feeding it. We spend a big percentage of our time on data wrangling, and nothing kills productivity faster than being handed a critical dataset locked inside a 100-page PDF. That is precisely the issue Docling solves, acting as a bridge from the world of unstructured documents on to the structured sanity of Markdown, JSON, or a Pandas DataFrame.Â
But its power extends beyond just data extraction, directly into the world of contemporary, AI-assisted development. Imagine pointing docling at an HTML page of API specifications; it effortlessly translates that complex web layout into clean, structured Markdown — the right context to feed directly into AI coding assistants like Cursor, ChatGPT, or Claude.
Where Docling got here from
The project originated inside IBM’s Deep Search team, which was developing retrieval-augmented generation (RAG) pipelines for long patent PDFs. They open-sourced the core under an MIT license in late 2024 and have been shipping weekly releases ever since. A vibrant community quickly formed around its unified DoclingDocument model, a Pydantic object that keeps text, images, tables, formulas, and layout metadata together so downstream tools like LangChain, LlamaIndex, or Haystack don’t should guess a page’s reading order.
Today, Docling integrates visual-language models (VLMs), comparable to SmolDocling, for figure captioning. It also supports Tesseract, EasyOCR, and RapidOCR for text extraction and ships recipes for chunking, serialisation, and vector-store ingestion. In other words: you point it at a folder, and also you get Markdown, HTML, CSV, PNGs, JSON, or simply a ready-to-embed Python object — no extra scaffolding code required.Â
What we’ll doÂ
To showcase Docling, we’ll first install it after which use it with three different examples that display its versatility and usefulness as a document parser and processor. Please note that using Docling is sort of computationally intensive, so it should be helpful if you’ve access to a GPU in your system.
Nonetheless, before we start coding, we’d like to establish a development environment.
Establishing a development environment
I’ve began using the UV package manager for this now, but be happy to make use of whichever tools you’re most comfortable with. Note also that I’ll be working under WSL2 Ubuntu for Windows and running my code using a Jupyter Notebook.Â
Note, even using UV, the code below took a few minutes to finish on my system, because it’s a fairly hefty set of library installs.
$ uv init docling
Initialized project `docling` at `/home/tom/docling`
$ cd docling
$ uv venv
Using CPython 3.11.10 interpreter at: /home/tom/miniconda3/bin/python
Creating virtual environment at: .venv
Activate with: source .venv/bin/activate
$ source .venv/bin/activate
(docling) $ uv pip install docling pandas jupyter
Now type within the command,
(docling) $ jupyter notebook
And you need to see a notebook open in your browser. If that doesn’t occur mechanically, you’ll likely see a screenful of data after running the Jupyter Notebook command. Near the underside, you can see a URL to repeat and paste into your browser to launch the Jupyter Notebook.
Your URL can be different to mine, however it should look something like this:-
http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69d
Example 1: Convert any PDF or DOCX to Markdown or JSON
The only use case can be the one you’ll use a big percentage of the time:- turn a document’s text into MarkdownÂ
For many of our examples, our input PDF can be one I’ve used several times before for various tests. It’s a replica of Tesla’s 10-Q SEC filing document from September 2023. It’s roughly fifty pages long and consists mainly of monetary information related to Tesla. The complete document is publicly available on the Securities & Exchange Commission (SEC) website and will be viewed/downloaded using this link.
Here is a picture of the primary page of that document on your reference.
Let’s review the docling code we’d like to convert into markdown. It sets up the file path for the input PDF, runs the DocumentConverter function on it, after which exports the parsed result into Markdown format in order that the content will be more easily read, edited, or analysed.
from docling.document_converter import DocumentConverter
import time
from pathlib import Path
inpath = "/mnt/d//tesla"
infile = "tesla_q10_sept_23.pdf"
data_folder = Path(inpath)
doc_path = data_folder / infile
converter = DocumentConverter()
result = converter.convert(doc_path) # → DoclingResult
# Markdown export still works
markdown_text = result.document.export_to_markdown()
That is the output we get from running the above code (just the primary page).
## UNITED STATES SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549 FORM 10-Q
(Mark One)
- x QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the quarterly period ended September 30, 2023
OR
- o TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from _________ to _________
Commission File Number: 001-34756
## Tesla, Inc.
(Exact name of registrant as laid out in its charter)
Delaware
(State or other jurisdiction of incorporation or organization)
1 Tesla Road Austin, Texas
(Address of principal executive offices)
## (512) 516-8177
(Registrant's telephone number, including area code)
## Securities registered pursuant to Section 12(b) of the Act:
| Title of every class | Trading Symbol(s) | Name of every exchange on which registered |
|-----------------------|---------------------|---------------------------------------------|
| Common stock | TSLA | The Nasdaq Global Select Market |
Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 ('Exchange Act') in the course of the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. Yes x No o
Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§232.405 of this chapter) in the course of the preceding 12 months (or for such shorter period that the registrant was required to submit such files). Yes x No o
Indicate by check mark whether the registrant is a big accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of 'large accelerated filer,' 'accelerated filer,' 'smaller reporting company' and 'emerging growth company' in Rule 12b-2 of the Exchange Act:
Large accelerated filer
x
Accelerated filer
Non-accelerated filer
o
Smaller reporting company
Emerging growth company
o
If an emerging growth company, indicate by check mark if the registrant has elected not to make use of the prolonged transition period for complying with any recent or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act. o
Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act). Yes o No x
As of October 16, 2023, there have been 3,178,921,391 shares of the registrant's common stock outstanding.
With the rise of AI code editors and using LLMs usually, this method has change into significantly more invaluable and relevant. The efficacy of LLMs and code editors will be significantly enhanced by providing them with appropriate context. Often this can entail supplying them with the textual representation of a selected tool or framework’s documentation, API and coding examples.
Converting the output of PDFs to JSON format can be straightforward. Just add these two lines of code. You could encounter limitations with the scale of the JSON output, so adjust the print statement accordingly.
json_blob = result.document.model_dump_json(indent=2)
print(json_blob[10000], "…")
Example 2: Extract complex tables from a PDF
Many PDFs often store tables as isolated text chunks or, worse, as flattened images. Docling’s table-structure model reassembles rows, columns, and spanning cells, providing you with either a Pandas DataFrame or a ready-to-save CSV. Our test input PDF has many tables. Look, for instance, at page 11 of the PDF, and we are able to see the table below,

Let’s see if we are able to extract that data. It’s barely more complex code than in our first example, however it’s doing more work. The PDF is converted again using Docling’s DocumentConverter function, producing a structured document representation. Then, for every table detected, it transforms the table right into a Pandas DataFrame and in addition retrieves the page variety of the table from the document’s provenance metadata. If the table comes from page 11, it prints it out in Markdown format after which breaks the loop (so only the primary matching table is shown).
import pandas as pd
from docling.document_converter import DocumentConverter
from time import time
from pathlib import Path
inpath = "/mnt/d//tesla"
infile = "tesla_q10_sept_23.pdf"
data_folder = Path(inpath)
input_doc_path = data_folder / infile
doc_converter = DocumentConverter()
start_time = time()
conv_res = doc_converter.convert(input_doc_path)
# Export table from page 11
for table_ix, table in enumerate(conv_res.document.tables):
page_number = table.prov[0].page_no if table.prov else "Unknown"
if page_number == 11:
table_df: pd.DataFrame = table.export_to_dataframe()
print(f"## Table {table_ix} (Page {page_number})")
print(table_df.to_markdown())
break
end_time = time() - start_time
print(f"Document converted and tables exported in {end_time:.2f} seconds.")
And the output will not be too shabby.
## Table 10 (Page 11)
| | | Three Months Ended September 30,.2023 | Three Months Ended September 30,.2022 | Nine Months Ended September 30,.2023 | Nine Months Ended September 30,.2022 |
|---:|:---------------------------------------|:----------------------------------------|:----------------------------------------|:---------------------------------------|:---------------------------------------|
| 0 | Automotive sales | $ 18,582 | $ 17,785 | $ 57,879 | $ 46,969 |
| 1 | Automotive regulatory credits | 554 | 286 | 1,357 | 1,309 |
| 2 | Energy generation and storage sales | 1,416 | 966 | 4,188 | 2,186 |
| 3 | Services and other | 2,166 | 1,645 | 6,153 | 4,390 |
| 4 | Total revenues from sales and services | 22,718 | 20,682 | 69,577 | 54,854 |
| 5 | Automotive leasing | 489 | 621 | 1,620 | 1,877 |
| 6 | Energy generation and storage leasing | 143 | 151 | 409 | 413 |
| 7 | Total revenues | $ 23,350 | $ 21,454 | $ 71,606 | $ 57,144 |
Document converted and tables exported in 33.43 seconds.
To retrieve ALL the tables from a PDF, you would wish to omit the if page_number =… line from my code.
One thing I actually have noticed with Docling is that it’s not fast. As shown above, it took almost 34 seconds to extract that single table from a 50-page PDF.
Example 3: Perform OCR on an image.
For this instance, I scanned a random page from the Tesla 10-Q PDF and saved it as a PNG file. Let’s see how Docling copes with reading that image and converting what it finds into markdown. Here is my scanned image.

And our code. We use Tesseract as our OCR engine (others can be found)
from pathlib import Path
import time
import pandas as pd
from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.models.tesseract_ocr_cli_model import TesseractCliOcrOptions
def major():
inpath = "/mnt/d//tesla"
infile = "10q-image.png"
input_doc_path = Path(inpath) / infile
# Configure OCR for image input
image_options = ImageFormatOption(
ocr_options=TesseractCliOcrOptions(force_full_page_ocr=True),
do_table_structure=True,
table_structure_options={"do_cell_matching": True},
)
converter = DocumentConverter(
format_options={"image": image_options}
)
start_time = time.time()
conv_res = converter.convert(input_doc_path).document
# Print all tables as Markdown
for table_ix, table in enumerate(conv_res.tables):
table_df: pd.DataFrame = table.export_to_dataframe(doc=conv_res)
page_number = table.prov[0].page_no if table.prov else "Unknown"
print(f"n--- Table {table_ix+1} (Page {page_number}) ---")
print(table_df.to_markdown(index=False))
# Print full document text as Markdown
print("n--- Full Document (Markdown) ---")
print(conv_res.export_to_markdown())
elapsed = time.time() - start_time
print(f"nProcessing accomplished in {elapsed:.2f} seconds")
if __name__ == "__main__":
major()
Here is our output.
--- Table 1 (Page 1) ---
| | Three Months Ended September J0,. | Three Months Ended September J0,.2022 | Nine Months Ended September J0,.2023 | Nine Months Ended September J0,.2022 |
|:-------------------------|------------------------------------:|:----------------------------------------|:---------------------------------------|:---------------------------------------|
| Cost ol revenves | 181 | 150 | 554 | 424 |
| Research an0 developrent | 189 | 124 | 491 | 389 |
| | 95 | | 2B3 | 328 |
| Total | 465 | 362 | 1,328 | 1,141 |
--- Full Document (Markdown) ---
## Note 8 Equity Incentive Plans
## Other Pertormance-Based Grants
("RSUs") und stock optlons unrecognized stock-based compensatian
## Summary Stock-Based Compensation Information
| | Three Months Ended September J0, | Three Months Ended September J0, | Nine Months Ended September J0, | Nine Months Ended September J0, |
|--------------------------|------------------------------------|------------------------------------|-----------------------------------|-----------------------------------|
| | | 2022 | 2023 | 2022 |
| Cost ol revenves | 181 | 150 | 554 | 424 |
| Research an0 developrent | 189 | 124 | 491 | 389 |
| | 95 | | 2B3 | 328 |
| Total | 465 | 362 | 1,328 | 1,141 |
## Note 9 Commitments and Contingencies
## Operating Lease Arrangements In Buffalo, Latest York and Shanghai, China
## Legal Proceedings
Between september 1 which 2021 pald has
Processing accomplished in 7.64 seconds
In case you compare this output to the unique image, the outcomes are disappointing. Quite a lot of the text within the image was just missed or garbled. That is where a product like AWS Textract comes into its own, because it excels at extracting text from a big selection of sources.Â
Nonetheless, Docling does provide various options for OCR, so when you receive poor results from one system, you may all the time switch to a different.
I attempted the identical task using EasyOCR, but the outcomes weren’t significantly different from those obtained with Tesseract. In case you’d prefer to try it out, here is the code.
from pathlib import Path
import time
import pandas as pd
from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.models.easyocr_model import EasyOcrOptions # Import EasyOCR options
def major():
inpath = "/mnt/d//tesla"
infile = "10q-image.png"
input_doc_path = Path(inpath) / infile
# Configure image pipeline with EasyOCR
image_options = ImageFormatOption(
ocr_options=EasyOcrOptions(force_full_page_ocr=True), # use EasyOCR
do_table_structure=True,
table_structure_options={"do_cell_matching": True},
)
converter = DocumentConverter(
format_options={"image": image_options}
)
start_time = time.time()
conv_res = converter.convert(input_doc_path).document
# Print all tables as Markdown
for table_ix, table in enumerate(conv_res.tables):
table_df: pd.DataFrame = table.export_to_dataframe(doc=conv_res)
page_number = table.prov[0].page_no if table.prov else "Unknown"
print(f"n--- Table {table_ix+1} (Page {page_number}) ---")
print(table_df.to_markdown(index=False))
# Print full document text as Markdown
print("n--- Full Document (Markdown) ---")
print(conv_res.export_to_markdown())
elapsed = time.time() - start_time
print(f"nProcessing accomplished in {elapsed:.2f} seconds")
if __name__ == "__main__":
major()
Summary
The generative-AI boom re-ignited an old truth: garbage in, garbage out. LLMs can hallucinate less only once they ingest semantically and spatially coherent input. Docling provides coherence (more often than not) across multiple source formats that your stakeholders can present, and does so locally and reproducibly.
Docling has its uses beyond the AI world, though. Consider the vast variety of documents stored in locations comparable to bank vaults, solicitors’ offices, and insurance firms worldwide. If these are to be digitised, Docling may provide a number of the solutions for that.
Its biggest weakness might be the Optical Character Recognition of text inside images. I attempted using Tesseract and EasyOCR, and the outcomes from each were disappointing. You’ll probably need to make use of a business product like AWS Textract if you desire to reliably reproduce text from those forms of sources.
It could actually even be slow. I’ve a reasonably high-spec desktop PC with a GPU, and it took a while on most tasks I set it. Nonetheless, in case your input documents are primarily PDFs, Docling could possibly be a invaluable addition to your text processing toolbox.
I actually have only scratched the surface of what Docling is able to, and I encourage you to go to their homepage, which will be accessed using the next link to learn more.