.
Semantic entity resolution uses language models to bring an increased level of automation to schema alignment, blocking (grouping records into smaller, efficient for all-pairs comparison at quadratic, n² complexity), matching and even duplicate nodes and edges. Previously, entity resolution systems relied on statistical tricks resembling string distance, static rules or complex ETL to schema align, block, match and merge records uses representation learning to achieve a of records’ within the domain of a business to automate the identical process as a part of a knowledge graph factory.
TLDR
The identical technology that transformed textbooks, customer support and programming is coming for entity resolution. Skeptical? Try the interactive demos below… they show potential 🙂
Don’t Just Say It: Prove It
I don’t need to you, I need to you with interactive demos in each post. Try them, edit the information, see what they will do. Play with it. I hope these easy examples proves the of a approachto entity resolution
- This post has two demos. Within the first demo we extract firms from news plus wikipedia for enrichment. Within the second demo we deduplicate those firms in a single prompt using .
- In a second post I’ll exhibit a term I define as meaning “using deep embeddings and semantic clustering to construct smaller groups of records for pairwise comparison.”
- In a 3rd post I’ll show how semantic blocking and matching mix to enhance text-to-Cypher of an actual knowledge graph in KuzuDB.
Agent-Based Knowledge Graph Explosion!
Why does semantic entity resolution matter in any respect? It’s about agents!
Autonomous agents are hungry for knowledge, and up to date models like Gemini 2.5 Pro make extracting knowledge graphs from text easy. LLMs are so good at extracting structured information from text that there shall be more knowledge graphs built from unstructured data in the subsequent eighteen months than have ever existed before. The source of most web traffic is already hungry LLMs consuming text to supply structured information. Autonomous agents are increasingly powered by text to question of a graph database via tools like Text2Cypher.
The semantic web turned out to be highly individualistic: of any size is about to have their very own knowledge graph of their problem domain as a core asset to power the agents that automate their business.
Subplot: Powerful Agents Need Entity ResolvedKGs
Corporations constructing agents are about to run straight into entity resolution for knowledge graphs as a fancy, often cost-prohibitive problem stopping them from harnessing their organizational knowledge. Extracting knowledge graphs from text with LLMs produces of duplicate nodes and edges. Garbage in: garbage out. When concepts are split across multiple entities, fallacious answers emerge. This limits raw, extracted graphs’ ability to power agents. Entity resolved knowledge graphs are required for agents to do their jobs.
Entity Resolution for Knowledge Graphs
There are several steps to entity resolution for knowledge graphs to go from raw data to retrievable knowledge. Let’s define them to grasp how improves the method.
Node Deduplication
- A low price blocking function groups similar nodes into smaller blocks (groups) for pairwise comparison, since it scales at n² complexity.
- An identical function makes a match decision for every pair of nodes inside each block, often with a confidence rating and a proof.
- Latest SAME_AS edges are created between each matched pair of nodes.
- This forms clusters of connected nodes called connected components. One component corresponds to at least one resolved record.
- Nodes in components are merged — fields may grow to be lists, that are then deduplicated. Merging nodes with LLMs.
The diagram below illustrates this process:
Edge Deduplication
Merged nodes mix the sides of the source nodes, which incorporates duplicates of the identical type to mix. Blocking for edges is easier, but merging will be complex depending on edge properties.
- Edges are GROUPED BY their source node id, destination node id and edge type to create edge blocks.
- An edge matching function makes a match decision for every pair of edges inside an edge block.
- Edges are then merged using rules for learn how to mix properties like weights.
The resulting now accurately represents expertise in the issue domain. Text2Cypher over this information base becomes a strong method to drive autonomous agents… but not before entity resolution occurs.
Where Existing Tools Come up Short
Entity resolution for knowledge graphs is a difficult problem, so existing ER tools for knowledge graphs are complex. Most entity linking libraries from academia aren’t effective in real world scenarios. Business entity resolution products are stuck in a SQL centric world, often limited to people and company records and will be prohibitively expensive, especially for big knowledge graphs. Each sets of tools but don’t nodes and edges for you, which requires lots of manual effort through complex ETL. There may be an acute need for the simpler, automated workflow represents.
Semantic Entity Resolution for Graphs
Modern schema aligns, blocks, matches and merges records using pre-trained language models: deep embeddings, semantic clustering and generative AI. It may possibly group, match and merge records in an automatic process, using the same transformers which can be replacing so many legacy systems because they comprehend the actual of knowledge within the context of a business or problem domain.
Semantic ER isn’t recent: (Li et al, 2020), beating previous benchmarks by as much as 29%. We used Ditto and BERT do entity resolution for billions of nodes at Deep Discovery in 2021. Each Google and Amazon have semantic ER offerings… what’s recent is its simplicity, making it more accessible to developers. Semantic blocking still uses sentence transformers, with today’s powerful embeddings. Matching has transitioned from custom transformer models to large language models. Merging with language models emerged just this yr. It continues to evolve.
Semantic Blocking: Clustering Embedded Records
uses the identical sentence transformer models powering today’s Retrieval Augmented Generation (RAG) systems to convert records into dense vector representations for semantic retrieval using vector similarity measures like cosine similarity. Semantic uses semantic on the fixed-length vector representations provided by sentence encoder models (i.e. sbert) to group records more likely to match based on their semantic similarity within the terms of the information’s problem domain.

Semantic clustering is an efficient approach to blocking that leads to smaller blocks with more positive matches because unlike traditional syntactic blocking methods that employ string similarity measures to form blocking keys to group records, semantic clustering leverages the wealthy contextual understanding of recent language models to capture deeper relationships between the fields of records, even when their strings differ dramatically.
You may see semantic clusters emerge on this vector similarity matrix of semantic representations below: they’re the blocks along the diagonals… and so they will be beautiful 🙂

While off-the-shelf, pre-trained embeddings can work well, semantic blocking will be greatly enhanced by fine-tuning sentence transformers for entity resolution. I’ve been working on exactly that using contrastive learning for people and company names in a project called Eridu (huggingface). It’s a piece in progress, but my prototype address matching model works surprisingly well using synthetic data from GPT4o. You may fine-tune embeddings to each cluster and match.
I’ll exhibit the specifics of semantic blocking in my second post. Stay tuned!
Align, Match and Merge Records with LLMs
Prompting Large Language Models to each match and merge records is a brand new and powerful technique. The newest generation of Large Language Models is surprisingly powerful for matching JSON records, which shouldn’t be surprising given how well they will perform information extraction. My initial experiment used BAML to match and merge company records in a single step and worked surprisingly well. Given the rapid pace of improvement in LLMs, it isn’t hard to see that that is the longer term of entity resolution.
Can an LLM be trusted to perform entity resolution? This must be judged on merit, not preconception. It’s strange to think that LLMs will be trusted to construct knowledge graphs whole-cloth, but can’t be trusted to deduplicate their entities! Chain-of-Thought will be employed to supply a proof for every match. I discuss workloads below, but as the variety of information graphs expands to cover every business and its agents, there shall be a powerful demand for easy ER solutions extending the KG construction pipeline using the identical tools that make it up: BAML, DSPy and LLMs.
Low-Code Proof-of-Concept
There are two interactive Prompt Fiddle demos below. The entities extracted from the primary demo are used as records to be entity resolved within the second.
Extracting Corporations from News and Wikipedia
The primary demo is an interactive demo showing learn how to perform information extraction from news and Wikipedia using BAML and Gemini 2.5 Pro. BAML models are based on Jinja2 templates and define what semi-structured data is extracted from a given prompt. They will be exported as Pydantic models, via the baml-cli generate
command. The next demo extracts firms from the Wikipedia article on Nvidia.
Click for live demo: Interactive demo of knowledge extraction of firms using BAML + Gemini – Prompt Fiddle
I’ve been doing the above for the past three months for my investment club and… I’ve hardly found a single mistake. Any time I’ve thought an organization was erroneous, it was actually an excellent idea to incorporate it: Meta when Llama models were mentioned. By comparison, state-of-the-art, traditional information extraction tools… don’t work thoroughly. Gemini is much ahead of other models in relation to information extraction… provided you utilize the best tool.
BAML and DSPy feel like disruptive technologies. They supply enough accuracy LLMs grow to be practical for a lot of task. They’re to LLMs what Ruby on Rails was to web development: they make using LLMs joyous. A lot fun! An introduction to BAML is here and it’s also possible to try Ben Lorica’s show about BAML.
A truncated version of the corporate model appears below. It has 10 fields, most of which won’t be extracted from anybody article… so I threw in Wikipedia, which gets most of them. The query marks after properties like exchange string?
mean optional, which is vital because BAML won’t extract an entity missing a required field. @description
gives guidance to the LLM in interpreting the sector for each extraction and .

Semantic ER Accelerates Enrichment
Once entity resolution is automated, it becomes trivial to flesh out any public facing entity using the wikipedia PyPi package (or a business API like Diffbot or Google Knowledge Graph), so within the examples I included Wikipedia articles for some firms, together with a pair of articles about NVIDIA and AMD. Enriching public facing entities from Wikipedia was on the TODO list when constructing a knowledge graph but… so often so far, it didn’t get done resulting from the overhead of schema alignment, entity resolution and merging records. For this post, I added it in minutes. This convinced me there shall be lots of downstream impact from the rapidity of semantic ER.
Semantic Multi-Match-Merge with BAML, Gemini 2.5 Pro
The second demo below performs entity matching on the Company
entities extracted in the course of the first demo, together with several more company Wikipedia articles. It merges all 39 records directly and not using a single mistake! Discuss potential!? It just isn’t a quick prompt… but you don’t really need Gemini 2.5 Pro to do it, faster models will work and LLMs can merge many more records than this directly in a 1M token window… and rising fast 🙂
Click for live demo: LLM MulitMatch + MultiMerge – Prompt Fiddle
Merging Guided by Field Descriptions
If you happen to look, you’ll find that the merge of firms above robotically chooses the total company name when multiple forms are present owing to the outline of the Company.name
field description Formal name of the corporate with corporate suffix
. I didn’t have to offer that instruction within the prompt! It is feasible to use record metadata to guide schema alignment, matching and merging without directly editing a prompt. Together with merging multiple records in an LLM, I imagine that is original work… I stumbled into 🙂
The sector annotation within the BAML schema:
class Company {
name string
@description("Formal name of the corporate with corporate suffix")
...
}
The unique two records, one extracted from news, the opposite from Wikipedia:
{
"name": "Nvidia Corporation",
"ticker": {
"symbol": "NVDA",
"exchange": "NASDAQ"
},
"description": "An American technology company, founded in 1993, specializing in GPUs (e.g., Blackwell), SoCs, and full-stack AI computing platforms like DGX Cloud. A dominant player within the AI, gaming, and data center markets, it's led by CEO Jensen Huang and headquartered in Santa Clara, California.",
"website_url": "null",
"headquarters_location": "Santa Clara, California, USA",
"revenue_usd": 10918000000,
"employees": null,
"founded_year": 1993,
"ceo": "Jensen Huang",
"linkedin_url": "null"
}
{
"name": "Nvidia",
"ticker": null,
"description": "An organization specializing in GPUs and full-stack AI computing platforms, including the GB200 and Blackwell series, and platforms like DGX Cloud.",
"website_url": "null",
"headquarters_location": "null",
"revenue_usd": null,
"employees": null,
"founded_year": null,
"ceo": "null",
"linkedin_url": "null"
}
The matched and merged record below. Note the longer Nvidia Corporation
was chosen without specific guidance based on the sector description. Also, the outline is a summary of each the Nvidia mention within the article and the wikipedia entry. And no, the schemas don’t should be the identical 🙂
{
"name": "Nvidia Corporation",
"ticker": {
"symbol": "NVDA",
"exchange": "NASDAQ"
},
"description": "An American technology company, founded in 1993, specializing in GPUs (e.g., Blackwell), SoCs, and full-stack AI computing platforms like DGX Cloud. A dominant player within the AI, gaming, and data center markets, it's led by CEO Jensen Huang and headquartered in Santa Clara, California.",
"website_url": "null",
"headquarters_location": "Santa Clara, California, USA",
"revenue_usd": 10918000000,
"employees": null,
"founded_year": 1993,
"ceo": "Jensen Huang",
"linkedin_url": "null"
}
Below is the prompt, all pretty and branded for a slide:

Now to be clear: there’s quite a bit greater than matching in a production entity resolution system… it is advisable to assign unique identifiers to recent records and include the merged IDs as a field, to maintain track of which records were merged… at a minimum. I do that in my investment club’s pipeline. My goal is to point out you the potential of semantic matching and merging using large language models… if you happen to’d prefer to take it further, I might help. We try this at Graphlet AI 🙂
Schema Alignment? Coming Up!
One other tough problem in entity resolution is schema alignment: different sources of knowledge for a similar sort of entity have fields that don’t exactly match. Schema alignment is a painful process that normally occurs before entity resolution is feasible… with semantic matching and similar names or descriptions, schema alignment just happens. The records being matched and merged will align using the ability of representation learning… which understands that the underlying concepts are the identical, so the schemas align.
Beyond Matching
An interesting aspect of doing multiple record comparisons directly is that it provides a possibility for the language model to watch, evaluate and comment on the group of records within the prompt. In my very own entity resolution pipeline, I mix and summarize multiple descriptions of firms in Company objects, extracted from different news articles, each of which summarizes the corporate because it appears in that exact article. This provides a comprehensive description of an organization by way of its relationships not otherwise available.
I imagine there are numerous opportunities like this, on condition that even last yr’s LLMs can do linear and non-linear regression… try From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples (Vacareanu et al, 2024).

There is no such thing as a end to the observations an LLM might make about groups of records: tasks related to entity resolution, but not limited to it.
Cost and Scalability
The early, high cost of enormous language model APIs and the historical high price of GPU inference have created skepticism about whether semantic entity resolution can scale.
Scaling Blocking via Semantic Clustering
Matching in entity resolution for knowledge graphs is just link prediction of SAME_AS edges, a standard graph machine learning task. There may be little query that semantic clustering for link prediction can cost-efficiently scale, because the technique was proven at Google by Google Grale (Halcrow et al, 2020, NeurIPS presentation). That paper’s authors include graph learning luminary Bryan Perozzi, recent winner of KDD’s Test of Award for his invention of graph embeddings.

Semantic clustering in Grale is a vital a part of the machine learning behind many features across Google’s web properties, including recommendations at YouTube. Note that Google also uses language models to match nodes during link prediction in Grale 🙂 Google also uses semantic clustering in its Entity Reconciliation API for its Enterprise Knowledge Graph service.
Clustering in Grale uses Locality Sensitive Hashing (LSH). One other efficient approach to clustering via information retrieval is to make use of L2 / Approximate K-Nearest Neighbors clustering in a vector database resembling Facebook FAISS (blog post) or Milvus. In FAISS, records are clustered during indexing and will be retrieved as groups of comparable records via A-KNN.
I’ll talk more about scaling semantic blocking in my second post!
Scaling Matching via Large Language Models
Large Language Models are resource intensive and employ GPUs for efficiency in each training and inference. There are three reasons to be optimistic about their effiency for entity resolution.
1. LLMs are continually, rapidly becoming inexpensive… don’t match your budget today? Wait a month.

…and more capable. Not accurate enough today? Wait per week for the brand new best model. Given time, your satisfaction is inevitable.

The economics of matching via an LLM were first explored in Cost-Efficient Prompt Engineering for Unsupervised Entity Resolution (Nananukul et al, 2023). The authors include Mayank Kejriwal, who wrote the bible of KGs. They achieved surprisingly accurate results, given how bad GPT3.5 now looks.
2. Semantic blocking will be simpler, meaning smaller blocks with more positive matches. I’ll exhibit this process in my next post.
3. Multiple records, even multiple blocks, will be matched concurrently in a single prompt, on condition that modern LLMs have 1 million token context windows. 39 records match and merge directly within the demo above, but ultimately, hundreds will directly.

Skepticism: A Tale of Two Workloads
Some workloads are appropriate for semantic entity resolution today, while others should not yet. Let’s explore what works today and what doesn’t.
Semantic entity resolution is best suited to knowledge graphs which have been extracted from unstructured text using a big language model — which you already trust to . You furthermore mght trust embeddings to . Why wouldn’t you trust embeddings to into matching groups, followed by an LLM to ?
Modern LLMs and tools like BAML are so powerful for information extraction from text that the subsequent two years will see a proliferation of information graphs covering each traditional domains like science, e-commerce, marketing, finance, manufacturing and biomedicine to… anything and the whole lot: sports, fashion, cosmetics, hip-hop, crafts, entertainment, non-fiction (every book gets a KG), even fiction (I predict a KG… which I’ll now construct). These sorts of workloads will skip traditional entity resolution tools entirely and perform semantic entity resolution as one other step of their KG construction pipelines.
Idempotence for Entity Resolution
Semantic entity resolution isn’t ready for finance and medicine, each of which have strict idempotence (reproducibility) as a legal requirement. This has led to scare tactics that faux this is applicable to all workloads.
LLM output varies for several reasons. GPUs execute multiple threads concurrently that finish in various orders. There are hardware and software settings to scale back or remove variation to enhance consistency at a performance hit, nevertheless it isn’t clear these remove all variation even on the identical hardware. Strict idempotence is just possible when hosting large language models on the identical hardware between runs using a wide range of hardware and software settings and at a performance penalty… it requires a proof-of-concept. That’s more likely to change via specific hardware designed for financial institutions as LLMs take over the remainder of the world. Regulations are also more likely to change over time to accommodate statistical precision slightly than precise determinism.
For explanations of matching and merging records, idempotent workloads must also address the undeniable fact that Reasoning Models Don’t At all times Say What They Think (Chen et al, 2025). See more recently, Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens, Zhao et al, 2025. This is feasible with sufficient validation using emerging tools like prompt tuning for accurate, fully reproducible behavior.
Data Provenance
If you happen to use semantic methods to dam, match and merge for existing entity resolution workloads, you will need to still track the rationale for a match and maintain data provenance: a whole lineage of records. This is tough work! Which means that almost all businesses will select a tool that leverages language models, slightly than doing their very own entity resolution. Take into accout that almost all knowledge graphs two years from now shall be built by large language models in other domains.
Abzu Capital
I’m not a vendor selling you a product… I strongly imagine in open source, open data tools. I’m in an investment club that built an entity resolved knowledge graph of AI, robotics and data-center related industries using this technology. We wanted to speculate in smaller technology firms with high growth potential that cut deals and form strategic relationships with greater players with large capital expenditures… but reading form 10-K reports, tracking the news and adding up the deals for even a handful of investments became a full time job. So we built agents powered by a knowledge graph of firms, technologies and products to automate the method! That is the place from which this post comes.
Conclusion
On this post, we explored semantic entity resolution. We demonstrated proof-of-concept information extraction and entity matching using Large Language Models (LLMs). I encourage you to play with the provided demos and are available to your personal conclusions about semantic entity matching. I believe the straightforward result above, combined with the opposite two posts, will show early adopters that is the way in which the market will turn, one workload at a time.
Up Next…
That is the primary post in a series of three posts. Within the second post, I’ll exhibit by semantic clustering of sentence encoded records. In my final post, I’ll provide an end-to-end example of semantic entity resolution to enhance text-to-cypher on an actual knowledge graph for a real-world use case. Stick around, I believe you’ll be pleased 🙂
At Graphlet AI we construct autonomous agents powered by entity resolved knowledge graphs for firms large and small. We construct large knowledge graphs from structured and unstructured data: thousands and thousands, billions or trillions of nodes and edges. I lead the Spark GraphFrames project, widely utilized in entity resolution for connected components. I even have a 20 yr background and teach network science, graph machine learning and NLP. I built and product managed LinkedIn InMaps and Profession Explorer. I used to be a visualization engineer at Ning (Marc Andreesen’s social network), evangelist at Hortonworks and Principal Data Scientist at Walmart. I coined the term “agile data science” in 2009 (from 0 hits on Google) and wrote the primary agile data science methodology in Agile Data Science (O’Reilly Media, 2013). I improved it in Agile Data Science 2.0 (O’Reilly Media, 2017), which has a 4-star rating on Amazon 8 years later (code still works). I wrote the first fully data-driven market report for O’Reilly Media in 2015. I’m an Apache Committer on DataFu, I wrote the Apache Druid onboarding docs, and I maintain graph sampler Little Ball of Fur and graph embedding collection Karate Club.
This post originally appeared on the Graphlet AI Blog.