The Architecture Behind Web Search in AI Chatbots

or Claude to “search the online,” it isn’t just answering from its training data. It’s calling a separate search system.

Most individuals know that part.

What’s less clear is how much traditional serps matter and the way much has been built on top of them.

All of it isn’t fully public, so I’m performing some mental deduction here. But we will use different hints from taking a look at larger systems to construct a useful mental model.

We’ll undergo query optimization, how serps are used for discovery, chunking content, “on-the-fly” retrieval, and the way you may potentially reverse-engineer a system like this to construct a “GEO [] scoring system.”

Should you’re aware of RAG, a few of this might be repetition, but it may still be useful to see how larger systems split the pipeline right into a discovery phase and a retrieval phase.

Should you’re short on time, you’ll be able to read the TL;DR.

TL;DR

Web search in these AI chatbots is probably going a two-part process. The primary part leans on traditional serps to search out and rank candidate docs. Within the second part, they fetch the content from those URLs and pull out essentially the most relevant passages using passage-level retrieval.

The large change (from traditional web optimization) is query rewriting and passage-level chunking, which let lower-ranked pages outrank higher ones if their specific paragraphs match the query higher.

The technical process

The businesses behind Claude and ChatGPT aren’t fully transparent about how their web search systems work throughout the UI chat, but we will infer quite a bit by piecing things together.

We all know they lean on serps to search out candidates, at this scale, it might be absurd to not. We also know that what the LLM actually sees are pieces of text (chunks or passages) when grounding their answer.

This strongly hints at some type of embedding-based retrieval over those chunks reasonably than over full pages.

This process has several parts, so we’ll undergo it step-by-step.

Query re-writing & fan-out

First, we’ll take a look at how the system cleans up human queries and expands them. We’ll cover the rewrite step, the fan-out step, and why this matters for each engineering and web optimization.

we’ll start at query rewriting — showing the whole pipeline we’re going through

I feel this part could be essentially the most transparent, and the one most individuals appear to agree on online.

The query optimization step is about taking a human query and turning it into something more precise. For instance, “please seek for those red shoes we talked about earlier” becomes “brown-red Nike sneakers.”

The fan-out step, alternatively, is about generating additional rewrites. So if a user asks about mountain climbing routes near me, the system might try things like “beginner hikes near Stockholm,” “day hikes near Stockholm public transport,” or “family-friendly trails near Stockholm.”

That is different from just using synonyms, which traditional serps are already optimized for.

If that is the primary time you’re hearing about it and also you’re unconvinced, take a take a look at Google’s own docs on AI query fan-out or do a little bit of digging around query rewriting.

To what extent this actually works, we will’t know. They could not fan it out that much and just work with a single query, then send additional ones down the pipeline if the outcomes are lackluster.

What we say is that it’s probably not a giant model doing this part. Should you take a look at the research, Ye et al. explicitly use an LLM to generate strong rewrites, then distill that right into a smaller rewriter to avoid latency and value overhead.

As for what this a part of the pipeline means, for engineering, it just means you ought to clean up the messy human queries and switch them into something that has a better hit rate.

For the business and web optimization people on the market, it means those human queries you’ve been optimizing for are getting transformed into more robotic, document-shaped ones.

web optimization, as I understand it, used to care quite a bit about matching the precise long-tail phrase in titles and headings. If someone looked for “best trainers for bad knees,” you’d keep on with that exact string.

What it is advisable to care about now can be entities, attributes, and relationships.

So, if a user asks for “something for dry skin,” the rewrites might include things like “moisturizer,” “occlusive,” “humectant,” “ceramides,” “fragrance-free,” “avoid alcohols” and not only “how would I find product for dry skin.”

But let’s be clear so there’s no confusion: we will’t see the inner rewrites themselves, so these are only examples.

Should you’re desirous about this part, you’ll be able to dig deeper. I bet there are many papers on the market on how you can do that well.

Let’s move on to what these optimized queries are literally used for.

Using serps (for doc level discovery)

It’s pretty common knowledge by now that, to get up-to-date answers, most AI bots depend on traditional serps. That’s not the entire story, but it surely does cut the online right down to something smaller to work with.

next up doc discovery — showing the whole pipeline we’re going through

I’m assuming the complete web is simply too large, too noisy, and too fast-changing for an LLM pipeline to tug raw content directly. So by utilizing already established serps, you get a strategy to narrow the universe.

Should you take a look at larger RAG pipelines that work with thousands and thousands of documents, they do something similar. I.e. using a filter of some sort to make your mind up which documents are essential and value further processing.

For this part, we do have proof.

Each OpenAI and Anthropic have said they use third-party serps like Bing and Brave, alongside their very own crawlers.

Perplexity can have built out this part on their very own by now, but to start with, they’d have done the identical.

We even have to think about that traditional serps like Google and Bing have already solved the toughest problems. They’re a longtime technology that handles things like language detection, authority scoring, domain trust, spam filtering, recency, geo-biasing, personalization, and so forth.

Throwing all of that away to embed the whole web yourself seems unlikely. So I’m guessing they lean on those systems as a substitute of rebuilding them.

Nevertheless, we don’t know the way many results they really fetch per query, whether it’s just the highest 20 or 30. One unofficial article compared citations from ChatGPT and Bing, checked out the rating order, and located that some got here from as far down as twenty second place. If true, this means it is advisable to aim for top-20-ish visibility.

Moreover, we also don’t know what other metrics they use to make your mind up what surfaces from there. This article argues that AI engines heavily favor reasonably than official sites or socials, so there’s more happening.

Still, the search engine’s job (whether it’s fully third-party or a combination) is discovery. It ranks the URL based on authority and keywords. It would include a snippet of data, but that alone won’t be enough to reply the query.

If the model relied only on the snippet, plus the title and URL, it might likely hallucinate the main points. That’s not enough context.

So this pushes us toward a two-stage architecture, where a retrieval step is baked in — which we’ll get to soon.

What does this mean when it comes to web optimization?

It means you continue to have to rank high in traditional serps to be included in that initial batch of documents that gets processed. So, yes, classic web optimization still matters.

Nevertheless it might also mean it is advisable to take into consideration po metrics they could be using to rank those results.

This stage is all about narrowing the universe to a number of pages value digging into, using established search tech plus internal knobs. All the pieces else (the “it returns passages of data” part) comes after this step, using standard retrieval techniques.

Crawl, chunk & retrieve

Now let’s move on to what happens when the system has identified a handful of interesting URLs.

Once a small set of URLs passes the primary filter, the pipeline is fairly straightforward: crawl the page, break it into pieces, embed those pieces, retrieve those that match the query, after which re-rank them. That is what’s called retrieval.

next up chunking, retrieval — showing the whole pipeline we’re going through

I call it here since the system only embeds chunks once a URL becomes a candidate, then it caches those embeddings for reuse. This part could be latest in the event you’re already aware of retrieval.

To crawl the page, they use their very own crawlers. For OpenAI, that is OAI-SearchBot, which fetches the raw HTML so it may be processed. Crawlers don’t execute JavaScript. They depend on server-rendered HTML, so the identical web optimization rules apply: content must be accessible.

Once the HTML is fetched, the content needs to be was something searchable.

Should you’re latest to this, it’d feel just like the AI “scans the document,” but that’s not what happens. Scanning entire pages per query could be too slow and too expensive.

As an alternative, pages are split into passages, normally guided by HTML structure: headings, paragraphs, lists, section breaks, that type of thing. These are called within the context of retrieval.

Each chunk becomes a small, self-contained unit. Token-wise, you’ll be able to see from Perplexity UI citations that chunks are on the order of tens of tokens, perhaps around 150, not 1,000. That’s about 110–120 words.

After chunking, those units are embedded using each sparse and dense vectors. This permits the system to run hybrid search and match a question each semantically and by keyword.

Should you’re latest to semantic search, briefly, it means the system searches for meaning as a substitute of tangible words.

Once a preferred page has been chunked and embedded, those embeddings are probably cached. Nobody is re-embedding the identical StackOverflow answer hundreds of times a day.

This is clearly why the system feels so fast, probably the new 95–98% of the online that really gets cited is already embedded, and cached aggressively.

We don’t know to what extent though and the way much they pre-embed to be sure the system runs fast for popular queries.

Now the system must determine which chunks matter. It uses the embeddings for every chunk of text to compute a rating for each semantic and keyword matching.

It picks the chunks with the best scores. This could be anything from 10 to 50 top-scoring chunks.

From here, most mature systems will use a re-ranker (cross-encoder) to process those top chunks again, doing one other round of rating. That is the “fix the retrieval mess” stage, because unfortunately retrieval isn’t at all times completely reliable for a number of reasons.

Although they are saying nothing about using a cross-encoder, Perplexity is one in all the few that documents their retrieval process openly.

Their Search API says they “divide documents into fine-grained units” and rating those units individually in order that they can return the “most relevant snippets already ranked.”

What does this all mean for web optimization? If the system is doing retrieval like this, your page isn’t treated as one big blob.

It’s broken into pieces (often paragraph or heading level) and people pieces are what get scored. The total page matters during discovery, but once retrieval begins, it’s the chunks that matter.

Which means each chunk needs to reply the user’s query.

It also implies that in case your essential information isn’t contained inside a single chunk, the system can lose context. Retrieval isn’t magic. The model never sees your full page.

So now we’ve covered the retrieval stage: where the system crawls pages, chops them into units, embeds those units, after which uses hybrid retrieval and re-ranking to tug out only the passages that may answer the user’s query.

Doing one other pass & handing over chunks to LLM

Now let’s move on to what happens after the retrieval part, including the “continuing to go looking” feature and handing the chunks to the essential LLM.

next up checking content + handing it over to the LLM

Once the system has identified a number of high-ranking chunks, it has to make your mind up whether or not they’re ok or if it needs to maintain searching. This decision is sort of actually made by a small controller model, not the essential LLM.

I’m guessing here, but when the fabric looks thin or off-topic, it might run one other round of retrieval. If it looks solid, it may hand those chunks over to the LLM.

In some unspecified time in the future, that handoff happens. The chosen passages, together with some metadata, are passed to the essential LLM.

The model reads all of the provided chunks and picks whichever one best supports the reply it desires to generate.

It doesn’t mechanically follow the retriever’s order. So there’s no guarantee the LLM will use the “top” chunk. It might prefer a lower-ranked passage just because it’s clearer, more self-contained, or closer to the phrasing needed for the reply.

So similar to us, it decides what to absorb and what to disregard. And even in case your chunk scores the best, there’s no assurance it’ll be the primary one mentioned.

What to think about

This method isn’t really a black box. It’s a system people have built handy the LLMs the fitting information to reply a user’s query.

It finds candidates, splits documents into units, searches and ranks those units, after which hands them over to an LLM to summarize. So once we understand how the system works, we can even determine what we’d like to take into consideration when creating content for it.

Traditional web optimization still matters quite a bit, because this technique leans on the old one. Things like having a correct sitemap, easily rendered content, proper headings, domain authority, and accurate last-modified tags are all essential on your content to be sorted accurately.

As I identified, they could be mixing serps with their very own technology to make your mind up which URLs get picked, which is value keeping in mind.

But I feel paragraph level relevance is the brand new leverage point.

Possibly this implies answer-in-one-chunk design will rule. (Just don’t do it in a way that feels weird, perhaps a TL;DR.) And remember to make use of the fitting vocabulary: entities, attributes, relationships, like we talked about within the query optimization section.

Methods to construct a “GEO Scoring System” (for fun)

To determine how well your content will do, we’ll should simulate the hostile environment your content will live in. So let’s attempt to reverse engineer this pipeline.

The concept is to create a pipeline that may do query rewrite, discovery, retrieval, re-ranking and an LLM judge, after which see where you find yourself in comparison with your competitors for various topics.

sketching the pipeline to ascertain where you rating in comparison with competitors

You start with a number of topics like “hybrid retrieval for enterprise RAG” or “LLM evaluation with LLM-as-judge,” after which construct a system that generates natural queries around them.

You then pass those queries through an LLM rewrite step, because these systems often reformulate the user query before retrieval. Those rewritten queries are what you truly push through the pipeline.

The primary check is visibility. For every query, take a look at the highest 20–30 results across Brave, Google and Bing. Note whether your page appears and where it sits relative to competitors.

At the identical time, collect domain-level authority metrics (Moz DA, Ahrefs DR, etc.) so you’ll be able to fold that in later, since these systems probably still lean heavily on those signals.

In case your page appears in these first results, you progress on to the retrieval part.

Fetch your page and the competing pages, clean the HTML, split them into chunks, embed those chunks, and construct a small hybrid retrieval setup that mixes semantic and keyword matching. Add a re-ranking step.

Somewhere here you furthermore mght inject the authority signal, because higher-authority domains realistically get scored higher (regardless that we don’t know exactly how much).

Once you’ve gotten the highest chunks, you add the ultimate layer: an LLM-as-a-judge. Being in the highest five doesn’t guarantee citation, so that you simulate the last step by handing the LLM a number of of the top-scored chunks (with some metadata) and see which one it cites first.

While you run this on your pages and competitors, you see where you win or lose: the search layer, the retrieval layer or the LLM layer.

Remember this remains to be a rough sketch, but it surely gives you something to begin with if you ought to construct the same system.

This text focused on the mechanics reasonably than the strategy side of web optimization/GEO, which I get won’t be for everybody.

The goal was to map the flow from a user query to the ultimate answer and show that the AI search tool isn’t some opaque force.

Even when parts of the system aren’t public, we will still infer an affordable sketch of what’s happening. What’s clear up to now is that the AI web search doesn’t replace traditional serps. It just layers retrieval on top of them.

Before ending this, it’s value mentioning that the deeper research feature is different from the built-in search tools, that are fairly limited and low cost. Deep research likely leans on more agentic search, which could also be “scanning” the pages to a greater extent.

This might explain why my very own content from my website shows up in deep research regardless that it’s not optimized for the fundamental search layer, so it almost never shows up in basic AI search.

There’s still more to determine before saying what actually matters in practice. Here I’ve mostly undergone the technical pipeline but when that is latest stuff I hoped it explain it well.

Hopefully it was easy to read. Should you enjoyed it, be happy to share it or connect with me on LinkedIn, Medium or through my site.

❤

The Architecture Behind Web Search in AI Chatbots

TL;DR