Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics

generate customer journeys that appear smooth and fascinating, but evaluating whether these journeys are structurally sound stays difficult for current methods.

This text introduces Continuity, Deepening, and Progression (CDP) — three deterministic, content-structure-based metrics for evaluating multi-step journeys using a predefined taxonomy quite than stylistic judgment.

Traditionally, optimizing customer-engagement systems has involved fine-tuning delivery mechanics comparable to timing, channel, and frequency to attain engagement and business results.

In practice, this meant you trained the model to grasp rules and preferences, comparable to , and

To administer this, you built a cool-off matrix to balance timing, channel constraints, and business rules to manipulate customer communication.

To date, so good. The mechanics of delivery are optimized.

At this point, the core challenge arises when the LLM generates the journey itself. The problem is just not nearly channel or timing, but whether the sequence of messages forms a coherent, effective narrative that meets business objectives.

And suddenly you realize:

There isn’t any standard metric to find out if an AI-generated journey is coherent, meaningful, or advances business goals.

What We Expect From a Successful Customer Journey

From a business perspective, the sequence of contents per journey step can’t be random: it have to be a guided experience that feels coherent, moves the client forward through meaningful stages, and deepens the connection over time.

While this intuition is common, it’s also supported by customer-engagement research. Brodie et al. (2011) describe engagement as “a dynamic, iterative process” that varies in intensity and complexity as value is co-created over time.

In practice, we evaluate journey quality along three complementary dimensions:

Continuity — whether each message matches the context established by prior interactions.

Deepening — whether content becomes more specific, relevant, or personalized quite than remaining generic.

Progression — whether the journey advances through stages (e.g., from exploration to motion) without unnecessary backtracking.

Why Existing LLM Evaluation Metrics Fall Short

If we have a look at standard evaluation methods for LLMs, comparable to accuracy metrics, similarity metrics, human-evaluation criteria, and even LLM-as-a-judge, it becomes clear that none provide a technique to evaluate customer journeys generated as multi-step sequences.

Let’s examine what standard customer journey metrics can and might’t provide.

Accuracy Metrics (Perplexity, Cross-Entropy Loss)

These metrics measure confidence level in predicting the following token given the training data. They don’t capture whether a generated sequence forms a coherent or meaningful journey.

Similarity Metrics (BLEU, ROUGE, METEOR, BERTScore, MoveScore)

These metrics compare the generated result to a reference text. Nonetheless, customer journeys rarely have a single correct reference, as they adapt to context, personalization, and prior interactions. Structurally valid journeys may differ significantly while remaining effective.

Undoubtedly, semantic similarity has its benefits, and we’ll use cosine similarity, but more on that later.

Human Evaluation (Fluency, Relevance, Coherence)

Human judgment often outperforms automated metrics in assessing language quality, nevertheless it is poorly suited to continuous journey evaluation. It is dear, suffers from cultural bias and ambiguity, and doesn’t function as a everlasting a part of the workflow but quite as a one-time effort to bootstrap a fine-tuning stage.

LLM-as-a-Judge (AI feedback scoring)

Using LLMs to guage outputs from other LLM systems is a formidable process.

This approach tends to focus more on style, clarity, and tone quite than structural evaluation.

LLM-as-a-Judge will be applied in multi-stage use cases, but results are sometimes less precise resulting from the increased risk of context overload. Moreover, fine-grained evaluation scores from this method are sometimes unreliable. Like human evaluators, LAAJ also carries biases and ambiguities.

A Structural Approach to Evaluating Customer Journeys

Ultimately, the first missing element in evaluating really helpful content sequences throughout the customer journey is .

Probably the most natural technique to represent content structure is as a , a hierarchical model consisting of stages, content themes, and levels of detail.

Once customer journeys are mapped onto this tree, CDP metrics will be defined as structural variations:

Continuity: smooth movement across branches
Deepening: moving into more specific nodes
Progression: moving forward through customer journey stages

The answer is to represent a journey as a path through a hierarchical taxonomy derived from the content space. Once this representation is established, CDP metrics will be computed deterministically from the trail. The diagram below summarizes your complete pipeline.

Image created by the writer

Constructing the Taxonomy Tree

To judge customer journeys structurally, we first require a structured representation of content. We construct this representation as a multi-level taxonomy derived directly from customer-journey text using semantic embeddings.

The taxonomy is anchored by a small set of high-level stages (e.g., motivation, purchase, delivery, ownership, and loyalty). Each anchors and journey messages are embedded into the identical semantic vector space, allowing content to be organized by semantic proximity.

Inside each anchor, messages are grouped into progressively more specific themes, forming deeper levels of the taxonomy. Each level refines the previous one, capturing increasing topical specificity without counting on manual labeling.

The result’s a hierarchical structure that groups semantically related journey messages and provides a stable foundation for evaluating how journeys flow, deepen, and progress over time.

Mapping Customer Journeys onto the Taxonomy

Once the taxonomy is established, individual customer journeys are mapped onto it as ordered sequences of messages. Each step is embedded in the identical semantic space and matched to the closest taxonomy node using cosine similarity.

This mapping converts a temporal sequence of messages right into a path through the taxonomy, enabling the structural evaluation of journey evolution quite than treating the journey as a flat list of texts.

Defining the CDP Metrics

The CDP framework consists of three complementary metrics: Continuity, Deepening, and Progression. Each captures a definite aspect of journey quality. We describe these metrics conceptually before defining them formally based on the taxonomy-mapped journey.

Setup and Computation

Before analyzing real journeys, we make clear two points of the setup.
(1) how journey content is structurally represented, and
(2) how CDP metrics are derived from that structure.

Customer-journey content is organized right into a hierarchical taxonomy consisting of anchors (L1 journey stages), thematic heads (L2 topics), and deeper nodes that represent increasing specificity:

Anchor (L1)
└── Head (L2)
     └── Child (L3)
          └── Grandchild (L4+)

Once a journey is mapped onto this hierarchy, Continuity, Deepening, and Progression are computed deterministically from the journey’s path through the tree.

Let a customer journey be an ordered sequence of steps:

Each step xᵢ is assigned:

aᵢ — anchor (L1 journey stage)
tᵢ — thematic head (L2 topic), where tᵢ = 0 means “unknown”
ℓᵢ — taxonomy depth level (L1 = 0, L2 = 1, L3 = 2, …)

Continuity (C)

Continuity evaluates whether consecutive messages remain contextually and thematically coherent.

For every transition (xᵢ →xᵢ₊₁), a step-level continuity rating cᵢ ∈ [0, 1] is assigned based on taxonomy alignment, with higher weights given to transitions that stay throughout the same topic or closely related branches.

Transitions are ranked from strongest to weakest (e.g., same topic, related topic, forward stage move, backward move), and
assigned decreasing weights:

The general continuity rating is computed as:

Deepening (D)

Deepening measures whether a journey accumulates value by moving from general content toward more specific or detailed
interactions. It’s computed using two complementary components.

Journey-based deepening captures how depth changes along the observed path:

Taxonomy-aware deepening measures how deeply a journey explores the actual taxonomy tree, based on the heads it visits.
It evaluates how most of the possible deeper content items (children, sub-children, etc.) under each visited head are later seen
through the journey.

The ultimate deepening rating is a weighted combination:

Deepening lies in [0, 1].

Progression (P)

Progression measures directional movement through journey stages. For every transition, we compute:

Only moving steps (Δᵢ ≠ 0) are considered. Let wᵢ denote the relative importance of the present stage.

If (forward movement):

If (backward movement):

The raw progression rating is:

for allwhere

To certain the rating to[−1, +1], we apply a tanh normalization:

Applying CDP Metrics to an Automotive Customer Journey

To display how structured evaluation works on realistic journeys, we generated an artificial automotive customer-journey dataset covering the principal stages of the client lifecycle.

Image created by the writer using Excalidraw

Input Data: Anchors and Journey Content

The CDP framework uses two principal inputs: anchors, which define journey stages, and customer-journey content, which provides the messages to guage.

Anchors represent meaningful phases within the lifecycle, comparable to , , , , and . Each anchor is augmented with a small set of representative keywords to ground it semantically. Anchors serve each as reference points for taxonomy construction and because the expected directional flow used later within the Progression metric.

anchor Words:
motivation exploration research discovery interest test drive needs assessment experience
purchase financing comparison quotes loan negotiation credit pre-approval deposit
delivery paperwork signing deposit logistics handover activation
ownership maintenance warranty repair dealer support service inspections
loyalty feedback satisfaction survey referral upgrade retention advocacy

Customer-journey content consists of short, action-oriented CRM-style messages (emails, calls, chats, in-person interactions) with various levels of specificity and spanning multiple stages. Although this dataset is synthetically generated, anchor information is just not used during taxonomy construction or CDP scoring.

CJ messages:
Explore models that match your lifestyle and private goals.
Take a virtual tour to find key features and trims.
Compare body styles to evaluate space, comfort, and utility.
Book a test drive to experience handling and visibility.
Use the needs assessment to rank must-have features.
Filter models by range, mpg, or towing to narrow decisions.

Taxonomy Construction Results

Here, we applied the taxonomy construction process to the automotive customer-engagement dataset. The figure below shows the resulting customer-journey taxonomy, built from message content and anchor semantics.

Each top-level branch corresponds to a journey anchor (L1), which represents major journey stages comparable to , , , , and .

Deeper levels (L2, L3+) group messages by thematic similarity and increasing specificity.

What the Taxonomy Reveals

Even on this compact dataset, the taxonomy highlights several functional patterns:

Early-stage messages cluster around exploration and comparison, steadily narrowing toward concrete actions comparable to booking a test drive.
Purchase-related content separates naturally into financial planning, document handling, and finalization.
Ownership content shows a transparent progression from maintenance scheduling to diagnostics, cost estimation, and warranty evaluation.
Loyalty content shifts from transactional actions toward feedback, upgrades, and advocacy.

While these patterns align with how practitioners typically reason about journeys, they arise directly from the information quite than from predefined rules.

Why This Matters for Evaluation

This taxonomy now provides a shared structural reference:

Any customer journey will be mapped as a path through the tree.
Movement across branches, depth levels, and anchors becomes measurable.
, , and are not any longer abstract concepts; they now correspond to concrete structural changes.

In the following section, we use this taxonomy to map real journey examples and compute CDP scores in steps.

Mapping Customer Journeys onto the Taxonomy

Once the taxonomy is constructed, evaluating a customer journey becomes a structural problem.

Each journey is represented as an ordered sequence of customer-facing messages.

As an alternative of judging these messages in isolation, we project them onto the taxonomy and analyze the resulting path.

Formally, a journey is mapped to a sequence of taxonomy nodes: where each is the closest taxonomy node based on embedding similarity.

A Step-by-Step Walkthrough: From Journey Text to CDP Scores

To make the CDP framework concrete, let’s walk through a single customer journey example and show the way it is evaluated step-by-step.

Step 1 — The Customer Journey Input

We start with an ordered sequence of customer-facing messages generated by an LLM.
Each message represents a touchpoint in a sensible automotive customer journey:

journey = ['Take a virtual tour to discover key features and trims.'; 
'We found a time slot for a test drive that fits your schedule.'; 
'Upload your income verification and ID to finalize the pre-approval decision.';
'Estimate costs for upcoming maintenance items.'; 
'Track retention offers as your lease end nears.'; 
'Add plates and registration info before handover.']

Step 2 — Mapping the Journey into the Taxonomy

For structural evaluation, each journey step is mapped into the customer-journey taxonomy. Using text embeddings, each message is matched to its closest taxonomy node. This produces a journey map (jmap), a structured representation of how the journey traverses the taxonomy.

Step 3 — Applying CDP Metrics to This Journey

Once the journey is mapped, we compute Continuity, Deepening, and Progression deterministically from step-to-step transitions.

Taken together, the CDP signals indicate a journey that’s and , with and . Importantly, these insights are derived solely from structure, not from
stylistic judgments in regards to the text.

Conclusion: From Scores to Successful Journeys

Continuity, Deepening, and Progression are determined by structure and will be applied wherever LLMs generate multi-step
content:

to match alternative journeys generated by different prompts or models,
to offer automated feedback for improving journey generation over time.

In this fashion, CDP scores offer structural feedback for LLMs. They complement, quite than replace, stylistic or fluency-based evaluation by providing signals that reflect business logic and customer experience.

Although this text focuses on automotive commerce, the concept is broadly applicable. Any system that generates ordered, goal-oriented content requires strong structural foundations.

Large language models are already able to generating fluent, persuasive text.
The greater challenge is ensuring that text sequences form coherent narratives that align with business logic and user experience.

CDP provides a technique to make structure explicit, measurable, and actionable.

References

Brodie, R. J., et al. (2011).

Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics

What We Expect From a Successful Customer Journey

Why Existing LLM Evaluation Metrics Fall Short

Accuracy Metrics (Perplexity, Cross-Entropy Loss)

Similarity Metrics (BLEU, ROUGE, METEOR, BERTScore, MoveScore)

Human Evaluation (Fluency, Relevance, Coherence)

LLM-as-a-Judge (AI feedback scoring)

A Structural Approach to Evaluating Customer Journeys

Constructing the Taxonomy Tree

Mapping Customer Journeys onto the Taxonomy

Defining the CDP Metrics

Setup and Computation

Continuity (C)

Deepening (D)

Progression (P)

Applying CDP Metrics to an Automotive Customer Journey

Input Data: Anchors and Journey Content

Taxonomy Construction Results

What the Taxonomy Reveals

Why This Matters for Evaluation

Mapping Customer Journeys onto the Taxonomy

A Step-by-Step Walkthrough: From Journey Text to CDP Scores

Step 1 — The Customer Journey Input

Step 2 — Mapping the Journey into the Taxonomy

Step 3 — Applying CDP Metrics to This Journey

Conclusion: From Scores to Successful Journeys

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

“Dr. Google” had its issues. Can ChatGPT Health do higher?

Scaling NVFP4 Inference for FLUX.2 on NVIDIA Blackwell Data Center GPUs

Tremendous-tuning Stable Diffusion models on Intel CPUs

eBay bans illicit automated shopping amid rapid rise of AI agents

D4RT: Unified, Fast 4D Scene Reconstruction & Tracking

Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics

What We Expect From a Successful Customer Journey

Why Existing LLM Evaluation Metrics Fall Short

Accuracy Metrics (Perplexity, Cross-Entropy Loss)

Similarity Metrics (BLEU, ROUGE, METEOR, BERTScore, MoveScore)

Human Evaluation (Fluency, Relevance, Coherence)

LLM-as-a-Judge (AI feedback scoring)

A Structural Approach to Evaluating Customer Journeys

Constructing the Taxonomy Tree

Mapping Customer Journeys onto the Taxonomy

Defining the CDP Metrics

Setup and Computation

Continuity (C)

Deepening (D)

Progression (P)

Applying CDP Metrics to an Automotive Customer Journey

Input Data: Anchors and Journey Content

Taxonomy Construction Results

What the Taxonomy Reveals

Why This Matters for Evaluation

Mapping Customer Journeys onto the Taxonomy

A Step-by-Step Walkthrough: From Journey Text to CDP Scores

Step 1 — The Customer Journey Input

Step 2 — Mapping the Journey into the Taxonomy

Step 3 — Applying CDP Metrics to This Journey

Conclusion: From Scores to Successful Journeys

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.