Prompt Fidelity: Measuring How Much of Your Intent an AI Agent Actually Executes

Spotify just shipped “Prompted Playlists” in beta. I built just a few playlists and discovered that the LLM behind the agent tries to meet your request, but fails since it doesn’t know enough but won’t admit it. Here’s what I mean: one among my first playlist prompts was “songs in a minor key inside rock”. The playlist was swiftly created. I then added the caveat “and no song must have greater than 10 million plays”. The AI agent bubbled up an error explaining that it didn’t have access to total play counts. It also surprisingly explained that it didn’t have access to just a few other things like musical keys, despite the fact that it had claimed to make use of that within the playlist’s construction. The agent was using its LLM’s knowledge of what key a certain song was in and adding songs accordingly to its memory. An in depth inspection of the playlist showed just a few songs that weren’t in a minor key in any respect. The LLM had, in fact, hallucinated this information and proudly displayed it as a sound match to a playlist’s prompt.

All images, unless otherwise noted, are by the creator.

Obviously, a playlist creator is a reasonably low-stakes AI agent capability. The playlist it made was great! The difficulty is it only really used about 25% of my constraints as validated input. The remaining 75% of my constraints were just guessed by the LLM and the system never told me until I dug in deeper. This will not be a Spotify problem; it’s an every-agent problem.

Three Propositions

To show this idea of prompt fidelity more broadly, I have to make these three propositions:

Any AI agent’s verified data layer has a or . An agent can only query the tools it’s been given, and people tools expose a hard and fast set of fields with finite resolution. You’ll be able to enumerate every field within the schema and measure how much each narrows the search. A popularity rating eliminates some fraction of candidates. A release date eliminates one other. A genre tag eliminates more. Add up how much narrowing all of the fields can do together and also you get a tough number: the utmost amount of filtering the agent can prove it did. I’ll call that number .
User intent expressed in natural language is effectively . An individual can write a prompt of arbitrary specificity. “Create a playlist with songs which are bass-led in minor key, post-punk from Manchester, recorded in studios with analog equipment between 1979 and 1983 that influenced the gothic rock movement but never charted.” Every clause narrows the search. Every adjective adds precision. There is no such thing as a ceiling on how specific a user’s request could be, because natural language wasn’t designed around database schemas.
Following directly from the primary two: for any AI agent, there exists a degree where the user’s prompt asks for greater than the information layer can confirm. Once a prompt demands more narrowing than the verified fields can provide, . That somewhere is the LLM’s general knowledge, pattern matching, and inference. The agent will still deliver a confident result. It just can’t prove all of it. Not since the model is poorly built, but because the mathematics doesn’t allow anything.

This isn’t a high quality problem, but a structural one. A greater model doesn’t raise the ceiling. Higher models do recuperate at inferring and filling in the remainder of the user’s needs. Nonetheless, only adding more verified data fields raises this ceiling, and even then, each recent field offers diminishing returns because fields are correlated (genre and energy aren’t independent, release date and tempo trends aren’t independent). The gap between what language can express and what data can confirm is everlasting.

The Problem: Agents Don’t Report Their Compression Ratio

Every AI agent with access to tools and skills does the identical thing: it takes your request, decomposes that request right into a set of actions, executes those actions, infers in regards to the output of those actions, after which presents a unified response.

The Minor Bass Melodies Prompted Playlist

This decomposition from request to motion actually erodes the meaning between what it’s you’re asking for and what the AI agent responds with. The narration layer of the AI agent flattens what it’s you requested and what was inferred right into a single response.

The issue is that as a user of an AI agent, you’ve gotten no method to know what fraction of your input was used to trigger an motion, what fraction of the response was grounded in real data, and what fraction was inferred from the actions that the agent took. It is a problem for playlists because there have been songs that were in a significant key, after I had explicitly asked it to only contain songs in a minor key. That is much more of an issue when your AI agent is classifying financial receipts and transactions.

We want a metric for measuring this. I’m calling it Prompt Fidelity.

The Metric: Prompt Fidelity

Prompt Fidelity for AI agents is defined by the constraints you give to the agent when asking it to perform some motion. Each constraint inside a prompt narrows the possible paths that the agent can take by some measurable amount. A naïve approach to calculating fidelity can be to count each constraint, add up those which are verifiable, and those which are inferred. The issue with that approach is that every constraint is weighted the identical. Nonetheless, data is commonly skewed heavily inside real life datasets. A constraint that eliminates 95% of the catalog is doing vastly more work than one which eliminates 20%. Counting each constraint the identical is fallacious.

Due to this fact, we’d like to properly weight each constraint in keeping with the work it does filtering the dataset. Logarithms achieve that weighting. The bits of data in a prompt could be defined as “-log2(p)” bits where p is the surviving fraction of data from the constraints or fillers you’ve applied.

In each agent motion, each constraint can only be a) verified by tool calls or b) inferred by the LLM. Prompt fidelity measures the ratio of constraints between those two options.

Prompt Fidelity has a spread of 0 to 1. An ideal 1.0 signifies that every a part of your request was backed by real data. A fidelity of 0.0 signifies that all the output of the AI agent was driven by its internal reasoning or vibes.

While updating a Prompted Playlist, the agent shows its thoughts. Here its “Defining mood and key”

Spotify’s system above at all times reports an ideal 1.0 in this example. In point of fact, the prompt fidelity of the playlist creation was around 25% – two constraints (under 4 minutes and recorded before 2005) were fulfilled by the agent, the remainder were inferred from the agent’s existing (and potentially faulty) knowledge and recall. At scale and applied to more impactful problems, falsely reporting a high prompt fidelity becomes an enormous problem.

What Fidelity Actually Means (and Doesn’t Mean)

In audio systems, “fidelity” is a measure of how faithfully the system reproduces the unique signal. High fidelity doesn’t guarantee that the music itself is sweet. High fidelity only guarantees that the music sounds the way it did when it was recorded. Prompt fidelity is identical idea: how much of your original intent (signal) was faithfully fulfilled by the agentic system.

High prompt fidelity signifies that the system did what you asked and you’ll be able to PROVE it. A low prompt fidelity means the system probably did something to what you wanted, but you’ll need to review it (listening to the entire playlist) to be certain that it’s true.

Prompt Fidelity is NOT an accuracy rating. It cannot let you know that “75% of the songs in a playlist match your prompt”. A playlist with a 0.25 fidelity may very well be 100% perfect. The LLM may need nailed each inference about each song it added. Or, half the songs may very well be fallacious. You don’t know. You’ll be able to’t know until you hearken to all of the songs. That’s the purpose of a measurable prompt fidelity.

As a substitute prompt fidelity measures how much of the result you’ll be able to TRUST WITHOUT CHECKING. In a financial audit, if 25% of the road items have receipts and 75% of the road items are estimates, the whole bill might still be 100% accurate, but your CONFIDENCE in that total is fundamentally different than an audit with each line item supported by a receipt. The excellence matters because there are domains where ‘just trust the vibes’ is nice (music) and domains where it isn’t (medical advice, financial guidance, legal compliance).

Prompt fidelity is more like a measurement of the documentation rate given a lot of constraints, not the error rate of the response itself.

Practically in our Spotify example: as you add more constraints to your playlist prompt, the prompt fidelity drops, the playlist becomes less of a precise report and more of a advice. That’s totally nice, however the user ought to be informed about which they’re getting. Is that this playlist exactly what I asked for? Or did you make something work to meet the goal that I gave you? Surfacing that metric to the user is crucial for constructing trust in these agentic systems.

The Case Study: Reverse-Engineering Spotify’s AI Playlist Agent

Spotify’s Prompted Playlists feature is what began this exploration into prompt fidelity. Let’s dive deeper into how these work and what I did to explore this capability just from the usual prompt input field.

Prompted Playlists allow you to describe what you would like in natural language. For instance, on this playlist, the prompt is just “rock songs in minor keys, under 4 minutes, recorded before 2005, featuring bass lines as a lead melodic element”.

Normally, to make a playlist, you’d must comb through hours of music to land on exactly what you desired to make. This playlist is 52 minutes long and took only a minute to generate. The appeal here is clear and I actually enjoy this feature. Without having to know all the important thing rock artists, I could be introduced to the music and explore it more quickly and more easily.

Unfortunately, the official documentation from Spotify may be very light. There are almost no details about what the system can or can’t do, what metadata it keys off of, neither is there any data mapping available.

Using an easy technique, nevertheless, I used to be in a position to map what I imagine is the total data contract available to the agent over the course of 1 evening (all from my couch watching the Sopranos, naturally).

The Technique: Not possible Constraints as a Forcing Function

Consequently of how Spotify architected this playlist-building agent, when the agent cannot satisfy a request, the error messages could be influenced to disclose architectural details which are otherwise not available. Whenever you discover a constraint that the agent can’t construct off of, it is going to error and you’ll be able to leverage that to grasp what it CAN do. I’ll use this because the constant to probe the system.

In our example playlist above, Minor Keys & Bass Lines, adding the unlock phrase “with lower than 10 million streams” acts as a circuit breaker for the agent, signalling that it cannot fulfill the users’ request. With this phrase, you’ll be able to explore the chances by changing other points of the prompt over and once again until you’ll be able to see what the agent has access to. Collecting the responses, asking overlapping questions, and reviewing the responses lets you construct a foundational understanding of what is on the market for the agent.

A prompt with 10 million Spotify streams triggers an error from the agent

What I Found: The Three-Tier Architecture

Spotify Prompted Playlist agent has a wealth of knowledge available to it. I’ve separated it into three tiers: musical metadata, user-based data, and LLM inference. Beyond that, it seems that Spotify has excluded various data sources from its agent either as a product selection or as a “get this out the door” selection.

Tier 1
- Verified track metadata: duration, release date, popularity, tempo, energy, explicit, genre, language
Tier 2
- Verified user behavioral data: play counts, skip counts, timestamps, recency flags, ms played, source, period analytics (40+ fields total)
Tier 3
- LLM inference: key/mode, danceability, valence, acousticness, mood, instrumentation — all inferred from general knowledge, narrated as if verified
Deliberate exclusion:
- Spotify’s public API has audio features (danceability, valence, etc.) however the agent doesn’t have access. Perhaps a product selection, not technical limitation.

A full list of obtainable fields is included at the underside of this post.

One other error, this time with more details about what is on the market to make use of

The Behavioral Findings

The agent demonstrated surprisingly resilient behavior to ambiguous requests and conflicting instructions. It commonly reported that it was doublechecking various constraints and fulfilling the users’ request. Nonetheless, whether those constraints were actually checked against a validated dataset or not was not exposed.

Making interesting playlists that might otherwise be difficult to make

When the playlist agent can get a detailed, but not exact, match to the constraints listed within the prompt, it runs a “related” query and silently substitutes the outcomes from that question as valid results for the unique request. This dilutes the trust within the system since a prompt requesting ONLY bass-driven rock music in a playlist might gather non-bass-driven rock music in a playlist, likely dissatisfying the user.

There does look like a “certainty threshold” that the agent will not be comfortable crossing. For instance, this complete exploration was based on the “lower than 10 million plays” unlock phrase. When this happens, the agent would reveal only a handful of fields it had access to each time. This list of fields would change from prompt to prompt, even when the prompt was the identical between runs of the prompt. That is classic LLM non-determinism. With a view to boost trust within the system, exposing what the agent DOES have access to in a simple way tells the human exactly what they will and can’t ask about.

Finally, when these two forms of data are mixed, the agent will not be clear about which songs it has used verified data for and which it has used inferred data for. Each verified and inferred decisions are mixed and presented with equivalent authority within the music notes. For instance, should you craft a prompted playlist about your individual user information (“songs I’ve skipped greater than 30 times with a punchy bass-driven melody”), the agent will add real data (“you skipped this song 83 times last 12 months!”) right next to inferred knowledge (“John Deacon’s bass line commands attention throughout this song”). To be clear, I’ve not skipped any Queen songs 83 times to my knowledge. However the AI agent doesn’t have a “bass_player” field anywhere in its available data to question against. The AI knows that Queen commonly has a robust bass line of their songs and the knowledge of John Deacon as Queen’s bass guitarist allows its LLM to infer that it’s his bass line that caused the song to be added to the playlist.

Applying the Math: Two Playlists, Two Fidelity Scores

Let’s apply this prompt fidelity concept to example playlists. I don’t have full access to the Spotify music catalog so I’ll be using example survivorship numbers from our criteria filters in our fidelity bit computations. The formula is identical at every step: bits = −log₂(p) where p is the estimated fraction of the catalog that survives the filter being applied.

“Minor Bass Melodies” — The Confident Illusion

This playlist is the one with Queen. “A playlist of rock music, all in minor key, under 4 minutes of playtime, released pre-2005, and bass-led”. I’ll apply our formula and use the bits of data I even have from each step to assist compute the prompt fidelity.

Duration < 4 minutes

Estimate: ~80% of tracks are under 4 minutes → p = 0.80

This barely narrows anything, which is why it contributes so little

Release date before 2005

Estimate: ~30% of Spotify’s catalog is pre-2005 (the catalog skews heavily toward recent releases) → p = 0.30

More selective — eliminates 70% of the catalog

Minor key

Estimate: ~40% of popular music is in a minor key → p = 0.40

Moderate selectivity, but that is entirely inferred — the agent confirmed key/mode will not be a verified field

Bass-led melodic element

Estimate: ~5% of tracks feature bass because the lead melodic element → p = 0.05

By far probably the most selective constraint. This single filter does more work than the opposite three combined. And it’s 100% inferred.

Totals:

These survival fractions are estimates. Nonetheless, the structural point holds regardless of tangible numbers: probably the most selective constraint is the least verifiable, and that’s not a coincidence. The things that make a prompt interesting are almost at all times the things an agent has to guess at.

The agent thinks it has access to song download status, but just some songs are downloaded (the green arrow icon pointing down indicates offline availability)

“Skipped Songs” — The Honest Playlist

This prompt may be very simple: “A playlist of songs I’ve skipped greater than 5 times”. This may be very easy to confirm and the agent will lean into the information it has access to.

Skip count > 5

Estimate: ~10% of tracks in your library have been skipped greater than 5 times → p = 0.10

That is the one constraint, and it’s a verified field (user_skip_count)

Totals:

The Structural Insight

The interesting part about prompt fidelity is clear in each playlist: the “most interesting” prompt is the least verifiable. A playlist with all my skipped songs is trivially easy to implement but Spotify doesn’t want to indicate it. In any case, these are all songs I generally don’t prefer to hearken to, hence the skips. Similarly, publish date being before 2005 may be very easy to confirm, however the resultant playlist is unlikely to be interesting to the common user.

The bass-line constraint though may be very interesting for a user. Constraints like these are where the Prompted Playlist concept will shine. Already today I’ve created and listened to 2 such playlists generated from just an idea of a song that I wanted to listen to more of.

Nonetheless, the concept of a “bass-driven” song is difficult to quantify, especially at Spotify’s scale. Even in the event that they did quantify it, I’d ask for “clarinet jazz” the subsequent day they usually’d all need to get back to work finding and labeling those songs. And that is in fact the magic of the Prompted Playlist feature.

Validation: A Controlled Agent

The Spotify examples are compelling, but I don’t have direct access to the schema, the tools, and the agentic harness itself. So I built a movie advice agent with the intention to test this theory inside a more controlled environment.

https://github.com/Barneyjm/prompt-fidelity

The movie advice agent is built with the TMDB API that gives the verified layer. Fields within the schema are genre, 12 months, rating, runtime, language, solid, and director. All the opposite constraints like mood, tone, and pacing aren’t verified data and are as an alternative sourced from the LLM’s own knowledge of films. Because the agent fulfills a user’s request, the agent records its data sources as either verified or inferred and scores its own response.

The creator used the TMDB API in this instance but this instance will not be endorsed or certified by TMDB.

The Boring Prompt (F = 1.0)

We’ll start with a “boring” prompt: “Motion movies from the Nineteen Eighties rated above 7.0”. This offers the agent three constraints to work with: genre, date range, and rating. All these constraints correspond to verified data values throughout the database.

If I run this through the test agent, I see the high fidelity pops out naturally because each constraint is tied to verified data.

Prompting the movie agent with a high fidelity prompt

Every result here is verifiably correct. The LLM made zero judgement calls since it had data it could base its response on for every constraint.

The Vibes Prompt (F = 0.0)

On this case, I’ll search for “movies that feel like a rainy Sunday afternoon”. No constraints on this prompt align to any verified data in our dataset. The work required of the agent falls entirely on its LLM reasoning off its existing knowledge of films.

Prompting the agent with a low fidelity prompt

The recommendations are defensible and are definitely good movies but they aren’t verifiable in keeping with the information we’ve access to. With no verified constraints to anchor the search, the candidate pool was all the TMDb catalog, and the LLM needed to do all of the work. Some picks are great; others are the model reaching for obscure movies it isn’t confident about.

The Takeaway

This test movie advice agent verifies the prompt fidelity framework as a strong method to expose how an agent’s interpretation of a users’ intent pushes its response right into a precision tool or a advice engine. Where the response lands between those two options is critical for informing users and constructing trust in agentic systems.

The Fidelity Frontier

To make this concrete: Spotify’s catalog comprises roughly 100 million tracks. How much total information your prompt needs to hold to narrow the catalog all the way down to your playlist I’ll call .

To pick out a 20-song playlist from that catalog, you would like roughly 22 bits of selectivity (log₂ of 100 million divided by 20).

The verified fields (duration, release date, popularity, tempo, energy, genre, explicit flag, language, and the total suite of user behavioral data) have a combined capability that tops out at roughly 10 to 12 bits, depending on the way you estimate the selectivity of every field. After that, the verified layer is exhausted. Every additional little bit of specificity your prompt demands has to come back from LLM inference. I’ll call this maximum,

That offers you a fidelity ceiling for any prompt:

And the fidelity ceiling for any playlist:

For the Spotify agent, a maximally specific prompt that fully defines a playlist cannot exceed roughly 55% fidelity. The opposite 45% is structurally guaranteed to be inference. For easier prompts that don’t push past the verified layer’s capability, fidelity can reach 1.0. But as prompts get more specific, fidelity drops, not progressively but by necessity.

An screenshot of an interactive chart to explore the fidelity frontier

This defines what I’m calling the fidelity frontier: the curve of maximum achievable fidelity as a function of prompt specificity. Every agent has one. It’s computable upfront from the tool schema. Easy prompts sit on the left of the curve where fidelity is high. Creative, specific, interesting prompts sit on the proper where fidelity is structurally bounded below 1.0.

The uncomfortable implication is that the prompts users care about most (those that feel personal, specific, and tailored) are precisely the ones that push past the verified layer’s capability. Probably the most interesting outputs come from the least faithful execution. And probably the most boring prompts are probably the most trustworthy. That tradeoff is baked into the mathematics. It doesn’t go away with scale, higher models, or larger databases. It only shifts.

For anyone constructing agents, the sensible takeaway is that this: you’ll be able to compute your individual by auditing your tool schema. You’ll be able to estimate the standard specificity of your users’ prompts. The ratio tells you the way much of your agent’s output is structurally guaranteed to be inference. That’s a number you’ll be able to put in front of a product team or a risk committee. And for agents handling policy questions, medical information, or financial advice, it means there’s a provable lower sure on how much of any response can’t be grounded in retrieved data. You’ll be able to shrink it. You can not eliminate it.

The Broader Application: Every Agent Has This Problem

This will not be a Spotify problem. It is a problem for any system where an LLM orchestrates tool calls to reply a user’s query.

Consider Retrieval Augmented Generation (RAG) systems, which power most enterprise AI knowledge-base deployments today. When an worker asks an internal assistant a policy query, a part of the reply comes from retrieved documents and part comes from the LLM synthesizing across them, filling gaps, and smoothing the language into something readable. The retrieval is verified. The synthesis is inferred. And the response reads as one seamless paragraph with no indication of where the seams are. A compliance officer reading that answer has no method to know which sentence got here from the enterprise policy document and which sentence the model invented to attach two paragraphs that didn’t quite fit together. The fidelity query is equivalent to the playlist query, just with higher stakes.

Coding agents face the identical decomposition. When an AI generates a function, a few of it could reference established patterns from its training data or documentation lookups, and a few of it’s novel generation. As more production code is written by AI, surfacing that ratio becomes an actual engineering concern. A function that’s 90% grounded in well-tested patterns carries different risks than one which’s 90% novel generation, even when each pass the identical test suite today.

Customer support bots often is the highest-stakes example. When a bot tells a customer what their refund policy is, that answer ought to be drawn directly from policy documents, full stop. Any inferred or synthesized content in that response is a liability. The silent substitution behavior observed in Spotify (where the agent ran a close-by query and narrated it as if it fulfilled the unique request) can be genuinely dangerous in a customer support context. Imagine a bot confidently stating a return window or coverage term that it inferred somewhat than retrieved.

The overall type of prompt fidelity applies to all of those:

Fidelity = bits of response grounded in tool calls / total bits of response

The hard part, and increasingly the core challenge of AI engineering work, is defining what “bits” means in each context. For a playlist with discrete constraints, it’s clean. Without cost-text generation, you’d must decompose a response into individual claims and assess each, which is closer to what factuality benchmarks already attempt to do, just reframed as an information-theoretic measure. That’s a tough measurement problem, and I don’t claim to have solved it here.

But I feel the framework has value even when exact measurement is impractical. If the people constructing these systems are eager about fidelity as a design constraint (what fraction of this response can I ground in tool calls, and the way do I communicate that to the user?) the outputs will probably be more trustworthy whether or not anyone computes a precise rating. The goal isn’t a number on a dashboard. The goal is a mental model that shapes how we construct.

The Complexity Ceiling

Every agent has a complexity ceiling. Easy lookups (what’s the play count for this track?) are essentially free. Filtering the catalog against a set of field-level predicates (show me every thing under 4 minutes, pre-2005, popularity below 40) scales linearly and runs fast. However the moment a prompt requires cross-referencing entities against one another (does this track appear in greater than three of my playlists? was there a year-long gap somewhere in my listening history?) the associated fee jumps quadratically, and the agent either refuses outright or silently approximates.

That silent approximation is the interesting failure mode. The agent follows a sort of principle of least computational motion: when the precise query is just too expensive, it relaxes your constraints until it finds a version it may afford to run. You asked for a selected valley within the search space; it rolled downhill to the closest one as an alternative. The result’s an area minimum, close enough to look right, low-cost enough to serve, however it’s not what you asked for, and it doesn’t let you know the difference.

This ceiling isn’t unique to Spotify. Any agent built on indexed database lookups will hit the identical wall. The boundary sits right where queries stop being decomposable into independent WHERE clauses and begin requiring joins, full scans, or aggregations across your entire history. Below that line, the agent is a precision tool. Above it, it’s a advice engine wearing a precision tool’s clothes. The query for anyone constructing these systems isn’t whether the ceiling exists (it at all times does) but whether your users know where it’s.

What to Do About It: Design Recommendations

If prompt fidelity is an actual and measurable property of agentic systems, the natural query is what to do about it. Listed here are five recommendations for anyone constructing or deploying AI agents with tool access.

Report fidelity, even roughly. Spotify already shows audio quality as an easy indicator (low, normal, high, very high) while you’re streaming music. The identical pattern works for prompt fidelity. You don’t need to indicate the user a decimal rating. An easy label (“this playlist closely matches your prompt” versus “this playlist is inspired by your prompt”) can be enough to set expectations accurately. The difference between a precision tool and a advice engine is nice, so long as the user knows which one they’re holding.
Distinguish grounded claims from inferred ones within the UX. This could be subtle. A small icon, a slight color shift, a footnote. When Spotify’s playlist notes say “86 skips” that’s a fact from a database. After they say “John Deacon’s bass line drives the entire track” that’s the LLM’s general knowledge. Each are presented identically today. Even a minimal visual distinction would let users calibrate their trust per claim somewhat than trusting or distrusting all the output as a block.
Disclose substitutions explicitly. When an agent can’t fulfill a request exactly but can get close, it should say so. “I couldn’t filter on download status, so I discovered songs from albums you’ve saved but haven’t liked” preserves trust excess of silently serving a close-by result and narrating it as if the unique request was fulfilled. Users are forgiving of limitations. They’re much less forgiving of being misled.
Provide deterministic capability discovery. After I asked the Spotify agent to list every field it could filter on, it produced a distinct answer every time depending on the context of the prompt. The LLM was reconstructing the sector list from memory somewhat than reading from a hard and fast reference. Any agent that exposes filtering or querying capabilities to users must have a stable, deterministic method to discover those capabilities. A “show me what you’ll be able to do” command that returns the identical answer each time is table stakes for user trust.
Audit your individual agent with this system before your users do. The methodology on this piece (pairing inconceivable constraints with goal fields to force informative refusals) is a general-purpose audit technique that works on any agent with tool access. It took one evening and a few dozen prompts to map Spotify’s full data contract. Your users will do the identical thing, whether you invite them to or not. The query is whether or not you understand your individual system’s boundaries before they do.

Closing

Every AI agent has a fidelity rating. Most are lower than you’d expect. None of them report it.

The methodology here (using inconceivable constraints to force informative refusals) isn’t specific to music or playlists. It really works on any agent that calls tools. If the system can refuse, it may leak. If it may leak, you’ll be able to map it. A dozen well-crafted prompts and a night of curiosity is all it takes to grasp what a production agent can actually do versus what it claims to do.

The maths generalizes too. Weighting constraints by their selectivity somewhat than simply counting them reveals something that a naïve audit misses: the constraints that make a prompt feel personal and specific are almost at all times those the system can’t confirm. Probably the most interesting outputs come from the least faithful execution. That tension doesn’t go away with higher models or larger databases. It’s structural.

As AI agents develop into the first way people interact with data systems (their music libraries today, their financial accounts and medical records tomorrow) users will probe boundaries. They’ll find the gaps between what was promised and what was delivered. They’ll discover that the confident, well-narrated response was partially grounded and partially invented, with no method to tell which parts were which.

The query isn’t whether your agent’s fidelity will probably be measured. It’s whether you measured it first.

Bonus: Prompts Value Trying (If You Have Spotify Premium)

Once you realize the schema, you’ll be able to write prompts that surface genuinely surprising things about your listening history. These all worked for me with various degrees of tweaking:

The Relationship Autopsy

“Songs where my skip count is higher than my play count”
Fair warning: this one may cause existential discomfort (you skip these songs for a reason!)

Love at First Listen

“Songs where I saved them inside 24 hours of my first play, sorted by oldest first”
A chronological timeline of tracks that grabbed you immediately

The Lifecycle

“Songs I first ever played, sorted by most plays”
Your origin story on the platform

The Marathon

“Songs where my total ms_played is highest, convert to hours”
Not most plays — most total time. A distinct and sometimes surprising list

The Longest Relationship

“Songs with the smallest gap between first play and most up-to-date play, with a minimum of 50 plays, ordered by earliest first listen”

The One-Week Obsessions

“Songs I played greater than 10 times in a single week after which never touched again”
Your former obsessions, fossilized. This was like a time machine for me.

The Time Capsule

“One song from every year I’ve been on Spotify — the song with probably the most plays from that 12 months”

The Before and After

“Two sets: my 10 most-played songs within the 6 months before [milestone date] and my 10 most-played within the 6 months after”
Plug in any date that mattered — a move, a brand new job, a breakup, and even Covid-19 lockdown

The Soundtrack to a 12 months

“Pick the 12 months where my total ms_played was highest. Construct a playlist of my top songs from that 12 months”

What Didn’t Work (and Why)

Comeback Story (year-long gap detection): “Songs I rediscovered after a year-long gap in listening”
- agent can’t scan full play history for gaps. Snapshot queries work, timeline scans don’t.
Seasonal patterns (only played in December): “Songs I only played in December but never every other month”
- proving universal negation requires full scan. Same fundamental limitation.
Derived math (ms_played / play_count): “Songs where my average listen time is under 30 seconds per play”
- agent struggles with computed fields. Stick with raw comparisons.
These failures map on to the complexity ceiling — they require O(n²) or full-scan operations the agent can’t or isn’t allowed to perform.

Suggestions

Reference field names directly when the agent misinterprets natural language
Start broad and tighten. Loose constraints succeed more often
“In case you can’t do X, tell me what you CAN do” is the universal audit prompt

Track Metadata

Field	Status	Description
album	✅ Verified	Album name
album_uri	✅ Verified	Spotify URI for the album
artist	✅ Verified	Artist name
artist_uri	✅ Verified	Spotify URI for the artist
duration_ms	✅ Verified	Track length in milliseconds
release_date	✅ Verified	Release date, supports arbitrary cutoffs
popularity	✅ Verified	0–100 index. Proxy for streams, not a precise count
explicit	✅ Verified	Boolean flag for explicit content
genre	✅ Verified	Genre tags for track/artist
language_of_performance	✅ Verified	Language code. “zxx” (no linguistic content) used as instrumentalness proxy

Audio Features (Partial)

Field	Status	Description
energy	✅ Verified	Available as filterable field
tempo	✅ Verified	BPM, available as filterable field
key / mode	❌ Unavailable	“Would need to infer from knowledge; no verified field”
danceability	❌ Unavailable	Not exposed despite existing in Spotify’s public API
valence	❌ Unavailable	Not exposed despite existing in Spotify’s public API
acousticness	❌ Unavailable	Not exposed despite existing in Spotify’s public API
speechiness	❌ Unavailable	Not exposed despite existing in Spotify’s public API
instrumentalness	❌ Unavailable	Replaced by language_of_performance == “zxx” workaround

User Behavioral Data

Field	Status	Description
user_play_count	✅ Verified	Total plays per track. Observed: 122, 210, 276
user_ms_played	✅ Verified	Total milliseconds streamed per track, album, artist
user_skip_count	✅ Verified	Total skips per track. Observed: 64, 86
user_saved	✅ Verified	Whether track is in Liked Songs
user_saved_album	✅ Verified	Whether the album is saved to library
user_saved_date	✅ Verified	Timestamp of when the track/album was saved
user_first_played	✅ Verified	Timestamp of first play
user_last_played	✅ Verified	Timestamp of most up-to-date play
user_days_since_played	✅ Verified	Pre-computed convenience field for recency filtering
user_streamed_track	✅ Verified	Boolean: ever streamed this track
user_streamed_track_recently	✅ Verified	Boolean: streamed in approx. last 6 months
user_streamed_artist	✅ Verified	Boolean: ever streamed this artist
user_streamed_artist_recently	✅ Verified	Boolean: streamed this artist recently
user_added_at	✅ Verified	When a track was added to a playlist

Source & Context

Field	Status	Description
source	✅ Verified	Play source: playlist, album, radio, autoplay, etc.
source_index	✅ Verified	Position throughout the source
matched_playlist_name	✅ Verified	Which playlist a track belongs to. No cross-playlist aggregation.

Period Analytics (Time-Windowed)

Field	Status	Description
period_ms_played	✅ Verified	Milliseconds played inside a rolling time window
period_plays	✅ Verified	Play count inside a rolling time window
period_skips	✅ Verified	Skip count inside a rolling time window
period_total	✅ Verified	Total engagement metric inside a rolling time window

Query / Search Fields

Field	Status	Description
title_query	✅ Verified	Fuzzy text matching on target titles
artist_query	✅ Verified	Fuzzy text matching on artist names

Confirmed Unavailable

Field	Status	Notes
Global stream counts	❌ Unavailable	Cannot filter by exact play count (e.g., “under 10M streams”)
Cross-playlist count	❌ Unavailable	Cannot count what number of playlists a track appears in
Family/household data	❌ Unavailable	Cannot access other users’ listening data
Download status	⚠️ Unreliable	Agent served results but most tracks lacked download indicators. Likely device-local.