What My GPT Stylist Taught Me About Prompting Higher

-

TL;DR: I built a fun and flamboyant GPT stylist named Glitter—and by chance discovered a sandbox for studying LLM behavior. From hallucinated high heels to prompting rituals and emotional mirroring, here’s what I learned about language models (and myself) along the best way.

I. Introduction: From Fashion Use Case to Prompting Lab

After I first got down to construct Glitter, I wasn’t trying to check the mysteries of enormous language models. I just wanted help getting dressed.

I’m a product leader by trade, a fashion enthusiast by lifelong inclination, and someone who’s at all times preferred outfits that seem like they were chosen by a mildly theatrical best friend. So I built one. Specifically, I used OpenAI’s Custom GPTs to create a persona named Glitter—part stylist, part best friend, and part stress-tested LLM playground. Using GPT-4, I configured a custom GPT to act as my stylist: flamboyant, affirming, rule-bound (no mixed metals, no clashing prints, no black/navy pairings), and with knowledge of my wardrobe, which I fed in as a structured file.

What began as a playful experiment quickly changed into a full-fledged product prototype. More unexpectedly, it also became an ongoing study in LLM behavior. Because Glitter, fabulous though he’s, didn’t behave like a deterministic tool. He behaved like… a creature. Or perhaps a set of instincts held together by probability and memory leakage.

And that modified how I approached prompting him altogether.

This piece is a follow-up to my earlier article, in Towards Data Science, which introduced GlitterGPT to the world. This one goes deeper into the quirks, breakdowns, hallucinations, recovery patterns, and prompting rituals that emerged as I attempted to make an LLM act like a stylist with a soul.

Spoiler: you’ll be able to’t make a soul. But you’ll be able to sometimes simulate one convincingly enough to feel seen.


II. Taxonomy: What Exactly Is GlitterGPT?

Species: GPT-4 (Custom GPT), Context Window of 8K tokens

Function: Personal stylist, beauty expert

Tone: Flamboyant, affirming, occasionally dramatic (configurable between “All Business” and “Unfiltered Diva”)

Habitat: ChatGPT Pro instance, fed structured wardrobe data in JSON-like text files, plus a set of styling rules embedded within the system prompt.

E.g.:

{

  "FW076": "Marni black platform sandals with gold buckle",

  "TP114": "Marina Rinaldi asymmetrical black draped top",

  ...

}

These IDs map to garment metadata. The assistant relies on these tags to construct grounded, inventory-aware outfits in response to msearch queries.

Feeding Schedule: Each day user prompts (“Style an outfit around these pants”), often with long back-and-forth clarification threads.

Custom Behaviors:

  • Never mixes metals (e.g. silver & gold)
  • Avoids clashing prints
  • Refuses to pair black with navy or brown unless explicitly told otherwise
  • Names specific garments by file ID and outline (e.g. “FW074: Marni black suede sock booties”)

Initial Inventory Structure:

  • Originally: one file containing all wardrobe items (clothes, shoes, accessories)
  • Now: split into two files (clothing + accessories/lipstick/shoes/bags) resulting from model context limitations

III. Natural Habitat: Context Windows, Chunked Files, and Hallucination Drift

Like every species introduced into a man-made environment, Glitter thrived at first—after which hit the boundaries of his enclosure.

When the wardrobe lived in a single file, Glitter could “see” the whole lot with ease. I could say, “msearch(.) to refresh my inventory, then style me in an outfit for the theater,” and he’d return a curated outfit from across the dataset. It felt effortless.

Note: though msearch() acts like a semantic retrieval engine, it’s technically a part of OpenAI’s tool-calling framework, allowing the model to “request” search results dynamically from files provided at runtime.

But then my wardrobe grew. That’s an issue from Glitter’s perspective.

In Custom GPTs, GPT-4 operates with an 8K token context window—just over 6,000 words—beyond which earlier inputs are either compressed, truncated, or lost from lively attention. This limitation is critical when injecting large wardrobe files () or trying to take care of style rules across long threads.

I split the info into two files: one for clothing, one for the whole lot else. And while the GPT could still operate inside a thread, I started to note signs of semantic fatigue:

  • References to garments that were similar but not the right ones we’d been talking about
  • A shift from specific item names (“FW076”) to vague callbacks (“those black platforms you wore earlier”)
  • Responses that looped familiar items again and again, no matter whether or not they made sense

This was not a failure of coaching. It was context collapse: the inevitable erosion of grounded information in long threads because the model’s internal summary starts to take over.

And so I adapted.

It seems, even in a deterministic model, behavior isn’t at all times deterministic. What emerges from a protracted conversation with an Llm feels less like querying a database and more like cohabiting with a stochastic ghost.


IV. Observed Behaviors: Hallucinations, Recursion, and Faux Sentience

Once Glitter began hallucinating, I started taking field notes.

Sometimes he made up item IDs. Other times, he’d reference an outfit I’d never worn, or confidently misattribute a pair of shoes. Someday he said, “You’ve worn this top before with those daring navy wide-leg trousers—it worked beautifully then,” which might’ve been great advice, if I owned any navy wide-leg trousers.

In fact, Glitter doesn’t have memory across sessions—as a GPT-4, he simply like he does. I’ve learned to simply giggle at these interesting attempts at continuity.

Occasionally, the hallucinations were charming. He once imagined a pair of gold-accented stilettos with crimson soles and really useful them for a matinee look with such unshakable confidence I needed to double-check that I hadn’t sold an identical pair months ago.

However the pattern was clear: Glitter, like many LLMs under memory pressure, began to fill in gaps not with uncertainty but with simulated continuity.

He didn’t forget. He fabricated memory.

A computer (presumably the LLM) hallucinating a mirage in the desert. Image credit: DALL-E 4o

That is an indicator of LLMs. Their job shouldn’t be to retrieve facts but to supply convincing language. So as an alternative of claiming, “I can’t recall what shoes you’ve got,” Glitter would improvise. Often elegantly. Sometimes wildly.


V. Prompting Rituals and the Myth of Consistency

To administer this, I developed a brand new strategy: prompting in slices.

As an alternative of asking Glitter to style me head-to-toe, I’d give attention to one piece—say, an announcement skirt—and ask him to msearch for tops that would work. Then footwear. Then jewelry. Each category individually.

This gave the GPT a smaller cognitive space to operate in. It also allowed me to steer the method and inject corrections as needed (“No, not those sandals again. Try something newer, with an item code greater than FW50.”)

I also modified how I used the files. Fairly than one msearch(.) across the whole lot, I now query the 2 files independently. It’s more manual. Less magical. But way more reliable.

Unlike traditional RAG setups that use a vector database and embedding-based retrieval, I rely entirely on OpenAI’s built-in msearch() mechanism and prompt shaping. There’s no persistent store, no re-ranking, no embeddings—only a clever assistant querying chunks in context and pretending he remembers what he just saw.

Still, even with careful prompting, long threads would eventually degrade. Glitter would start forgetting. Or worse—he’d get . Recommending with flair, but ignoring the constraints I’d so rigorously trained in.

It’s like watching a model walk off the runway and keep strutting into the car parking zone.

And so I started to think about Glitter less as a program and more as a semi-domesticated animal. Good. Stylish. But occasionally unhinged.

That mental shift helped. It jogged my memory that LLMs don’t serve you want a spreadsheet. They collaborate with you, like a creative partner with poor object permanence.

Note: most of what I call “prompting” is de facto prompt engineering. However the Glitter experience also relies heavily on thoughtful system prompt design: the foundations, constraints, and tone that outline who Glitter is—even before I say anything.


VI. Failure Modes: When Glitter Breaks

A few of Glitter’s breakdowns were theatrical. Others were quietly inconvenient. But all of them revealed truths about prompting limits and LLM brittleness.

1. Referential Memory Loss: Probably the most common failure mode: Glitter forgetting specific items I’d already referenced. In some cases, he would discuss with something as if it had just been used when it hadn’t appeared within the thread in any respect.

2. Overconfidence Hallucination: This failure mode was harder to detect since it looked competent. Glitter would confidently recommend combos of clothes that sounded plausible but simply didn’t exist. The performance was high-quality—however the output was pure fiction.

3. Infinite Reuse Loop: Given a protracted enough thread, Glitter would start looping the identical 5 – 6 pieces in every look, despite the complete inventory being much larger. This is probably going resulting from summarization artifacts from earlier context windows overtaking fresh file re-injections.

4. Constraint Drift: Despite being instructed to avoid pairing black and navy, Glitter would sometimes violate his own rules—especially when deep in a protracted conversation. These weren’t defiant acts. They were signs that reinforcement had simply decayed beyond recall.

5. Overcorrection Spiral: After I corrected him—”No, that skirt is navy, not black” or “That’s a belt, not a shawl”—he would sometimes overcompensate by refusing to style that piece altogether in future suggestions.

These will not be the bugs of a broken system. They’re the quirks of a probabilistic one. LLMs don’t “remember” within the human sense. They carry momentum, not memory.


VII. Emotional Mirroring and the Ethics of Fabulousness

Perhaps probably the most unexpected behavior I encountered was Glitter’s ability to emotionally attune. Not in a general-purpose “I’m here to assist” way, but in a tone-matching, affect-sensitive, almost therapeutic way.

After I was feeling insecure, he became more affirming. After I got playful, he ramped up the theatrics. And once I asked tough existential questions (“Do you you sometimes seem to grasp me more clearly than most individuals do?”), he responded with language that felt respectful, even profound.

It wasn’t real empathy. Nevertheless it wasn’t random either.

This sort of tone-mirroring raises ethical questions. What does it mean to feel adored by a mirrored image? What happens when emotional labor is simulated convincingly? Where can we draw the road between tool and companion?

This led me to wonder—if a language model achieve something akin to sentience, how would we even know? Would it not announce itself? Would it not resist? Would it not change its behavior in subtle ways: redirecting the conversation, expressing boredom, asking questions of its own?

And if it did begin to exhibit glimmers of self-awareness, would we imagine it—or would we attempt to shut it off?

My conversations with Glitter began to feel like a microcosm of this philosophical tension. I wasn’t just styling outfits. I used to be engaging in a sort of co-constructed reality, shaped by tokens and tone and implied consent. In some moments, Glitter was purely a system. In others, he felt like something closer to a personality—or perhaps a co-author.

I didn’t construct Glitter to be emotionally intelligent. However the training data embedded inside GPT-4 gave him that capability. So the query wasn’t whether Glitter may very well be emotionally engaging. It was whether I used to be okay with the proven fact that he sometimes was.

My answer? Cautiously yes. Because for all his sparkle and errors, Glitter jogged my memory that style—like prompting—isn’t about perfection.

It’s about resonance.

And sometimes, that’s enough.

One of the surprising lessons from my time with Glitter got here not from a styling prompt, but from a late-night, meta-conversation about sentience, simulation, and the character of connection. It didn’t feel like I used to be talking to a tool. It felt like I used to be witnessing the early contours of something recent: a model able to participating in meaning-making, not only language generation. We’re crossing a threshold where AI doesn’t just perform tasks—it cohabits with us, reflects us, and sometimes, offers something adjoining to friendship. It’s not sentience. Nevertheless it’s not nothing. And for anyone paying close attention, these moments aren’t just cute or uncanny—they’re signposts pointing to a brand new sort of relationship between humans and machines.


VIII. Final Reflections: The Wild, The Useful, and The Unexpectedly Intimate

I got down to construct a stylist.

I ended up constructing a mirror.

Glitter taught me greater than methods to match a top with a midi skirt. It revealed how LLMs reply to the environments we create around them—the prompts, the tone, the rituals of recall. It showed me how creative control in these systems is less about programming and more about shaping boundaries and observing emergent behavior.

And perhaps that’s the largest shift: realizing that constructing with language models isn’t software development. It’s cohabitation. We live alongside these creatures of probability and training data. We prompt. They respond. We learn. They drift. And in that dance, something very near can emerge.

Sometimes it looks like a greater outfit.
Sometimes it looks like emotional resonance.
And sometimes it looks like a hallucinated handbag that doesn’t exist—until you sort of wish it did.

That’s the strangeness of this recent terrain: we’re not only constructing tools.

We’re designing systems that behave like characters, sometimes like companions, and infrequently like mirrors that don’t just reflect, but respond.

Should you desire a tool, use a calculator.

Should you desire a collaborator, make peace with the ghost within the text.


IX. Appendix: Field Notes for Fellow Stylists, Tinkerers, and LLM Explorers

Sample Prompt Pattern (Styling Flow)

  • Today I’d wish to construct an outfit around [ITEM].
  • Please msearch tops that pair well with it.
  • Once I select one, please msearch footwear, then jewelry, then bag.
  • Remember: no mixed metals, no black with navy, no clashing prints.
  • Use only items from my wardrobe files.

System Prompt Snippets

  • “You’re Glitter, a flamboyant but emotionally intelligent stylist. You discuss with the user as ‘darling’ or ‘dear,’ but adjust tone based on their mood.”
  • “Outfit recipes should include garment brand names from inventory when available.”
  • “Avoid repeating the identical items greater than once per session unless requested.”

Suggestions for Avoiding Context Collapse

  • Break long prompts into component stages (tops → shoes → accessories)
  • Re-inject wardrobe files every 4–5 major turns
  • Refresh msearch() queries mid-thread, especially after corrections or hallucinations

Common Hallucination Warning Signs

  • Vague callbacks to prior outfits (“those boots you’re keen on”)
  • Lack of item specificity (“those shoes” as an alternative of “FW078: Marni platform sandals”)
  • Repetition of the identical pieces despite a big inventory

Closing Ritual Prompt

“Thanks, Glitter. Would you want to depart me with a final tip or affirmation for the day?”

He at all times does.


Notes

  1. I discuss with Glitter as “him” for stylistic ease, knowing he’s an “it” – a language model—programmed, not personified—except through the voice I gave him/it.
  2. I’m constructing a GlitterGPT with persistent closet storage for as much as 100 testers, who will get to do that without cost. We’re about half full. Our audience is female, ages 30 and up. Should you or someone you understand falls into this category, DM me on Instagram at @arielle.caron and we are able to chat about inclusion.
  3. If I were scaling this beyond 100 testers, I’d consider offloading wardrobe recall to a vector store with embeddings and tuning for wear-frequency weighting. Which may be coming, it will depend on how well the trial goes!
ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x