Solving the Human Training Data Problem

Practice Makes Passing

in computer science was anything but easy. I vividly remember reaching a breaking point around the top of the tenth week of my first semester. With just a couple of weeks until my first final, I sat looking at Calc 1 practice problems, spiraling into despair. I’d all the time been good at math. I did all of the homework and paid attention in all of the lectures. So how could it’s that I didn’t even know where to begin? Why wasn’t anything clicking?

I often joked with friends about dropping out of this system, even well into my final semester. Week 10 of Semester 1 was the one time I very seriously considered it.

It was January 2022, right on the heels of the COVID tech hiring boom. I’d tried my hand at frontend development and had a reasonably good grasp of React. Not one of the introductory math courses I used to be taking made any sense. Loads of acquaintances and friends of friends had gotten cushy tech jobs without degrees, so why couldn’t I? What use was knowing the best way to prove a function was continuous out in the true world?

Excerpt from Calc 1 lecture notes, circa 2021. Image by the writer.

On reflection, I understood that that was exactly what I used to be purported to feel. was after I actually decided to pursue my degree, not after I applied a yr earlier. That feeling of impending doom was what lit a fireplace under me and drove me to review like a person possessed for the following few months.

To this present day, I’ve never been happier to get back a grade than after I opened the scan of my graded Calc 1 exam to see “61/100” staring me back within the face: a passing grade with a cool margin of two points above failing. But all that mattered was that it was a passing grade, especially when almost half the scholars had failed the category, many for the second or third time.

Calc 1 grade distribution. 42.6% fail rate and a failing average grade of 55.5. Image by the writer.

By all accounts, my first semester of undergrad was rough. Yes, this was by design, and yes, I learned loads from it, each when it comes to the fabric itself and (mostly) about resilience and perseverance. However it took moving to Germany and starting my master’s for me to grasp how good I actually had it back then, not less than in a single particular regard.

The Human Training Data Problem

Certainly one of the most important surprises to me at my latest university was that past exams are much less of a thing here. For all of the stress and anxiety I had during my bachelor’s, one thing I knew I could all the time count on was the existence of plentiful and easily-accessible scans of past exams and exam-relevant problem sets, especially for introductory courses.

For Discrete Math, I solved all the handfuls of past exams going back almost a decade. I distinctly remember warming up for Linear Algebra 1 with questions from the Nineties. This was so ingrained within the culture of my program that I completely took it as a right. The one reason I managed to pass Calc 1 (by the skin of my teeth) was because I had spent hours on end solving a whole bunch of questions from exams.

I used to be so accustomed to exams from past years being available that skimming over them had turn into a part of my process for vetting classes I used to be considering taking. This meant that my rude awakening got here fairly early on in my first semester of grad school, while attempting to determine my schedule.

So shocking was the revelation that I can map my response to the five stages of grief. At first, I used to be in denial, absolutely convinced that there should be some secret platform where all of the past exams were hiding. Anger, bargaining, and depression soon followed. Acceptance didn’t really, but I used to be willing to postpone my concerns until finals got here closer at the top of the semester.

As my first two finals (on back-to-back days, no less) approached in a rush, I discovered myself faced with what I wish to call the Granted, the human brain and machines are (very!) substantially different. But I couldn’t help but liken my situation to that of a machine learning model with insufficient training data. I used to be completely stumped on the best way to bridge the gap between lecture notes and potential exam questions.

My undergrad experience had granted me the insight of what human underfitting looks like, each at training time (studying) and test time (on exam day). I vividly remember multiple class where, for one reason or one other, I preferred more in-depth review of lecture slides or notes to solving practice problems.

This was an approach I quickly dropped during my freshman yr, and for good reason: even in theory-heavy classes, it yielded disastrous results. Knowing the proofs for all 40 theorems the professor required was much less assist in passing Linear Algebra 2 than practicing applying them to resolve problems would have been. That’s to not say an adequate grasp of the idea isn’t essential; it absolutely is. But having the ability to recite the lecture notes by heart won’t prevent in case you can’t answer questions just like the ones on the ultimate.

Proof of the Riesz representation theorem (for an inner product space with a finite orthonormal basis), **written out one in every of repeatedly while memorizing it during exam prep,** circa 2022. **Even while studying, this definitely didn’t feel like one of the best use of my time.** Image by the writer.

And so, armed with a whole bunch of slides and a vague idea of the structure of every exam, I racked my brain for tactics to avoid the pitfall of stepping into blind with none practice problems. Denial crept back in, and I desperately looked for past exams I knew didn’t exist. Eventually, I shifted my attention from finding the Holy Grail to

Synthetic Training Data for Humans

Researchers at IBM define synthetic data as “information that’s been generated on a pc to reinforce or replace real data to enhance AI models” [1]. It has many advantages, from mitigating privacy concerns to cutting costs, resulting in its widespread adoption for uses as varied as tooling for financial institutions [1] and 3D content generation [2].

In my case, the motivation was easy: the real-world (human) training data I needed to review just wasn’t available within the wild.

After all, using synthetic data only is sensible if that data accurately imitates the info our trained model will encounter in the true world. I knew I needed to be very intentional about how I generated the mock exams I wanted to make use of. Just telling Claude to jot down a practice test or two wouldn’t cut it, even when I gave it all of the slides and material I needed to work with. Only when setting out to jot down an exam does one realize how many selections there are to be made, well beyond what’s in and what’s out when it comes to the fabric.

Luckily, I wasn’t flying blind on that front. For one class, I had information in regards to the exam’s structure and the sorts of questions there have been on it from students who had taken it the yr prior. For the opposite, the professor provided a breakdown of the exam into sections and a small handful of open-ended review questions.

Each classes had Q&A sessions after their respective final lectures. I paid special attention to anything that gave the look of a touch as to what they may ask, which later proved to be very helpful.

Easy Mode: Replicating a Template

The primary exam was straightforward since I had far more to work with. It also had a popularity for being relatively formulaic. I gave Claude the instance questions and structure I had and asked it to keep on with the identical style.

Lots of the questions lent themselves nicely to slight changes that made them novel enough to be value solving for practice without straying too removed from what was typical for the actual exam. Other than a couple of LaTeX formatting hiccups, which were fairly easily resolved, it was smooth sailing.

To insure myself against any surprises, I also had it generate some trickier questions based on the lecture slides and my notes from the Q&A session. Though nothing unexpected was asked in the long run, doing a little targeted review tailored to my very own personal blind spots was an ideal confidence booster.

Although I definitely would have been in a position to study for the primary exam without the assistance of LLMs, I still felt like I gained loads by utilizing Claude. I could absolutely imagine how helpful it will have been for among the newer or more advanced courses I took in undergrad, where there have been only a small handful of past exams available.

Hard Mode: Construction from Scratch

The second exam was a much tougher nut to crack. To begin with, the breadth of the fabric was much wider. Secondly, the slides only very loosely reflected what was discussed in school. Most significantly, there was far less information available on what the exam would appear to be. What details there have been were hard to search out and vague.

The primary two concerns were not less than partially mitigated by the proven fact that I made an effort to take comprehensive notes throughout the semester. As for hints on the structure and form of the exam, I scoured every possible platform and picked up anything that seemed even remotely relevant. In that vein, the Q&A session ended up being a godsend. Transcribing the professor’s answers and comments left me with a significantly better (albeit still incomplete) idea of what to anticipate.

Admittedly, I used to be initially pessimistic in regards to the prospect of Claude having the ability to generate mock exams of much value. Though I had used it fairly extensively for guided material review, I had my doubts about how it will fare with the uncertainty at play. Still, I gave it all the things I knew in regards to the exam and hoped for one of the best.

I used to be pleasantly surprised at the outcomes. Although the primary few attempts produced exams that didn’t feel quite right, the core did seem promising. They did appear to adequately cover the fabric and to be difficult enough. After some forwards and backwards, Claude began generating tests that I might have been convinced were real.

**Overview of mock exams generated by Claude Sonnet 4.5 for Course #2.** Note the (relatively typical) yes-man commentary. Image by the writer.

I solved the improved tests and asked Claude to correct my solutions. The very act of solving practice tests made me feel great about my grasp of the fabric. Claude’s usual sycophancy was the cherry on top. (It did indicate mistakes, but was exceptionally soft on deducting points and overly-excited about correct answers.) Ultimately, nevertheless, I wouldn’t understand how well Claude had done training me until test time. With the fateful day fast approaching, I hoped for one of the best.

Generalizing to Test Data and Stopping Dataset Pollution

When Synthetic Data Alone Doesn’t Cut It

While synthetic data definitely has its advantages, it has a critical drawback. What a model learns based on synthetic data will, at best, model the world from which that data is drawn. That simulated world could diverge from reality in ways we’re completely unaware of until it’s too late [3].

As Dani Shanley puts it in “Synthetic data, real harm,“

“… just as generative AI models can produce plausible (but false) text or images, synthetic data generators may create datasets that appear statistically valid, while introducing subtle, hard-to-catch distortions and artificial patterns, or missing crucial real-world complexities.” [3]

Shanley also draws attention to the hidden and disproportionate impact of the individuals tasked with synthesizing data on how models ultimately behave. Largely arbitrary decisions on their part could have significant, possibly harmful, downstream effects [3].

I saw this impact in motion while studying for my second exam. Slowly but surely, I had unintentionally skewed Claude’s outputs based on my personal interpretation of what the professor had said. My gut feeling on what the exam appear to be became the arbiter of which questions were relevant and which weren’t.

It also became clearer as time went on that my training dataset was veering ever further right into a biased tackle reality. After the sixth mock exam, it was obvious that Claude had just settled on a hard and fast set of several dozen questions.

Even when prompted to introduce more variety, every output from there on out was just a few cobbling together of questions I had already seen. Granted, these did include many key questions it was heavily implied would seem on the actual exam.

On test day, I used to be shocked at how much the exam resembled those I had solved for practice. The gimmes the professor had hinted at were indeed there, but so were a powerful variety of non-trivial questions I had solved while studying. Roughly 60% of the questions were similar or very just like ones I had practiced. A lot of the remaining were on topics I had not less than touched on.

Nonetheless, one a part of the exam ended up being a big blind spot. It was a piece on topics we had discussed only briefly at the start of the semester. While studying, I used to be unreasonably confident in swiftly dismissing certain forms of questions, be it because they seemed uncharacteristic (e.g., too mathematical) or because they were about things I had deemed too insignificant to incorporate within the notes I took in school.

Unfortunately, those turned out to the precise forms of questions that were asked in that section. Some were about topics that only appeared on a single slide all semester. Others were deeply technical in a way I just didn’t expect. Though I did my best to reply them, I hadn’t trained my mental model on data that will enable it to generalize to those questions well enough.

The pill was all of the more bitter to swallow because the sorts of questions I struggled with were ones Claude included in its first attempts at mock exams. These were precisely those I did away with early on based on little greater than hunches.

On this case, the slip up was removed from catastrophic. For my part, it wasn’t even near undoing the advantages of studying using synthetic mock exams. Even so, it serves as a cautionary tale that hearkens back to Shanley’s warnings about how synthetic data can insidiously exacerbate model subjectivity and bias [3].

Overcoming Overfitting: Make the Better of Synthetic Human Training Data

For a lot of real-world applications, an artificial dataset that yields a model with only 60% accuracy would probably be considered next to useless. With sufficient real-world data (i.e., actual past exams), there isn’t any doubt in my mind that 90%+ accuracy could be achievable.

To be fair, though, the (human) model into account has flaws that machines don’t and is, in some ways, much harder to coach. I can say with confidence that that 60% would almost definitely surpass the accuracy of every other method I could have attempted.

I’ll absolutely keep on with this method for future exams, with three key takeaways I plan to implement:

Separate chats are the method to go. The feedback loop that led Claude to converge on specific questions undoubtedly had loads to do with me running all the cycle of generating tests and checking answers in a single big, long context. This meant any latest mock exam was directly based on the entire previous ones. Beyond that, Claude tried to be helpful by tailoring the inquiries to what it thought were my weak spots, leading it to turn into much more entrenched in what it thought ought to be asked. General context rot⁽¹⁾ was also probably a crucial factor.
Keep an open mind. As mentioned above, the foremost blind spot I developed was largely the results of putting an excessive amount of stock in my subjective assessment of what material would or should make the cut. As a substitute of difficult my assumptions and devoting a while to covering minor topics that gave the look of long shots, I leaned into my biases.
Augment with real-world training data! That is, after all, easier said than done. It somewhat contradicts the very premise of this text. But what you possibly can do as a student (or as an educator) is enrich the bank of known questions for future students. I managed to recollect a lot of the questions that were on my second exam and document them for future students to make use of when studying.

Afterword: My Thoughts on LLMs as a Learning Aid

The elephant within the room is that not one of the exam preparation workflow I described would have been even remotely feasible after I began my bachelor’s in late 2021. Perhaps that is what made the method feel almost magical to me.

I remember wishing I had a method to mechanically check and proper my answers on mock exams when studying in my freshman yr.For those who would have told me back then that an AI tool, let alone a free one, would give you the chance to try this (nevertheless imperfectly) in 2026, I might have thought you were crazy.

Much has been written in regards to the latest problems LLMs have led to. Lots of the points which were made are especially relevant to students. And indeed, I can’t argue that claims like “AI is making people dumber” are completely unfounded. I’ve seen firsthand how these tools let an individual outsource pondering and eliminate any mental discomfort. For an ever-growing range of complex tasks, they represent the final word shortcut [4].

Concerningly, I imagine individuals who resist the temptation to take those shortcuts are increasingly being penalized, not less than within the short run. A friend who was the just one to not vibe-code assignments in a certain class involves mind. Others cruised to perfect grades on their homework despite threats about how AI-generated submissions would supposedly be rejected. He put within the work and ended up being docked significant points for minor errors, with little in the way in which of constructive feedback or recourse.

Still, in the long term, it’s a well-established proven fact that growth, in its myriad forms, entails some form of stress. Certainly one of those forms is learning, and the essential stress is available in the shape of energetic engagement with the fabric. Few things are more rewarding for my part than the lightbulb moment of finally understanding a difficult concept after fighting it for hours or days. Experiencing such moments with Fourier series, reductions, metric spaces, and lots of other concepts was a serious a part of what led me to decide on to pursue a master’s degree in the sector.

LLMs undoubtedly enable would-be learners to deprive themselves of this stress and, in turn, of actual learning. Often, though, I feel too little attention is paid to the opposite side of the coin:

Having experienced higher education each pre- and post-ChatGPT, I feel enormously fortunate to have tools like Claude and Gemini at my fingertips. Their utility for exam preparation was just the tip of the iceberg. It felt like my productivity was boosted tenfold throughout the semester. Things clicked much faster than they ever would have otherwise. LLMs were a game changer for all the things from strategy (when and the best way to study what) to reviewing slides and notes to developing real curiosity and interest in the fabric.

To summarize with a platitude: “With great power comes great responsibility.” LLMs are what you make of them. With the suitable approach, they’ll coach you to tackle the heavy lifting as a substitute of doing it for you.

For those who enjoyed this text, please consider following me on LinkedIn to maintain up with future articles and projects.

Footnotes

(1) defines as a phenomenon where “because the variety of tokens within the context window increases, the model’s ability to accurately recall information from that context decreases.” [5]

References

[1] K. Martineau and R. Feris, “What’s synthetic data?,” , Feb. 7, 2023. https://research.ibm.com/blog/what-is-synthetic-data.

[2] Y. Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang, “MVDream: Multi-view diffusion for 3D generation,” , 2023. https://doi.org/10.48550/arXiv.2308.16512.

[3] D. Shanley, “Synthetic data, real harm,” , Sep. 18, 2025. https://www.adalovelaceinstitute.org/blog/synthetic-data-real-harm/.

[4] S. Bogdanov, “In the long term, LLMs make us dumber,” , Aug. 12, 2025. https://desunit.com/blog/in-the-long-run-llms-make-us-dumber/.

[5] P. Rajasekaran, E. Dixon, C. Ryan, and J. Hadfield, “Effective context engineering for AI agents,” , Sep. 29, 2025. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents.

Solving the Human Training Data Problem

Practice Makes Passing

The Human Training Data Problem

Synthetic Training Data for Humans

Easy Mode: Replicating a Template

Hard Mode: Construction from Scratch

Generalizing to Test Data and Stopping Dataset Pollution

When Synthetic Data Alone Doesn’t Cut It

Overcoming Overfitting: Make the Better of Synthetic Human Training Data

Afterword: My Thoughts on LLMs as a Learning Aid

Footnotes

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Bringing AI Closer to the Edge and On-Device with Gemma 4

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

Linear Regression Is Actually a Projection Problem (Part 2: From Projections to Predictions)

Latest Rowhammer attacks give complete control of machines running Nvidia GPUs

Our most capable open models so far

Solving the Human Training Data Problem

Practice Makes Passing

The Human Training Data Problem

Synthetic Training Data for Humans

Easy Mode: Replicating a Template

Hard Mode: Construction from Scratch

Generalizing to Test Data and Stopping Dataset Pollution

When Synthetic Data Alone Doesn’t Cut It

Overcoming Overfitting: Make the Better of Synthetic Human Training Data

Afterword: My Thoughts on LLMs as a Learning Aid

Footnotes

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.