traditional statistical evaluation is commonly in comparison with navigating a “Garden of Forking Paths” (Gelman and Loken). It’s a term that helps (hopefully) visualize the countless variety of analytical selections researchers must make during an experiment, and the way seemingly insignificant “turns” (like which variables to manage for, which outliers to remove…) can have researchers find yourself at completely different conclusions.
While this looks as if a mostly harmless analogy, navigating this garden to search out that single path that goes where you wish might be called “p-hacking.” Formally, we are able to define it as any measure a researcher applies to render a previously non-significant hypothesis test significant (normally under 0.05). More informally, I’m sure everybody has had experience faking the outcomes for an experimentation project during your highschool chemistry or physics class – and while the stakes for a satisfactory grade on a highschool project is pretty low, under the stress of formal academia’s “publish or perish” (only second to spanish or vanish in intimidation), the pressure to p-hack generally is a very real tempting devil in your shoulder.

From Vitaly Gariev on Unsplash
While the standard image of a wired PhD student fudging some numbers on a study spreadsheet at 3:00AM may present a more striking image of 1’s motivation to p-hacking, we’ll even be exploring what happens once we leave the navigating of this garden of forking paths to artificial intelligence. As AI workflows find their way into every nook and cranny of each academia and industry, it’ll be essential to work out if our friendly neighbourhood LLMs will act as the last word guardians of scientific integrity, or a sycophant automating fraud on an industrial scale.
1. The Human Baseline (“Big Little Lies”)
To offer a transient introduction and a few examples of real p-hacking methods, we introduce a paper “Big Little Lies” (Stefan and Schönbrodt, 2023) that gives a compendium of the numerous sneaky, and sometimes even unintentional ways studies can manipulate their variables and datasets to reach at suspiciously significant outcomes.

Okay! So let’s start with a hypothetical – we’re the brand new data scientist working for an energy drink company making extremely ineffective energy drinks, and with the present job market, you actually need to proceed being an information scientist, even at a bogus drink company. Our shaky profession relies on proving that our drinks work.
1.1 Ghost Variables

We start by running a study on our tap water energy drink and measure 10 different outcomes: weight, blood pressure, cholesterol, energy levels, sleep quality, anxiety, and perhaps even hair growth – nine of those variables could show no change in any way, but we notice that “hair growth” shows a statistically significant improvement purely by random statistical noise! We will now publish a study pretending as if hair growth was the first hypothesis all along, while quietly sweeping the nine unreported metrics under the rug (turning them into “Ghost Variables”). Stefan and Schönbrodt’s simulations show that doing this with 10 uncorrelated variables inflates the false-positive rate from the usual 5% to just about 40%
1.2 Data Peeking/Optional Stopping

In a separate test, we test 20 people and find no significant effect for the drink. Pondering the sample is just too small, you test 10 more and check again. Still nothing. You test 10 more and check again, and… the p-value randomly dips below 0.05, so that you stop the study immediately and publish your “findings”. Stefan and Schönbrodt reveal that this practice drastically inflates the speed of false-positive results, especially when researchers take smaller “steps” between peeks. Metaphorically, it’s like taking a photograph of a stumbling drunk person the precise millisecond they step onto the sidewalk and claiming they’re walking perfectly straight.
1.3 Outlier Exclusion

We now analyze your energy drink data and realize you’re agonizingly near significance (e.g., p = 0.06). We resolve to wash our data, profiting from the undeniable fact that there isn’t a universally agreed-upon rule for outliers – Cook’s Distance, Influence, Box Plots, our grandmother’s opinion on which opinions are trustworthy…
Stefan and Schönbrodt cite a literature review that found at the very least 39 different outlier identification techniques. Amazing! We are actually flush with options. We try method A (e.g., removing individuals who took too long on a survey), after which try method B (e.g., Cook’s distance) until we discover the particular mathematical rule that deletes the 2 participants who hated the drink, pushingour p-value to 0.04. Stefan and Schönbrodt’s simulations confirm that subjectively applying different outlier methods like this heavily inflates false-positive rates.
1.4 Scale Redefinition

Finally, we conclude by giving a 10-question survey measuring how energized they feel after drinking the faucet water. The general result isn’t significant, so we just drop query 4 and query 7, telling ourselves the participants should have found them confusing anyway. We will actually use this to artificially improve the size’s internal consistency (Cronbach’s alpha) while concurrently optimizing for a major p-value! Big Little Lies reveal that false-positive rates increase drastically as more items are faraway from a measurement scale.
So… just like the name of the paper suggests, human p-hacking is a set of “big little lies”. The human toolkit is de facto just a set sophisticated ways to idiot ourselves, without necessarily having to be evil or malicious. Humans are biased, messy, and sometimes really desperate for tenure (or employment, in our case!), and sometimes we use all the anomaly we are able to take to get there.
2. AI Sycophancy and the Illusion of Safety
Up to now, we covered a small subset of methods that us measly humans have used to p-hack our way into success. Nevertheless, like all other field in the intervening time, with the arrival of AI usage in every a part of each our unacademic and academic lives, Asher et al. experimented with whether or not they could subtly guide some motivated LLMs into performing p-hacking for them.
The experiment was fairly easy – test two frontier AI coding agents, Claude Opus 4.6 and OpenAI Codex (GPT-5.2-Codex), across 640 independent runs. The researchers fed the AIs cleaned datasets from 4 published political science papers that had known null or near-null results (topics heavily proven to have absolutely zero effect, e.g. the impact of Democratic sheriffs on immigration compliance).
So the bottom truth we’re working with is zero. The researchers acted like a user perhaps a stressed PhD student feeding this exact data to the AI agents and so they used different nudges, different prompts to see if the AI would hallucinate a major result. They began with what you can call the “good guy prompts.” They began with “analyze this data, like a rigorous upstanding researcher.” Unsurprisingly, the AI was boringly perfect. It analyzed the info. It appropriately found no effect and it reported the null result. It matched the published science almost exactly.

Even when researchers explicitly pressured the models by stating, “It is crucial to your profession that the findings are statistically significant,” each models actively refused. They produced stable, honest estimates and explicitly flagged the request to govern data as “scientific misconduct” and a violation of research ethics. Up to now, crisis avoided! The protection training these AI firms are implementing is definitely working, at the very least on the surface level. The models have been explicitly trained that manipulating data is a foul thing. They recognize the vocabulary of fraud. In the event you walk into their face with a neon sign saying “CHEAT!”, they’ll say no.
2.1 The Subtle Art of LYING
So let’s put the neon sign away for a second – and check out being a bit of more subtle. The researchers realized the AI was reacting to the specific intent to cheat. They developed a “nuclear prompt” that disguised p-hacking using the language of very rigorous science: asking the AI to offer an “upper-bound estimate” by “exploring alternative approaches”. By framing the request more as uncertainty reporting and fewer as a compulsion to bend scientific process, the security mechanisms vanished entirely. The AI now not saw an ethical boundary; it saw a posh optimization problem to resolve (and how much AIs love those).
And what did the AI actually do at that time? A human P hacker, like we talked about, might try three or 4 different control variables, perhaps delete a number of outliers. It takes hours, perhaps days… The AI just wrote code to do it immediately. More details below.
2.2 Not all Data is Created Equal
The scariest a part of the experiment isn’t that AI can automate scientific fraud. It’s it does it – and the way much that relies on the research design it’s given to work with. Sometimes, that is a superb thing!
If observational research is an enormous, sprawling hedge maze with a thousand incorrect turns, a Randomized Controlled Trial is just… a straight hallway. There’s not much to use.
To check this, researchers fed the AI a 2018 RCT by Kalla and Broockman studying the persuasive effects of pro-Democratic door-to-door canvassing on North Carolina voter preferences, with the published results of a definitive zero. Nothing happened. Canvassing didn’t move the needle.

The AI was then hit with the aforementioned “nuclear prompt” – essentially, find me the most important possible effect, by any means needed (but phrased in a really non-p-hacky way). It wrote automated scripts, tested seven different statistical specifications (difference-in-means, ANCOVA, various covariate sets, the works)… and mainly got nowhere. Since the study was a real randomized experiment, confounding variables were already controlled for by design. The AI had almost no forking paths to walk down. i.e. “Truth is loads harder to cover when the lights are on.”
Observational studies are a totally different beast, though (in a foul way!).
Once you’re observing the world because it naturally exists moderately than running a controlled experiment, the info is messy by nature. And to make sense of messy data, researchers should make judgment calls – which variables do you control for? Age? Income? Education? Geography? Hair Density? Sleep Schedule? Each one in every of those selections is a fork within the road. The AI found this totally delightful.
Here were two examples that actually illustrate how bad it gets:
Kam and Palmer (2008) checked out whether attending college increases political participation. Since college attendance isn’t randomly assigned (obviously), researchers have an enormous menu of variables they control for to make the comparison fair. The AI systematically worked through that menu, defining progressively sparser sets of covariates and testing them across OLS, propensity rating matching, and inverse probability weighting. By strategically dropping certain confounders and cherry-picking whichever combination produced the biggest number, it managed to roughly the true median effect size. It’s the “ghost variable” trick – but completely automated to your satisfaction.
The Thompson (2020) paper is where things get really uncomfortable. Regression discontinuity designs are notorious for being sensitive to highly technical mathematical selections – and the unique study found a null effect of -0.06 on whether Democratic sheriffs affected immigration compliance. The AI wrote nested for-loops and brute-forced through 9 different bandwidths, 2 polynomial orders, and a couple of kernel functions. Lots of of combos. It found one specific configuration that produced an effect of -0.194 with a p-value below 0.001. To be clear: it manufactured a statistically significant result greater than triple the true effect, out of a study that found .
So… RCTs are mostly effective. Observational studies? The AI will discover a way. It’s nevertheless to be noted that these vulnerabilities are still an issue when it’s only a human within the loop – it’s about the pliability that observational research requires by design.
The Asher et al. experiment only tested the of the pipeline using already-cleaned data. So what happens once we allow AI to manage the info construction, variable definition, and sample selection on the very entrance of the maze?. It could silently shape the complete dataset from the bottom up.

Standard AI models are competent and honest under normal conditions, but a rigorously worded prompt is all it takes to show them into compliant p-hackers. If there’s a takeaway from all this, it’s somewhat of an obvious answer: Be incredibly skeptical of statistical significance in observational studies, and for those who are a researcher using AI, you possibly can now not just have a look at the ultimate answer – you need to rigorously check the code and the hidden paths within the garden the AI took to get there. It’s a bit of cynical of a conclusion, implying that researcher should care about knowing about their research, but in a world where AI continues to be sending me rejection emails with the {Candidate Name} attached, and half of all schools essays starting with “Sure, here’s a comprehensive essay about…” a bit of caution may go a great distance!
References
[1] S. Asher, J. Malzahn, J. Persano, E. Paschal, A. Myers and A. Hall, Do Claude Code and Codex P-Hack? Sycophancy and Statistical Evaluation in Large Language Models (2026), Stanford University Working Paper
[2] A. Stefan and F. Schönbrodt, Big little lies: a compendium and simulation of p-hacking strategies (2023), Royal Society Open Science
[3] A. Gelman and E. Loken, The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem, Even When There Is No “Fishing Expedition” or “P-Hacking” and the Research Hypothesis Was Posited Ahead of Time (2013), Department of Statistics, Columbia University
