Stop Tuning Hyperparameters. Start Tuning Your Problem.

. You’re three weeks right into a churn prediction model, hunched over a laptop, watching a Bayesian optimization sweep crawl through its 2 hundredth trial. The validation AUC ticks from 0.847 to 0.849. You screenshot it. You post it in Slack. Your manager reacts with a thumbs-up.

You are feeling productive. You usually are not.

For those who’ve ever spent days squeezing fractions of a percent out of a Machine Learning (ML) metric while a quiet voice at the back of your head whispered , you already sense the issue. That voice is true. And silencing it with one other grid search is one of the expensive habits within the career.

Here’s the uncomfortable math: greater than 80% of Artificial Intelligence (AI) projects fail, in line with RAND Corporation research published in 2024. The primary root cause isn’t bad models. It isn’t insufficient data. It’s misunderstanding (or miscommunicating) what problem must be solved. Not a modeling failure. A framing failure.

This text gives you a concrete protocol to catch that failure before you write a single line of coaching code. Five steps. Every one takes a conversation, not a GPU cluster.

“All that progress in algorithms means it’s actually time to spend more time on the info.” Andrew Ng didn’t say spend more time on the model. He said the alternative.

The Productive Procrastination Trap

Hyperparameter tuning seems like engineering. You will have a search space. You will have an objective function. You iterate, measure, improve. The feedback loop is tight (minutes to hours), the progress is visible (metrics go up), and the work is legible to your team (“I improved AUC by 2 points”).

Problem framing seems like stalling. You sit in a room with business stakeholders who use imprecise language. You ask questions that don’t have clean answers. There’s no metric ticking upward. No Slack screenshot to post. Your manager asks what you probably did today and also you say, “I spent 4 hours determining whether we should always predict churn or predict reactivation likelihood.” That answer doesn’t sound like progress.

However it is the one progress that matters.

Effort allocation vs. actual impact in ML projects. Sources: RAND (2024), Anaconda State of Data Science (2022).
Image by the writer.

The explanation is structural. Tuning operates throughout the problem as defined. If the issue is defined mistaken, tuning optimizes a function that doesn’t map to business value. You get a phenomenal model that solves the mistaken thing. And no amount of Optuna sweeps can fix a goal variable that shouldn’t exist.

Zillow Bet $500 Million on the Incorrect Problem

In 2021, Zillow shut down its home-buying division, Zillow Offers, after losing over $500 million. The corporate had acquired roughly 7,000 homes across 25 metro areas, consistently overpaying because its pricing algorithm (the Zestimate) didn’t adjust to a cooling market.

The post-mortems focused on concept drift. The model trained on hot-market data couldn’t sustain as demand slowed. Contractor shortages during COVID delayed renovations. The feedback loop between purchase and resale was too slow to catch the error.

However the deeper failure happened before any model was trained.

Zillow framed the issue as: That framing assumed a stable relationship between features and price. It assumed Zillow could renovate and resell fast enough that the prediction window stayed short. It assumed the model’s error distribution was symmetric (overpaying and underpaying equally likely). None of those assumptions held.

Competitors Opendoor and Offerpad survived the identical market shift. Their models detected the cooling and adjusted pricing. The difference wasn’t algorithmic sophistication. It was how each company framed what their model needed to do and the way quickly they updated that frame.

Zillow didn’t lose $500 million due to a nasty model. They lost it because they never questioned whether “predict home value” was the proper problem to resolve at their operational speed.

When the AI Learned to Detect Rulers As an alternative of Cancer

A research team built a neural network to categorise skin lesions as benign or malignant. The model reached accuracy comparable to board-certified dermatologists. Impressive numbers. Clean validation curves.

Then someone checked out what the model actually learned.

It was detecting rulers. When dermatologists suspect a lesion may be malignant, they place a ruler next to it to measure its size. So within the training data, images containing rulers correlated with malignancy. The model found a shortcut: ruler present = probably cancer. Ruler absent = probably benign.

The accuracy was real. The educational was garbage. And no hyperparameter tuning could have caught this, since the model was performing exactly as instructed on the info exactly as provided. The failure was upstream: no one asked, “What the model be taking a look at to make this decision?” before measuring how well it made the choice.

This can be a pattern called shortcut learning, and it shows up in every single place. Models learn to use correlations in your data that won’t hold in production. The one defense is a transparent specification of , and that specification comes from problem framing, not from tuning.

Why Framing Errors Survive So Long

If bad problem framing is that this destructive, why do smart teams keep skipping it?

Three reinforcing dynamics make it persistent.

First, feedback asymmetry. Whenever you tune a hyperparameter, you see the lead to minutes. Whenever you reframe an issue, the payoff is invisible for weeks. Human brains discount delayed rewards. So teams gravitate toward the fast feedback loop of tuning, even when the slow work of framing has 10x the return.

Second, legibility bias. “I improved accuracy from 84.7% to 84.9%” is a clean, defensible statement in a standup meeting. “I spent yesterday convincing the product team that we’re optimizing the mistaken metric” seems like you completed nothing. Organizations reward visible output. Framing produces no visible output until it prevents a disaster no one knows was coming.

Third, identity. Data scientists are trained as model builders. The tools, the courses, the Kaggle leaderboards, the interview questions: all of them center on modeling. Problem framing seems like another person’s job (product, business, strategy). Claiming it means stepping outside your technical identity, and that’s uncomfortable.

The three reinforcing dynamics that keep ML teams optimizing the mistaken thing. Image by the writer.

Andrew Ng named this pattern when he introduced the concept of data-centric Artificial Intelligence (AI) in 2021. He defined it as “the discipline of systematically engineering the info needed to construct a successful AI system.” His argument: the ML community had spent a decade obsessing over model architecture while treating data (and by extension, problem definition) as another person’s job. The returns from higher architectures had plateaued. The returns from higher problem definition had barely been tapped.

The Steel-Man for Tuning
Before going further: hyperparameter tuning is just not useless. There are situations where it’s exactly the proper thing to do.

For those who’ve already validated that your goal variable maps on to a business decision. In case your data distribution in production matches training. For those who’ve confirmed that your features capture the signal the business cares about (and only that signal). If all of that is true, then tuning the model’s capability, regularization, and learning rate is legitimate optimization.

The claim isn’t “never tune.” The claim is: most teams start tuning before they’ve earned the proper to tune. They skip the framing work that determines whether tuning will matter in any respect. And when tuning produces marginal gains on a misframed problem, those gains are illusory.

Data analytics research shows the pattern clearly: when you’ve achieved 95% of possible performance with basic configuration, spending days to extract one other 0.5% rarely justifies the computational cost. That calculation gets worse when the 95% is measured against the mistaken objective.

The 5-Step Problem Framing Protocol

This protocol runs before any modeling. It takes 2 to five days depending on stakeholder availability. Every step produces a written artifact that your team can reference and challenge. Skip a step, and also you’re gambling that your assumptions are correct. Most aren’t.

Step 1: Name the Decision (Not the Prediction)

Who: Data science lead + the business stakeholder who will act on the model’s output.
When: First meeting. Before any data exploration.
How: Ask this query and write down the reply verbatim:

“When this model produces an output, what specific decision changes? Who makes that call, and what do they do otherwise?”

Example (good): “The retention team calls the highest 200 at-risk customers each week as a substitute of emailing all 5,000. The model ranks customers by reactivation probability so the team knows who to call first.”

Example (bad): “We would like to predict churn.” (No decision named. No actor identified. No motion specified.)

Red flag: If the stakeholder can’t name a particular decision, the project doesn’t have a use case yet. Pause. Don’t proceed to data exploration. A model and not using a decision is a report no one reads.

Step 2: Define the Error Cost Asymmetry

Who: Data science lead + business stakeholder + finance (if available).
When: Same meeting or next day.
How: Ask:

“What’s worse: a false positive or a false negative? By how much?”

Example: For a fraud detection model, a false negative (missed fraud) costs the corporate a median of $4,200 per incident. A false positive (blocking a legitimate transaction) costs $12 in customer support time plus a 3% likelihood of losing the client ($180 expected value). The ratio is roughly 23:1. This implies the model needs to be tuned for recall, not precision, and the choice threshold needs to be set much lower than 0.5.

Why this matters: Default ML metrics (accuracy, F1) assume symmetric error costs. Real business problems almost never have symmetric error costs. For those who optimize F1 when your actual cost ratio is 23:1, you’ll construct a model that performs well on paper and poorly in production. Zillow’s Zestimate treated overestimates and underestimates as equally bad. They weren’t. Overpaying for a house you’ll be able to’t resell for months is catastrophically worse than underbidding and losing a deal.

Step 3: Audit the Goal Variable

Who: Data science lead + domain expert.
When: After Steps 1-2 are documented. Before any feature engineering.
How: Answer these 4 questions in writing:

Does this goal variable actually measure what the business cares about? “Churn” might mean “cancelled subscription” in your data but “stopped using the product” within the stakeholder’s mind. These are different populations. Make clear which one maps to the choice in Step 1.
When is the goal observed relative to when the model must act? For those who’re predicting 30-day churn however the retention team needs 14 days to intervene, your prediction window is mistaken. The model must predict churn no less than 14 days before it happens.
Is the goal contaminated by the intervention you’re attempting to optimize? If past retention efforts already reduced churn for some customers, your training data underestimates their true churn risk. The model learns “these customers don’t churn” when the reality is “these customers don’t churn we intervened.” That is the causal inference trap, and it’s invisible in standard train/test splits.
Can the model learn the proper signal, or will it find shortcuts? The ruler-in-dermatology problem. List the features. For every one, ask: “Would a website expert use this feature to make this decision?” If not, it may be a proxy that won’t generalize.

Step 4: Simulate the Deployment Decision

Who: Full project team (DS, engineering, product, business stakeholder).
When: After Steps 1-3 are documented. Before modeling begins.
How: Run a tabletop exercise. Present the team with 10 synthetic model outputs (a combination of correct predictions, false positives, and false negatives) and ask:

“Given this output, what motion does the business take?”
“Is that motion correct given the bottom truth?”
“How much does each error type cost?”
“At what confidence threshold does the business stop trusting the model?”

This exercise surfaces misalignments that no metric can catch. You may discover that the business actually needs a rating (not a binary classification). Or that the stakeholder won’t act on predictions below 90% confidence, which implies half your model’s output is ignored. Or that the “motion” requires information the model doesn’t provide (like a customer is in danger).

Artifact: A one-page deployment spec listing: who uses the output, in what format, at what frequency, with what confidence threshold, and what happens when the model is mistaken.

Step 5: Write the Anti-Goal

Who: Data science lead.
When: After Steps 1-4. The last check before modeling begins.
How: Write one paragraph answering:

“If this project succeeds on every metric we’ve defined but still fails in production, what went mistaken?”

Example 1: “The churn model hits 0.91 AUC on the test set, however the retention team ignores it since the predictions arrive 48 hours after their weekly planning meeting. The model is accurate but operationally useless because we didn’t align the prediction cadence with the choice cadence.”

Example 2: “The fraud model flags 15% of transactions, overwhelming the review team. They begin rubber-stamping approvals to clear the queue. Technically the model catches fraud; practically the humans within the loop have learned to disregard it.”

The anti-target is an inversion: as a substitute of defining success, define probably the most plausible failure. For those who can write a vivid anti-target, you’ll be able to often prevent it. For those who can’t write one, you haven’t thought hard enough about deployment.

Run all 5 steps before writing training code. Each step produces a written artifact the team can reference. Image by the writer.

Is This a Tuning Problem or a Framing Problem?

Not every stalled project needs reframing. Sometimes the issue is well-framed and also you genuinely need higher model performance. Use this diagnostic to inform the difference.

What Changes When Teams Frame First

The shift from model-centric to problem-centric work isn’t nearly avoiding failure. It changes what “senior” means in data science.

Junior data scientists are valued for modeling skill: are you able to train, tune, and deploy? Senior data scientists needs to be valued for framing skill: are you able to translate an ambiguous business situation right into a well-posed prediction problem with the proper goal, the proper features, and the proper success criteria?

The industry is slowly catching up. Andrew Ng’s push toward data-centric AI is one signal. The RAND Corporation’s 2024 report on AI anti-patterns is one other: their top advice is that leaders should ensure technical staff understand the aim and context of a project before starting. QCon’s 2024 evaluation of ML failures names “misaligned objectives” as probably the most common pitfall.

The pattern is evident. The bottleneck in ML isn’t algorithms. It’s alignment between the model’s objective and the business’s actual need. And that alignment is a human conversation, not a computational one.

The bottleneck in ML is just not compute or algorithms. It’s the conversation between the one who builds the model and the one who uses the output.

For organizations, this implies problem framing needs to be a first-class activity with its own time allocation, its own deliverables, and its own review process. Not a preamble to “the actual work.” The true work.

For individual data scientists, it means the fastest strategy to increase your impact isn’t learning a brand new framework or mastering distributed training. It’s learning to ask higher questions before you open a notebook.

It’s 11:14 PM on a Wednesday. You’re three weeks right into a project. Your validation metric is climbing. You’re about to launch one other sweep.

Stop.

Open a blank document. Write one sentence: “The choice that changes based on this model’s output is ___.” For those who can’t fill within the blank without calling a stakeholder, you’ve just found the highest-ROI activity for tomorrow morning. It won’t feel like progress. It won’t produce a Slack-worthy screenshot. However it’s the one work that determines whether the following three weeks matter in any respect.

References

RAND Corporation, “The Root Causes of Failure for Artificial Intelligence Projects and How They Can Succeed”, James Ryseff, Brandon De Bruhl, Sydne J. Newberry, 2024.
MIT Sloan, “Why It’s Time for ‘Data-Centric Artificial Intelligence’”, Sara Brown, June 2022.
insideAI News, “The $500mm+ Debacle at Zillow Offers: What Went Incorrect with the AI Models?”, December 2021.
Stanford Graduate School of Business, “Flip Flop: Why Zillow’s Algorithmic Home Buying Enterprise Imploded”.
Diagnostics (MDPI), “Uncovering and Correcting Shortcut Learning in Machine Learning Models for Skin Cancer Diagnosis”, 2022.
VentureBeat, “When AI Flags the Ruler, Not the Tumor”.
InfoQ, “QCon SF 2024: Why ML Projects Fail to Reach Production”, November 2024.
Number Analytics, “8 Hyperparameter Tuning Insights Backed by Data Analytics”.