Ivory Tower Notes: The Problem

months on a Machine Learning project, only to find you never defined the “correct” problem at first? In that case, or even when not, and you’re only starting with the information science or AI field, welcome to my first Ivory Tower Note, where I’ll address this topic.

Welcome to post #1…

How every Machine Learning and AI journey starts

— It starts with an issue.

For you, this is generally “the” problem because you could live with it for months or, within the case of research, .

With “the” problem, I’m addressing the business problem you don’t fully understand or know tips on how to solve at first.

A fair worse scenario is once you think you fully understand and know tips on how to solve it . This then creates more problems which can be again yours to unravel. But more about this within the upcoming sections.

So, what’s “the” problem about?

Causa: It’s mostly about not managing or leveraging resources properly — workforce, equipment, money, or time.

Ratio: It’s normally about generating business value, which might span from improved accuracy, increased productivity, cost savings, revenue gains, faster response, decision, planning, delivery or turnaround times.

Veritas: It’s at all times about finding an answer that relies and is hidden somewhere in the present dataset.

Or, a couple of dataset that somebody labelled as “the one”, and that’s waiting for you to unravel problem. Because datasets follow and are created from technical or business process logs, “”

Ah, if only it were really easy.

Avoiding a special chain of thought again, the purpose is you have to to:

1 — Understand the issue fully,
2 — If not given, find the dataset “behind” it, and
3 — Create a technique to get to the answer that may generate business value from it.

On this path, you will probably be tracked and measured, and time won’t be in your side to deliver the answer that may solve “the universe equation.”

That’s why you have to to approach the issue methodologically, drill all the way down to smaller problems first, and focus entirely on them because they’re the foundation explanation for the general problem.

That’s why it’s good to learn tips on how to…

Think like a Data Scientist.

Returning to the issue itself, let’s imagine that you just are a tourist lost somewhere in the large museum, and you need to determine where you’re. What you do next is walk to the closest info map on the ground, which is able to show your current location.

At this moment, in front of you, you see something like this:

Process. Image by Writer, inspired by Microsoft Learn

The following thing you may tell yourself is, “(: These are the insights you need to get.)

Because your goal is to see this one painting that brought you miles away from your house and now sits two floors below, you head straight to the second floor. Beforehand, you memorized the shortest path to achieve your goal. (: That is the initial data collection and discovery phase.)

Nonetheless, along the way in which, you come across some obstacles — the elevator is shut down for renovation, so you’ve got to make use of the steps. The museum paintings were reordered just two days ago, and the information plans didn’t reflect the changes, so the trail you had in mind to get to the painting is just not accurate.

You then end up wandering across the third floor already, asking quietly again, “”

When you don’t know the reply, you ask the museum staff on the third floor to allow you to out, and also you start collecting the brand new data to get the right path to your painting. (: This can be a recent data collection and discovery phase.)

Nonetheless, when you get to the second floor, you wander off again, but what you do next is start noticing a pattern in how the paintings have been ordered chronologically and thematically to group the artists whose styles overlap, thus supplying you with a sign of where to go to search out your painting. (: This can be a modelling phase overlapped with the enrichment phase from the dataset you collected during school days — your art knowledge.)

Finally, after adapting the pattern evaluation and recalling the collected inputs on the museum route, you arrive in front of painting you had been planning to see since booking your flight a couple of months ago.

What I described now’s the way you approach data science and, nowadays, generative AI problems. You usually start with the top goal in mind and ask yourself:

“What’s the expected final result I need or have to get from this?”

You then start planning from this query backwards. The instance above began with requesting holidays, booking flights, arranging accommodation, traveling to a destination, buying museum tickets, wandering around in a museum, after which seeing the painting you’ve been reading about for ages.

After all, there may be more to it, and this process must be approached in a different way if you could solve another person’s problem, which is a little more complex than locating the painting within the museum.

On this case, you’ve got to…

Ask the “good” questions.

To do that, let’s define what a query means [1]:

A good data science query have to be concrete, tractable, and answerable. Your query works well if it naturally points to a feasible approach in your project. In case your query is too vague to suggest what data you wish, it won’t effectively guide your work.

Formulating questions keeps you on course so that you don’t wander off in the information that must be used to get to the particular problem solution, otherwise you don’t find yourself solving the mistaken problem.

Going into more detail, questions will help discover gaps in reasoning, avoid faulty premises, and create alternative scenarios in case things go south (which just about at all times happens)👇🏼.

**Image created by Writer after analyzing “Chapter 2. Setting goals by asking good questions” from “Think Like a Data Scientist” book [2]**

From the above-presented diagram, you understand how questions, in the beginning, have to support concrete assumptions. This implies they must be formulated in a way that premises are clear and ensure they might be tested without mixing up facts with opinions.

questions produce answers that move you closer to your goal, whether through confirming hypotheses, providing recent insights, or eliminating mistaken paths. They’re measurable, and with this, they hook up with project goals because they’re formulated with consideration of what’s possible, invaluable, and efficient [2].

questions are answerable with available data, considering current data relevance and limitations.

Last but not least, questions anticipate obstacles. If something is definite in data science, that is , so having backup plans when things don’t work as expected is vital to provide results in your project.

Let’s exemplify this with one use case of an airline company that has a challenge with increasing its fleet availability because of unplanned technical groundings (UTG).

These unexpected maintenance events disrupt flights and price the corporate significant money. For this reason, executives decided to react to the issue and call in a knowledge scientist (you) to assist them improve aircraft availability.

Now, if this could be the primary data science task you ever got, you’ll perhaps start an investigation by asking:

You understand how this query is an example of the mistaken or “poor” one because:

It is just not realistic: It includes every possible defect, each small and large, into one not possible goal of “zero operational interruptions”.
It doesn’t hold a measure of success: There’s no concrete metric to indicate progress, and in the event you’re not at zero, you’re at “failure.”
It is just not data-driven: The query didn’t cover which data is recorded before delays occur, and the way the aircraft unavailability is measured and reported from it.

So, as an alternative of this vague query, you’ll probably ask a set of targeted questions:

Which aircraft (sub)system is most crucial to flight disruptions?
() This query narrows down your scope, specializing in just one or two specific (sub) systems affecting most delays.
What constitutes “critical downtime” from an operational perspective?
() If the airline (or regulatory body) doesn’t define what number of minutes of unscheduled downtime matter for schedule disruptions, you may waste effort solving less urgent issues.
Which data sources capture the foundation causes, and the way can we fuse them?
() This clarifies which data sources one would want to search out the issue solution.

With these sharper questions, you’ll drill all the way down to the actual problem:

Not all delays weigh the identical in cost or impact. The “correct” data science problem is to predict that result in so maintenance crews can prioritize them.

That’s why…

Defining the issue determines every step after.

It’s the muse upon which your data, modelling, and evaluation phases are built 👇🏼.

**Image created by Writer after analyzing and overlapping different images from “Chapter 2. Setting goals by asking good questions, Think Like a Data Scientist” book [2]**

It means you’re clarifying the project’s objectives, constraints, and scope; you could articulate the final word goal first and, apart from asking “”, ask as well:

What would success seem like and the way can we measure it?

From there, drill all the way down to (possible) next-level questions that you just (I) have learned from the Ivory Tower days:
— History questions: “Has anyone tried to unravel this before? What happened? What continues to be missing?”
— Context questions: “Who’s affected by this problem and the way? How are they partially resolving it now? Which sources, methods, and tools are they using now, and might they still be reused in the brand new models?”
— Impact Questions: “What happens if we don’t solve this? What changes if we do? Is there a price we will create by default? How much will this approach cost?”
— Assumption Questions: “What are we taking without any consideration which may not be true (especially in the case of data and stakeholders’ ideas)?”
— ….

Then, do that within the loop and at all times “ask, ask again, and don’t stop asking” questions so you’ll be able to drill down and understand which data and evaluation are needed and what the bottom problem is.

That is the evergreen knowledge you’ll be able to apply nowadays, too, when deciding in case your problem is of a or nature.

(More about this in another note where I’ll explain how problematic it’s trying to unravel the issue with the models which have never seen — or have never been trained on — similar problems before.)

Now, going back to memory lane…

I need so as to add one vital note: I even have learned from late nights within the Ivory Tower that no amount of knowledge or data science knowledge can prevent in the event you’re solving the mistaken problem and attempting to get the answer (answer) from a matter that was simply mistaken and vague.

When you’ve got an issue available, don’t rush into assumptions or constructing the models without understanding what you could do (.

As well as, prepare yourself for unexpected situations and do a correct investigation together with your stakeholders and domain experts because their patience will probably be limited, too.

With this, I need to say that the “real art” of being successful in data projects is knowing precisely what the issue is, determining if it will probably be solved in the primary place, after which coming up with the “how” part.

You get there by learning to ask questions.

Thanks for reading, and stay tuned for the following Ivory Tower note.

In the event you found this post invaluable, be at liberty to share it together with your network. 👏

Connect for more stories on Medium ✍️ and LinkedIn 🖇️.

References:

[1] DS4Humans, , accessed: April fifth 2025, https://ds4humans.com/40_in_practice/05_backwards_design.html#defining-a-good-question

[2] Godsey, B. (2017), , Manning Publications.

Ivory Tower Notes: The Problem

How every Machine Learning and AI journey starts

So, what’s “the” problem about?

Think like a Data Scientist.

Ask the “good” questions.

Defining the issue determines every step after.

Now, going back to memory lane…

References:

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How Agents Plan Tasks with To-Do Lists

A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems

How social media encourages the worst of AI boosterism

Hugging Face + PyCharm

The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel

Ivory Tower Notes: The Problem

How every Machine Learning and AI journey starts

So, what’s “the” problem about?

Ask the “good” questions.

Defining the issue determines every step after.

Now, going back to memory lane…

References:

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.