The Missing Curriculum: Essential Concepts For Data Scientists within the Age of AI Coding Agents

-

Why read this text?

one about learn how to structure your prompts to enable your AI agent to perform magic. There are already a sea of articles that goes into detail about what structure to make use of and when so there’s no need for one more.

As a substitute, this text is one out of a series of articles which can be about learn how to keep yourself, the coder, relevant in the fashionable AI coding ecosystem.

We’ll go into the concepts from existing software engineering practices that you need to be aware of, and go into why these concepts are relevant, particularly now.

  • By reading this series, it is best to have a very good idea of what common pitfalls to search for in auto-generated code, and know learn how to guide a coding assistant to create production grade code that’s maintainable and extensible.
  • This text is most relevant for budding programmers, graduates, and professionals from other technical industries that need to level up their coding expertise.

What we are going to cover not only makes you higher at using coding assistants but additionally higher coders on the whole.

The Core Concepts

The high level concepts we’ll cover are the next:

  • Code Smells
  • Abstraction
  • Design Patterns

In essence, there’s nothing latest about them. To seasoned developers, they’re second nature, drilled into their brains through years of PR reviews and debugging. You finally reach a degree where you instinctively react to code that “” like future pain.

And now, they’re perhaps more relevant than ever since coding assistants have change into a vital a part of any developers’ experience, be it juniors to seniors.

Since the manual labor of writing code has been offloaded. The first responsibility for any developer has now shifted from writing code to reviewing it. Everyone has effectively change into a senior developer guiding a junior (the coding assistant).

So, it’s change into essential for even junior software practitioners to have the option to ‘’ code. However the ones who will thrive in today’s industry are those with the foresight of a senior developer.

This is the reason we can be covering the above concepts in order that within the very very least, you’ll be able to tell your coding assistant to take them under consideration, even for those who yourself don’t exactly know what you’re on the lookout for.

So, introductions are actually done. Let’s get straight into our first topic: Code smells.

Code Smells

What’s a code smell?

I find it a really aptly named term – it’s the equivalent of sour smelling milk indicating to you that it’s a nasty idea to drink it.

For a long time, developers have learnt through trial and error what sort of code works long-term. “Smelly” code are brittle, liable to hidden bugs, and difficult for a human or AI agent to know exactly what’s happening.

Thus it is usually very useful for developers to learn about code smells and learn how to detect them.

https://luzkan.github.io/smells

https://refactoring.guru/refactoring/smells

Now, having used coding agents to construct the whole lot from skilled ML pipelines for my 9-5 job to entire mobile apps in languages I’d never touched before for my side-projects, I’ve identified two typical “smells” that emerge if you change into over-reliant in your coding assistant:

  • Divergent Change
  • Speculative Generality

Let’s undergo what they’re, the risks involved, and an example of learn how to fix it.

Photo by Greg Jewett on Unsplash

Divergent Change

Divergent change is when a single module or class is doing too many things directly. The aim of the code has ‘diverged’ into many alternative directions and so slightly than being focused on being good at one task (), it’s attempting to do the whole lot.

This ends in a painful situation where this code is all the time breaking and thus requires fixing for various independent reasons.

When does it occur with AI?

When the developer shouldn’t be engaged with the codebase and blindly accepts the Agent output, you might be doubly at risk of this.

Yes, you could have done all the proper things and made a nicely structured prompt that adheres to the most recent is in prompt engineering.

But on the whole, for those who ask it to “,” the agent will often do exactly because it is told and cram code into your existing class, especially when the present codebase is already very complicated.

It’s ultimately as much as you to take into consideration the role, responsibility and intended usage of the code to give you a holistic approach. Otherwise, you’re very more likely to find yourself with smelly code.

Example — ML Engineering

Below, now we have a ModelPipeline class from which you’ll get whiffs of future extensibility issues.


class ModelPipeline:
    def __init__(self, data_path):
        self.data_path = data_path

    def load_from_s3(self):
        print(f"Connecting to S3 to get {self.data_path}")
        return "raw_data"

    def clean_txn_data(self, data):
        print("Cleansing specific transaction JSON format")
        return "cleaned_data"

    def train_xgboost(self, data):
        print("Running XGBoost trainer")
        return "model"
A fast warning:

Also note:

So, what are things that needs to be going through your head if you have a look at this code?

  • Data retrieval: What happens once we start having a couple of data source, like Bigquery tables, local databases, or Azure blobs? How likely is that this to occur?
  • Data Engineering: If the upstream data changes or downstream modelling changes, this can even need to alter.
  • Modelling: If we use different models, LightGBM or some Neural Net, the upstream modelling needs to alter.

It is best to notice that by coupling Platform, Data engineering, and ML engineering concerns right into a single place, we’ve tripled the rationale for this code to be modified – i.e. code that’s starting to smell like ‘‘.

Why is that this a possible problem?

  1. Operational risk: Every edit runs the chance of introducing a bug, be it human or AI. By having this class wear three different hats, you’ve tripled the chance of this breaking, since there’s thrice as more reasons for this code to alter.
  2. AI Agent Context Pollution: The Agent sees the cleansing and training code as a part of the identical problem. For instance, it’s more more likely to change the training and data loading logic to accommodate a change in the information engineering, regardless that it was unnecessary. Ultimately, this increases the ‘divergent change’ code smell.
  3. Risk is magnified by AI: An agent can rewrite lots of of lines of code in a second. If those lines represent three different disciplines, the agent has just tripled the possibility of introducing a bug that your unit tests won’t catch.

The right way to fix it?

The risks outlined above should provide you with some ideas about learn how to refactor this code.

One possible approach is as below:

class S3DataLoader:
    """Handles only Infrastructure concerns."""
    def __init__(self, data_path):
        self.data_path = data_path

    def load(self):
        print(f"Connecting to S3 to get {self.data_path}")
        return "raw_data"

class TransactionsCleaner:
    """Handles only Data Domain/Schema concerns."""
    def clean(self, data):
        print("Cleansing specific transaction JSON format")
        return "cleaned_data"

class XGBoostTrainer:
    """Handles only ML/Research concerns."""
    def train(self, data):
        print("Running XGBoost trainer")
        return "model"

class ModelPipeline:
    """The Orchestrator: It knows 'what' to do, but not 'how' to do it."""
    def __init__(self, loader, cleaner, trainer):
        self.loader = loader
        self.cleaner = cleaner
        self.trainer = trainer

    def run(self):
        data = self.loader.load()
        cleaned = self.cleaner.clean(data)
        return self.trainer.train(cleaned)

Formerly, the model pipeline’s responsibility was to handle your complete DS stack.

Now, its responsibility is to orchestrate the various modelling stages, whilst the complexities of every stage is cleanly separated into their very own respective classes.

What does this achieve?

1. Minimised Operational Risk: Now, concerns are decoupled and responsibilities are stark clear. You may refactor your data loading logic with confidence that the ML training code stays untouched. So long as the inputs and outputs (the “contracts”) stay the identical, the chance of impacting anything downstream is lowered.

2. Testable Code: It’s significantly easier to jot down unit tests for the reason that scope of testing is smaller and well defined.

3. Lego-brick Flexibility: The architecture is now open for extension. Must migrate from S3 to Azure? Simply drop in an AzureBlobLoader. Wish to experiment with LightGBM? Swap the trainer.

You ultimately find yourself with code that’s more reliable, readable, and maintainable for each you and the AI agent. In case you don’t intervene, it’s likely this class change into greater, broader, and flakier and find yourself being an operational nightmare.

Speculative Generality

Photo by Greg Jewett on Unsplash

Whilst ‘‘ occurs most frequently in an already large and sophisticated codebase, ‘‘ seems to occur if you start out making a latest project.

This code smell is when the developer tries to future-proof a project by guessing how things will pan out, leading to unnecessary functionality that only increases complexity.

We’ve all been there:

only to search out that…

  1. It’s a monster of a job,
  2. code seems flaky,
  3. you spend an excessive amount of time on it
  4. whilst you’ve not been in a position to construct out the straightforward LightGBM classification model that you just needed in the primary place.

When AI Agents are at risk of this smell

I’ve found that the most recent, high performing coding agents are most at risk of this smell. Couple a robust agent with a vague prompt, and also you quickly find yourself with too many modules and lots of of lines of latest code.

Perhaps every line is pure gold and it’s exactly what you wish. Once I experienced something like this recently, the code actually appeared to make sense to me at first.

But I ended up rejecting all of it. Why?

Since the agent was making design selections for a future I hadn’t even mapped out yet. It felt like I used to be losing control of my very own codebase, and that it will change into an actual pain to undo in the long run if the necessity arises.

The Key Principle: Grow your codebase organically

The mantra to recollect when reviewing AI output is “YAGNI” (). It’s a principle in software development that means it is best to only implement the code you wish, not the code you foresee.

This can be a more natural, organic way of growing your codebase that gets things done, whilst also being lean, easy, and fewer at risk of bugs.

Revisiting our examples

We previously checked out refactoring Example 1 () into Example 2 () to show how the unique ModelPipeline code was smelly.

It needed to be refactored since it was subject to too many changes for too many independent reasons, and in its current state the code was too brittle to take care of effectively.

Example 1

class ModelPipeline:
    def __init__(self, data_path):
        self.data_path = data_path

    def load_from_s3(self):
        print(f"Connecting to S3 to get {self.data_path}")
        return "raw_data"

    def clean_txn_data(self, data):
        print("Cleansing specific transaction JSON format")
        return "cleaned_data"

    def train_xgboost(self, data):
        print("Running XGBoost trainer")
        return "model"

Example 2

class S3DataLoader:
    """Handles only Infrastructure concerns."""
    def __init__(self, data_path):
        self.data_path = data_path

    def load(self):
        print(f"Connecting to S3 to get {self.data_path}")
        return "raw_data"

class TransactionsCleaner:
    """Handles only Data Domain/Schema concerns."""
    def clean(self, data):
        print("Cleansing specific transaction JSON format")
        return "cleaned_data"

class XGBoostTrainer:
    """Handles only ML/Research concerns."""
    def train(self, data):
        print("Running XGBoost trainer")
        return "model"

class ModelPipeline:
    """The Orchestrator: It knows 'what' to do, but not 'how' to do it."""
    def __init__(self, loader, cleaner, trainer):
        self.loader = loader
        self.cleaner = cleaner
        self.trainer = trainer

    def run(self):
        data = self.loader.load()
        cleaned = self.cleaner.clean(data)
        return self.trainer.train(cleaned)

Previously, we implicitly assumed that this was production grade code that was subject to the assorted maintenance changes/feature additions which can be steadily made for such code. In such context, the ‘Divergent Change’ code smell was relevant.

But what if this was code for a brand new product MVP or R&D? Would the identical ‘Divergent Change’ code-smell apply on this context?

Photo by Kenny Eliason on Unsplash

In such a scenario, opting for instance 2 may very well be the smellier selection.

If the scope of the project is to contemplate one data source, or one model, constructing three separate classes and an orchestrator may count as ‘pre-solving’ problems you don’t yet have.

Thus, in MVP/R&D situations where detailed deployment considerations are unknown and there are specific input data/output model requirements, example 1 may very well be more appropriate.

The Overarching Lesson

What these two code smells reveal is that software engineering is never about “” code. It’s about .

A coding agent can write perfect Python in each function and syntax, but it surely doesn’t know your entire business context. It doesn’t know if the script it’s writing is a throwaway experiment or the backbone of a multi-million dollar production pipeline revamp.

Efficiency tradeoffs

You may argue that we will simply feed the AI every little detail of business context, from the meetings you’ve needed to the tea-break chats you had with a fellow colleague. But in practice, that isn’t scalable.

If you’ve got to spend half and hour writing a “context memo” simply to get a clean 50-line function, have you actually gained efficiency? Or have you only transformed the manual labor of writing code into that of writing prompts?

What makes you stand out from the remainder

Within the age of AI, your value as an information scientist has fundamentally modified. The manual labour of writing code has now been removed. Agents will handle the boilerplating, the formatting, and unit testing.

So, to make yourself stand out from the opposite data scientists who’re blindly copy pasting code, you want to have the structural intuition to guide a coding agent in a direction that’s relevant to your unique situation. This ends in higher reliability, performance, and outcomes which can be reflected on you, making you stand out.

But to realize this, you want to construct this intuition that comes years of experience by knowing the code smells we’ve discussed, and the opposite two concepts () that we’ll delve into in subsequent articles.

And ultimately, with the ability to do that effectively gives you more headspace to give attention to the issue solving and architecting an answer an issue – i.e. the true ‘fun’ of knowledge science.

Related Articles

In case you liked this text, see my Software Engineering Concepts for Data Scientists series, where we expand on the concepts most relevant for Data Scientists

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x