Abstract Classes: A Software Engineering Concept Data Scientists Must Know To Succeed

it’s best to read this text

In case you are planning to enter data science, be it a graduate or knowledgeable on the lookout for a profession change, or a manager in command of establishing best practices, this text is for you.

Data science attracts quite a lot of different backgrounds. From my skilled experience, I’ve worked with colleagues who were once:

Nuclear physicists
Post-docs researching gravitational waves
PhDs in computational biology
Linguists

simply to name just a few.

It’s wonderful to give you the option to satisfy such a various set of backgrounds and I even have seen such quite a lot of minds result in the expansion of a creative and effective data science function.

Nonetheless, I even have also seen one big downside to this variety:

Because of this, I even have seen work done by some data scientists that’s good, but is:

Unreadable — you’ve got no idea what they are attempting to do.
Flaky — it breaks the moment another person tries to run it.
Unmaintainable — code quickly becomes obsolete or breaks easily.
Un-extensible — code is single-use and its behaviour can’t be prolonged

which ultimately dampens the impact their work can have and creates all styles of issues down the road.

So, in a series of articles, I plan to stipulate some core software engineering concepts that I even have tailored to be necessities for data scientists.

They’re easy concepts, however the difference between knowing them vs not knowing them clearly draws the road between amateur and skilled.

Abstract Art, Photo by Steve Johnson on Unsplash

Today’s concept: Abstract classes

Abstract classes are an extension of sophistication inheritance, and it might probably be a really great tool for data scientists if used accurately.

.

Like we did for sophistication inheritance, I won’t hassle with a proper definition. Looking back to after I first began coding, I discovered it hard to decipher the vague and abstract (no pun intended) definitions on the market within the Web.

It’s much easier as an instance it by going through a practical example.

So, let’s go straight into an example that a knowledge scientist is more likely to encounter to display how they’re used, and why they’re useful.

Example: Preparing data for ingestion right into a feature generation pipeline

Let’s say we’re a consultancy that specialises in fraud detection for financial institutions.

We work with various different clients, and we’ve got a set of features that carry a consistent signal across different client projects because they embed domain knowledge gathered from subject material experts.

So it is sensible to construct these features for every project, even in the event that they are dropped during feature selection or are replaced with bespoke features built for that client.

The challenge

We data scientists know that working across different projects/environments/clients implies that the input data for every one isn’t the identical;

Clients may provide different file types: CSV, Parquet, JSON, tar, to call just a few.
Different environments may require different sets of credentials.
Most definitely each dataset has their very own quirks and so every one requires different data cleansing steps.

Due to this fact, you could think that we would wish to construct a brand new feature generation pipeline for every client.

How else would you handle the intricacies of every dataset?

No, there may be a greater way

On condition that:

We all know we’re going to be constructing the set of useful features for every client
We are able to construct one feature generation pipeline that may be reused for every client
Thus, the one recent problem we’d like to unravel is cleansing the input data.

Thus, our problem may be formulated into the next stages:

Image by creator. Blue circles are datasets, yellow squares are pipelines.

Data Cleansing pipeline
- Liable for handling any unique cleansing and processing that’s required for a given client to be able to format the dataset right into a dictated by the feature generation pipeline.
The Feature Generation pipeline
- Implements the feature engineering logic assuming the input data will follow a set schema to output our useful set of features.

Given a set input data schema, constructing the feature generation pipeline is trivial.

Due to this fact, we’ve got boiled down our problem to the next:

Theproblem we’re solving

Our problem of will not be nearly getting code to run. That’s the simple part.

The hard part is designing code that is strong to a myriad of external, non-technical aspects similar to:

Human error
- People naturally forget small details or prior assumptions. They might construct a knowledge cleansing pipeline whilst overlooking certain requirements.
Leavers
- Over time, your team inevitably changes. Your colleagues could have knowledge that they assumed to be obvious, and subsequently they never bothered to document it. Once they’ve left, that knowledge is lost. Only through trial and error, and hours of debugging will your team ever get well that knowledge.
Recent joiners
- Meanwhile, recent joiners haven’t any knowledge about prior assumptions that were once assumed obvious, so their code normally requires plenty of debugging and rewriting.

That is where abstract classes really shine.

Input data requirements

We mentioned that we are able to fix the schema for the feature generation pipeline input data, so let’s define this for our example.

Let’s say that our pipeline expects to read in files, containing the next columns:

row_id:
    int, a singular ID for each transaction.
timestamp:
    str, in ISO 8601 format. The timestamp a transaction was made.
amount: 
    int, the transaction amount denominated in pennies (for our US readers, the equivalent can be cents).
direction: 
    str, the direction of the transaction, considered one of ['OUTBOUND', 'INBOUND']
account_holder_id: 
    str, unique identifier for the entity that owns the account the transaction was made on.
account_id: 
    str, unique identifier for the account the transaction was made on.

Let’s also add in a requirement that the dataset have to be ordered by timestamp.

The abstract class

Now, time to define our abstract class.

An abstract class is basically a blueprint from which we are able to inherit from to create child classes, otherwise named ‘‘ classes.

Let’s spec out different methods we may have for our data cleansing blueprint.

import os
from abc import ABC, abstractmethod

class BaseRawDataPipeline(ABC):
    def __init__(
        self,
        input_data_path: str | os.PathLike,
        output_data_path: str | os.PathLike
    ):
        self.input_data_path = input_data_path
        self.output_data_path = output_data_path

    @abstractmethod
    def transform(self, raw_data):
        """Transform the raw data.
        
        Args:
            raw_data: The raw data to be transformed.
        """
        ...

    @abstractmethod
    def load(self):
        """Load within the raw data."""
        ...

    def save(self, transformed_data):
        """save the transformed data."""
        ...

    def validate(self, transformed_data):
        """validate the transformed data."""
        ...

    def run(self):
        """Run the info cleansing pipeline."""
        ...

You possibly can see that we’ve got imported the ABC class from the abc module, which allows us to create abstract classes in Python.

Image by creator. Diagram of the abstract class and concrete class relationships and methods.

Pre-defined behaviour

Image by creator. The methods to be pre-defined are circled red.

Let’s now add some pre-defined behaviour to our abstract class.

Remember, this behaviour can be made available to all child classes which inherit from this class so that is where we bake in behaviour that you need to implement for all future projects.

For our example, the behaviour that needs fixing across all projects are all related to how we output the processed dataset.

1. The `run` method

First, we define the run method. That is the tactic that can be called to run the info cleansing pipeline.

    def run(self):
        """Run the info cleansing pipeline."""
        inputs = self.load()
        output = self.transform(*inputs)
        self.validate(output)
        self.save(output)

The run method acts as a single point of entry for all future child classes.

This standardises how any data cleansing pipeline can be run, which enables us to then construct recent functionality around any pipeline without worrying in regards to the underlying implementation.

You possibly can imagine how incorporating such pipelines into some orchestrator or scheduler can be easier if all pipelines are executed through the identical run method, versus having to handle many various names similar to run, execute, process, fit, transform etc.

2. The `save` method

Next, we fix how we output the transformed data.

    def save(self, transformed_data:pl.LazyFrame):
        """save the transformed data to parquet."""
        transformed_data.sink_parquet(
            self.output_file_path,
        )

We’re assuming we are going to use `polars` for data manipulation, and the output is saved as `parquet` files as per our specification for the feature generation pipeline.

3. The `validate` method

Finally, we populate the validate method which is able to check that the dataset adheres to our expected output format before saving it down.

    @property
    def output_schema(self):
        return dict(
            row_id=pl.Int64,
            timestamp=pl.Datetime,
            amount=pl.Int64,
            direction=pl.Categorical,
            account_holder_id=pl.Categorical,
            account_id=pl.Categorical,
        )
    
    def validate(self, transformed_data):
        """validate the transformed data."""
        schema = transformed_data.collect_schema()
        assert (
            self.output_schema == schema, 
            f"Expected {self.output_schema} but got {schema}"
        )

We’ve created a property called output_schema. This ensures that every one child classes could have this available, whilst stopping it from being by accident removed or overridden if it was defined in, for instance, __init__.

Project-specific behaviour

Image by creator. Project specific methods that must be overridden are circled red.

In our example, the load and transform methods are where project-specific behaviour can be held, so we leave them blank in the bottom class – the implementation is deferred to the longer term data scientist in command of writing this logic for the project.

You can even notice that we’ve got used the abstractmethod decorator on the transform and load methods. This decorator enforces these methods to be defined by a toddler class. If a user forgets to define them, an error can be raised to remind them to accomplish that.

Let’s now move on to some example projects where we are able to define the transform and load methods.

Example project

The client on this project sends us their dataset as CSV files with the next structure:

event_id: str
unix_timestamp: int
user_uuid: int
wallet_uuid: int
payment_value: float
country: str

We learn from them that:

Each transaction is exclusive identified by the mixture of event_id and unix_timestamp
The wallet_uuid is the equivalent identifier for the ‘account’
The user_uuid is the equivalent identifier for the ‘account holder’
The payment_value is the transaction amount, denominated in Pound Sterling (or Dollar).
The CSV file is separated by | and has no header.

The concrete class

Now, we implement the load and transform functions to handle the unique complexities outlined above in a toddler class of BaseRawDataPipeline.

Remember, these methods are all that must be written by the info scientists working on this project. All of the aforementioned methods are pre-defined so that they needn’t worry about it, reducing the quantity of labor your team must do.

1. Loading the info

The load function is kind of easy:

class Project1RawDataPipeline(BaseRawDataPipeline):

    def load(self):
        """Load within the raw data.
        
        Note:
            As per the client's specification, the CSV file is separated 
            by `|` and has no header.
        """
        return pl.scan_csv(
            self.input_data_path,
            sep="|",
            has_header=False
        )

We use polars’ scan_csv method to stream the info, with the suitable arguments to handle the CSV file structure for our client.

2. Transforming the info

The transform method can be easy for this project, since we don’t have any complex joins or aggregations to perform. So we are able to fit all of it right into a single function.

class Project1RawDataPipeline(BaseRawDataPipeline):

    ...

    def transform(self, raw_data: pl.LazyFrame):
        """Transform the raw data.

        Args:
            raw_data (pl.LazyFrame):
                The raw data to be transformed. Must contain the next columns:
                    - 'event_id'
                    - 'unix_timestamp'
                    - 'user_uuid'
                    - 'wallet_uuid'
                    - 'payment_value'

        Returns:
            pl.DataFrame:
                The transformed data.

                Operations:
                    1. row_id is constructed by concatenating event_id and unix_timestamp
                    2. account_id and account_holder_id are renamed from user_uuid and wallet_uuid
                    3. transaction_amount is converted from payment_value. Source data
                    denomination is in £/$, so we'd like to convert to p/cents.
        """

        # select only the columns we'd like
        DESIRED_COLUMNS = [
            "event_id",
            "unix_timestamp",
            "user_uuid",
            "wallet_uuid",
            "payment_value",
        ]
        df = raw_data.select(DESIRED_COLUMNS)

        df = df.select(
            # concatenate event_id and unix_timestamp
            # to get a singular identifier for every row.
            pl.concat_str(
                [
                    pl.col("event_id"),
                    pl.col("unix_timestamp")
                ],
                separator="-"
            ).alias('row_id'),

            # convert unix timestamp to ISO format string
            pl.from_epoch("unix_timestamp", "s").dt.to_string("iso").alias("timestamp"),

            pl.col("user_uuid").alias("account_id"),
            pl.col("wallet_uuid").alias("account_holder_id"),

            # convert from £ to p
            # OR convert from $ to cents
            (pl.col("payment_value") * 100).alias("transaction_amount"),
        )

        return df

Thus, by overloading these two methods, we’ve implemented all we’d like for our client project.

The output we all know conforms to the necessities of the downstream feature engineering pipeline, so we robotically have assurance that our outputs are compatible.

Final summary: Why use abstract classes in data science pipelines?

Abstract classes offer a robust solution to bring consistency, robustness, and improved maintainability to data science projects. Through the use of Abstract Classes like in our example, our data science team sees the next advantages:

1. No must worry about compatibility

By defining a transparent blueprint with abstract classes, the info scientist only must concentrate on implementing the load and transform methods specific to their client’s data.

So long as these methods conform to the expected input/output types, compatibility with the downstream feature generation pipeline is guaranteed.

This separation of concerns simplifies the event process, reduces bugs, and accelerates development for brand new projects.

2. Easier to document

The structured format naturally encourages in-line documentation through method docstrings.

This proximity of design decisions and implementation makes it easier to speak assumptions, transformations, and nuances for every client’s dataset.

Well-documented code is simpler to read, maintain, and hand over, reducing the knowledge loss attributable to team changes or turnover.

3. Improved code readability and maintainability

With abstract classes enforcing a consistent interface, the resulting codebase avoids the pitfalls of unreadable, flaky, or unmaintainable scripts.

Each child class adheres to a standardized method structure (load, transform, validate, save, run), making the pipelines more predictable and easier to debug.

4. Robustness to human aspects

Abstract classes help reduce risks from human error, teammates leaving, or learning recent joiners by embedding essential behaviours in the bottom class. This ensures that critical steps are never skipped, even when individual contributors are unaware of all downstream requirements.

5. Extensibility and reusability

By isolating client-specific logic in concrete classes while sharing common behaviors within the abstract base, it becomes straightforward to increase pipelines for brand new clients or projects. You possibly can add recent data cleansing steps or support recent file formats without rewriting all the pipeline.

In summary, abstract classes levels up your data science codebase from ad-hoc scripts to scalable, and maintainable production-grade code. Whether you’re a knowledge scientist, a team lead, or a manager, adopting these software engineering principles will significantly boost the impact and longevity of your work.

In case you enjoyed this text, then have a have a look at a few of my other related articles.

Inheritance: A software engineering concept data scientists must know to succeed (here)
Encapsulation: A softwre engineering concept data scientists must know to succeed (here)
The Data Science Tool You Need For Efficient ML-Ops (here)
DSLP: The info science project management framework that transformed my team (here)
Methods to stand out in your data scientist interview (here)
An Interactive Visualisation For Your Graph Neural Network Explanations (here)
The Recent Best Python Package for Visualising Network Graphs (here)

Abstract Classes: A Software Engineering Concept Data Scientists Must Know To Succeed

it’s best to read this text

Today’s concept: Abstract classes

Example: Preparing data for ingestion right into a feature generation pipeline

The challenge

No, there may be a greater way

Theproblem we’re solving

Input data requirements

The abstract class

Pre-defined behaviour

1. The `run` method

2. The `save` method

3. The `validate` method

Project-specific behaviour

Example project

The concrete class

1. Loading the info

2. Transforming the info

Final summary: Why use abstract classes in data science pipelines?

1. No must worry about compatibility

2. Easier to document

3. Improved code readability and maintainability

4. Robustness to human aspects

5. Extensibility and reusability

Related articles:

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Why 90% Accuracy in Text-to-SQL is 100% Useless

An Introduction to AI Secure LLM Safety Leaderboard

Generative coding

The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

Speed up StarCoder with 🤗 Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding

Abstract Classes: A Software Engineering Concept Data Scientists Must Know To Succeed

it’s best to read this text

Today’s concept: Abstract classes

Example: Preparing data for ingestion right into a feature generation pipeline

The challenge

No, there may be a greater way

Theproblem we’re solving

Input data requirements

The abstract class

Pre-defined behaviour

1. The run method

2. The save method

3. The validate method

Project-specific behaviour

Example project

The concrete class

1. Loading the info

2. Transforming the info

Final summary: Why use abstract classes in data science pipelines?

1. No must worry about compatibility

2. Easier to document

3. Improved code readability and maintainability

4. Robustness to human aspects

5. Extensibility and reusability

Related articles:

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

1. The `run` method

2. The `save` method

3. The `validate` method

What are your thoughts on this topic?
Let us know in the comments below.