it’s best to read this text
In case you are planning to enter data science, be it a graduate or knowledgeable on the lookout for a profession change, or a manager in command of establishing best practices, this text is for you.
Data science attracts quite a lot of different backgrounds. From my skilled experience, I’ve worked with colleagues who were once:
- Nuclear physicists
- Post-docs researching gravitational waves
- PhDs in computational biology
- Linguists
simply to name just a few.
It’s wonderful to give you the option to satisfy such a various set of backgrounds and I even have seen such quite a lot of minds result in the expansion of a creative and effective data science function.
Nonetheless, I even have also seen one big downside to this variety:
Because of this, I even have seen work done by some data scientists that’s good, but is:
- Unreadable — you’ve got no idea what they are attempting to do.
- Flaky — it breaks the moment another person tries to run it.
- Unmaintainable — code quickly becomes obsolete or breaks easily.
- Un-extensible — code is single-use and its behaviour can’t be prolonged
which ultimately dampens the impact their work can have and creates all styles of issues down the road.
So, in a series of articles, I plan to stipulate some core software engineering concepts that I even have tailored to be necessities for data scientists.
They’re easy concepts, however the difference between knowing them vs not knowing them clearly draws the road between amateur and skilled.
Today’s concept: Abstract classes
Abstract classes are an extension of sophistication inheritance, and it might probably be a really great tool for data scientists if used accurately.
.
Like we did for sophistication inheritance, I won’t hassle with a proper definition. Looking back to after I first began coding, I discovered it hard to decipher the vague and abstract (no pun intended) definitions on the market within the Web.
It’s much easier as an instance it by going through a practical example.
So, let’s go straight into an example that a knowledge scientist is more likely to encounter to display how they’re used, and why they’re useful.
Example: Preparing data for ingestion right into a feature generation pipeline

Let’s say we’re a consultancy that specialises in fraud detection for financial institutions.
We work with various different clients, and we’ve got a set of features that carry a consistent signal across different client projects because they embed domain knowledge gathered from subject material experts.
So it is sensible to construct these features for every project, even in the event that they are dropped during feature selection or are replaced with bespoke features built for that client.
The challenge
We data scientists know that working across different projects/environments/clients implies that the input data for every one isn’t the identical;
- Clients may provide different file types:
CSV
,Parquet
,JSON
,tar
, to call just a few. - Different environments may require different sets of credentials.
- Most definitely each dataset has their very own quirks and so every one requires different data cleansing steps.
Due to this fact, you could think that we would wish to construct a brand new feature generation pipeline for every client.
How else would you handle the intricacies of every dataset?
No, there may be a greater way
On condition that:
- We all know we’re going to be constructing the set of useful features for every client
- We are able to construct one feature generation pipeline that may be reused for every client
- Thus, the one recent problem we’d like to unravel is cleansing the input data.
Thus, our problem may be formulated into the next stages:

- Data Cleansing pipeline
- Liable for handling any unique cleansing and processing that’s required for a given client to be able to format the dataset right into a dictated by the feature generation pipeline.
- The Feature Generation pipeline
- Implements the feature engineering logic assuming the input data will follow a set schema to output our useful set of features.
Given a set input data schema, constructing the feature generation pipeline is trivial.
Due to this fact, we’ve got boiled down our problem to the next:
Theproblem we’re solving
Our problem of will not be nearly getting code to run. That’s the simple part.
The hard part is designing code that is strong to a myriad of external, non-technical aspects similar to:
- Human error
- People naturally forget small details or prior assumptions. They might construct a knowledge cleansing pipeline whilst overlooking certain requirements.
- Leavers
- Over time, your team inevitably changes. Your colleagues could have knowledge that they assumed to be obvious, and subsequently they never bothered to document it. Once they’ve left, that knowledge is lost. Only through trial and error, and hours of debugging will your team ever get well that knowledge.
- Recent joiners
- Meanwhile, recent joiners haven’t any knowledge about prior assumptions that were once assumed obvious, so their code normally requires plenty of debugging and rewriting.
That is where abstract classes really shine.
Input data requirements
We mentioned that we are able to fix the schema for the feature generation pipeline input data, so let’s define this for our example.
Let’s say that our pipeline expects to read in files, containing the next columns:
row_id:
int, a singular ID for each transaction.
timestamp:
str, in ISO 8601 format. The timestamp a transaction was made.
amount:
int, the transaction amount denominated in pennies (for our US readers, the equivalent can be cents).
direction:
str, the direction of the transaction, considered one of ['OUTBOUND', 'INBOUND']
account_holder_id:
str, unique identifier for the entity that owns the account the transaction was made on.
account_id:
str, unique identifier for the account the transaction was made on.
Let’s also add in a requirement that the dataset have to be ordered by timestamp
.
The abstract class
Now, time to define our abstract class.
An abstract class is basically a blueprint from which we are able to inherit from to create child classes, otherwise named ‘‘ classes.
Let’s spec out different methods we may have for our data cleansing blueprint.
import os
from abc import ABC, abstractmethod
class BaseRawDataPipeline(ABC):
def __init__(
self,
input_data_path: str | os.PathLike,
output_data_path: str | os.PathLike
):
self.input_data_path = input_data_path
self.output_data_path = output_data_path
@abstractmethod
def transform(self, raw_data):
"""Transform the raw data.
Args:
raw_data: The raw data to be transformed.
"""
...
@abstractmethod
def load(self):
"""Load within the raw data."""
...
def save(self, transformed_data):
"""save the transformed data."""
...
def validate(self, transformed_data):
"""validate the transformed data."""
...
def run(self):
"""Run the info cleansing pipeline."""
...
You possibly can see that we’ve got imported the ABC
class from the abc
module, which allows us to create abstract classes in Python.

Pre-defined behaviour

Let’s now add some pre-defined behaviour to our abstract class.
Remember, this behaviour can be made available to all child classes which inherit from this class so that is where we bake in behaviour that you need to implement for all future projects.
For our example, the behaviour that needs fixing across all projects are all related to how we output the processed dataset.
1. The run
method
First, we define the run
method. That is the tactic that can be called to run the info cleansing pipeline.
def run(self):
"""Run the info cleansing pipeline."""
inputs = self.load()
output = self.transform(*inputs)
self.validate(output)
self.save(output)
The run method acts as a single point of entry for all future child classes.
This standardises how any data cleansing pipeline can be run, which enables us to then construct recent functionality around any pipeline without worrying in regards to the underlying implementation.
You possibly can imagine how incorporating such pipelines into some orchestrator or scheduler can be easier if all pipelines are executed through the identical run
method, versus having to handle many various names similar to run
, execute
, process
, fit
, transform
etc.
2. The save
method
Next, we fix how we output the transformed data.
def save(self, transformed_data:pl.LazyFrame):
"""save the transformed data to parquet."""
transformed_data.sink_parquet(
self.output_file_path,
)
We’re assuming we are going to use `polars` for data manipulation, and the output is saved as `parquet` files as per our specification for the feature generation pipeline.
3. The validate
method
Finally, we populate the validate
method which is able to check that the dataset adheres to our expected output format before saving it down.
@property
def output_schema(self):
return dict(
row_id=pl.Int64,
timestamp=pl.Datetime,
amount=pl.Int64,
direction=pl.Categorical,
account_holder_id=pl.Categorical,
account_id=pl.Categorical,
)
def validate(self, transformed_data):
"""validate the transformed data."""
schema = transformed_data.collect_schema()
assert (
self.output_schema == schema,
f"Expected {self.output_schema} but got {schema}"
)
We’ve created a property called output_schema
. This ensures that every one child classes could have this available, whilst stopping it from being by accident removed or overridden if it was defined in, for instance, __init__
.
Project-specific behaviour

In our example, the load
and transform
methods are where project-specific behaviour can be held, so we leave them blank in the bottom class – the implementation is deferred to the longer term data scientist in command of writing this logic for the project.
You can even notice that we’ve got used the abstractmethod
decorator on the transform
and load
methods. This decorator enforces these methods to be defined by a toddler class. If a user forgets to define them, an error can be raised to remind them to accomplish that.
Let’s now move on to some example projects where we are able to define the transform
and load
methods.
Example project
The client on this project sends us their dataset as CSV files with the next structure:
event_id: str
unix_timestamp: int
user_uuid: int
wallet_uuid: int
payment_value: float
country: str
We learn from them that:
- Each transaction is exclusive identified by the mixture of
event_id
andunix_timestamp
- The
wallet_uuid
is the equivalent identifier for the ‘account’ - The
user_uuid
is the equivalent identifier for the ‘account holder’ - The
payment_value
is the transaction amount, denominated in Pound Sterling (or Dollar). - The CSV file is separated by
|
and has no header.
The concrete class
Now, we implement the load
and transform
functions to handle the unique complexities outlined above in a toddler class of BaseRawDataPipeline
.
Remember, these methods are all that must be written by the info scientists working on this project. All of the aforementioned methods are pre-defined so that they needn’t worry about it, reducing the quantity of labor your team must do.
1. Loading the info
The load
function is kind of easy:
class Project1RawDataPipeline(BaseRawDataPipeline):
def load(self):
"""Load within the raw data.
Note:
As per the client's specification, the CSV file is separated
by `|` and has no header.
"""
return pl.scan_csv(
self.input_data_path,
sep="|",
has_header=False
)
We use polars’ scan_csv
method to stream the info, with the suitable arguments to handle the CSV file structure for our client.
2. Transforming the info
The transform method can be easy for this project, since we don’t have any complex joins or aggregations to perform. So we are able to fit all of it right into a single function.
class Project1RawDataPipeline(BaseRawDataPipeline):
...
def transform(self, raw_data: pl.LazyFrame):
"""Transform the raw data.
Args:
raw_data (pl.LazyFrame):
The raw data to be transformed. Must contain the next columns:
- 'event_id'
- 'unix_timestamp'
- 'user_uuid'
- 'wallet_uuid'
- 'payment_value'
Returns:
pl.DataFrame:
The transformed data.
Operations:
1. row_id is constructed by concatenating event_id and unix_timestamp
2. account_id and account_holder_id are renamed from user_uuid and wallet_uuid
3. transaction_amount is converted from payment_value. Source data
denomination is in £/$, so we'd like to convert to p/cents.
"""
# select only the columns we'd like
DESIRED_COLUMNS = [
"event_id",
"unix_timestamp",
"user_uuid",
"wallet_uuid",
"payment_value",
]
df = raw_data.select(DESIRED_COLUMNS)
df = df.select(
# concatenate event_id and unix_timestamp
# to get a singular identifier for every row.
pl.concat_str(
[
pl.col("event_id"),
pl.col("unix_timestamp")
],
separator="-"
).alias('row_id'),
# convert unix timestamp to ISO format string
pl.from_epoch("unix_timestamp", "s").dt.to_string("iso").alias("timestamp"),
pl.col("user_uuid").alias("account_id"),
pl.col("wallet_uuid").alias("account_holder_id"),
# convert from £ to p
# OR convert from $ to cents
(pl.col("payment_value") * 100).alias("transaction_amount"),
)
return df
Thus, by overloading these two methods, we’ve implemented all we’d like for our client project.
The output we all know conforms to the necessities of the downstream feature engineering pipeline, so we robotically have assurance that our outputs are compatible.
Final summary: Why use abstract classes in data science pipelines?
Abstract classes offer a robust solution to bring consistency, robustness, and improved maintainability to data science projects. Through the use of Abstract Classes like in our example, our data science team sees the next advantages:
1. No must worry about compatibility
By defining a transparent blueprint with abstract classes, the info scientist only must concentrate on implementing the load
and transform
methods specific to their client’s data.
So long as these methods conform to the expected input/output types, compatibility with the downstream feature generation pipeline is guaranteed.
This separation of concerns simplifies the event process, reduces bugs, and accelerates development for brand new projects.
2. Easier to document
The structured format naturally encourages in-line documentation through method docstrings.
This proximity of design decisions and implementation makes it easier to speak assumptions, transformations, and nuances for every client’s dataset.
Well-documented code is simpler to read, maintain, and hand over, reducing the knowledge loss attributable to team changes or turnover.
3. Improved code readability and maintainability
With abstract classes enforcing a consistent interface, the resulting codebase avoids the pitfalls of unreadable, flaky, or unmaintainable scripts.
Each child class adheres to a standardized method structure (load
, transform
, validate
, save
, run
), making the pipelines more predictable and easier to debug.
4. Robustness to human aspects
Abstract classes help reduce risks from human error, teammates leaving, or learning recent joiners by embedding essential behaviours in the bottom class. This ensures that critical steps are never skipped, even when individual contributors are unaware of all downstream requirements.
5. Extensibility and reusability
By isolating client-specific logic in concrete classes while sharing common behaviors within the abstract base, it becomes straightforward to increase pipelines for brand new clients or projects. You possibly can add recent data cleansing steps or support recent file formats without rewriting all the pipeline.
In summary, abstract classes levels up your data science codebase from ad-hoc scripts to scalable, and maintainable production-grade code. Whether you’re a knowledge scientist, a team lead, or a manager, adopting these software engineering principles will significantly boost the impact and longevity of your work.
Related articles:
In case you enjoyed this text, then have a have a look at a few of my other related articles.
- Inheritance: A software engineering concept data scientists must know to succeed (here)
- Encapsulation: A softwre engineering concept data scientists must know to succeed (here)
- The Data Science Tool You Need For Efficient ML-Ops (here)
- DSLP: The info science project management framework that transformed my team (here)
- Methods to stand out in your data scientist interview (here)
- An Interactive Visualisation For Your Graph Neural Network Explanations (here)
- The Recent Best Python Package for Visualising Network Graphs (here)