Inheritance: A Software Engineering Concept Data Scientists Must Know To Succeed

-

it is best to read this text

In case you are planning to enter data science, be it a graduate or an expert in search of a profession change, or a manager answerable for establishing best practices, this text is for you.

Data science attracts quite a lot of different backgrounds. From my skilled experience, I’ve worked with colleagues who were once:

  • Nuclear Physicists
  • Post-docs researching Gravitational Waves
  • PhDs in Computational Biology
  • Linguists

simply to name a number of.

It’s wonderful to have the opportunity to fulfill such a various set of backgrounds and I actually have seen such quite a lot of minds result in the expansion of a creative and effective Data Science function.

Nonetheless, I actually have also seen one big downside to this variety:

Everyone has had different levels of exposure to key Software Engineering concepts, leading to a patchwork of coding skills.

Because of this, I actually have seen work done by some data scientists that’s good, but is:

  • Unreadable — you’ve got no idea what they try to do.
  • Flaky — it breaks the moment another person tries to run it.
  • Unmaintainable — code quickly becomes obsolete or breaks easily.
  • Un-extensible — code is single-use and its behaviour can’t be prolonged.

Which ultimately dampens the impact their work can have and creates all styles of issues down the road.

Photo by Shekai on Unsplash

So, in a series of articles, I plan to stipulate some core software engineering concepts that I actually have tailored to be necessities for data scientists.

They’re easy concepts, however the difference between knowing them vs not knowing them clearly draws the road between amateur and skilled.

Today’s Concept: Inheritance

Inheritance is key to writing clean, reusable code that improves your efficiency and work productivity. It could even be used to standardise the best way a team writes code which reinforces readability and maintainability.

Looking back at how difficult it was to learn these concepts after I was first learning to code, I’m not going to start out off with an abstract, high level definition that gives no value to you at this stage. There’s plenty in the web you possibly can google in the event you want this.

.

We are going to outline the sort of practical problems a knowledge scientist could run into, see what inheritance is, and the way it will probably help a knowledge scientist write higher code.

And by we mean:

Example: Ingesting data from multiple different sources

Photo by John Schnobrich on Unsplash

Probably the most tedious and time consuming a part of a knowledge scientist’s job is determining where to get data, the way to read it, the way to clean it, and the way to reserve it.

Let’s say you’ve got labels provided in CSV files submitted from five different external sources, each with their very own unique schema.

Your task is to wash each certainly one of them and output them as a parquet file, and for this file to be compatible with downstream processes, they have to conform to a schema:

  • label_id : Integer
  • label_value : Integer
  • label_timestamp : String timestamp in ISO format.

The Quick & Dirty Approach

On this case, the short and dirty approach could be to jot down a separate script for every file.

# clean_source1.py

import polars as pl

if __name__ == '__main__':

    df = pl.scan_csv('source1.csv')
    overall_label_value = df.group_by('some-metadata1').agg(
        overall_label_value=pl.col('some-metadata2').or_().over('some-metadata2')
    ) 

    df = df.drop(['some-metadata1', 'some-metadata2', 'some-metadata3'], axis=1)

    df = df.join(overall_label_value, on='some-metadata4')

    df = df.select(

        pl.col('primary_key').alias('label_id'),

        pl.col('overall_label_value').alias('label_value').replace([True, False], [1, 0]),
        pl.col('some-metadata6').alias('label_timestamp'),

    )

df.to_parquet('output/source1.parquet')

and every script could be unique.

So what’s mistaken with this? It gets the job done right?

Let’s return to our criterion for good code and evaluate why this one is bad:

1. It is difficult to read

There’s no organisation or structure to the code.

All of the logic for loading, cleansing, and saving is all in the identical place, so it’s difficult to see where the road is between each step.

Bear in mind, this can be a contrived, easy example. In the true world, the code you’d write could be for much longer and complicated.

When you’ve got hard to read code, and five different versions of it, it results in long run problems:

2. It is difficult to keep up

The dearth of structure makes it hard so as to add latest features or fix bugs. If the logic needed to be modified, your entire script will likely must be overhauled.

If there was a typical operation that needed to be applied to all outputs, then someone must go and modify all five scripts individually.

Every time, they should decipher the aim of lines and features of code. Because there’s no clear distinction between

  • where data is loaded,
  • where data is used,
  • which variables are depending on downstream operations,

it becomes hard to know whether the changes you make can have any unknown impact on downstream code, or violates some upstream assumption.

Ultimately, it becomes very easy for bugs to creep in.

3. It is difficult to re-use

This code is the definition of a one-off.

It’s hard to read, you don’t know what’s happening where unless you invest quite a lot of time to make certain you understand every line of code.

If someone desired to reuse logic from it, the one option they might have is to copy-paste the script and modify it, or rewrite their very own from scratch.

The Higher, Skilled Approach

Now, let’s have a look at how we are able to improve our situation by utilizing inheritance.

Photo by Kelly Sikkema on Unsplash

1. Discover the commonalities

In our example, every data source is exclusive. We all know that every file would require:

  • A number of cleansing steps
  • A saving step, which we already know all files might be saved right into a single parquet file.

We also know each file needs to adapt to the identical schema, so best we’ve got some validation of the output data.

So these commonalities will inform us what functionalities we could write once, after which reuse them.

2. Create a base class

Now comes the inheritance part.

We write a base class, or parent class, which implements the logic for handling the commonalities we identified above. This class will change into the template from which other classes will .

Classes which inherit from this class (called child classes) can have the identical functionality because the parent class, but can even have the opportunity so as to add latest functionality, or change those which are already available.

import polars as pl


class BaseCSVLabelProcessor:

    REQUIRED_OUTPUT_SCHEMA = {
        "label_id": pl.Int64,
        "label_value": pl.Int64,
        "label_timestamp": pl.Datetime
    }

    def __init__(self, input_file_path, output_file_path):
        self.input_file_path = input_file_path
        self.output_file_path = output_file_path

    def load(self):
        """Load the information from the file."""
        return pl.scan_csv(self.input_file_path)

    def clean(self, data:pl.LazyFrame):
        """Clean the input data"""
        ...

    def save(self, data:pl.LazyFrame): 
        """Save the information to parquet file."""
        data.sink_parquet(self.output_file_path)

    def validate_schema(self, data:pl.LazyFrame):
        """
        Check that the information conforms to the expected schema.
        """
        for colname, expected_dtype in self.REQUIRED_OUTPUT_SCHEMA.items():
            actual_dtype = data.schema.get(colname)
            
            if actual_dtype is None:
                raise ValueError(f"Column {colname} not present in data")

            if actual_dtype != expected_dtype:
                raise ValueError(
                    f"Column {colname} has incorrect type. Expected {expected_dtype}, got {actual_dtype}"
                )

    def run(self):
        """Run data processing on the desired file."""
        data = self.load()
        data = self.clean(data)
        self.validate_schema(data)
        self.save(data)

3. Define child classes

Now we define the kid classes:

class Source1LabelProcessor(BaseCSVLabelProcessor):
    def clean(self, data:pl.LazyFrame):
        # bespoke logic for source 1
        ...

class Source2LabelProcessor(BaseCSVLabelProcessor):
    def clean(self, data:pl.LazyFrame):
        # bespoke logic for source 2
        ...

class Source3LabelProcessor(BaseCSVLabelProcessor):
    def clean(self, data:pl.LazyFrame):
        # bespoke logic for source 3
        ...

Since all of the common logic is already implemented within the parent class, all of the child class must be concerned of is the bespoke logic that is exclusive to every file.

So the code we wrote for the bad example can now be turned into:

from  import BaseCSVLabelProcessor

class Source1LabelProcessor(BaseCSVLabelProcessor):
    def get_overall_label_value(self, data:pl.LazyFrame):
        """Get overall label value."""
        return data.with_column(pl.col('some-metadata2').or_().over('some-metadata1'))

    def conform_to_output_schema(self, data:pl.LazyFrame):
        """Drop unnecessary columns and confrom required columns to output schema."""
        data = data.drop(['some-metadata1', 'some-metadata2', 'some-metadata3'], axis=1)

        data = data.select(
            pl.col('primary_key').alias('label_id'),
            pl.col('some-metadata5').alias('label_value').replace([True, False], [1, 0]),
            pl.col('some-metadata6').alias('label_timestamp'),
        )

        return data

    def clean(self, data:pl.LazyFrame) -> pl.DataFrame:
        """Clean label data from Source 1.
        
        The next steps are crucial to wash the information:
        
        1. 
        2. 
        3. Renaming columns and data types to confrom to the expected output schema.
        """
        overall_label_value = self.get_overall_label_value(data)
        df = df.join(overall_label_value, on='some-metadata4')
        df = self.conform_to_output_schema(df)
        return df

and with a view to run our code, we are able to do it in a centralised location:

# label_preparation_pipeline.py
from  import Source1LabelProcessor, Source2LabelProcessor, Source3LabelProcessor


INPUT_FILEPATHS = {
    'source1': '/path/to/file1.csv',
    'source2': '/path/to/file2.csv',
    'source3': '/path/to/file3.csv',
}

OUTPUT_FILEPATH = '/path/to/output.parquet'

def foremost():
    """Label processing pipeline.

    The label processing pipeline ingests data sources 1, 2, 3 that are from 
    external vendors . 

    The output is written to a parquet file, ready for ingestion by .
    
    The code assumes the next:
    - 

    The user must specify the next inputs:
    - 
""" processors = [ Source1LabelProcessor(FILEPATHS['source1'], OUTPUT_FILEPATH), Source2LabelProcessor(FILEPATHS['source2'], OUTPUT_FILEPATH), Source3LabelProcessor(FILEPATHS['source3'], OUTPUT_FILEPATH) ] for processor in processors: processor.run()

Why is that this higher?

1. Good encapsulation

Any colleague who must re-run this code will only have to run the foremost() function. You’d have provided sufficient docstrings within the respective functions to clarify what they do and the way to use them.

They need to have the opportunity to trust your work and run it. Only once they have to fix a bug or extend its functionality will they should go deeper.

This is named — strategically hiding the implementation details from the user. It’s one other programming concept that is crucial for writing good code.

Photo by Dan Crile on Unsplash

In a nutshell, it ought to be sufficient for the reader to depend on the docstrings to know what the code does and the way to use it.

How often do you go into the scikit-learn source code to learn the way to use their models? scikit-learn is a really perfect example of excellent Coding design through encapsulation.

I’ve already written an article dedicated to encapsulation here, so if you ought to know more, test it out.

2. Higher extensibility

What if the label outputs now had to alter? For instance, downstream processes that ingest the labels now require them to be stored in a SQL table.

Well, it becomes quite simple to do that – we simply need to change the save method within the BaseCSVLabelProcessor class, after which all the child classes will inherit this variation routinely.

What in the event you find an incompatibility between the label outputs and a few process downstream? Perhaps a brand new column is required?

Well, you would wish to alter the respective clean methods to account for this. But, you may also extend the checks within the validate method within the BaseCSVLabelProcessor class to account for this latest requirement.

You possibly can even take this one step further and add many more checks to at all times make certain the outputs are as expected – it’s possible you’ll even need to define a separate validation module for doing this, and plug them into the validate method.

You possibly can see how extending the behaviour of our label processing code becomes quite simple.

Compared, if the code lived in separate bespoke scripts, you could be copy and pasting these checks over and another time. Even worse, possibly each file requires some bespoke implementation. This implies the identical problem must be solved five times, when it may very well be solved properly only once.

It’s rework, its inefficiency, it’s wasted resources and time.

Final Remarks

So, in this text, we’ve covered how the usage of inheritance greatly enhances the standard of our codebase.

By appropriately applying inheritance, we’re capable of solve common problems across different tasks, and we’ve seen first hand how this results in:

  • Code that is simpler to read — Readability
  • Code that is simpler to debug and maintain — Maintainability
  • Code that is simpler so as to add and extend functionality — Extensibility

Nonetheless, some readers will still be sceptical of the necessity to jot down code like this.

Perhaps they’ve been writing one-off scripts for his or her entire profession, and the whole lot has been superb thus far. Why hassle writing code in a more complicated way?

Photo by Towfiqu barbhuiya on Unsplash

Well, that’s a excellent query —

Up until very recently, Data Science has been a brand new, area of interest industry where proof-of-concepts and research was the foremost focus of labor. Coding standards didn’t matter then, so long as we got something out through the doors and it worked.

.

We now have to keep up, fix, debug, and retrain not only models, but in addition the processes required to create the model – for so long as they’re used.

That is the truth that data science must face — constructing models is the part whilst maintaining what we’ve got built is the part.

Meanwhile, software engineering has been doing this for many years, and has through trial and error built up all one of the best practices we discussed today in order that the code that they construct are easy to keep up.

Subsequently, data scientists might want to know these best practices going forwards.

Those that know this can inevitably be in comparison with those that don’t.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x