Reducing Time to Value for Data Science Projects: Part 4

-

series in reducing the time to value of your projects (see part 1, part 2 and part 3) takes a less implementation-led approach and as an alternative focusses on the perfect practises of developing code. As a substitute of detailing what and find out how to code explicitly, I need to speak about how it is best to approach development of projects basically which underpins all the things that has been covered previously.

Introduction

Being a knowledge scientist involves bringing together plenty of different disciplines and applying them to drive value for a business. Probably the most commonly prized skill of a knowledge scientist is the technical ability to provide a trained model able to go live. This covers a big selection in required knowledge akin to exploratory data evaluation, feature engineering, data transformations, feature selection, hyperparameter tuning, model training and model evaluation. Learning these steps alone are a major undertaking, especially within the continuously evolving world of Large Language Models and Generative AI. Data scientists could devote all their learning to becoming technical powerhouses, knowing the inner working of probably the most advanced models.

While being technically proficient is very important, there are other skills that needs to be developed for those who want be a very great data scientist. The chief amongst these is being a great software developer. With the ability to write robust, flexible and scalable code is just as essential, if no more so, than knowing all the newest techniques and models. Lacking these software skills will allow bad practises to creep into your work and you may find yourself with code that is probably not suitable for production. Embracing software development principles will give a structured way of ensuring your code is top quality and can speed up the general project development process.

This text will function a temporary introduction to topics that multiple books have been written about. As such I don’t expect this to be a comprehensive breakdown of all the things software development; as an alternative I need this to merely be a start line in your journey in writing clean code that helps to drive forward value for your small business.

Set Up Your DevOps Platform Properly

All data scientists are taught to make use of Git as a part of their education to perform tasks akin to cloning repositories, creating branches, pulling / pushing changes etc. These are inclined to be backed by platforms akin to GitHub or GitLab, and data scientists are content to make use of these purely as a spot to store code remotely. Nevertheless they’ve significantly more to supply as fully fledged DevOps platforms, and using them as such will greatly improve your coding experience.

Assigning Roles To Team Members In Your Repository

Many individuals will want or must access your project repository for various purposes. As a matter of security, it is nice practice to limit how everybody can interact with it. The roles that individuals can take typically fall into categories akin to:

  • Analyst: Only must give you the option to read the repository
  • Developer: Must give you the option to read and write to the repository
  • Maintainer: Must give you the option to edit repository settings

For data scientists, it is best to have more senior members of staff on the project be maintainers and junior members be developers. This becomes essential when deciding who can merge changes into production.

Managing Branches

When developing a project with Git, you’ll make extensive use of branches that add features / develop functionality. Branches can split into different categories akin to:

  • predominant/master: Used for official production releases
  • development: Used to bring together features and functionality
  • features: What to make use of when doing code development work
  • bugfixes: Used for minor fixes
Proper management of branching structure simplifies the event process. Image by creator

The predominant and development branches are special as they’re everlasting and represent the work that’s closest to production. As such special care should be taken with these, namely:

  • Ensure they can’t be deleted
  • Ensure they can’t be pushed to directly
  • They’ll only be updated via merge requests
  • Limit who can merge changes into them

We will and will protect these branches to implement the above. This is often the job of project maintainers.

When deciding merge strategies for adding to development / predominant we’d like to contemplate:

  • Who’s allowed to trigger and approve these merges (specific roles / people?)
  • What number of approvals are required before a merge is accepted?
  • What checks does a branch must pass to be accepted?

Usually we can have less strict controls for updating development vs updating predominant but it is necessary to have a consistent strategy in place.

When coping with feature branches it’s essential to consider:

  • What is going to the branch be called?
  • What’s the structure to the commit messages?

What is very important is to agree as a team the rules for naming branches. Some examples may very well be to call them after a ticket, to have a typical list of prefixes to begin a branch with or so as to add a suffix at the top to simply discover the owner. For the commit messages, it’s possible you’ll wish to use a 3rd party library akin to Commitizen to implement standardisation across the team.

Maintain a Consistent Development Environment

Taking a step back, developing code would require you to:

  • Have access to the programming languages software developer kit
  • Install 3rd party libraries to develop your solution

Even at this point care should be taken. It’s all too common to run into the scenario where solutions that work locally fail when one other team member tries to run them. That is attributable to inconsistent development environments where:

  • Different version of the programming language are installed
  • Different versions of the threerd party library are installed

Ensuring that everyone seems to be developing inside the same environment that replicates the production conditions will ensure now we have no compatibility issues between developers, the answer will work in production and can eliminate the necessity for ad-hoc installation of libraries. Some recommendations are:

  • Use a requirements.txt / pyproject.toml at a minimum. No pip installing libraries on the fly!
  • Look into using docker / containerisation to have fully shippable environments
Consistent environments and libraries ensures reproducibility and reduces friction. Image by creator

Without these standardisations in place there isn’t any guarantee that your solution will work when deployed into production

Readme.md

Readme’s are the very first thing which might be seen once you open a project in your DevOps platform. It gives you a chance to supply a high level summary of your project and informs your audience find out how to interact with it. Some essential sections to place in a readme are:

  • Project title, description and setup to get people onboarded
  • The right way to run / use so people can use any core functionality and interpret the outcomes
  • Contributors / point of contact for people to follow up with
A one-stop shop to getting users onboarded onto your project. Image by creator

A readme doesn’t must be extensive documentation of all the things relevant to a project, merely a fast start guide. More detailed background, experimental results etc may be hosted elsewhere, akin to an internal Wiki like Confluence.

Test, Test And Test Some More!

Anyone can write code but not everyone can write correct and maintainable code. Ensuring that your code is bug free is critical and each precaution needs to be taken to mitigate this risk. The best technique to do that is to put in writing tests for whatever code you develop. There are different varieties of tests you possibly can write, akin to:

  • Unit tests: Test individual components
  • Integration tests: Test how the person components work together
  • Regression tests: Test that any recent changes haven’t broken existing functionality

Writing a great unit test is reliant on a well written function. Functions should try to stick to principles akin to Do One Thing (DOT) or Don’t Repeat Yourself (DRY) to make sure which you could write clear tests. Usually it is best to test to:

  • Show the function working
  • Show the function failing
  • Trigger any exceptions raised inside the function

One other essential aspect to contemplate is how much of your code is tested aka the test coverage. While achieving 100% coverage is the idealised scenario, in practise you’ll have to accept less which is okay. That is common when you find yourself coming into an existing project where standards haven’t been properly maintained. The essential thing is to begin with a coverage baseline after which try to increase that over time as your solution matures. It will involve some technical debt work to get the tests written.

pytest --cov=src/ --cov-fail-under=20 --cov-report term --cov-report xml:coverage.xml --junitxml=report.xml tests

This instance pytest invocation each runs the tests and checks that a minimum level of coverage has been attained.

Code Reviews

The only most vital a part of writing code is having it reviewed and approved by one other developer. Having code checked out ensures:

  • The code produced answers the unique query
  • The code meets the required standards
  • The code uses an appropriate implementation

Code reviewing data science projects may involve extra steps attributable to its experimental nature. While this is much for an exhaustive list, some general checks are:

  • Does the code run?
  • Is it tested sufficiently?
  • Are appropriate programming paradigms and data structures used?
  • Is the code readable?
  • Is it code maintainable and extensible?
def bad_function(keys, values, specifc_key):
 
    for i, key in enumerate(keys):
        if key == specific_key:
            value[i] = X
    return keys, values

The above code snippets highlights a wide range of bad habits akin to using lists as an alternative of dictionary and no typehints or docstrings. From a knowledge science perspective you’ll moreover want to ascertain:

  • Are notebooks used sparingly and commented appropriately?
  • Has the evaluation been communicated sufficiently (e.g. graphs labelled, dataframes described etc.)
  • Has care been taken when producing models (no data leakage, only using features available at inference etc.)
  • Are any artefacts produced and are they stored appropriately?
  • Are experiments carried out to a high standard, e.g. set out with a research query, tracked and documented?
  • Are there clear next steps from this work?

There’ll come a time where you progress off the project onto other things, and another person will take over. When writing code it is best to all the time ask yourself:

How easy wouldn’t it be for somebody to know what I even have written and be comfortable with maintaining or extending functionality?

Use CICD To Automate The Mundane

As projects grow in size, each in people and code, having checks and standards becomes increasingly more essential. This is usually done through code reviews and may involve tasks like checking:

  • Implementation
  • Testing
  • Test Coverage
  • Code Style Standardization

We moreover want to ascertain security concerns akin to exposed API keys / credentials or code that’s vulnerable to malicious attack. Having to manually check all of those for every code review can quickly develop into time consuming and will also result in checks being ignored. Loads of these checks may be covered by 3rd party libraries akin to:

  • Black, Flake8 and isort
  • Pytest

While this alleviates a number of the reviewers work, there remains to be the issue of getting to run these libraries yourself. What can be higher is the flexibility to automate these checks and others so that you just now not should. This will allow code reviews to be more focussed on the answer and implementation. This is strictly where Continuous Integration / Continuous Deployment (CICD) involves the rescue.

Automating checks frees up developer time. Image by creator

There are a number of CICD tools available (GitLab Pipelines, GitHub Actions, Jenkins, Travis etc) that allow the automation of tasks. We could go further and automate tasks akin to constructing environments and even training / deploying models. While CICD can encompasses the entire software development process, I hope I even have motivated some useful examples for its use in improving data science projects.

Conclusion

This text concludes a series where I even have focussed on how we will reduce the time to value for data science projects by being more rigorous in our code development and experimentation strategies. This final article has covered a big selection of topics related to software development and the way they may be applied inside a knowledge science context to enhance your coding experience. The important thing areas focussed on were leveraging DevOps platforms to their full potential, maintaining a consistent development environment, the importance of readme’s and code reviews and leveraging automation through CICD. All of those will be sure that you develop software that is powerful enough to assist support your data science projects and supply value to your small business as quickly as possible.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x