Home Artificial Intelligence A Week to a Day: Machine Learning Pipeline Optimization at Clover Getting faster by being lazier Time slowed to a crawl Good enough protobuf The case of the missing index Pivoting to SQL Not-so-incremental improvements Learnings and what’s next

A Week to a Day: Machine Learning Pipeline Optimization at Clover Getting faster by being lazier Time slowed to a crawl Good enough protobuf The case of the missing index Pivoting to SQL Not-so-incremental improvements Learnings and what’s next

1
A Week to a Day: Machine Learning Pipeline Optimization at Clover
Getting faster by being lazier
Time slowed to a crawl
Good enough protobuf
The case of the missing index
Pivoting to SQL
Not-so-incremental improvements
Learnings and what’s next

Illustrations by Lisa Xu

Clover’s data science team is targeted on constructing machine learning (ML) models which are designed to enhance the detection and management of chronic diseases. One in all the things that makes our platform unique is the feedback loop that permits for rapid iteration and increased model accuracy. In 2022, the pipeline that processes our data for ML would take almost per week to run. Every week is lightning quick in healthcare IT, but as a technology company, this was not a normal we were completely satisfied with.

Join us as we share a few of the optimizations our ML engineers made to bring our typical pipeline run duration down from over per week to lower than a day!

Our pipeline uses quite common tools: Python code running in Airflow and backed by a PostgreSQL database. We followed an easy 3-step process to optimize it:

  1. Understand what our code was doing
  2. Determine what our code didn’t must be doing
  3. Remove as much unnecessary work as possible

Step one was probably crucial. We made heavy use of profiling tools to know what our code was doing. Since we use Google Cloud Platform, we leveraged Cloud Profiler to transparently measure what our jobs were spending essentially the most time doing.

Identifying unnecessary work required a powerful understanding of our code. We asked ourselves questions like, “Is that this database round-trip obligatory?”, or “Is that this date parsing function more flexible than we’d like?”

Once we knew what to chop out, the last part was easiest–although careful testing and comparisons were essential to make sure our changes didn’t alter the pipeline’s behavior.

With that framework in mind, listed below are some things we found and improved on.

We found something surprising while examining profiles for one among our longer jobs: roughly 20% of its 10+ hour runtime was being spent parsing timestamps from strings. This seemed excessive, because the strings were all stored in standard ISO 8601 format.

After investigation, we switched from using pandas’ very flexible yet slow to_datetime() function to make use of the usual library’s rather more rigid and faster datetime.fromisoformat() method as an alternative. Our timestamp-parsing code then stopped running in slow motion.

Clover makes heavy use of Protocol Buffers and gRPC for service-to-service communication, and our ML pipeline even receives its raw data as protobuf messages. These messages’ data might be accessed directly in Python code, but you can too convert them to straightforward Python dictionaries for more familiar semantics.

We found a 4 hour job which was spending 3 hours dutifully converting protobuf messages to dictionaries. We realized we could retrieve the identical data from the messages without converting them and thereby relieved 75% of that job’s duties.

It was a tale as old as time: a database query performing a lookup on an ID column comprised a lot of the runtime of an 11 hour job. But there was a novel index that included the ID column! What gives?

Upon closer inspection, the relevant ID wasn’t the primary column within the unique index, meaning our PostgreSQL database elected not to make use of the index for lookups based solely on that ID. We quickly rectified that by adding a separate unique index only for the problematic ID, making the lookups much faster.

The ultimate step in our job to generate aggregated features for our ML models was calculating a large pivot using the aptly named DataFrame.pivot() method. Not only did the computation eat a ton of memory and require us to run it in batches, but it surely was the one a part of the job that required retrieving all the information from the database. This led us to ask: could we calculate the pivot in SQL?

One very ugly set of CASE statements later, we were in a position to run the large pivot entirely in our database, drastically reduce memory usage, eliminate the necessity for batching, and cut a 20 hour job all the way down to only 5 hours.

Similarly, a separate process generating our analytic datasets for ML model training and inference was querying each aggregated feature from the database individually and joining them in-memory as pandas DataFrames. We rewrote this to do as much of the joining within the database as possible and, after adding appropriate indexes, realized an excellent more impressive speedup from 20 hours to 2 hours.

We found several ways to enhance the speed of our processing jobs, but ingestion was by far the longest-running step. Originally of our pipeline, we were reloading all raw data used for ML, even when the information hadn’t modified. This was obviously an enormous waste of time, but performing incremental calculations would require drastic logic changes across our whole pipeline.

Fortunately, our raw data and the APIs we used to access it already contained the important thing ingredient for an incremental pipeline: a field representing the time the record was last modified. We got down to retrofit our entire pipeline with an incremental processing mode based on that timestamp.

This was an extended and arduous journey, since our work on each step of the pipeline revealed subtle edge cases for incremental processing logic. We performed exhaustive at-scale testing in our staging environment with rigorous before-and-after comparisons to persuade ourselves we hadn’t broken anything. We also built tools to assist our data scientists to administer the brand new functionality, including a command to trigger full refreshes on downstream jobs; this was useful when changing a job’s logic, because the modified job and all downstream jobs would then have to reprocess all of their data.

Fortunately, the reward at the top made our incremental processing quest totally value it. Ponderous ingestion jobs that ran for over 60 hours and dragged out longer every week were reduced to five hours and even less if no recent data was available. 15 hour downstream tasks also shrunk to five hours or less depending on the quantity of latest data. With incremental processing fully implemented, we finally saw our regular runtimes drop below the 24-hour mark.

We took away several learnings from our optimization adventure:

  • Timestamp parsing might be really slow, but it surely doesn’t must be slow in case your timestamp formats are uniform
  • In-memory protobuf message conversion might be slow and ought to be avoided unless obligatory
  • Database indexes are necessary for good performance in read-heavy workloads, but they should be specified accurately
  • Offloading calculations to the database generally is a large performance boost, especially when it saves you from having to retrieve the information from the database
  • Most of all, try to not re-process data that hasn’t modified!

Our data scientists are very completely satisfied with the brand new sub-day runtime, but our ML engineers suspect they could possibly be happier still. We’re adding data and functionality to the pipeline on a regular basis, and we anticipate reaching the bounds of Python and PostgreSQL in some unspecified time in the future within the near future. We’re currently investigating ways to re-architect the pipeline to make use of more scalable compute and storage engines, reminiscent of Apache Spark and BigQuery.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here