Big Savings On Big Data Motivation Key Metrics Reducing Compute Costs Accelerating Development Iterations Conclusion

How Lyft’s ML Platform Saves Time and Money on Big Data/ML Workloads

In previous articles, we talked concerning the ML Platform of Lyft, LyftLearn, which manages ML model training in addition to batch predictions. With the quantity of knowledge Lyft has to process, it’s natural that the fee of operating the platform may be very high.

After we talked about how we democratized distributed compute, we described an answer with some key design principles resembling .

In early 2022, we accomplished this migration. Now’s time to guage the impact of the design decisions during the last two years, in each increasing developer productivity and lowering cost.

In this text, we define each run as executing an information/ML task using an ephemeral Spark/Ray cluster. The time and price of runs are measured by their ephemeral Spark/Ray usage.

Runs are the option to use the LyftLearn big data system in each development and production. There are two foremost use cases in the event environment: running ad-hoc tasks and iterating as a way to create a production workflow.

We are going to compare the metrics of runs between 2021 and 2022 in development () and production ().

In 2022, we had an enormous increase in production usage.

Total number of runs (%) in the production and development — Total variety of runs (%) in production and development

The whole variety of runs increased and prod runs increased . In later sections, we’ll explain why the rise isn’t proportional between dev and prod.

We also boosted users’ development speed:

Comparison of average minutes required for one run in Development vs Production

The typical per-iteration time (the blue bars) on big data reduced from 31 minutes to 11 minutes. That shows .

Notice that the prod run time increased barely as a result of recent heavier jobs. This also points to the indisputable fact that the massive increase in prod runs is organic and isn’t as a result of breaking up large existing workloads.

More usage and faster iterations on big data commonly require more compute resource and better cost. How way more did we spend in 2022 vs 2021?

Comparing the cost incurred in Production and Development — Comparing the fee incurred in Production and Development

Surprisingly, in 2022, not only were we successful in controlling the general cost (), but we also managed to .

The whole dev cost reduced 32% despite the fact that the dev usage barely increased in 2022. How did we achieve that?

Comparing cost incurred per run in last 2 years for the development and production environments — Comparing cost incurred per run in last 2 years for the event and production environments

We were able to scale back the common dev per-run cost from $25 to $10.7 (-57%). Meaning .

One other data point price mentioning: .

Within the previous article, we mentioned that the LyftLearn platform enforces ephemeral clusters. Within the LyftLearn notebook experience, users can declare cluster resources for every step of their workflow. Within the image below, a user is requesting a Spark cluster with 8 machines, each with 8 CPUs and 32 GB of RAM. The cluster is ephemeral and only exists during the SparkSQL query.

Using ephemeral clusters has contributed a good portion of total savings. Managed platforms like AWS Elastic MapReduce are likely to require an information scientist to spin up a cluster after which develop on top of that cluster. This results in under-utilization (as a result of idling) during project iteration. Ephemeral clusters ensure users are allocated costly resources only when vital.

It’s also essential to say LyftLearn Spark autoscaling. Autoscaling can result in instability or underutilization. It’s less useful when the clusters are already ephemeral. We also found similar patterns discussed in this text published by Sync Computing.

The advantages of being explicit on compute resources are:

Users are aware of the resources they really want for his or her cases.
Resource contention within the K8s clusters is reduced

Loads of LyftLearn users are surprised with the spin-up time (2–5 seconds) due to Kubernetes Spark with cached images. Ephemeral clusters also directly reduce maintenance because different steps of a workflow could be executed using different images to separate packages that conflict with one another (i.e. requiring different versions for dependencies).

One other big a part of cost savings is selecting the tool that’s handiest for the job. That is most evident with Presto and Hive. In this text, we shared the most effective practices for selecting them:

Presto is sweet for aggregation and small output scenarios — it shouldn’t take greater than 10 minutes. If Presto is slow, try Hive.

Hive is slower but generally more scalable. At all times try to avoid wasting the output to files as an alternative of dumping it into Pandas.

As more big data frameworks come into the landscape of knowledge science, we’d like to decide on the most effective tool for every a part of the job. One among the essential pieces of the LyftLearn platform is to offer data practitioners the flexibleness and ease to decide on the most effective tool for every job.

For instance, some data pipelines inside Lyft leverage Spark for preprocessing and Ray for the distributed machine learning portion. This can also be specifically enabled by ephemeral clusters. (Watch our Data AI Summit 2022 Talk)

One other less tracked type of savings are the hours saved as a result of operational efficiencies gained as a result of the LyftLearn platform. The massive reduction on time of dev runs and better ratio of prod to dev variety of runs directly translates to data scientists having more time spent on modeling and scientific computing. More importantly, more projects make it to production to generate real business value.

Our abstraction layer of compute, built on top of the open-source project Fugue, plays the important thing role in accelerating development iterations. It optimizes big data workstreams in 3 ways:

With a backend agnostic design, we . Only well tested code finally ends up running on clusters. This explains why in 2022 the rise of prod and dev runs weren’t proportional. A big portion of the iterations happened locally without using clusters.

That is probably the most essential sources of LyftLearn savings.

Developing a fancy Hive(Spark) query with a whole bunch of lines is one in every of the most important and commonest challenges for Lyft ML practitioners. On account of the Common Table Expression(CTE) syntax, breaking up a SQL query to small subqueries to run isn’t practical. Iterating on such queries requires re-running the entire query each time. In a worse situation, when a fancy query never finishes, the owner can’t even know which step caused the issue. Retrying is inefficient and incurs big cost too.

FugueSQL is a superset of traditional SQL with improved syntax and features: it doesn’t require CTE. As a substitute, the task syntax could make the SQL query easy to interrupt up and mix.

Breaking up and combining complex SQL queries using FugueSQL — Breaking up and mixing complex SQL queries using FugueSQL

Within the above example, let’s assume the unique hive query has unknown issues. We will rewrite it in FugueSQL and break it up into multiple parts to iterate. In the primary cell, YIELD FILE will cache b to a file (saved by Spark) and make the reference available for the next cells. Within the second cell, we are able to directly use b which will probably be loaded from S3. Lastly, we may also print the result to confirm. In this manner we are able to quickly debug issues. More importantly, with caching, finished cells is not going to have to be re-run in the next iterations.

When multiple parts work end to finish, we just copy-paste them together and take away the YIELD. Notice we also add a PERSIST to b, because it’s going to be used twice in the next steps. That is to explicitly tell Spark to cache this result to avoid recompute.

FugueSQL should generate equivalent results as the unique SQL, however it has significant benefits:

Divide-and-conquer becomes possible for SQL, significantly speeding up iteration time on complex problems.
The ultimate FugueSQL is usually faster than the unique SQL (if we explicitly cache the intermediate steps to avoid recompute).

We may also easily construct back the standard Hive SQL after we fix all problems within the iterations. The slowest and most costly part is at all times the event iterations which we are able to improve using the Fugue approach.

We don’t require users to modernize their entire workloads in a single shot. As a substitute, we encourage them to migrate incrementally with vital refactoring.

There are numerous existing workloads written with small data tooling resembling Pandas and scikit-learn. In quite a lot of cases, if one step is compute intensive, then users can refactor their code to separate out the core computing logic, then use one Fugue transform call to distribute the logic.

Subsequently, incremental adoption can also be a natural process for users to adopt good coding practices and rewrite prime quality code that’s scale agnostic and framework (Spark, Ray, Fugue, etc.) agnostic.

The metrics shown from 2021 to 2022 led to each productivity boost and price savings, and doesn’t even include the advantages from human-hours saved from the improved development speed. Lyft’s top line also increased from the ML models that were in a position to reach production with the support of the LyftLearn platform.

Developing big data projects could be significantly expensive in each money and time, but LyftLearn succeeded in bringing down costs by enforcing best practices, simplifying the programming model and accelerating iterations.

As at all times, Lyft is hiring! Should you’re keen about developing state-of-the-art systems join our team.

Big Savings On Big Data Motivation Key Metrics Reducing Compute Costs Accelerating Development Iterations Conclusion

How Lyft’s ML Platform Saves Time and Money on Big Data/ML Workloads

What are your thoughts on this topic?
Let us know in the comments below.

190 COMMENTS

Share this article

Recent posts

What We Still Don’t Understand About Machine Learning

OpenAI Unveils SearchGPT: A Recent AI-Powered Search Engine

Public Release: Kling AI Video Generator

UK declares hiring of AI staff, but criticism continues

Radical Simplicity in Data Engineering

Big Savings On Big Data Motivation Key Metrics Reducing Compute Costs Accelerating Development Iterations Conclusion

How Lyft’s ML Platform Saves Time and Money on Big Data/ML Workloads

What are your thoughts on this topic? Let us know in the comments below.

190 COMMENTS

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.