Big Savings On Big Data Motivation Key Metrics Reducing Compute Costs Accelerating Development Iterations Conclusion


By Anindya Saha & Han Wang

Image by DALL·E

In previous articles, we talked concerning the ML Platform of Lyft, LyftLearn, which manages ML model training in addition to batch predictions. With the quantity of knowledge Lyft has to process, it’s natural that the fee of operating the platform may be very high.

After we talked about how we democratized distributed compute, we described an answer with some key design principles resembling .

In early 2022, we accomplished this migration. Now’s time to guage the impact of the design decisions during the last two years, in each increasing developer productivity and lowering cost.

In this text, we define each run as executing an information/ML task using an ephemeral Spark/Ray cluster. The time and price of runs are measured by their ephemeral Spark/Ray usage.

Runs are the option to use the LyftLearn big data system in each development and production. There are two foremost use cases in the event environment: running ad-hoc tasks and iterating as a way to create a production workflow.

We are going to compare the metrics of runs between 2021 and 2022 in development () and production ().

In 2022, we had an enormous increase in production usage.

Total number of runs (%) in the production and development
Total variety of runs (%) in production and development

The whole variety of runs increased and prod runs increased . In later sections, we’ll explain why the rise isn’t proportional between dev and prod.

We also boosted users’ development speed:

Comparison of average minutes required for one run in Development vs Production
Comparison of average minutes required for one run in Development vs Production

The typical per-iteration time (the blue bars) on big data reduced from 31 minutes to 11 minutes. That shows .

Notice that the prod run time increased barely as a result of recent heavier jobs. This also points to the indisputable fact that the massive increase in prod runs is organic and isn’t as a result of breaking up large existing workloads.

More usage and faster iterations on big data commonly require more compute resource and better cost. How way more did we spend in 2022 vs 2021?

Comparing the cost incurred in Production and Development
Comparing the fee incurred in Production and Development

Surprisingly, in 2022, not only were we successful in controlling the general cost (), but we also managed to .

The whole dev cost reduced 32% despite the fact that the dev usage barely increased in 2022. How did we achieve that?

Comparing cost incurred per run in last 2 years for the development and production environments
Comparing cost incurred per run in last 2 years for the event and production environments

We were able to scale back the common dev per-run cost from $25 to $10.7 (-57%). Meaning .

One other data point price mentioning: .

Within the previous article, we mentioned that the LyftLearn platform enforces ephemeral clusters. Within the LyftLearn notebook experience, users can declare cluster resources for every step of their workflow. Within the image below, a user is requesting a Spark cluster with 8 machines, each with 8 CPUs and 32 GB of RAM. The cluster is ephemeral and only exists during the SparkSQL query.

Defining Spark cluster configuration
Defining Spark cluster configuration

Using ephemeral clusters has contributed a good portion of total savings. Managed platforms like AWS Elastic MapReduce are likely to require an information scientist to spin up a cluster after which develop on top of that cluster. This results in under-utilization (as a result of idling) during project iteration. Ephemeral clusters ensure users are allocated costly resources only when vital.

It’s also essential to say LyftLearn Spark autoscaling. Autoscaling can result in instability or underutilization. It’s less useful when the clusters are already ephemeral. We also found similar patterns discussed in this text published by Sync Computing.

The advantages of being explicit on compute resources are:

  1. Users are aware of the resources they really want for his or her cases.
  2. Resource contention within the K8s clusters is reduced

Loads of LyftLearn users are surprised with the spin-up time (2–5 seconds) due to Kubernetes Spark with cached images. Ephemeral clusters also directly reduce maintenance because different steps of a workflow could be executed using different images to separate packages that conflict with one another (i.e. requiring different versions for dependencies).

One other big a part of cost savings is selecting the tool that’s handiest for the job. That is most evident with Presto and Hive. In this text, we shared the most effective practices for selecting them:

Presto is sweet for aggregation and small output scenarios — it shouldn’t take greater than 10 minutes. If Presto is slow, try Hive.

Hive is slower but generally more scalable. At all times try to avoid wasting the output to files as an alternative of dumping it into Pandas.

As more big data frameworks come into the landscape of knowledge science, we’d like to decide on the most effective tool for every a part of the job. One among the essential pieces of the LyftLearn platform is to offer data practitioners the flexibleness and ease to decide on the most effective tool for every job.

For instance, some data pipelines inside Lyft leverage Spark for preprocessing and Ray for the distributed machine learning portion. This can also be specifically enabled by ephemeral clusters. (Watch our Data AI Summit 2022 Talk)

One other less tracked type of savings are the hours saved as a result of operational efficiencies gained as a result of the LyftLearn platform. The massive reduction on time of dev runs and better ratio of prod to dev variety of runs directly translates to data scientists having more time spent on modeling and scientific computing. More importantly, more projects make it to production to generate real business value.

Our abstraction layer of compute, built on top of the open-source project Fugue, plays the important thing role in accelerating development iterations. It optimizes big data workstreams in 3 ways:

With a backend agnostic design, we . Only well tested code finally ends up running on clusters. This explains why in 2022 the rise of prod and dev runs weren’t proportional. A big portion of the iterations happened locally without using clusters.

That is probably the most essential sources of LyftLearn savings.

Developing a fancy Hive(Spark) query with a whole bunch of lines is one in every of the most important and commonest challenges for Lyft ML practitioners. On account of the Common Table Expression(CTE) syntax, breaking up a SQL query to small subqueries to run isn’t practical. Iterating on such queries requires re-running the entire query each time. In a worse situation, when a fancy query never finishes, the owner can’t even know which step caused the issue. Retrying is inefficient and incurs big cost too.

FugueSQL is a superset of traditional SQL with improved syntax and features: it doesn’t require CTE. As a substitute, the task syntax could make the SQL query easy to interrupt up and mix.

Breaking up and combining complex SQL queries using FugueSQL
Breaking up and mixing complex SQL queries using FugueSQL

Within the above example, let’s assume the unique hive query has unknown issues. We will rewrite it in FugueSQL and break it up into multiple parts to iterate. In the primary cell, YIELD FILE will cache b to a file (saved by Spark) and make the reference available for the next cells. Within the second cell, we are able to directly use b which will probably be loaded from S3. Lastly, we may also print the result to confirm. In this manner we are able to quickly debug issues. More importantly, with caching, finished cells is not going to have to be re-run in the next iterations.

When multiple parts work end to finish, we just copy-paste them together and take away the YIELD. Notice we also add a PERSIST to b, because it’s going to be used twice in the next steps. That is to explicitly tell Spark to cache this result to avoid recompute.

FugueSQL should generate equivalent results as the unique SQL, however it has significant benefits:

  1. Divide-and-conquer becomes possible for SQL, significantly speeding up iteration time on complex problems.
  2. The ultimate FugueSQL is usually faster than the unique SQL (if we explicitly cache the intermediate steps to avoid recompute).

We may also easily construct back the standard Hive SQL after we fix all problems within the iterations. The slowest and most costly part is at all times the event iterations which we are able to improve using the Fugue approach.

We don’t require users to modernize their entire workloads in a single shot. As a substitute, we encourage them to migrate incrementally with vital refactoring.

There are numerous existing workloads written with small data tooling resembling Pandas and scikit-learn. In quite a lot of cases, if one step is compute intensive, then users can refactor their code to separate out the core computing logic, then use one Fugue transform call to distribute the logic.

Subsequently, incremental adoption can also be a natural process for users to adopt good coding practices and rewrite prime quality code that’s scale agnostic and framework (Spark, Ray, Fugue, etc.) agnostic.

The metrics shown from 2021 to 2022 led to each productivity boost and price savings, and doesn’t even include the advantages from human-hours saved from the improved development speed. Lyft’s top line also increased from the ML models that were in a position to reach production with the support of the LyftLearn platform.

Developing big data projects could be significantly expensive in each money and time, but LyftLearn succeeded in bringing down costs by enforcing best practices, simplifying the programming model and accelerating iterations.

As at all times, Lyft is hiring! Should you’re keen about developing state-of-the-art systems join our team.


What are your thoughts on this topic?
Let us know in the comments below.


Notify of
Newest Most Voted
Inline Feedbacks
View all comments
buy followers on instagram
buy followers on instagram
6 months ago

I appreciate the author’s unbiased and objective approach.

instrumental jazz
instrumental jazz
4 months ago

instrumental jazz

Share this article

Recent posts

Humane-SKT partnership launches first AI device 'Ai Pin' in Korea

Humain's 'Ai Pin', well referred to as the primary artificial intelligence (AI) hardware device, will likely be released in Korea. Humain announced a strategic partnership...

Bans on deepfakes take us only to this point—here’s what we really want

Rules that require all AI-generated content to be watermarked are unattainable to implement, and it’s also highly possible that watermarks could find yourself...

Empathetic AI: Transforming Mental Healthcare and Beyond with Emotional Intelligence

In an era where technology and humanity increasingly intertwine, the rise of empathetic AI represents a major step forward in bridging the gap between...

Gwangju’s ‘G-Unicorn Company’ growth is visible

Gwangju's 'G-Unicorn Corporations', which select and foster local startups with high growth potential, are producing results. Gwangju City (Mayor Kang Ki-jeong) said that the five...

Advanced Selection from Tensors in Pytorch

Using torch.index_select, torch.gather and torch.takeIn some situations, you’ll have to do some advanced indexing / selection with Pytorch, e.g. answer the query: “how can...

Recent comments

AeroSlim Weight loss price on NIA holds AI Ethics Idea Contest Awards Ceremony
skapa binance-konto on LLMs and the Emerging ML Tech Stack
бнанс рестраця для США on Model Evaluation in Time Series Forecasting
Bonus Pendaftaran Binance on Meet Our Fleet
Créer un compte gratuit on About Me — How I give AI artists a hand
To tài khon binance on China completely blocks ‘Chat GPT’
Regístrese para obtener 100 USDT on Reducing bias and improving safety in DALL·E 2
crystal teeth whitening on What babies can teach AI
binance referral bonus on DALL·E API now available in public beta prihlásení on Neural Networks and Life
Büyü Yapılmışsa Nasıl Bozulur on Introduction to PyTorch: from training loop to prediction
yıldızname on OpenAI Function Calling
Kısmet Bağlılığını Çözmek İçin Dua on Examining Flights within the U.S. with AWS and Power BI
Kısmet Bağlılığını Çözmek İçin Dua on How Meta’s AI Generates Music Based on a Reference Melody
Kısmet Bağlılığını Çözmek İçin Dua on ‘이루다’의 스캐터랩, 기업용 AI 시장에 도전장
uçak oyunu bahis on Thanks!
para kazandıran uçak oyunu on Make Machine Learning Work for You
medyum on Teaching with AI
aviator oyunu oyna on Machine Learning for Beginners !
yıldızname on Final DXA-nation
adet kanı büyüsü on ‘Fake ChatGPT’ app on the App Store
Eşini Eve Bağlamak İçin Dua on LLMs and the Emerging ML Tech Stack
aviator oyunu oyna on AI as Artist’s Augmentation
Büyü Yapılmışsa Nasıl Bozulur on Some Guy Is Trying To Turn $100 Into $100,000 With ChatGPT
Eşini Eve Bağlamak İçin Dua on Latest embedding models and API updates
Kısmet Bağlılığını Çözmek İçin Dua on Jorge Torres, Co-founder & CEO of MindsDB – Interview Series
gideni geri getiren büyü on Joining the battle against health care bias
uçak oyunu bahis on A faster method to teach a robot
uçak oyunu bahis on Introducing the GPT Store
para kazandıran uçak oyunu on Upgrading AI-powered travel products to first-class
para kazandıran uçak oyunu on 10 Best AI Scheduling Assistants (September 2023)
aviator oyunu oyna on 🤗Hugging Face Transformers Agent
Kısmet Bağlılığını Çözmek İçin Dua on Time Series Prediction with Transformers
para kazandıran uçak oyunu on How China is regulating robotaxis
bağlanma büyüsü on MLflow on Cloud
para kazandıran uçak oyunu on Can The 2024 US Elections Leverage Generative AI?
Canbar Büyüsü on The reverse imitation game
bağlanma büyüsü on The NYU AI School Returns Summer 2023
para kazandıran uçak oyunu on Beyond ChatGPT; AI Agent: A Recent World of Staff
Büyü Yapılmışsa Nasıl Bozulur on The Murky World of AI and Copyright
gideni geri getiren büyü on ‘Midjourney 5.2’ creates magical images
Büyü Yapılmışsa Nasıl Bozulur on Microsoft launches the brand new Bing, with ChatGPT inbuilt
gideni geri getiren büyü on MemCon 2023: We’ll Be There — Will You?
adet kanı büyüsü on Meet the Fellow: Umang Bhatt
aviator oyunu oyna on Meet the Fellow: Umang Bhatt
abrir uma conta na binance on The reverse imitation game
código de indicac~ao binance on Neural Networks and Life
Larry Devin Vaughn Wall on How China is regulating robotaxis
Jon Aron Devon Bond on How China is regulating robotaxis
otvorenie úctu na binance on Evolution of Blockchain by DLC
puravive reviews consumer reports on AI-Driven Platform Could Streamline Drug Development
puravive reviews consumer reports on How OpenAI is approaching 2024 worldwide elections Registrácia on DALL·E now available in beta