MLOps

Machine Learning at Scale: Managing More Than One Model in Production

yourself how real machine learning products actually run in major tech corporations or departments? If yes, this text is for you 🙂 Before discussing scalability, please don’t hesitate to read my first article on...

Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not?

Introduction a continuous variable for 4 different products. The machine learning pipeline was in-built Databricks and there are two major components.  Feature preparation in SQL with serverless compute. Inference on an ensemble of several hundred models using...

Scaling Feature Engineering Pipelines with Feast and Ray

project involving the construct of propensity models to predict customers’ prospective purchases, I encountered feature engineering issues that I had seen quite a few times before. These challenges might be broadly classified into two categories: 1)...

Breaking the Host Memory Bottleneck: How Peer Direct Transformed Gaudi’s Cloud Performance

introduced Gaudi accelerators to Amazon’s EC2 DL1 instances, we faced a challenge that threatened your complete deployment. The performance numbers were not only disappointing; they were disastrous. Models that required training effectively were...

AWS vs. Azure: A Deep Dive into Model Training – Part 2

In Part 1 of this series, how Azure and AWS take fundamentally different approaches to machine learning project management and data storage. Azure ML uses a workspace-centric structure with user-level role-based access control (RBAC),...

Machine Learning in Production? What This Really Means

, whether you’re a manager, an information scientist, an engineer, or a product owner, you’ve almost definitely been in no less than one meeting where the discussion revolved around “putting a model in production.” But...

Azure ML vs. AWS SageMaker: A Deep Dive into Model Training — Part 1

(AWS) are the world’s two largest cloud computing platforms, providing database, network, and compute resources at global scale. Together, they hold about 50% of the worldwide enterprise cloud infrastructure services market—AWS at 30%...

Why Your ML Model Works in Training But Fails in Production

, I worked on real-time fraud detection systems and suggestion models for product corporations that looked excellent during development. Offline metrics were strong. AUC curves were stable across validation windows. Feature importance plots told...

Recent posts

Popular categories

ASK ANA