Bloq/ML Engineering: The 87% Failure Rate and How to Beat It

ML Engineering: The 87% Failure Rate and How to Beat It

Nearly 87% of ML projects never reach production. The failures aren't about models — they're about engineering.

Ismat Samadov19 fevral 2026(yenilənib: 1 aprel 2026)17 dəq. oxuma5 baxış

Məzmun cədvəli

A data scientist at a Series C fintech told me she spent four months building a fraud detection model that hit 96% precision in her Jupyter notebook. The team celebrated. Then an ML engineer took over, and it took another five months to get that same model running in production — handling 12,000 requests per second, retraining weekly on fresh transaction data, and failing gracefully when upstream services went down.

That 96% precision? It dropped to 89% within two weeks of deployment because the production data distribution didn't match the training data. The ML engineer caught the drift, set up automated monitoring, and built a retraining pipeline that kept accuracy above 93%.

The data scientist built the brain. The ML engineer built the body it lives in. Both jobs are hard. But only one of them is responsible for keeping the thing alive at 3 AM when the feature store crashes.

The Market Right Now

ML engineering is one of those rare fields where the demand numbers are genuinely hard to believe. Let me walk through them.

The Bureau of Labor Statistics projects computer and information research scientist roles — the closest BLS category to ML engineering — to grow 20% from 2024 to 2034, which is "much faster than average" in their language. But that 20% understates what's happening specifically in ML, because BLS categories were defined before the field existed as a distinct discipline.

More telling: AI/ML job postings increased 89% in the first half of 2025 alone, and 150% year-over-year. The US market faces a talent deficit where demand outstrips supply by a 3.2:1 ratio. There are currently over 2,800 open ML engineer positions on LinkedIn at any given time, at companies like Amazon, Netflix, Spotify, TikTok, and Ford.

The World Economic Forum's Future of Jobs Report projects demand for AI and machine learning specialists to rise by 40% — or 1 million jobs — over the next five years. And Gartner estimates that 70% of enterprises will operationalize AI architectures using MLOps by 2026, up from about 20% in 2022.

Salary Data

Here's what the major platforms report for ML engineer compensation in 2026:

Source	Base Salary	Total Comp	Notes
Glassdoor	$160,347	~$190K	Broad sample, includes non-tech
Built In	$162,080	$212,022	Includes bonuses and equity
Levels.fyi	$190,000	$261,683	Skews Big Tech, verified data
Indeed	$186,447	N/A	Self-reported

The spread between these numbers tells a story. If you're an ML engineer at a mid-size company or outside a major tech hub, $155K-$175K base is realistic. At a FAANG or well-funded startup in SF/NYC, you're looking at $185K-$220K base with total comp pushing $300K+ at senior levels. Senior ML roles at Meta reach $325,000+ in total compensation.

The experience breakdown matters too. Entry-level positions start at $120K-$147K, mid-level sits at $145K-$190K, and senior roles command $185K-$230K in base alone. Add equity at a pre-IPO company and the total comp number can get silly.

What ML Engineers Actually Do (Not What Job Descriptions Say)

Most "What is an ML Engineer" articles give you a bullet-point list of responsibilities that could describe half a dozen different roles. Here's what the job actually looks like day-to-day, based on real job descriptions and conversations with people doing the work.

The Core Loop

An ML engineer's week roughly breaks down like this:

30% data pipeline work — Building and maintaining the infrastructure that feeds data to models. Feature stores, ETL pipelines, data validation, handling schema changes from upstream teams
25% model deployment and serving — Getting models into production, optimizing inference latency, managing model versions, setting up A/B tests
20% monitoring and debugging — Watching dashboards, investigating performance drops, diagnosing why the model that worked perfectly on Thursday is making terrible predictions on Monday
15% experimentation — Working with data scientists to productionize their models, running experiments, evaluating new approaches
10% infrastructure and tooling — Improving CI/CD pipelines, setting up training infrastructure, building internal tools that make the team faster

Here's the thing most people don't realize: ML engineering is mostly not about machine learning. It's about building reliable software systems that happen to include a machine learning component. The model itself might be 5% of the code in a production ML system. The other 95% is data pipelines, serving infrastructure, monitoring, logging, error handling, and configuration.

Google published a famous paper about this. They called it "technical debt in machine learning systems," and their key insight was that the ML code is a tiny fraction of a real-world ML system. Everything around it — data collection, feature extraction, configuration, serving infrastructure, monitoring — dwarfs the model itself.

A Real Day in the Life

A fintech MLOps engineer's typical day looks something like this: morning starts by checking model monitoring dashboards for anomalies. Did overnight batch predictions look normal? Are latency percentiles still within SLA? Mid-morning, debugging a failing CI/CD pipeline for a new model version — turns out a dependency update broke the container build. Afternoon, working with a data scientist to package their model for deployment — they built it in a notebook, now it needs to handle concurrent requests, input validation, and graceful degradation. Late afternoon, setting up A/B testing infrastructure for a model update.

Notice what's missing? Nobody sat down and trained a model. That's a real week for many ML engineers. The training happens, but it's a small percentage of the actual work.

The ML Engineering Stack in 2026

The tooling has matured significantly, but it's also gotten more complex. Here's what a production ML stack actually looks like in 2026, with honest assessments of each layer.

Experiment Tracking and Model Registry

MLflow remains the most widely adopted open-source platform here. It handles experiment tracking, model versioning, and deployment across environments. MLflow 3.x added native support for LLM tracking and GenAI observability, which matters now that most teams are running both traditional ML and LLM workloads.

Weights & Biases (W&B) is the main alternative for teams that want a more polished UI and better collaboration features. W&B Weave added agent trace visualization in 2025, which is useful if you're debugging multi-step LLM agent workflows alongside traditional model experiments.

Training Infrastructure

# A realistic training infrastructure setup in 2026
training:
  frameworks: PyTorch (dominant), JAX (growing for research)
  distributed: PyTorch FSDP, DeepSpeed, Ray Train
  compute: AWS SageMaker / GCP Vertex AI / Azure ML
  data: Spark for batch, Kafka for streaming
  feature_store: Feast (open-source) or Tecton (managed)
  versioning: DVC for data, MLflow for models

serving:
  real_time: Triton Inference Server, BentoML, TorchServe
  batch: Spark, Ray Batch Inference
  optimization: ONNX Runtime, TensorRT, vLLM (for LLMs)

monitoring:
  data_drift: Evidently AI, NannyML
  model_performance: Arize, WhyLabs
  infrastructure: Prometheus + Grafana

The Framework Question

PyTorch has won. I know that sounds definitive, but the data backs it up. The 2025 Stack Overflow survey and the overwhelming majority of new ML research papers use PyTorch. TensorFlow still exists in production at companies that adopted it early, but new projects default to PyTorch.

JAX is the interesting wildcard. Google uses it internally, and it's gaining traction in research labs for its composability and JIT compilation. Some teams doing heavy numerical computing or custom hardware acceleration prefer JAX for its functional paradigm and XLA compilation. But for most production ML engineering, PyTorch plus PyTorch Lightning or the built-in FSDP is the standard answer.

One more thing on frameworks: scikit-learn isn't dead. Far from it. For tabular data problems — which still make up the majority of production ML use cases at non-tech companies — scikit-learn plus XGBoost or LightGBM remains the right tool. Not everything needs deep learning. A gradient-boosted tree that trains in 30 seconds and serves in 2 milliseconds beats a transformer that takes 6 hours to train and 200 milliseconds to serve, especially when the accuracy difference is 0.3%. I see too many engineers reaching for deep learning as a default when simpler models would outperform on their actual data.

Feature Stores: The Unsung Hero

Feature stores are probably the most underappreciated piece of ML infrastructure. They solve a deceptively hard problem: making sure the features you train on are the same features you serve predictions with.

Without a feature store, you end up with what's called training-serving skew — the most common and most damaging reason ML models fail in production. Your training pipeline computes features one way, your serving pipeline computes them slightly differently, and your model silently degrades.

# The training-serving skew problem, simplified
# Training time (batch, computed in Spark):
user_avg_spend = df.groupBy("user_id").agg(avg("amount"))

# Serving time (real-time, computed in Python):
user_avg_spend = sum(recent_transactions) / len(recent_transactions)

# These look the same but they're NOT:
# - Different time windows
# - Different null handling
# - Different precision (float64 vs float32)
# Result: model accuracy drops 3-7% and nobody knows why

Feast is the dominant open-source feature store. Tecton is the managed option that handles the operational complexity. Both solve the same core problem: compute features once, serve them consistently everywhere.

Why 87% of ML Projects Never Reach Production

This stat gets thrown around a lot, and it's real. Nearly 87% of machine learning projects never make it to production. But the reasons are more nuanced than "ML is hard."

The failures cluster into predictable patterns:

1. The Data Problem (40% of failures)

The model works great on the test set. Then it meets production data. Missing values where there shouldn't be any. Feature distributions that shifted since the training data was collected. Upstream schema changes that nobody communicated.

An ACM Computing Surveys study documented this systematically: data quality issues, data pipeline failures, and training-serving skew account for the largest share of production ML failures. Not model architecture. Not hyperparameter tuning. Data.

2. The Infrastructure Gap (30% of failures)

Many organizations don't have the infrastructure to deploy models. The data scientist built something brilliant in a notebook, but there's no model serving framework, no CI/CD for models, no monitoring, and no way to roll back when something goes wrong.

This is why the DevOps-to-MLOps transition is one of the fastest-growing career paths in 2026. Companies need people who can build the infrastructure that makes deployment possible.

3. The Monitoring Void (20% of failures)

Here's a genuinely alarming stat: half of ML practitioners don't monitor their production models at all. They deploy the model, celebrate, and move on. Nobody watches for data drift, model degradation, or silent failures.

Compared to traditional software, where a bug produces an error, ML failures are silent. The model doesn't crash. It just starts being wrong. Predictions degrade gradually, and unless you're measuring, you won't notice until a customer complains or a quarterly report looks off.

4. The Organizational Problem (10% of failures)

The data scientist reports to the analytics team. The ML engineer reports to platform engineering. The product manager doesn't understand what either of them does. Nobody owns the end-to-end system, so it falls through the cracks.

The Skills That Actually Matter

I've looked at ML engineer job postings from Meta, Apple, Microsoft, and dozens of mid-size companies. Here's what they're actually screening for, ranked by importance.

Must-Haves

Python — Not just scripting. Production-grade Python: packaging, async, profiling, debugging, testing. Every major tech company lists Python first
ML Fundamentals — Loss functions, optimization, regularization, evaluation metrics, bias-variance tradeoff. You need to reason about why a model behaves the way it does, not just call model.fit()
PyTorch — Building custom training loops, understanding autograd, distributed training basics. The three most common ML frameworks in interviews are scikit-learn, TensorFlow, and PyTorch, but PyTorch is dominant for new work
SQL — ML engineers work with data constantly. Complex joins, window functions, CTEs. Companies care about this more than you'd expect
System Design — Designing training pipelines, model serving architecture, feature stores. Meta's ML interview loop is closer to a pure SWE loop than you'd find at most companies

Strong Differentiators

Distributed systems — Scaling training across GPUs/nodes, understanding data parallelism vs model parallelism
MLOps experience — CI/CD for models, monitoring, automated retraining pipelines
Cloud infrastructure — AWS SageMaker, GCP Vertex AI, or Azure ML. Practical deployment experience, not just certifications
Data engineering skills — Spark, Kafka, data pipeline design. The line between ML engineering and data engineering blurs often

The Interview Reality

Here's something that surprises people: ML engineer interviews at top companies are heavily focused on coding and system design, not ML theory. At Meta, the coding bar is closer to a pure software engineer loop. You'll whiteboard algorithms, design distributed systems, and then discuss ML-specific topics.

The reasoning is straightforward. Companies can teach ML-specific skills to a strong engineer. They can't teach a mediocre engineer to write production-quality code. Strong fundamentals first, ML specialization second.

ML Engineer vs. AI Engineer: The Real Difference

I wrote about AI Engineering recently, and the most common question I got was "how is this different from ML engineering?" Fair question. Here's the honest answer.

Dimension	ML Engineer	AI Engineer
Core work	Training, deploying, monitoring models	Building apps around existing models
Trains models?	Yes, from scratch or fine-tuned	Rarely — calls APIs
Key tools	PyTorch, MLflow, Kubernetes, Spark	LangChain, vector DBs, LLM APIs
Data work	Heavy — pipelines, feature stores, drift	Lighter — embeddings, retrieval
Math required	Significant — stats, linear algebra, optimization	Minimal — mostly applied
Primary concern	"Is this model accurate and reliable at scale?"	"Does this AI feature work for users?"
Career origin	Data science or software engineering	Software engineering
Avg salary (2026)	~$160K-$190K base	~$140K-$185K base

The fundamental difference: ML engineers build the models and the systems that run them. AI engineers build products that use those models.

An ML engineer might spend a month building a recommendation system from scratch — collecting training data, designing the model architecture, training it across a GPU cluster, deploying it with low-latency serving, and monitoring for quality degradation. An AI engineer might spend that same month building a chatbot that uses an existing LLM API, with RAG for knowledge retrieval and tool calling for actions.

Both are valuable. But they require different skills and different mindsets. ML engineering is deeper and more technical. AI engineering is broader and more product-focused. The market currently pays a slight premium for AI engineering titles, but I think that'll correct as the novelty wears off.

A Practical Career Roadmap

Whether you're transitioning from software engineering, data science, or starting fresh, here's what actually works in 2026.

For Software Engineers (3-6 months)

You already have the hardest part — production engineering skills. You need to add ML knowledge on top.

# Phase 1: ML Foundations (month 1-2)
foundations = {
    "course": "Andrew Ng's ML Specialization (still the best starting point)",
    "framework": "Start with scikit-learn, then PyTorch",
    "practice": "Kaggle competitions (focus on tabular data first)",
    "math": "Linear algebra and probability refresher (3Blue1Brown)",
}

# Phase 2: Production ML (month 3-4)
production = {
    "deploy": "Build an end-to-end ML service with FastAPI + Docker",
    "monitor": "Add data drift detection with Evidently AI",
    "pipeline": "Build a retraining pipeline with Airflow or Prefect",
    "feature_store": "Set up Feast for feature management",
}

# Phase 3: Specialize (month 5-6)
specialize = {
    "nlp": "Fine-tune a transformer model on domain-specific data",
    "systems": "Distributed training with PyTorch FSDP",
    "mlops": "Full CI/CD pipeline for model deployment",
}

For Data Scientists (4-6 months)

You know the ML. You need to learn the engineering.

Month 1-2: Software engineering fundamentals. Git branching strategies, writing tests (pytest), Docker, CI/CD basics. This is usually the biggest gap and the most important one to close
Month 3-4: Learn to deploy models. Build a model serving API with FastAPI or BentoML. Set up monitoring with Evidently AI or Arize. Containerize everything. Deploy to a cloud provider
Month 5-6: Infrastructure skills. Learn Kubernetes basics (you don't need to be an expert, but you need to understand pods, services, and deployments). Build a data pipeline with Spark or Kafka. Set up a feature store

For Career Changers (6-12 months)

Be realistic about the timeline. ML engineering requires both software engineering skills and ML knowledge. You can't shortcut either one.

Start with Python and software engineering fundamentals. Build several non-ML projects first — a web scraper, a REST API, a CLI tool. Get comfortable with Git, testing, and Docker. Then layer on ML knowledge through courses and hands-on projects. Andrew Ng's Machine Learning Specialization is still the best starting point, followed by fast.ai for practical deep learning.

The fastest path in is through a related role — data engineering, backend engineering, or data analytics — then transitioning internally. This works because you're learning production skills on the job while building ML knowledge on the side. I've seen more successful ML engineer transitions from backend engineering than from any bootcamp or master's program. The engineering instincts transfer; the ML knowledge can be learned.

What I Actually Think

Here's my honest take on ML engineering in 2026.

ML engineering is the best long-term career bet in tech right now. Not AI engineering, which I think is partially a hype-driven title that'll blur back into software engineering as AI becomes standard. Not data science, which is being squeezed from both sides — ML engineers are taking the model-building work, and BI tools are automating the analysis work.

ML engineering is durable because it sits at the intersection of two things that are both getting more important: software systems and machine learning. Every year, more products include ML components. Every year, those components need to be more reliable, more scalable, and more cost-efficient. That's ML engineering.

The 87% failure rate isn't going away any time soon. It's not a technical problem — the tools exist to deploy models reliably. It's a people problem. Companies need engineers who understand both ML and production systems, and that combination is genuinely rare. The 3.2:1 demand-to-supply ratio isn't closing because training ML engineers takes time.

My controversial opinion: the GenAI boom has actually been bad for ML engineering hiring. Companies are so focused on LLM wrappers and chatbots that they're underinvesting in traditional ML infrastructure — the recommendation systems, fraud detection, pricing models, and search ranking that actually drive most of their revenue. The engineers maintaining those systems are stretched thin while the "AI team" builds the fifth internal ChatGPT clone.

This will correct. When the LLM hype normalizes (and it will — it always does), companies will remember that their recommendation engine drives 35% of revenue and they haven't upgraded the training pipeline in two years. That's when ML engineering demand spikes again, and the engineers who stayed sharp on fundamentals — not just LLM wrapper techniques — will have their pick of roles.

If you're considering this career: learn the fundamentals. Get comfortable with data at scale. Build systems that are boring and reliable. The flashy demo doesn't matter. What matters is whether your model is still making good predictions at 3 AM on a Saturday when nobody's watching. That's what ML engineering is.

Sources

Paylaş:E-poçt

Əlaqəli məqalələr

Remote Work Killed Mentorship — How Senior Engineers Can Fix It

25 aprel 2026

The Staff Engineer Trap: Why the Best ICs Get Promoted Into Misery

24 aprel 2026

Testing LLM Applications Is Nothing Like Testing Regular Software — Here's What Actually Works

23 aprel 2026