From ‘it worked on my notebook’ to production-ready machine learning

Bridging the gap with MLOps

Lino GALIANA

lino.galiana@insee.fr

Insee — French National Institute of Statistics and Economic Studies

Insee

2026-06-04

Sommaire

Introduction

Handling data driven projects distinct lifecycles

From zero to hero in production

Adding experiment tracking and observability

Conclusion

1 Introduction

The “production wall”

Most data-driven projects never deliver value (1, 2, 3)
- Most data science and AI POCs fail when going to production
- How to move beyond the experimentation stage?

Defining production

Going to production: making an application live in the space of its users

Serving: deploying the application in a relevant format for its potential users
Keeping it alive: managing the lifecycle and fostering continuous improvement

Multiple dimensions: domain knowledge, organisation, infrastructure, technical tooling…

Fostering continuity

Source: ibm.com

Applying and extending software development best practices
- DataOps: building robust data pipelines
- MLOps: deploying and maintaining ML models

2 Handling data driven projects distinct lifecycles

Three lifecycles to master

A robust ML pipeline requires managing three independent lifecycles:
- Jupyter Notebooks does not separate them properly

Why does this matter?

Models (weights and inference) are not static artefacts:
- they depend on data, code, and the environment in which they were trained

A model result is only reproducible if all three are tracked together

Failing to manage even one lifecycle leads to:
- “Works on my machine” syndrome
- Inability to roll back a broken update
- Silent regressions caused by data or dependency drift

Choosing appropriate tools

Code

Versioning (Git), improving quality with formatters (Ruff), community standard structure (cookiecutters)…

Configuration

Virtual environments and dependency management (uv), controlling external dependencies (Docker)…

Data

Standardised format (Parquet), cloud storage (S3), pipeline-oriented workflow (dbt)…

3 From zero to hero in production

“It works on my machine”

Bridging the dev/production gap

Environment gap is one of the most common sources of failure:
- Production servers might use different OS, libraries, CUDA versions…

Containerisation (Docker) solves this by packaging the full runtime alongside the code:
- Same image runs locally, in CI, and in production
- Eliminates “it works on my machine” class of bugs

An industrialized project

Kubernetes turns individual containers into an industrialised, scalable fleet

MLOps on Kubernetes

Training: launch parallel trainings (e.g. cross validation)
Model serving: expose a versioned model behind a stable endpoint (API)
Canary deployments: route a fraction of traffic to a new model version before full rollout
Rollbacks: switch back to a previous version if performance degrades

Note

Great improvement but this is only the first phase of a project: continuous improvement requires observability

4 Adding experiment tracking and observability

Experimentation phase

During development, practitioners need to:

Track every experiment: hyperparameters, metrics, artefacts
Compare runs objectively
Select and register the best model version

MLFlow (and similar platforms!) provides a centralised tracking server, model registry, and serving API

From experimentation to production observability

The same platforms extend into production monitoring:
- Log real-world inputs and outputs
- Compute performance metrics against ground truth or human feedback
- Detect data drift and concept drift

LLM-based systems need additional tracking. Langfuse adds:

Trace-level observability (prompt → retrieval → generation)
Cost and latency tracking
Human annotation workflows

Feedback loops and continuous improvement

Monitoring is not optional: a model that worked at launch will degrade as the world changes
Feedback loops close the gap between offline evaluation and real-world performance
Good observability turns every production incident into a training signal

Two paradigms, two sets of operational constraints

	Supervised ML	LLM-based systems
Training	Full retraining cycle	Fine-tuning or prompt engineering
Evaluation	Standard metrics (F1, RMSE…)	Requires LLM-as-judge or human review (see 4)
Drift	Feature / label drift	Prompt drift, outdated knowledge base
Availability	Batch or on the fly ?	Continuous
Infrastructure	CPU often sufficient for inference	Bigger and bigger GPU (💵💵)

Annotation and evaluation challenges

Supervised learning:

Ground truth is (relatively) well-defined
Evaluation is largely automated

LLMs:

What is the “correct” output? Often ambiguous
Human evaluation is expensive and hard to scale
Automated evaluation (LLM-as-judge) introduces its own biases
Prompt changes can silently break previously passing evaluations

Avertissement

Is it really possible to leapfrog when having missed the ML era ?

5 Conclusion

Key takeaways

Structure your project around three independent lifecycles: data, code, environment
Track everything: experiments, models, prompts — if it’s not tracked, it didn’t happen
Monitor in production: evaluation does not stop at deployment
Know your paradigm: supervised ML and LLM-based systems require different tooling and processes

The gap between a notebook that works and a system that delivers value is not only a technical gap: it is an operational one.