MLGuard

ML Deployment Safety Gate

safety checks

tests

5 lines

of YAML to integrate

View on GitHub →

My Role

Solo project. Built the checks, CLI, GitHub Action, and example integration.

The Problem

At a previous role, a model degradation went unnoticed for 3 days because nobody had an automated check between training and deployment. The model was serving worse predictions and nobody knew until a customer complained. I built MLGuard so that never happens again.

How I Built It

Data drift via PSI

Population Stability Index (PSI) on each numeric feature. PSI compares bin proportions between the reference and current distributions — if PSI > 0.2, something changed significantly. Simple, well-understood metric from credit risk that works surprisingly well for ML feature monitoring.

Performance regression

Load the model, run predictions on a holdout set, compare accuracy/F1 (classification) or RMSE (regression) against a saved baseline JSON file. If the metric drops more than 10%, the check fails. Baselines are stored as JSON in the repo — version controlled, auditable, no database.

Latency regression

Time N single-sample predictions (after a warm-up phase), compute p50/p95/p99, compare p95 against baseline. A 30%+ increase triggers a failure. Single-sample timing because that is the production pattern for real-time serving.

GitHub Action integration

Ships as a composite GitHub Action. Add 5 lines of YAML to your CI workflow and every PR that touches model files gets automatically checked. The action installs MLGuard, runs all 3 checks, and uploads a Markdown report as an artifact.

Architecture

CLI (Typer) / GitHub Action
├── Drift Check
│   └── PSI per numeric feature
│       └── PASS (< 0.1) / WARN (0.1–0.2) / FAIL (> 0.2)
├── Regression Check
│   ├── Classification: accuracy + F1 vs baseline
│   └── Regression: RMSE vs baseline
│       └── PASS / WARN (5%+ drop) / FAIL (10%+ drop)
├── Latency Check
│   ├── Warm-up (5 calls)
│   ├── Timing (N single-sample predictions)
│   └── p95 vs baseline
│       └── PASS / WARN (15%+) / FAIL (30%+)
├── Verdict
│   └── Any FAIL → block deploy
│       Any WARN → warning, allow deploy
│       All PASS → green light
└── Report → Markdown file

Results

▸3 focused checks — drift, regression, latency — with clear thresholds
▸PSI from scratch in 10 lines of numpy (no Evidently dependency)
▸17 tests covering drift calculation, regression logic, and verdict aggregation
▸GitHub Action ready for CI integration
▸Baselines as JSON files — version controlled alongside the model

What I'd Do Differently

I deliberately kept this to exactly 3 checks. The temptation is always to add more — fairness metrics, feature importance drift, prediction distribution analysis — but each additional check is another thing that can false-positive and block deploys that should go through. Start with the 3 that matter most and add more only when you have evidence they would have caught a real incident.

Tech Stack

PythonNumPypandasscikit-learnTyperRichGitHub Actions