Solo project. Built the checks, CLI, GitHub Action, and example integration.
The Problem
At a previous role, a model degradation went unnoticed for 3 days because nobody had an automated check between training and deployment. The model was serving worse predictions and nobody knew until a customer complained. I built MLGuard so that never happens again.
How I Built It
Data drift via PSI
Population Stability Index (PSI) on each numeric feature. PSI compares bin proportions between the reference and current distributions — if PSI > 0.2, something changed significantly. Simple, well-understood metric from credit risk that works surprisingly well for ML feature monitoring.
Performance regression
Load the model, run predictions on a holdout set, compare accuracy/F1 (classification) or RMSE (regression) against a saved baseline JSON file. If the metric drops more than 10%, the check fails. Baselines are stored as JSON in the repo — version controlled, auditable, no database.
Latency regression
Time N single-sample predictions (after a warm-up phase), compute p50/p95/p99, compare p95 against baseline. A 30%+ increase triggers a failure. Single-sample timing because that is the production pattern for real-time serving.
GitHub Action integration
Ships as a composite GitHub Action. Add 5 lines of YAML to your CI workflow and every PR that touches model files gets automatically checked. The action installs MLGuard, runs all 3 checks, and uploads a Markdown report as an artifact.
Architecture
CLI (Typer) / GitHub Action ├── Drift Check │ └── PSI per numeric feature │ └── PASS (< 0.1) / WARN (0.1–0.2) / FAIL (> 0.2) ├── Regression Check │ ├── Classification: accuracy + F1 vs baseline │ └── Regression: RMSE vs baseline │ └── PASS / WARN (5%+ drop) / FAIL (10%+ drop) ├── Latency Check │ ├── Warm-up (5 calls) │ ├── Timing (N single-sample predictions) │ └── p95 vs baseline │ └── PASS / WARN (15%+) / FAIL (30%+) ├── Verdict │ └── Any FAIL → block deploy │ Any WARN → warning, allow deploy │ All PASS → green light └── Report → Markdown file
Results
- ▸3 focused checks — drift, regression, latency — with clear thresholds
- ▸PSI from scratch in 10 lines of numpy (no Evidently dependency)
- ▸17 tests covering drift calculation, regression logic, and verdict aggregation
- ▸GitHub Action ready for CI integration
- ▸Baselines as JSON files — version controlled alongside the model
I deliberately kept this to exactly 3 checks. The temptation is always to add more — fairness metrics, feature importance drift, prediction distribution analysis — but each additional check is another thing that can false-positive and block deploys that should go through. Start with the 3 that matter most and add more only when you have evidence they would have caught a real incident.