StreamInfer
Real-Time Streaming Inference Engine
Solo project. Designed the architecture, implemented all components, wrote load tests.
The Problem
Most inference servers either batch too aggressively (high latency) or not at all (wasted throughput). When you are serving predictions over WebSocket to hundreds of concurrent clients, you need something that adapts. And you need backpressure so one runaway client does not OOM the server.
How I Built It
Adaptive batching
The batcher accumulates requests and flushes on whichever comes first: batch_size reached OR timeout_ms elapsed. At high load you get full batches (good throughput), at low load you get fast individual responses. Same idea as Triton Inference Server's dynamic batcher, but in ~100 lines of Python.
Per-client backpressure
Each WebSocket client gets a token bucket rate limiter. If a client sends faster than the configured rate, excess requests get a rate_limited response with a retry_after_ms hint. If a client's pending queue exceeds 80% capacity, they get a slow-consumer warning before potential disconnection.
Zero-downtime model hot-swap
Load the new model in a background thread, then swap the pointer under a threading.Lock. In-flight requests finish with the old model, new requests use the new one. Trigger via SIGHUP, API call, or config reload. Simple but correct.
Architecture
FastAPI Server
├── WebSocket endpoint (/ws)
│ └── JSON in → JSON out, per-client state
├── HTTP endpoint (/predict)
│ └── Single request/response (still batched internally)
├── Backpressure (per-client)
│ ├── Token bucket rate limiter
│ └── Slow consumer detection (queue > 80%)
├── Adaptive Batcher
│ ├── Flush on batch_size
│ └── Flush on timeout_ms (whichever first)
├── Model Holder
│ └── Atomic pointer swap under threading.Lock
├── Metrics (/metrics)
│ └── In-memory counters (JSON, no Prometheus dep)
└── Hot-swap triggers
├── SIGHUP signal
└── POST /api/reloadResults
- ▸1,190 req/s with echo model (CPU benchmark, not GPU — see note below)
- ▸p99 latency < 50ms under 100 concurrent connections
- ▸24 tests covering batcher, backpressure, pipeline, and HTTP endpoints
- ▸Zero-downtime model swap via SIGHUP or API call
- ▸Docker-ready: docker build -t streaminfer . && docker run -p 8000:8000 streaminfer
The benchmark numbers are with a trivial echo model on CPU — a real GPU model would have very different characteristics. For GPU inference you'd batch on the tensor dimension (proper padding, attention masks), not just collect requests into a Python list. The backpressure is per-client only — a full production deployment would also need global backpressure based on GPU utilization.