StreamInfer

Real-Time Streaming Inference Engine

1,190

req/s throughput

<50ms

p99 latency

tests

View on GitHub →

My Role

Solo project. Designed the architecture, implemented all components, wrote load tests.

The Problem

Most inference servers either batch too aggressively (high latency) or not at all (wasted throughput). When you are serving predictions over WebSocket to hundreds of concurrent clients, you need something that adapts. And you need backpressure so one runaway client does not OOM the server.

How I Built It

Adaptive batching

The batcher accumulates requests and flushes on whichever comes first: batch_size reached OR timeout_ms elapsed. At high load you get full batches (good throughput), at low load you get fast individual responses. Same idea as Triton Inference Server's dynamic batcher, but in ~100 lines of Python.

Per-client backpressure

Each WebSocket client gets a token bucket rate limiter. If a client sends faster than the configured rate, excess requests get a rate_limited response with a retry_after_ms hint. If a client's pending queue exceeds 80% capacity, they get a slow-consumer warning before potential disconnection.

Zero-downtime model hot-swap

Load the new model in a background thread, then swap the pointer under a threading.Lock. In-flight requests finish with the old model, new requests use the new one. Trigger via SIGHUP, API call, or config reload. Simple but correct.

Architecture

FastAPI Server
├── WebSocket endpoint (/ws)
│   └── JSON in → JSON out, per-client state
├── HTTP endpoint (/predict)
│   └── Single request/response (still batched internally)
├── Backpressure (per-client)
│   ├── Token bucket rate limiter
│   └── Slow consumer detection (queue > 80%)
├── Adaptive Batcher
│   ├── Flush on batch_size
│   └── Flush on timeout_ms (whichever first)
├── Model Holder
│   └── Atomic pointer swap under threading.Lock
├── Metrics (/metrics)
│   └── In-memory counters (JSON, no Prometheus dep)
└── Hot-swap triggers
    ├── SIGHUP signal
    └── POST /api/reload

Results

▸1,190 req/s with echo model (CPU benchmark, not GPU — see note below)
▸p99 latency < 50ms under 100 concurrent connections
▸24 tests covering batcher, backpressure, pipeline, and HTTP endpoints
▸Zero-downtime model swap via SIGHUP or API call
▸Docker-ready: docker build -t streaminfer . && docker run -p 8000:8000 streaminfer

What I'd Do Differently

The benchmark numbers are with a trivial echo model on CPU — a real GPU model would have very different characteristics. For GPU inference you'd batch on the tensor dimension (proper padding, attention masks), not just collect requests into a Python list. The backpressure is per-client only — a full production deployment would also need global backpressure based on GPU utilization.

Tech Stack

PythonFastAPIWebSocketasyncioPydanticDockeruvicorn