Real-time Machine Learning Inference at Scale

Training a machine learning model is only half the battle. The real engineering challenge begins when you need to serve that model in production — handling thousands of requests per second with consistent, low-latency responses. At Statotech, we've deployed ML models across several of our products, and here's what we've learned about making inference fast, reliable, and maintainable.

The Gap Between Training and Serving

Most ML tutorials end with a Jupyter notebook printing an accuracy score. Production is nothing like that. In production, your model needs to:

Respond in under 100 milliseconds for real-time applications
Handle concurrent requests without degrading performance
Gracefully recover from failures without dropping requests
Support model versioning so you can roll back bad deployments
Scale horizontally as traffic grows

These requirements demand a fundamentally different architecture from what you use during development.

Choosing a Model Serving Strategy

There are three main approaches to serving ML models, each with distinct trade-offs:

1. Embedded Serving

The simplest approach — load the model directly into your application process. Works well for small models (under 100 MB) with low traffic. We use this for our document classification microservice, where a lightweight scikit-learn model runs inside a FastAPI endpoint. The downside: scaling the app means duplicating the model in memory across every instance.

2. Model Server (TensorFlow Serving / Triton)

Dedicated model serving infrastructure. TensorFlow Serving or NVIDIA Triton run as separate services optimised specifically for inference. They handle batching, GPU memory management, and model versioning out of the box. This is our default choice for any model that needs to serve more than a few hundred requests per minute.

3. Edge Inference

For our Health Facility Management System, clinics in rural Zimbabwe can't always rely on stable internet. We convert models to TensorFlow Lite and run inference on-device. Accuracy takes a small hit from quantisation, but the system works completely offline — which is the only option that matters in the field.

Optimising Latency

Raw model inference is often fast enough. The bottleneck is usually everything around it. Here are the techniques that made the biggest difference for us:

Input preprocessing pipelines

Move preprocessing out of the request path. If your model expects normalised 224x224 images, don't resize and normalise on every request. Pre-compute what you can, cache intermediate representations, and use vectorised operations (NumPy/OpenCV) instead of Python loops. We reduced our document verification preprocessing from 45ms to 8ms by switching from PIL to OpenCV and pre-allocating buffers.

Request batching

GPUs are throughput machines — they're inefficient processing one input at a time. TensorFlow Serving's built-in batching collects individual requests into batches before running inference. We configure a maximum batch size of 32 with a 10ms batching window. Individual request latency increases slightly, but overall throughput jumps 4-5x.

Model quantisation

Converting model weights from 32-bit floats to 8-bit integers (INT8 quantisation) reduces model size by 4x and inference time by 2-3x on CPU. For our document verification model, post-training quantisation dropped inference from 35ms to 12ms with negligible accuracy loss (less than 0.3% F1 drop).

Automated Retraining Pipelines

Models degrade over time as the real world drifts from your training data. We've built automated retraining pipelines that:

Monitor prediction confidence — when average confidence drops below a threshold over a rolling window, the pipeline triggers.
Collect new training data — flagged predictions are reviewed by our team and added to the training set.
Retrain and evaluate — the pipeline trains a new model version and runs it against a held-out test set. If performance meets the bar, it proceeds.
Canary deployment — the new model receives 10% of traffic initially. If error rates stay stable for 24 hours, it gradually takes over.
Rollback — if the canary shows degraded performance, traffic automatically routes back to the previous version.

This entire pipeline runs on scheduled jobs with human review at the data labelling stage. Full automation without human oversight is tempting, but we've learned that a human in the loop at the data stage catches edge cases that metrics miss.

Infrastructure Considerations for Africa

Deploying ML infrastructure in African markets adds constraints that most ML engineering blog posts ignore:

Bandwidth costs — Cloud egress charges add up fast when serving large model responses. We aggressively compress outputs and use regional CDN caching.
Latency to cloud regions — The nearest major cloud regions to Zimbabwe are in South Africa or Europe. We use edge caching and lightweight proxy servers to minimise round trips.
Offline-first design — Many of our users operate in areas with intermittent connectivity. Our mobile applications cache model weights locally and sync results when connectivity returns.

Key Takeaways

If you're deploying ML models in production, especially in emerging markets:

Start with the simplest serving approach that meets your latency and throughput requirements
Invest in preprocessing optimisation before reaching for bigger hardware
Build retraining pipelines early — model drift is inevitable
Design for offline-first if your users have unreliable connectivity
Quantise aggressively — the accuracy trade-off is almost always worth it

We'll be sharing more technical deep dives as we build out our ML infrastructure alongside our new partnership with Strateji. If you're working on similar challenges, reach out — we're always keen to exchange notes.

— Ebenezer Tarubinga, AI/ML Engineer, Statotech Systems