Training a machine learning model is only half the battle. The real engineering challenge begins when you need to serve that model in production — handling thousands of requests per second with consistent, low-latency responses. At Statotech, we've deployed ML models across several of our products, and here's what we've learned about making inference fast, reliable, and maintainable.
The Gap Between Training and Serving
Most ML tutorials end with a Jupyter notebook printing an accuracy score. Production is nothing like that. In production, your model needs to:
- Respond in under 100 milliseconds for real-time applications
- Handle concurrent requests without degrading performance
- Gracefully recover from failures without dropping requests
- Support model versioning so you can roll back bad deployments
- Scale horizontally as traffic grows
These requirements demand a fundamentally different architecture from what you use during development.
Choosing a Model Serving Strategy
There are three main approaches to serving ML models, each with distinct trade-offs:
1. Embedded Serving
The simplest approach — load the model directly into your application process. Works well for small models (under 100 MB) with low traffic. We use this for our document classification microservice, where a lightweight scikit-learn model runs inside a FastAPI endpoint. The downside: scaling the app means duplicating the model in memory across every instance.
2. Model Server (TensorFlow Serving / Triton)
Dedicated model serving infrastructure. TensorFlow Serving or NVIDIA Triton run as separate services optimised specifically for inference. They handle batching, GPU memory management, and model versioning out of the box. This is our default choice for any model that needs to serve more than a few hundred requests per minute.
3. Edge Inference
For our Health Facility Management System, clinics in rural Zimbabwe can't always rely on stable internet. We convert models to TensorFlow Lite and run inference on-device. Accuracy takes a small hit from quantisation, but the system works completely offline — which is the only option that matters in the field.
Optimising Latency
Raw model inference is often fast enough. The bottleneck is usually everything around it. Here are the techniques that made the biggest difference for us:
Input preprocessing pipelines
Move preprocessing out of the request path. If your model expects normalised 224x224 images, don't resize and normalise on every request. Pre-compute what you can, cache intermediate representations, and use vectorised operations (NumPy/OpenCV) instead of Python loops. We reduced our document verification preprocessing from 45ms to 8ms by switching from PIL to OpenCV and pre-allocating buffers.
Request batching
GPUs are throughput machines — they're inefficient processing one input at a time. TensorFlow Serving's built-in batching collects individual requests into batches before running inference. We configure a maximum batch size of 32 with a 10ms batching window. Individual request latency increases slightly, but overall throughput jumps 4-5x.
Model quantisation
Converting model weights from 32-bit floats to 8-bit integers (INT8 quantisation) reduces model size by 4x and inference time by 2-3x on CPU. For our document verification model, post-training quantisation dropped inference from 35ms to 12ms with negligible accuracy loss (less than 0.3% F1 drop).
Automated Retraining Pipelines
Models degrade over time as the real world drifts from your training data. We've built automated retraining pipelines that:
- Monitor prediction confidence — when average confidence drops below a threshold over a rolling window, the pipeline triggers.
- Collect new training data — flagged predictions are reviewed by our team and added to the training set.
- Retrain and evaluate — the pipeline trains a new model version and runs it against a held-out test set. If performance meets the bar, it proceeds.
- Canary deployment — the new model receives 10% of traffic initially. If error rates stay stable for 24 hours, it gradually takes over.
- Rollback — if the canary shows degraded performance, traffic automatically routes back to the previous version.
This entire pipeline runs on scheduled jobs with human review at the data labelling stage. Full automation without human oversight is tempting, but we've learned that a human in the loop at the data stage catches edge cases that metrics miss.
Infrastructure Considerations for Africa
Deploying ML infrastructure in African markets adds constraints that most ML engineering blog posts ignore:
- Bandwidth costs — Cloud egress charges add up fast when serving large model responses. We aggressively compress outputs and use regional CDN caching.
- Latency to cloud regions — The nearest major cloud regions to Zimbabwe are in South Africa or Europe. We use edge caching and lightweight proxy servers to minimise round trips.
- Offline-first design — Many of our users operate in areas with intermittent connectivity. Our mobile applications cache model weights locally and sync results when connectivity returns.
Key Takeaways
If you're deploying ML models in production, especially in emerging markets:
- Start with the simplest serving approach that meets your latency and throughput requirements
- Invest in preprocessing optimisation before reaching for bigger hardware
- Build retraining pipelines early — model drift is inevitable
- Design for offline-first if your users have unreliable connectivity
- Quantise aggressively — the accuracy trade-off is almost always worth it
We'll be sharing more technical deep dives as we build out our ML infrastructure alongside our new partnership with Strateji. If you're working on similar challenges, reach out — we're always keen to exchange notes.
— Ebenezer Tarubinga, AI/ML Engineer, Statotech Systems