What is Model Compression?
Model compression is a set of techniques for reducing the size and computational requirements of AI models while preserving most of their accuracy, enabling faster inference, lower costs, and deployment on resource-constrained devices such as mobile phones and edge hardware.
What Is Model Compression?
Model compression refers to a collection of techniques that reduce the size, memory footprint, and computational requirements of trained AI models. A large neural network with billions of parameters may deliver excellent accuracy but requires expensive GPU hardware to run and responds slowly. Model compression aims to create a smaller, faster version of that model that retains most of its capabilities at a fraction of the cost.
This is one of the most practically important areas of AI infrastructure because it directly impacts deployment costs, response latency, and the range of hardware that can run AI models. For businesses in Southeast Asia deploying AI across diverse environments, from cloud servers in Singapore to mobile devices and edge hardware across the region, model compression makes AI accessible where it otherwise would not be feasible.
Model Compression Techniques
Several established techniques are used to compress AI models:
Quantisation
Quantisation reduces the numerical precision of model parameters. A standard model uses 32-bit floating-point numbers for its weights and calculations. Quantisation converts these to lower precision formats:
- INT8 (8-bit integer): Reduces model size by 4x with minimal accuracy loss for most models
- INT4 (4-bit integer): Reduces model size by 8x with moderate accuracy impact, increasingly popular for large language models
- FP16/BF16 (16-bit floating point): A common middle ground that halves model size with negligible accuracy loss
Quantisation is the most widely used compression technique because it delivers significant size and speed improvements with minimal effort. Tools like ONNX Runtime, TensorRT, and llama.cpp make quantisation accessible to non-specialist teams.
Pruning
Pruning removes unnecessary connections or neurons from a neural network. Research has shown that many trained models contain redundant parameters that contribute little to performance. Pruning identifies and removes these parameters:
- Unstructured pruning: Removes individual weights, creating sparse matrices
- Structured pruning: Removes entire neurons, attention heads, or layers, resulting in a genuinely smaller model that runs faster on standard hardware
Structured pruning is generally more practical because it produces models that can use standard hardware and libraries without specialised sparse computation support.
Knowledge Distillation
Knowledge distillation trains a smaller "student" model to mimic the behaviour of a larger "teacher" model. Rather than training the student on raw data, it learns to match the teacher's outputs and internal representations. The result is a compact model that captures much of the teacher's knowledge:
- The student model can be 10-100x smaller than the teacher
- Performance typically retains 90-99% of the teacher's accuracy
- The student can use a completely different, more efficient architecture
This technique is behind many practical deployments, including DistilBERT (a compressed version of BERT that is 60% smaller and 60% faster while retaining 97% of its language understanding capability).
Architecture Optimisation
Designing inherently efficient model architectures rather than compressing large ones:
- MobileNet and EfficientNet: Vision models designed for mobile and edge deployment
- TinyLlama and Phi: Language models designed to be small yet capable
- ONNX optimisation: Converting models to the ONNX format and applying graph optimisations that eliminate redundant operations
Why Model Compression Matters for Business
Model compression has direct business impact across several dimensions:
Cost Reduction
Running a large language model in production can cost $10,000 or more per month in GPU infrastructure. A properly compressed version of the same model running on cheaper hardware can reduce this to $1,000-2,000 per month while maintaining acceptable quality. For organisations running multiple AI models in production, compression can reduce total AI infrastructure costs by 50-80%.
Faster Response Times
Compressed models generate predictions faster because they perform fewer calculations. For customer-facing applications where response time directly impacts user experience and conversion rates, this improvement is commercially significant. Reducing an AI response from 2 seconds to 200 milliseconds can meaningfully improve user engagement.
Edge and Mobile Deployment
Many AI use cases in Southeast Asia require running models on devices with limited resources. Retail point-of-sale systems, agricultural sensors, mobile banking apps, and manufacturing quality inspection cameras all benefit from compressed models that can run on modest hardware without cloud connectivity.
Environmental Impact
Smaller models consume less energy for every prediction. For organisations with sustainability commitments, model compression reduces the carbon footprint of AI operations.
Implementing Model Compression
A practical approach to model compression:
- Start with quantisation: It is the easiest technique to apply and often delivers the biggest immediate improvements. Most models can be quantised to INT8 with negligible accuracy loss using standard tools.
- Benchmark before and after: Measure model size, inference latency, and accuracy on your specific evaluation dataset before and after compression. Ensure the accuracy trade-off is acceptable for your use case.
- Consider knowledge distillation for large models: If quantisation alone is not sufficient, distill a large model into a smaller architecture designed for your specific task.
- Use optimised inference runtimes: Deploy compressed models using optimised runtimes like TensorRT (NVIDIA), ONNX Runtime (cross-platform), or Core ML (Apple devices) to maximise performance.
- Test in production conditions: A model that performs well on a benchmark may behave differently in production with real-world data diversity. Validate compression results under realistic conditions.
- Iterate on compression levels: Start with moderate compression and increase until you find the optimal trade-off between size, speed, and accuracy for your specific application.
Model compression is not a one-time optimisation but an ongoing practice. As your models evolve and hardware capabilities change, regularly reassessing compression strategies ensures you maintain the best balance of performance and efficiency.
Model compression directly reduces your AI operating costs and expands where you can deploy AI, both of which have immediate financial impact. For CEOs and CTOs, the message is simple: a compressed model that is 90% as accurate as the full version but costs 70% less to run and responds 5x faster is almost always the better business choice for production deployment.
For business leaders in Southeast Asia, model compression is especially strategic because it enables AI deployment across the region's diverse infrastructure landscape. A model compressed to run on edge hardware works reliably in a factory in Vietnam or a retail outlet in rural Indonesia, locations where cloud-dependent AI may face connectivity limitations. This expands the addressable market for AI-powered products and services.
The cost implications scale with your AI portfolio. If your organisation runs five AI models in production, each costing $5,000 per month in compute, compression that reduces per-model costs by 60% saves $15,000 per month or $180,000 annually. These savings can be reinvested in developing new AI capabilities. Model compression should be a standard step in every production AI deployment, not an afterthought reserved for cost optimisation exercises.
- Make model compression a standard part of your production deployment process, not an optional optimisation. Every model should be evaluated for compression before going live.
- Start with quantisation (INT8) as the baseline compression technique. It delivers significant improvements with minimal effort and is supported by all major inference runtimes.
- Always benchmark compressed models on your specific data and use case. Generic accuracy metrics may not reflect performance on your production workload.
- Consider the accuracy-cost trade-off explicitly. For many business applications, a model that is 95% as accurate but 5x cheaper and faster is the better choice.
- Use knowledge distillation when you need aggressive compression beyond what quantisation provides, particularly for deploying large language model capabilities on smaller hardware.
- Test compressed models for fairness and bias. Compression can sometimes disproportionately affect performance on underrepresented data segments.
- Keep the uncompressed model available as a reference. You may need it for comparison, further fine-tuning, or use cases where maximum accuracy is required.
- Evaluate specialised inference runtimes like TensorRT and ONNX Runtime that can further optimise compressed model performance on specific hardware.
Frequently Asked Questions
How much accuracy do you lose with model compression?
The accuracy impact depends on the compression technique and aggressiveness. Quantisation to INT8 typically loses less than 1% accuracy, which is imperceptible for most business applications. More aggressive 4-bit quantisation may lose 2-5% accuracy. Knowledge distillation can retain 90-99% of the teacher model accuracy depending on the student model size. The key is to measure accuracy on your specific use case rather than relying on general benchmarks. For many business applications, the accuracy trade-off is negligible compared to the cost and speed improvements.
Which model compression technique should I start with?
Start with post-training quantisation to INT8. It requires no retraining, can be applied to most models in minutes using tools like ONNX Runtime or TensorRT, and typically reduces model size by 4x while improving inference speed by 2-3x with minimal accuracy loss. If you need further compression, try INT4 quantisation next. If quantisation alone is insufficient, consider knowledge distillation, which requires more effort but can achieve 10-100x compression. Pruning is useful but generally delivers smaller gains than quantisation and distillation.
More Questions
Yes, and this is one of the most active areas in AI infrastructure. Large language models are commonly compressed using 4-bit or 8-bit quantisation, reducing a 70-billion parameter model from 140GB to 35-70GB while retaining most of its capability. Tools like GPTQ, AWQ, and llama.cpp make this accessible. Additionally, smaller purpose-built models like Phi and TinyLlama provide LLM capabilities at a fraction of the cost. For many business applications, a compressed or smaller model fine-tuned for your specific use case outperforms a larger general model at a fraction of the cost.
Need help implementing Model Compression?
Pertama Partners helps businesses across Southeast Asia adopt AI strategically. Let's discuss how model compression fits into your AI roadmap.