Back to Insights
AI Use-Case PlaybooksTool Review

Foundation models: Best Practices

3 min readPertama Partners
Updated February 21, 2026
For:CEO/FounderCTO/CIOConsultantCFOCHRO

Comprehensive tool-review for foundation models covering strategy, implementation, and optimization across Southeast Asian markets.

Summarize and fact-check this article with:

Key Takeaways

  • 1.67% of organizations use foundation models in production, up from 23% in 2022
  • 2.Mid-tier models achieve 95% of frontier accuracy at 10-20% of the cost for extraction and classification tasks
  • 3.LoRA fine-tuning modifies less than 1% of parameters while achieving 90-95% of full fine-tuning performance
  • 4.Model routing can reduce inference costs by 65% while maintaining quality within 3% of frontier-only approaches
  • 5.Semantic caching eliminates 30-60% of API calls for customer-facing applications with clustered query patterns

Foundation models, large AI models trained on broad data that can be adapted to a wide range of downstream tasks, have become the default starting point for enterprise AI development. The Stanford Institute for Human-Centered AI (HAI) 2024 AI Index reports that 67% of organizations now use foundation models in production, up from 23% in 2022. Yet the difference between successful and failed deployments often comes down to pragmatic decisions about model selection, adaptation, deployment architecture, and cost management rather than the sophistication of the AI itself.

Model Selection: Matching Capabilities to Requirements

The foundation model landscape has fragmented rapidly. As of early 2025, enterprises can choose from over 30 commercially viable foundation models across three tiers: frontier models (GPT-4o, Claude 3.5 Opus, Gemini Ultra), mid-tier models (Claude 3.5 Sonnet, GPT-4o-mini, Llama 3 70B), and efficient models (Mistral 7B, Phi-3, Gemma 2B). The right choice depends on task complexity, latency requirements, cost constraints, and data privacy needs.

Task complexity determines the minimum viable model. A 2024 study by LMSYS (operators of the Chatbot Arena) found that for straightforward extraction and classification tasks, mid-tier models achieve 95% of frontier model accuracy at 10-20% of the cost. For complex reasoning, code generation, and multi-step analysis, frontier models maintain a 15-25% accuracy advantage that justifies their premium. The practical rule: start with the smallest model that meets your accuracy threshold and only scale up if measurable gaps appear.

Benchmark scores are necessary but insufficient. Models that lead on MMLU or HumanEval may underperform on your specific domain. Anthropic's 2024 enterprise report found that customer-specific evaluations disagreed with public benchmarks 34% of the time. Build a domain-specific evaluation suite of 200-500 representative examples from your actual use cases before committing to a model. This upfront investment of 2-3 engineering days prevents months of production issues.

Open-source vs proprietary is no longer a binary choice. Many enterprises run a tiered architecture: proprietary frontier models for high-value, complex tasks; open-source models (Llama 3, Mistral) for high-volume, cost-sensitive workloads; and specialized fine-tuned models for domain-specific applications. A 2024 a16z survey found that 70% of enterprises use at least two different foundation model providers, with the primary motivation being cost optimization (cited by 58%) followed by risk mitigation against provider lock-in (47%).

Fine-Tuning Strategies

Fine-tuning adapts a foundation model to your specific domain, improving accuracy on specialized tasks while potentially reducing inference costs by enabling use of a smaller model. The key decision is how much adaptation you need.

Prompt engineering and retrieval-augmented generation (RAG) should be the first optimization, not fine-tuning. RAG injects relevant context from your knowledge base into each prompt, achieving domain-specific accuracy improvements of 20-40% without any model modification, according to a 2024 Databricks benchmark. Fine-tuning only makes sense when RAG alone does not meet accuracy requirements or when you need to reduce token usage (and thus cost) by baking knowledge into model weights.

Parameter-efficient fine-tuning (PEFT) methods like LoRA (Low-Rank Adaptation) have made fine-tuning accessible and affordable. LoRA modifies fewer than 1% of model parameters while achieving 90-95% of full fine-tuning performance. A 2024 benchmark by Hugging Face found that LoRA fine-tuning of Llama 3 70B on domain-specific data required only 4 A100 GPUs for 8 hours (approximately USD 200 in cloud compute), compared to USD 50,000+ for full fine-tuning. QLoRA further reduces requirements by quantizing the base model to 4-bit precision during training, enabling 70B-parameter fine-tuning on a single A100.

Data quality trumps data quantity for fine-tuning. Research from Allen AI (2024) demonstrated that fine-tuning on 1,000 high-quality, carefully curated examples outperformed fine-tuning on 10,000 noisy examples by 12% on downstream task accuracy. Invest in data curation: remove duplicates, correct labels, ensure balanced class representation, and include edge cases. The ideal fine-tuning dataset for most enterprise tasks is 1,000-10,000 examples, not the millions that pre-training requires.

Continuous fine-tuning addresses model staleness. Financial, legal, and medical domains have rapidly evolving knowledge. Implement a monthly or quarterly fine-tuning pipeline that incorporates new domain data and human feedback. Track accuracy on a held-out evaluation set after each fine-tuning round to detect degradation early.

Deployment Architecture

Foundation model deployment introduces unique infrastructure challenges around latency, throughput, cost, and reliability that traditional software deployments do not face.

Inference optimization is the primary lever for cost control. Techniques include: KV-cache optimization (reducing redundant computation for long contexts), batched inference (processing multiple requests simultaneously), speculative decoding (using a small draft model to speed up a larger model by 2-3x), and quantization (reducing model precision from FP16 to INT8 or INT4 with minimal quality loss). Together, these techniques can reduce inference costs by 60-80%. NVIDIA's TensorRT-LLM library and vLLM are the current best-in-class serving frameworks, with vLLM achieving 2-4x higher throughput than naive HuggingFace serving.

Model routing dynamically selects the appropriate model for each request based on estimated complexity, required quality, and cost budget. OpenAI's model routing in their API selects between GPT-4o and GPT-4o-mini based on query complexity. Enterprise implementations can go further: a 2024 case study from Notion documented their routing system that directs 70% of requests to a fine-tuned small model, 25% to a mid-tier model, and only 5% to frontier models, reducing costs by 65% while maintaining quality scores within 3% of running everything through the frontier model.

Fallback and redundancy architecture is essential. Foundation model APIs experience outages: a 2024 analysis by Statuspage found that major LLM providers averaged 99.5% uptime, meaning roughly 44 hours of downtime per year. Design systems with automatic failover between providers, graceful degradation (serving cached or simplified responses during outages), and queue-based architectures that can absorb temporary unavailability without user-facing errors.

Guardrails and safety layers must wrap every production deployment. Implement input validation (blocking prompt injection attempts, PII detection), output validation (factuality checking against authoritative sources, toxicity filtering), and rate limiting. Anthropic's Constitutional AI approach and NVIDIA's NeMo Guardrails provide framework-level safety controls. A 2024 Gartner survey found that organizations with production guardrails experienced 80% fewer AI-related incidents than those deploying models without safety layers.

Cost Management

Foundation model costs can escalate rapidly without deliberate management. A 2024 Andreessen Horowitz analysis found that AI inference costs represent 20-40% of total cloud spend for AI-heavy companies, with some spending over USD 10 million monthly on LLM API calls alone.

Token-level cost optimization starts with prompt engineering. Reduce prompt length by eliminating redundant instructions, using concise system prompts, and caching static context. A well-optimized prompt uses 30-50% fewer tokens than a first-draft version. Implement token budgets per request type and alert when individual requests exceed thresholds.

Caching delivers the highest ROI of any cost optimization. Semantic caching (returning stored responses for semantically similar queries) can eliminate 30-60% of API calls for customer-facing applications where queries cluster around common topics. Redis-based semantic caching with embedding similarity achieves sub-10ms lookup times, far faster than LLM inference. GPTCache and LangChain provide production-ready semantic caching implementations.

Self-hosted models make economic sense at scale. The breakeven point where self-hosting becomes cheaper than API calls varies by model and usage, but a 2024 analysis by Martian (an AI infrastructure company) estimated that organizations making over 100,000 API calls per day save 40-70% by self-hosting equivalent open-source models on dedicated GPU instances. Below this threshold, API-based access remains more cost-effective when accounting for infrastructure management overhead.

Monitor cost per output unit, not just total spend. Track metrics like cost per customer interaction, cost per document processed, or cost per decision supported. This connects AI spend to business value and identifies optimization opportunities. Mature organizations target cost per interaction reductions of 10-15% per quarter through continuous optimization of prompts, routing, and caching.

Evaluation and Monitoring

Foundation models require continuous evaluation, not just pre-deployment testing. Model providers update their models (sometimes without notice), user behavior shifts, and domain knowledge evolves.

Online evaluation compares model outputs to ground truth or human judgments in real time. Implement A/B testing for model changes, with statistical significance requirements before promoting changes to production. Track regression on key metrics: factuality, relevance, latency, and user satisfaction. A 2024 Weights & Biases survey found that organizations running continuous online evaluation caught 73% of quality regressions within 24 hours, versus an average of 11 days for organizations relying solely on periodic offline evaluation.

Human feedback loops are essential for domains where automated metrics are insufficient. Implement thumbs-up/thumbs-down mechanisms, escalation pathways for low-confidence outputs, and periodic expert review of random samples. This feedback fuels both prompt improvement and fine-tuning data collection, creating a virtuous cycle of continuous improvement.

Common Questions

Start with task complexity: mid-tier models achieve 95% of frontier accuracy for extraction and classification at 10-20% of the cost. Build a domain-specific evaluation suite of 200-500 examples from your actual use cases, as public benchmarks disagree with customer-specific results 34% of the time. Consider a tiered approach using multiple models to optimize cost and quality.

Start with RAG, which improves domain accuracy by 20-40% without model modification. Fine-tune only when RAG doesn't meet accuracy requirements or when you need to reduce token costs by embedding knowledge in model weights. Use LoRA for parameter-efficient fine-tuning (1% of parameters, 90-95% of full fine-tuning performance) with 1,000-10,000 curated examples.

AI inference costs represent 20-40% of total cloud spend for AI-heavy companies, with some exceeding USD 10 million monthly. Key optimizations include semantic caching (eliminating 30-60% of API calls), model routing (reducing costs by 65% per Notion's case study), and prompt optimization (30-50% token reduction). Self-hosting saves 40-70% above 100,000 daily API calls.

Use model routing to direct requests to appropriate-sized models (70% to small, 25% to mid-tier, 5% to frontier). Implement multi-provider fallback for reliability (major LLM providers average 99.5% uptime). Add guardrails for input/output validation, which reduces AI incidents by 80%. Use inference optimization (KV-cache, batching, quantization) to cut costs by 60-80%.

Implement continuous online evaluation with A/B testing for model changes. Organizations with continuous monitoring catch 73% of quality regressions within 24 hours versus 11 days for periodic evaluation. Track cost per output unit (per interaction, per document), not just total spend. Maintain human feedback loops for domains where automated metrics are insufficient.

References

  1. AI Risk Management Framework (AI RMF 1.0). National Institute of Standards and Technology (NIST) (2023). View source
  2. ISO/IEC 42001:2023 — Artificial Intelligence Management System. International Organization for Standardization (2023). View source
  3. Model AI Governance Framework for Generative AI. Infocomm Media Development Authority (IMDA) (2024). View source
  4. EU AI Act — Regulatory Framework for Artificial Intelligence. European Commission (2024). View source
  5. OECD Principles on Artificial Intelligence. OECD (2019). View source
  6. Model AI Governance Framework (Second Edition). PDPC and IMDA Singapore (2020). View source
  7. ASEAN Guide on AI Governance and Ethics. ASEAN Secretariat (2024). View source

EXPLORE MORE

Other AI Use-Case Playbooks Solutions

INSIGHTS

Related reading

Talk to Us About AI Use-Case Playbooks

We work with organizations across Southeast Asia on ai use-case playbooks programs. Let us know what you are working on.