AI Safety & Security

What is Model Extraction Attack?

A Model Extraction Attack is a technique where an adversary systematically queries a deployed AI model to reconstruct a functional copy of it, effectively stealing the model's learned knowledge, capabilities, and intellectual property without authorised access to its parameters, architecture, or training data.

What is a Model Extraction Attack?

A Model Extraction Attack, also known as model stealing, is a process where an attacker creates a functional replica of a target AI model by systematically querying it and using the responses to train their own substitute model. The attacker does not need access to the original model's code, weights, or training data. They only need access to the model's input-output interface — the same interface that legitimate users interact with.

The process is conceptually simple: the attacker sends a large number of carefully chosen inputs to the target model, collects the outputs, and uses these input-output pairs as training data for their own model. With enough queries, the substitute model can approximate the original model's behaviour with high fidelity.

Why Model Extraction Matters for Business

AI models often represent significant business value. They embody the investment in data collection, curation, and labelling; the expertise in model design and training; and the competitive advantage that comes from superior AI capabilities. Model extraction threatens this value in several ways:

Intellectual property theft: A competitor who extracts your model gains the benefit of your AI investment without the cost. They can deploy a copy of your model, undercut your pricing, or use it as a starting point for their own improvements.
Competitive advantage erosion: If your AI model is a key differentiator — a better recommendation engine, a more accurate risk model, a superior diagnostic tool — extraction eliminates that advantage.
Enabling further attacks: A stolen model can be used to develop adversarial attacks against the original system. With a local copy, an attacker can explore vulnerabilities at their leisure without triggering monitoring on the target system.
Privacy violations: If the extracted model retains information from the training data — which many models do — extraction can lead to indirect exposure of sensitive training data, including personal information.

How Model Extraction Works

Query-Based Extraction

The most common approach involves sending a large volume of queries to the target model through its API or user interface:

Input selection: The attacker designs a set of inputs that will elicit informative responses. Random inputs work but are inefficient. Strategically chosen inputs that cover the model's decision boundaries require fewer queries to achieve good extraction.
Response collection: The attacker collects the model's outputs for each input. The more information the model provides — confidence scores, class probabilities, embeddings — the easier extraction becomes.
Substitute model training: Using the input-output pairs as labelled training data, the attacker trains their own model to mimic the target's behaviour.
Iterative refinement: The attacker may iterate, using the substitute model to identify areas of disagreement with the target and focusing additional queries on those areas.

Side-Channel Extraction

Some attacks exploit information leaked through side channels rather than the model's direct outputs:

Timing information: The time a model takes to respond can reveal information about its architecture and decision process.
Cache behaviour: In shared computing environments, cache access patterns can leak model information.
API metadata: Response headers, error messages, and other metadata may reveal model details.

Distillation-Based Approaches

Model extraction is closely related to knowledge distillation, a legitimate technique where a smaller "student" model is trained to mimic a larger "teacher" model. The difference is intent and authorisation — distillation is a tool used by the model owner, while extraction uses the same technique without permission.

The Scale of the Threat

Model extraction is not a theoretical concern. Research has demonstrated successful extraction of:

Image classifiers: Researchers have extracted functional copies of commercial image recognition APIs with thousands of queries, achieving accuracy close to the original model.
Language models: The behaviour of large language models can be partially replicated through systematic querying, capturing their style, knowledge, and capabilities.
Recommendation systems: The logic behind product and content recommendations can be approximated through careful observation of system outputs.

Defending Against Model Extraction

Query Monitoring and Rate Limiting

Monitor API usage patterns for signs of extraction attempts:

Unusual query volumes: Extraction requires many queries. Set thresholds for query frequency and investigate anomalies.
Suspicious input patterns: Extraction queries often follow patterns distinct from legitimate usage — systematic coverage of the input space, unusual data distributions, or queries that appear designed to probe decision boundaries.
Rate limiting: Restrict the number of queries per user, per time period, and in total, making extraction more time-consuming and expensive.

Output Perturbation

Reduce the information available to potential extractors:

Limit output detail: Return only top-level classifications rather than full probability distributions. Each piece of additional information makes extraction easier.
Add controlled noise: Introduce small random perturbations to model outputs that do not significantly affect legitimate use but degrade the quality of extracted models.
Watermark outputs: Embed detectable patterns in model outputs that transfer to extracted models, allowing you to identify stolen copies.

Architectural Protections

Ensemble serving: Use multiple models and combine their outputs, making it harder for an attacker to extract any single model.
Dynamic model serving: Periodically update or rotate the serving model, invalidating previously collected extraction data.
Query authentication: Require authenticated access and track query patterns per user to identify and block extraction attempts.

Legal and Contractual Protections

Terms of service: Explicitly prohibit model extraction in your API terms of service, establishing a legal basis for action against violators.
Intellectual property registration: Where applicable, register AI models as intellectual property to strengthen legal protections.
Audit rights: Include audit rights in enterprise agreements that allow you to verify how your API outputs are being used.

Model Extraction in Southeast Asia

The growing AI ecosystem in Southeast Asia creates both opportunities and risks related to model extraction. Companies in the region that have invested in building proprietary AI models for local markets — models trained on local languages, local data, and local business contexts — face the risk of having this hard-won local advantage extracted by competitors.

At the same time, the region's developing legal frameworks for AI intellectual property create uncertainty about enforcement. Singapore has relatively strong IP protections, but the legal status of AI model parameters and trained behaviour varies across ASEAN jurisdictions. Businesses should combine technical defences with legal protections and contractual safeguards.

Why It Matters for Business

Model Extraction Attacks threaten the return on your AI investment directly. For CEOs and CTOs, the risk is that competitors or bad actors can steal your AI model's capabilities — the product of your data, expertise, and compute investment — simply by querying your API systematically. They do not need to hack your servers or access your code. They just need to use your model enough to train their own copy.

This is particularly concerning for businesses that have built proprietary AI models as competitive differentiators. If you have invested in training models on unique datasets, local market data, or domain-specific knowledge — common for AI companies in Southeast Asia building solutions for local markets — model extraction can eliminate that competitive advantage at a fraction of your development cost.

The business response requires both technical and legal measures. Technical defences like rate limiting, output perturbation, and query monitoring raise the cost and difficulty of extraction. Legal protections through terms of service, IP registration, and contractual restrictions provide enforcement options. Neither approach alone is sufficient, but together they create meaningful barriers to model theft.

Key Considerations

Implement query monitoring and rate limiting on all AI APIs to detect and slow extraction attempts before a useful copy of your model can be created.
Limit the detail in model outputs to what users actually need, avoiding exposure of confidence scores, probability distributions, or embeddings that make extraction easier.
Include explicit anti-extraction provisions in your API terms of service and enterprise agreements to establish legal grounds for enforcement.
Consider output watermarking that embeds detectable signatures in model responses, allowing you to identify if a competitor is using an extracted copy of your model.
Treat your AI models as valuable intellectual property and explore IP registration options available in your operating jurisdictions.
Monitor the market for AI products that closely mimic your model's behaviour, which could indicate successful extraction.
Balance protection with usability — overly restrictive defences that degrade the user experience may drive away legitimate customers while only slowing determined attackers.

Frequently Asked Questions

How many queries does it take to extract an AI model?

The number varies significantly depending on the model complexity, the type of outputs available, and the fidelity the attacker aims to achieve. Research has shown that simple classifiers can be meaningfully extracted with thousands of queries, while more complex models may require millions. Models that return detailed outputs such as probability distributions are easier to extract than those returning only top-level predictions. Rate limiting and output restriction can significantly increase the number of queries required, making extraction impractical for many attackers.

Can we detect if our model has been extracted?

Detection is challenging but possible. Output watermarking embeds signatures in your model's responses that transfer to extracted copies, allowing identification. Monitoring for competitors who launch suspiciously similar AI products shortly after accessing your API can provide circumstantial evidence. Analysing the behaviour of suspected copies against your model's specific quirks and biases can indicate extraction. However, proving extraction definitively often requires legal processes including discovery and forensic analysis.

Need help implementing Model Extraction Attack?

Pertama Partners helps businesses across Southeast Asia adopt AI strategically. Let's discuss how model extraction attack fits into your AI roadmap.

Book a Consultation Browse AI Glossary