Back to AI Glossary
AI Infrastructure

What is Chunking?

Chunking is the process of splitting documents into optimally sized pieces for ingestion into vector databases and retrieval-augmented generation systems, directly affecting how accurately AI can find and use your organisation's information when answering questions or completing tasks.

What Is Chunking?

Chunking is a foundational step in building any AI system that needs to search through and reason over your organisation's documents. It refers to the process of breaking large documents, such as reports, contracts, manuals, or policies, into smaller, meaningful segments called chunks that can be stored in a vector database and retrieved by AI models when needed.

Think of chunking like organising a filing cabinet. If you stuffed entire books into a single drawer, finding a specific piece of information would be nearly impossible. If you cut books into individual sentences, you would lose all context. Effective chunking finds the right balance, creating pieces large enough to carry meaningful context but small enough for precise retrieval.

Despite sounding like a technical detail, chunking decisions have an outsized impact on the quality of your AI system's outputs. Poor chunking is one of the most common reasons enterprise RAG (retrieval-augmented generation) systems deliver disappointing results.

How Chunking Works

When your organisation builds an AI assistant or knowledge base powered by RAG, the system must first process your documents through several steps. Chunking is the critical middle step:

  • Document ingestion: Raw documents in formats like PDF, Word, HTML, or plain text are loaded into the system and converted into a standard text format.
  • Chunking: The text is divided into smaller segments using one of several strategies. Common approaches include:
    • Fixed-size chunking: Splitting text into segments of a set number of characters or tokens, such as every 500 words. This is simple but can break mid-sentence or mid-paragraph.
    • Semantic chunking: Using AI to identify natural boundaries in the text, such as topic changes or paragraph breaks, creating chunks that each contain a coherent idea.
    • Recursive chunking: Starting with large sections and progressively splitting them into smaller pieces, respecting document structure like headings, paragraphs, and lists.
    • Overlapping chunks: Including some text from the end of one chunk at the beginning of the next, ensuring that information at chunk boundaries is not lost.
  • Embedding: Each chunk is converted into a numerical representation called a vector, which captures its meaning and allows for semantic search.
  • Storage: The vectors and their associated text chunks are stored in a vector database for retrieval.

When a user later asks a question, the system finds the chunks most semantically similar to the question and provides them to the AI model as context for generating an answer.

Why Chunking Matters for Business

Chunking directly determines the quality of answers your AI system produces. For business leaders investing in AI-powered knowledge management, customer support, or decision-support tools, understanding chunking's impact is essential:

  • Too large chunks mean the AI retrieves documents containing a mix of relevant and irrelevant information. This dilutes the context provided to the AI model and can lead to vague or inaccurate answers. If your chunk contains an entire 20-page report, the model cannot focus on the specific section that answers the question.
  • Too small chunks strip away the context needed to understand information properly. A sentence extracted from a financial report might be meaningless without the surrounding paragraphs explaining the methodology and assumptions.
  • Poorly bounded chunks that break in the middle of sentences, tables, or logical arguments confuse the AI model and lead to fragmented or contradictory responses.

For example, a Southeast Asian bank implementing an AI system to help relationship managers access product information found that switching from fixed-size 1,000-character chunks to semantic chunks that respected document headings and sections improved answer accuracy by over 35 percent. The change required no new technology, just a better chunking strategy.

Key Examples and Use Cases

Legal document analysis. Law firms across Singapore and Malaysia processing contracts and regulatory filings need chunking strategies that respect the structure of legal documents. A clause in a contract must be kept together with its sub-clauses and definitions to maintain legal meaning. Firms that chunk contracts by clause structure rather than by arbitrary character limits get dramatically better results from their AI review tools.

Customer support knowledge bases. Companies like Tokopedia and Lazada maintain thousands of help articles and policy documents. Effective chunking ensures that when a customer asks a question, the AI retrieves the specific relevant section of a help article rather than an entire page or a disconnected fragment. This leads to faster, more accurate responses.

Financial research. Investment firms in the region processing analyst reports, earnings transcripts, and market research benefit from chunking that preserves the relationship between financial data and its accompanying analysis. A revenue figure is meaningless without the context explaining what drove that result.

Multilingual document processing. Organisations operating across ASEAN markets often have documents in English, Bahasa Indonesia, Thai, Vietnamese, and other languages. Chunking strategies must account for differences in language structure, sentence length, and character encoding to maintain quality across all languages.

Getting Started with Chunking

If your organisation is building or improving a RAG-based AI system, here is how to approach chunking effectively:

  1. Audit your document types. Different documents require different chunking strategies. Contracts, technical manuals, FAQs, and emails each have different structures that should inform how they are split.
  2. Start with recursive chunking. This approach respects document structure by splitting at headings, then paragraphs, then sentences as needed. Most modern RAG frameworks like LangChain and LlamaIndex offer this as a built-in option.
  3. Add overlap between chunks. A 10-20 percent overlap between consecutive chunks prevents information loss at boundaries and is a simple improvement with significant impact.
  4. Test with real questions. The best way to evaluate your chunking strategy is to ask your system real questions that your users would ask and examine which chunks are retrieved. If the retrieved chunks contain the right information with sufficient context, your strategy is working.
  5. Iterate based on failure modes. When your AI system gives poor answers, examine the retrieved chunks. If the right information is not being found, your chunks may be too large. If the information is found but lacks context, your chunks may be too small or poorly bounded.

Chunking may seem like a minor technical detail, but it is the foundation upon which your entire RAG system's quality rests. Investing time in getting chunking right is one of the highest-return activities in any enterprise AI project.

Why It Matters for Business

high

Key Considerations
  • Chunking quality directly determines the accuracy of your AI system's answers. Poor chunking is the most common and least visible cause of disappointing RAG performance in enterprise deployments.
  • Different document types require different chunking strategies. A one-size-fits-all approach will underperform compared to tailored strategies for contracts, reports, FAQs, and other document formats in your organisation.
  • Evaluate chunking effectiveness by testing with real user questions and examining retrieved results. The right chunk size and strategy can only be determined through iterative testing with your actual data and use cases.

Frequently Asked Questions

What is the ideal chunk size for business documents?

There is no universal ideal chunk size because it depends on your document types and use cases. However, most enterprise RAG systems perform well with chunks between 500 and 1,500 tokens, which roughly translates to one to three paragraphs. The key is to test different sizes with your actual documents and questions, then measure retrieval accuracy. Starting with 800-1,000 tokens and a 10-20 percent overlap is a practical default for most business document collections.

Can we change our chunking strategy after deployment?

Yes, and you should expect to refine your chunking strategy over time as you learn from real usage patterns. Changing the strategy requires re-processing your documents and rebuilding your vector database, which takes time and computing resources but is not destructive. Many organisations run A/B tests with different chunking approaches to measure which delivers better results before fully committing to a new strategy.

More Questions

Yes, language differences affect optimal chunking approaches. Languages like Thai and Chinese do not use spaces between words, which impacts character-based chunking. Bahasa Indonesia and Malay tend to have longer sentences than English, which can affect token-based approaches. Most modern chunking tools handle multiple languages, but you should test your strategy specifically with your non-English documents rather than assuming settings optimised for English will transfer directly.

Need help implementing Chunking?

Pertama Partners helps businesses across Southeast Asia adopt AI strategically. Let's discuss how chunking fits into your AI roadmap.