Chunking

Chunking breaks down large documents into smaller, manageable pieces (or “chunks”). This is especially important when dealing with documents that exceed the token limits of your embedding model. When designing RAG pipelines, it’s crucial to consider the chunking strategy to ensure that the model can effectively process and retrieve relevant information.

There are many ways to chunk documents, and the best approach depends on the specific use case. Langformers offers following chunking strategies:

  • Fixed-size chunking

  • Semantic chunking

  • Recursive chunking

Tokenizer, tokens and chunks

Across all chunking strategies, the chunker uses the provided tokenizer to tokenize the document. The chunk size is defined in terms of tokens (not words). The number of tokens for a same document can vary depending on the tokenizer used.

Fixed-size chunking

This approach divides a document into fixed-size chunks.

Here’s a simple example of how to use the fixed-size chunker:

# Import Langformers
from langformers import tasks

# Create a chunker
chunker = tasks.create_chunker(strategy="fixed_size", tokenizer="sentence-transformers/all-mpnet-base-v2")

# Chunk a document
chunks = chunker.chunk(document="This is a test document. It contains several sentences. We will chunk it into smaller pieces.",
                        chunk_size=8)

Overlapping: If overlap is provided to chunk(), the chunker will create overlapping chunks.

langformers.tasks.create_chunker(strategy: str, **kwargs)

Factory method for creating a chunker that splits a document into smaller pieces (chunks).

Parameters:
  • strategy (str, required) – The chunking strategy. Currently supported startegies: fixed_size, semantic, recursive.

  • **kwargs (str, required) – Chunking strategy specific keyword arguments.

Returns:

An instance of the appropriate chunker class, based on the selected strategy.

  • If strategy is “fixed_size”, an instance of FixedSizeChunker is returned.

  • If strategy is “semantic”, an instance of SemanticChunker is returned.

  • If strategy is “recursive”, an instance of RecursiveChunker is returned.

kwargs for fixed_size strategy:

langformers.chunkers.FixedSizeChunker.__init__(self, tokenizer: str)

Initializes the FixedSizeChunker class.

Parameters:

tokenizer (str, required) – The tokenizer to use for encoding the document.

Semantic Chunking

Semantic chunking is a more advanced approach that uses semantic similarity to create chunks[1]. Intially the document will be chunked into smaller pieces, and then the chunker will group them based on their semantic similarity.

Here’s a simple example of how to use the semantic chunker:

# Import Langformers
from langformers import tasks

# Create a chunker
chunker = tasks.create_chunker(strategy="semantic", model_name="sentence-transformers/all-mpnet-base-v2")

# Chunk a document
chunks = chunker.chunk(document="Cats are awesome. Dogs are awesome. Python is amazing.",
                        initial_chunk_size=4,
                        max_chunk_size=10,
                        similarity_threshold=0.3)
langformers.tasks.create_chunker(strategy: str, **kwargs)

Factory method for creating a chunker that splits a document into smaller pieces (chunks).

Parameters:
  • strategy (str, required) – The chunking strategy. Currently supported startegies: fixed_size, semantic, recursive.

  • **kwargs (str, required) – Chunking strategy specific keyword arguments.

Returns:

An instance of the appropriate chunker class, based on the selected strategy.

  • If strategy is “fixed_size”, an instance of FixedSizeChunker is returned.

  • If strategy is “semantic”, an instance of SemanticChunker is returned.

  • If strategy is “recursive”, an instance of RecursiveChunker is returned.

kwargs for semantic strategy:

langformers.chunkers.SemanticChunker.__init__(self, model_name: str)

Initializes the SemanticChunker class.

Parameters:

model_name (str) – The name of the Hugging Face model to use for embedding. The model’s tokenizer will be used for tokenization.

Recursive chunking

Recursive chunking divides a document hierarchically using specified separators.

Typically, a document is split by sections first, then paragraphs, and so on. Langformers follows this approach by initially splitting the text using \n\n (representing sections), then \n (for paragraphs), and finally down to the token level. At each stage, if a chunk exceeds the tokenizer’s maximum token limit, it is recursively split into smaller parts until each chunk fits within the allowed token size.

You can also declare your own separators to split the document.

Here’s a simple example of how to use the recursive chunker:

# Import Langformers
from langformers import tasks

# Create a chunker
chunker = tasks.create_chunker(strategy="recursive", tokenizer="sentence-transformers/all-mpnet-base-v2")

# Chunk a document
chunks = chunker.chunk(document="Cats are awesome.\n\nDogs are awesome.\nPython is amazing.",
                        separators=["\n\n", "\n"],
                        chunk_size=5)
langformers.tasks.create_chunker(strategy: str, **kwargs)

Factory method for creating a chunker that splits a document into smaller pieces (chunks).

Parameters:
  • strategy (str, required) – The chunking strategy. Currently supported startegies: fixed_size, semantic, recursive.

  • **kwargs (str, required) – Chunking strategy specific keyword arguments.

Returns:

An instance of the appropriate chunker class, based on the selected strategy.

  • If strategy is “fixed_size”, an instance of FixedSizeChunker is returned.

  • If strategy is “semantic”, an instance of SemanticChunker is returned.

  • If strategy is “recursive”, an instance of RecursiveChunker is returned.

kwargs for recursive strategy:

langformers.chunkers.RecursiveChunker.__init__(self, tokenizer: str)

Initializes the RecursiveChunker class.

Parameters:

tokenizer (str, required) – The tokenizer to use for encoding the document.

Footnotes