Chunking
Chunking breaks down large documents into smaller, manageable pieces (or “chunks”). This is especially important when dealing with documents that exceed the token limits of your embedding model. When designing RAG pipelines, it’s crucial to consider the chunking strategy to ensure that the model can effectively process and retrieve relevant information.
There are many ways to chunk documents, and the best approach depends on the specific use case. Langformers offers following chunking strategies:
Fixed-size chunking
Semantic chunking
Recursive chunking
Tokenizer, tokens and chunks
Across all chunking strategies, the chunker uses the provided tokenizer to tokenize the document. The chunk size is defined in terms of tokens (not words). The number of tokens for a same document can vary depending on the tokenizer used.
Fixed-size chunking
This approach divides a document into fixed-size chunks.
Here’s a simple example of how to use the fixed-size chunker:
# Import Langformers
from langformers import tasks
# Create a chunker
chunker = tasks.create_chunker(strategy="fixed_size", tokenizer="sentence-transformers/all-mpnet-base-v2")
# Chunk a document
chunks = chunker.chunk(document="This is a test document. It contains several sentences. We will chunk it into smaller pieces.",
chunk_size=8)
Overlapping: If overlap
is provided to chunk()
, the chunker will create overlapping chunks.
- langformers.tasks.create_chunker(strategy: str, **kwargs)
Factory method for creating a chunker that splits a document into smaller pieces (chunks).
- Parameters:
strategy (str, required) – The chunking strategy. Currently supported startegies:
fixed_size
,semantic
,recursive
.**kwargs (str, required) – Chunking strategy specific keyword arguments.
- Returns:
An instance of the appropriate chunker class, based on the selected strategy.
If strategy is “fixed_size”, an instance of FixedSizeChunker is returned.
If strategy is “semantic”, an instance of SemanticChunker is returned.
If strategy is “recursive”, an instance of RecursiveChunker is returned.
kwargs for fixed_size strategy:
- langformers.chunkers.FixedSizeChunker.__init__(self, tokenizer: str)
Initializes the FixedSizeChunker class.
- Parameters:
tokenizer (str, required) – The tokenizer to use for encoding the document.
- langformers.chunkers.FixedSizeChunker.chunk(self, document: str, chunk_size: int = None, overlap: int = 0)
Splits the document into fixed-size chunks based on the tokenizer’s encoding.
- Parameters:
document (str, required) – The document to be chunked. If the document is something like PDF, it should be converted to a string first.
chunk_size (int, default=None) – The maximum size of each chunk. If not provided, the tokenizers’s max length will be used.
overlap (int, default=0) – The number of tokens to overlap between consecutive chunks.
Semantic Chunking
Semantic chunking is a more advanced approach that uses semantic similarity to create chunks[1]. Intially the document will be chunked into smaller pieces, and then the chunker will group them based on their semantic similarity.
Here’s a simple example of how to use the semantic chunker:
# Import Langformers
from langformers import tasks
# Create a chunker
chunker = tasks.create_chunker(strategy="semantic", model_name="sentence-transformers/all-mpnet-base-v2")
# Chunk a document
chunks = chunker.chunk(document="Cats are awesome. Dogs are awesome. Python is amazing.",
initial_chunk_size=4,
max_chunk_size=10,
similarity_threshold=0.3)
- langformers.tasks.create_chunker(strategy: str, **kwargs)
Factory method for creating a chunker that splits a document into smaller pieces (chunks).
- Parameters:
strategy (str, required) – The chunking strategy. Currently supported startegies:
fixed_size
,semantic
,recursive
.**kwargs (str, required) – Chunking strategy specific keyword arguments.
- Returns:
An instance of the appropriate chunker class, based on the selected strategy.
If strategy is “fixed_size”, an instance of FixedSizeChunker is returned.
If strategy is “semantic”, an instance of SemanticChunker is returned.
If strategy is “recursive”, an instance of RecursiveChunker is returned.
kwargs for semantic strategy:
- langformers.chunkers.SemanticChunker.__init__(self, model_name: str)
Initializes the SemanticChunker class.
- Parameters:
model_name (str) – The name of the Hugging Face model to use for embedding. The model’s tokenizer will be used for tokenization.
- langformers.chunkers.SemanticChunker.chunk(self, document: str, initial_chunk_size: int, max_chunk_size: int, similarity_threshold: float = 0.2)
Chunk the document into semantic segments based on cosine similarity.
- Parameters:
document (str, required) – The document to be chunked. If the document is something like PDF, it should be converted to a string first.
initial_chunk_size (int, required) – The maximum size of chunks to be created for merging later.
max_chunk_size (int, required) – The maximum size of the final chunks.
similarity_threshold (float, default=0.2) – The threshold for cosine similarity to merge chunks.
Recursive chunking
Recursive chunking divides a document hierarchically using specified separators.
Typically, a document is split by sections first, then paragraphs, and so on. Langformers follows this approach by initially splitting the text using \n\n
(representing sections), then \n
(for paragraphs), and finally down to the token level. At each stage, if a chunk exceeds the tokenizer’s maximum token limit, it is recursively split into smaller parts until each chunk fits within the allowed token size.
You can also declare your own separators to split the document.
Here’s a simple example of how to use the recursive chunker:
# Import Langformers
from langformers import tasks
# Create a chunker
chunker = tasks.create_chunker(strategy="recursive", tokenizer="sentence-transformers/all-mpnet-base-v2")
# Chunk a document
chunks = chunker.chunk(document="Cats are awesome.\n\nDogs are awesome.\nPython is amazing.",
separators=["\n\n", "\n"],
chunk_size=5)
- langformers.tasks.create_chunker(strategy: str, **kwargs)
Factory method for creating a chunker that splits a document into smaller pieces (chunks).
- Parameters:
strategy (str, required) – The chunking strategy. Currently supported startegies:
fixed_size
,semantic
,recursive
.**kwargs (str, required) – Chunking strategy specific keyword arguments.
- Returns:
An instance of the appropriate chunker class, based on the selected strategy.
If strategy is “fixed_size”, an instance of FixedSizeChunker is returned.
If strategy is “semantic”, an instance of SemanticChunker is returned.
If strategy is “recursive”, an instance of RecursiveChunker is returned.
kwargs for recursive strategy:
- langformers.chunkers.RecursiveChunker.__init__(self, tokenizer: str)
Initializes the RecursiveChunker class.
- Parameters:
tokenizer (str, required) – The tokenizer to use for encoding the document.
- langformers.chunkers.RecursiveChunker.chunk(self, document: str, separators=['\n\n', '\n'], chunk_size=None)
Chunk the document recursively based on the specified separators.
- Parameters:
document (str, required) – The document to be chunked. If the document is something like PDF, it should be converted to a string first.
separators (list, default=["\n\n", "\n"]) – The list of separators to use for chunking.
chunk_size (int, default=None) – The maximum size of each chunk. If not provided, the tokenizers’s max length will be used.
Footnotes