Tasks

tasks

class langformers.tasks

Bases: object

Factory class for creating and managing various NLP tasks involving LLMs and MLMs.

static create_chunker(strategy: str, **kwargs)

Factory method for creating a chunker that splits a document into smaller pieces (chunks).

Parameters:
  • strategy (str, required) – The chunking strategy. Currently supported startegies: fixed_size, semantic, recursive.

  • **kwargs (str, required) – Chunking strategy specific keyword arguments.

Returns:

An instance of the appropriate chunker class, based on the selected strategy.

  • If strategy is “fixed_size”, an instance of FixedSizeChunker is returned.

  • If strategy is “semantic”, an instance of SemanticChunker is returned.

  • If strategy is “recursive”, an instance of RecursiveChunker is returned.

static create_classifier(model_name: str, csv_path: str, text_column: str = 'text', label_column: str = 'label', training_config: Dict | None = None)

Factory method for training text classifiers. Only Hugging Face models are supported. Used for finetuning models such as BERT, RoBERTa, MPNet on a text classification dataset.

Parameters:
  • model_name (str, required) – Name or path of the pretrained transformer model (e.g., “bert-base-uncased”).

  • csv_path (str, required) – Path to the CSV file containing training data.

  • text_column (str, default="text") – Column name in the CSV file containing the input text.

  • label_column (str, default="label") – Column name in the CSV file containing labels.

  • training_config (Optional[Dict], required) – A dictionary containing training parameters. If not provided, default values will be assigned from langformers.classifiers.huggingface_classifier.TrainingConfig.

Returns:

An instance of HuggingFaceClassifier.

static create_embedder(provider: str, model_name: str, **kwargs)

Factory method for creating a sentence embedder.

Parameters:
  • provider (str, required) – The model provider (e.g., “huggingface”). Currently supported providers: huggingface.

  • model_name (str, required) – The model name from the provider’s hub (e.g., “sentence-transformers/all-mpnet-base-v2”).

  • **kwargs (dict, required) – Provider specific keyword arguments. Keeping this as more providers will be added.

Returns:

An instance of the appropriate embedder class, based on the selected provider.

  • If provider is “huggingface”, an instance of HuggingFaceEmbedder is returned.

static create_generator(provider: str, model_name: str, memory: bool = True, dependency: Callable[[...], Any] | None = None, device: str = None)

Factory method for creating and managing LLMs chatting (User Interface) and LLM Inference (REST api).

Parameters:
  • provider (str, required) – The model provider (e.g., “ollama”). Currently supported providers: ollama, huggingface.

  • model_name (str, required) – The model name from the provider’s hub (e.g., “llama3.1:8b”).

  • memory (bool, default= True) – Whether to save previous chat interactions to maintain context. Chatting with an LLM (via a user interface) definitely makes sense to maintain memory, which is why this option defaults to True. But, when doing LLM Inference (via REST api) there might be some use cases where maintaining contexts might not be useful. Therefore, this option exists.

  • dependency (Optional[Callable[..., Any]], default=<no auth>) – A FastAPI dependency. The callable can return any value which will be injected into the route /api/generate.

  • device (str, default=None) – The device to load the model, data on (“cuda”, “mps” or “cpu”). If not provided, device will automatically be inferred. Currently used for HuggingFace models, as input ids and attention mask need to be on the save device as the model.

Returns:

An instance of the appropriate generator class, based on the selected provider.

  • If provider is “huggingface”, an instance of HuggingFaceGenerator is returned.

  • If provider is “ollama”, an instance of OllamaGenerator is returned.

static create_labeller(provider: str, model_name: str, multi_label: bool = False)

Factory method for loading an LLM for data labelling tasks.

Parameters:
  • provider (str, required) – The name of the embedding model provider (e.g., “huggingface”). Currently supported providers: huggingface, ollama.

  • model_name (str, required) – The model name from the provider’s hub (e.g., “llama3.1:8b” if provider is “Ollama”).

  • multi_label (bool, default=False) – If True, allows multiple labels to be selected.

Returns:

An instance of the appropriate labeller class, based on the selected provider.

  • If provider is “ollama”, an instance of OllamaDataLabeler will be returned.

  • If provider is “huggingface”, an instance of HuggingFaceDataLabeler will be returned.

static create_mimicker(teacher_model: str, student_config: dict, training_config: dict)

Factory method for creating smaller models by mimicking the vector space of a large teacher model.

Parameters:
  • teacher_model (str, required) – Name of the teacher model from Hugging Face.

  • student_config (dict, required) – Configuration for the student model.

  • training_config (dict, required) – Configuration for training.

Returns:

An instance of EmbeddingMimicker.

static create_mlm(tokenizer: str, tokenized_dataset: str, model_config: Dict | None = None, training_config: Dict | None = None, checkpoint_path=None)

Factory method for training a masked language model based on RoBERTa pretraining procedure.

Parameters:
  • tokenizer (str, required) – Path to the trained tokenizer.

  • tokenized_dataset (str, required) – Path to the tokenized dataset.

  • model_config (Optional[Dict], default=None) – Dictionary containing model configurations. If None, the model must be loaded from a checkpoint.

  • training_config (Optional[Dict], default=None) – Dictionary containing training configurations. If None, default training configurations will be used.

  • checkpoint_path (Optional[str], default=None) – Path to a model checkpoint for resuming training.

Returns:

An instance of HuggingFaceMLMCreator.

static create_reranker(model_type: str, model_name: str, **kwargs)

Factory method for creating a reranker.

Parameters:
  • model_type (str, required) – The type of the reranker model. Currently supported model types: cross_encoder

  • model_name (str, required) – The model name from Hugging Face (e.g., “cross-encoder/ms-marco-MiniLM-L-6-v2”). Refer to this link for more models: https://huggingface.co/models?library=sentence-transformers&pipeline_tag=text-ranking

  • **kwargs (dict, required) – Model specific keyword arguments. Keeping this as more model_type can be added.

Returns:

An instance of the appropriate reranker class, based on the selected model type.

  • If model_type is “cross_encoder”, an instance of CrossEncoder is returned.

static create_searcher(database: str, **kwargs)

Factory method for creating a searcher and perform vector search.

Parameters:
  • database (str, required) – Type of vector database (e.g., “faiss”). Supported database: faiss, chromadb, and pinecone.

  • **kwargs (dict, required) – Database specific keyword arguments.

Returns:

An instance of the appropriate searcher class, based on the selected database.

  • If database is “faiss”, an instance of FaissSearcher is returned.

  • If database is “chromadb”, an instance of ChromaDBSearcher is returned.

  • If database is “pinecone”, an instance of PineconeSearcher is returned.

static create_tokenizer(data_path: str, tokenizer_config: Dict | None = None, tokenizer: str = None)

Factory method for training a tokenizer on a custom dataset, and use the trained tokenizer to tokenize the dataset.

Parameters:
  • data_path (str, required) – Path to a raw text data (e.g., data.txt). Each line in the dataset should contain a single sentence or document. Each line can also be multiple sentences, but note that truncation will be applied.

  • tokenizer_config (dict, default=None) – Configurations for the tokenizer. If not provided, default values will be assigned from the dataclass langformers.mlms.mlm_tokenizer_dataset_creator.TokenizerConfig.

  • tokenizer (str, default=None) – Path to a trained tokenizer, such as “roberta-base” on Hugging Face, or a local path. If tokenizer is provided, it ignores tokenizer_config.

Returns:

An instance of MLMDatasetCreator.

static load_classifier(model_name: str)

Factory method for loading a custom classifier from disk.

Parameters:

model_name (str, required) – Path to the classifier.

Returns:

An instance of LoadClassifier.

Classifiers

HuggingFaceClassifier

class langformers.HuggingFaceClassifier(model_name: str, csv_path: str, text_column: str = 'text', label_column: str = 'label', training_config: Dict | None = None)

Bases: object

Fine-tunes a masked language model (e.g., BERT, RoBERTa, MPNet) on a text classification task.

This class handles dataset preparation, label encoding, model training, and evaluation.

__init__(model_name: str, csv_path: str, text_column: str = 'text', label_column: str = 'label', training_config: Dict | None = None)

Initializes the fine-tuning process for a masked language model.

Parameters:
  • model_name (str, required) – Name or path of the pretrained transformer model (e.g., “bert-base-uncased”).

  • csv_path (str, required) – Path to the CSV file containing training data.

  • text_column (str, default="text") – Column name in the CSV file containing the input text.

  • label_column (str, default="label") – Column name in the CSV file containing labels.

  • training_config (Optional[Dict], required) – A dictionary containing training parameters. If not provided, default values will be assigned from langformers.classifiers.huggingface_classifier.TrainingConfig.

create_dataset(df, labels)

Performs tokenization on a dataset.

train()

Starts the fine-tuning process.

class langformers.classifiers.huggingface_classifier.TrainingConfig(num_train_epochs: int = 10, save_total_limit: int = 1, learning_rate: float = 2e-05, per_device_train_batch_size: int = 16, per_device_eval_batch_size: int = 16, save_strategy: str = 'steps', save_steps: int = 100, eval_strategy: str = 'steps', logging_steps: int = 100, report_to: list = <factory>, logging_dir: str = None, output_dir: str = None, run_name: str = 'langformers-classifier-d20250504-t052724', test_size: float = 0.2, val_size: float = 0.1, metric_for_best_model: str = 'f1_macro', early_stopping_patience: int = 5, load_best_model_at_end: bool = True, max_length: int = None)

Bases: object

Default configuration for fine-tuning a Hugging Face model (e.g., BERT, RoBERTa, MPNet) on a text classification dataset. In addition to these parameters, you can pass the parameters class transformers.TrainingArguments takes. Refer: https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments.

Parameters:
  • num_train_epochs (int, default=10) – Number of epochs to train the model.

  • save_total_limit (int, default=1) – Limits the number of saved checkpoints to save disk space.

  • learning_rate (float, default=2e-5) – Learning rate. Controls the step size during optimization.

  • per_device_train_batch_size (int, default=16) – Batch size during training (per device).

  • per_device_eval_batch_size (int, default=16) – Batch size during evaluation (per device).

  • save_strategy (str, default="steps") – When to save checkpoints during training.

  • save_steps (int, default=100) – Number of steps between model checkpoint saves.

  • eval_strategy (str, default="steps") – When to run evaluation during training.

  • logging_strategy (str, default="steps") – When to log training metrics (“steps”, “epoch”, etc.).

  • logging_steps (int, default=100) – Number of update steps between logging.

  • report_to (list, default=["none"]) – List of integrations to report to (e.g., “tensorboard”, “wandb”).

  • logging_dir (str, default=None) – Directory to save the training logs. If not provided, logging will be done in a timestamp-based directory inside logs/. (see run_name)

  • output_dir (str, default=None) – Directory to save model checkpoints. If not provided, logging will be done in a timestamp-based directory. (see run_name)

  • run_name (str, default=<timestamp based string>) – Descriptor for the run. Will be typically used by logging tools “wandb”, “tensorboard” etc. Langformers will automatically generate a run name based on current timestamp.

  • test_size (float, default=0.2) – Proportion of data for test split.

  • val_size (float, default=0.1) – Proportion of data for validation split.

  • metric_for_best_model (str, default="f1_macro") – Metric to use for comparing models.

  • early_stopping_patience (int, default=5) – Number of evaluations to wait before early stopping.

  • early_stopping_threshold (float, default=0.0001) – Minimum improvement threshold for early stopping.

  • load_best_model_at_end (bool, default=True) – Whether to load best model at end of training.

  • max_length (int, default=None) – Maximum sequence length for tokenization. If not provided, automatically assigned to tokenizer’s model_max_length.

LoadClassifier

class langformers.classifiers.LoadClassifier(model_name: str)

Bases: object

Loads a text classification model trained with Langformers.

This class loads a custom classification model, tokenizes input text, and predicts labels with their associated probabilities.

__init__(model_name: str)

Loads the provided custom classification model. Retrieves label mappings if available in the model configuration.

Parameters:

model_name (str, required) – Path to the classifier.

classify(texts: list[str]) list[dict[str, Any]]

Classifies input texts into predefined categories.

Parameters:

texts (list[str], required) – A list of text strings to classify.

Notes

  • Tokenizes input texts and feeds them into the model.

  • Uses softmax to obtain probability distributions.

  • Returns the most probable class label for each input text.

Embedders

HuggingFaceEmbedder

class langformers.embedders.HuggingFaceEmbedder(model_name: str)

Bases: object

Embeds sentences using Hugging Face models.

This class generates embeddings for input texts and computes similarity scores between two given texts using cosine similarity.

__init__(model_name: str)

Loads the embedding model and its tokenizer.

Parameters:

model_name (str, required) – The model name from the provider’s hub (e.g., “sentence-transformers/all-mpnet-base-v2”).

embed(texts: list[str])

Generates normalized sentence embeddings for input texts.

Parameters:

texts (list[str]) – A list of text sequences to be embedded.

Notes

  • Uses mean pooling over token embeddings to obtain sentence embeddings.

  • Applies L2 normalization to the embeddings.

similarity(texts: list[str])

Computes cosine similarity between two input texts.

Parameters:

texts (list) – A list containing exactly two text sequences.

Notes

  • The similarity score ranges from -1 (completely different) to 1 (identical).

Generators

OllamaGenerator

class langformers.generators.OllamaGenerator(model_name: str, memory: bool = True, dependency: Callable[[...], Any] | None = None, device: str = None)

Bases: object

A FastAPI- and Ollama-powered LLM interface (user interface and REST api).

This class sets up a web server that allows users to interact with an LLM. It maintains a conversation history (memory) and supports real-time streaming responses.

__init__(model_name: str, memory: bool = True, dependency: Callable[[...], Any] | None = None, device: str = None)

Initializes the Ollama chat generator with the specified model.

Parameters:
  • model_name (str) – The name of the AI model to use for text generation.

  • memory (bool, default=True) – Whether to retain conversation history. Defaults to True.

  • dependency (Optional[Callable[..., Any]], default=<no auth>) – A FastAPI dependency. The callable can return any value which will be injected into the route /api/generate.

  • device (str, default=None) – The device to load the model, data on (“cuda”, “mps” or “cpu”). If not provided, device will automatically be inferred. Particularly used for HuggingFace models, input ids and attention mask needs to be on save device as the model.

Notes

  • Initializes FastAPI and sets up static and template directories.

  • Defines application routes for chat interactions.

async generate(prompt: str, system_prompt: str = 'You are a helpful, respectful, and honest AI assistant. Your goal is to provide concise, accurate, clear information while adhering to ethical guidelines. If you are unsure about something, say so. Avoid generating harmful, biased or inappropriate content.', memory_k: int = 10, temperature: float = 0.5, top_p: float = 1, max_length: int = 5000)

Generates an LLM response to a given prompt.

Parameters:
  • prompt (str, required) – The user’s input prompt.

  • system_prompt (str, default=default_chat_prompt_system) – The system-level instruction for the LLM.

  • memory_k (int, default=10) – The number of previous messages to retain in memory.

  • temperature (float, default=0.5) – Controls randomness of responses (higher = more random).

  • top_p (float, default=1) – Nucleus sampling parameter (lower = more focused).

  • max_length (int, default=5000) – Maximum number of tokens to generate.

Notes

  • Maintains conversation memory if enabled.

  • Streams responses in real time.

run(host: str = '0.0.0.0', port: int = 8000)

Starts the FastAPI web server.

Parameters:
  • host (str, default="0.0.0.0") – The IP address to bind the server to.

  • port (int, default=8000) – The port number to listen on.

setup_routes()

Defines the FastAPI routes for handling chat interactions.

Routes:
  • GET / : Renders the chat interface.

  • POST /api/generate : Accepts user input and returns AI-generated responses.

Notes

  • Uses Jinja2 templates for rendering the chat interface.

  • The API expects a JSON payload containing a prompt key.

HuggingFaceGenerator

class langformers.generators.HuggingFaceGenerator(model_name: str, memory: bool = True, dependency: Callable[[...], Any] | None = None, device: str = None)

Bases: object

A FastAPI- and Hugging Face-powered LLM interface (user interface and REST api).

__init__(model_name: str, memory: bool = True, dependency: Callable[[...], Any] | None = None, device: str = None)

Initializes the Hugging Face chat generator with the specified model.

Parameters:
  • model_name (str) – The name/path of the Hugging Face model to use

  • memory (bool, default=True) – Whether to retain conversation history

  • dependency (Optional[Callable[..., Any]], default=<no auth>) – A FastAPI dependency. The callable can return any value which will be injected into the route /api/generate.

  • device (str, default=None) – The device to load the model, data on (“cuda”, “mps” or “cpu”). If not provided, device will automatically be inferred. Particularly used for HuggingFace models, input ids and attention mask needs to be on save device as the model.

async generate(prompt: str, system_prompt: str = 'You are a helpful, respectful, and honest AI assistant. Your goal is to provide concise, accurate, clear information while adhering to ethical guidelines. If you are unsure about something, say so. Avoid generating harmful, biased or inappropriate content.', memory_k: int = 10, temperature: float = 0.5, top_p: float = 1, max_length: int = 5000)

Generates an LLM response to a given prompt.

Parameters:
  • prompt (str, required) – The user’s input prompt.

  • system_prompt (str, default=default_chat_prompt_system) – The system-level instruction for the LLM.

  • memory_k (int, default=10) – The number of previous messages to retain in memory.

  • temperature (float, default=0.5) – Controls randomness of responses (higher = more random).

  • top_p (float, default=1) – Nucleus sampling parameter (lower = more focused).

  • max_length (int, default=5000) – Maximum number of tokens to generate.

Notes

  • Maintains conversation memory if enabled.

  • Streams responses in real time.

run(host: str = '0.0.0.0', port: int = 8000)

Starts the FastAPI web server.

Parameters:
  • host (str, default="0.0.0.0") – The IP address to bind the server to.

  • port (int, default=8000) – The port number to listen on.

setup_routes()

Defines the FastAPI routes for handling chat interactions.

Routes:
  • GET / : Renders the chat interface.

  • POST /api/generate : Accepts user input and returns AI-generated responses.

Notes

  • Uses Jinja2 templates for rendering the chat interface.

  • The API expects a JSON payload containing a prompt key.

StreamProcessor

class langformers.generators.StreamProcessor(headers)

Bases: object

Handles Server-Sent Events (SSE) responses.

__init__(headers)

Initializes the StreamProcessor class.

Parameters:

headers (dict) –

A dictionary of headers. Below is an example. Headers also contain API keys, Bearer tokens, etc.

headers = {
    "Content-Type": "application/json"
}

process(endpoint_url: str, payload: dict, key_name: str = 'chunk', stream: bool = True, encoding: str = 'utf-8')

Processes an API response that sends Server-Sent Events (SSE).

Parameters:
  • endpoint_url (str) – The API endpoint (e.g., http://0.0.0.0:8000/api/generate).

  • payload (dict) – The request payload.

  • key_name (str, default="chunk") – The custom key name used in the SSE streams.

  • stream (bool, default=True) – Whether to stream outputs.

  • encoding (str, default="utf-8") – The encoding to use for JSON encoding.

Labellers

HuggingFaceDataLabeller

class langformers.labellers.HuggingFaceDataLabeller(model_name: str, multi_label: bool = False)

Bases: object

Performs data labelling task using generative LLMs available on Hugging Face.

This class uses the selected generative LLM to assign labels to input text based on user-defined conditions. It supports both single-label and multi-label classification.

__init__(model_name: str, multi_label: bool = False)

Initializes the HuggingFaceDataLabeler with the specified model.

Parameters:
  • model_name (str, required) – The name of the Hugging Face model to use.

  • multi_label (bool, default=False) – If True, allows multiple labels to be selected.

label(text: str, conditions: Dict[str, str])

Labels a given text based on user-defined conditions.

Parameters:
  • text (str, required) – The text to be classified.

  • conditions (Dict[str, str], required) – A dictionary mapping labels to their descriptions.

pipeline(complete_prompt: List[Dict[str, str]])

Runs the generated prompt through the Hugging Face text generation pipeline.

Parameters:

complete_prompt (List[Dict[str, str]], required) – The structured prompt to be processed.

OllamaDataLabeller

class langformers.labellers.OllamaDataLabeller(model_name: str, multi_label: bool = False)

Bases: object

Performs data labelling task using generative LLMs provided by Ollama.

This class uses the selected generative LLM to assign labels to input text based on user-defined conditions. It supports both single-label and multi-label classification.

__init__(model_name: str, multi_label: bool = False)

Initializes the OllamaDataLabeler with the specified model.

Parameters:
  • model_name (str, required) – The name of the Ollama model to use.

  • multi_label (bool, default=False) – If True, allows multiple labels to be selected.

label(text: str, conditions: Dict[str, str])

Labels a given text based on user-defined conditions.

Parameters:
  • text (str) – The text to be labelled.

  • conditions (Dict[str, str], required) – A dictionary mapping labels to their descriptions.

pipeline(complete_prompt: List[Dict[str, str]])

Runs the generated prompt through the Ollama’s chat().

Parameters:

complete_prompt (List[Dict[str, str]], required) – The structured prompt to be processed.

Mimickers

EmbeddingMimicker

class langformers.mimickers.EmbeddingMimicker(teacher_model: str, student_config: dict | None = None, training_config: dict | None = None)

Bases: object

Mimics the teacher model’s vector space using a student model.

This class is used to train a student model to replicate the embedding space of a pretrained teacher model. The student model learns to match the output embeddings of the teacher model by minimizing the mean squared error (MSE) loss between the teacher’s and student’s embeddings.

__init__(teacher_model: str, student_config: dict | None = None, training_config: dict | None = None)

Loads the teacher model and initializes the student model with provided configurations. Prepares dataset and dataloader required for the training.

Parameters:
  • teacher_model (str, required) – Name of the teacher model on Hugging Face.

  • student_config (dict, required) – Configuration for the student model. Refer to langformers.mimickers.embedding_mimicker.StudentConfig for key-value arguments.

  • training_config (dict, required) – Configuration for training. Refer to langformers.mimickers.embedding_mimicker.TrainingConfig for key-value arguments.

train()

Starts the training.

class langformers.mimickers.embedding_mimicker.StudentConfig(max_position_embeddings: int, num_attention_heads: int, num_hidden_layers: int, hidden_size: int, intermediate_size: int)

Bases: object

Configuration parameters for the student model.

Parameters:
  • max_position_embeddings (int, required) – Maximum sequence length for input to the student model.

  • num_attention_heads (int, required) – Number of attention heads in the student model.

  • num_hidden_layers (int, required) – Number of transformer layers in the student model.

  • hidden_size (int, required) – Size of the hidden layer.

  • intermediate_size (int, required) – Size of the intermediate layer.

class langformers.mimickers.embedding_mimicker.TrainingConfig(num_train_epochs: int, learning_rate: float, batch_size: int, dataset_path: list | str, logging_steps: int)

Bases: object

Default configuration for training.

Parameters:
  • num_train_epochs (int, required) – The number of epochs to train the student model.

  • learning_rate (float, required) – The learning rate used in optimization.

  • batch_size (int, required) – The batch size used during training.

  • dataset_path (list or str, required) – List of sentences or path to a text corpus. The text corpus should have one sentence (or a few) per line. Anything beyond the max_length of the teacher’s tokenizer will be truncated.

  • logging_steps (int, required) – The number of steps between each log message during training.

MLMs

MLMTokenizerDatasetCreator

class langformers.mlms.MLMTokenizerDatasetCreator(data_path: str, tokenizer_config: Dict | None = None, tokenizer: str = None)

Bases: object

Trains a tokenizer on a custom dataset and tokenizes the dataset required for MLM training.

__init__(data_path: str, tokenizer_config: Dict | None = None, tokenizer: str = None)

Initializes the provided data_path and tokenizer_config.

Parameters:
  • data_path (str, required) – Path to a raw text data (e.g., data.txt). Each line in the dataset should contain a single sentence or document. Each line can also be multiple sentences, but note that truncation will be applied.

  • tokenizer_config (Optional[Dict], default=None) – Configurations for the tokenizer. If not provided, default values will be assigned from langformers.mlms.mlm_tokenizer_dataset_creator.TokenizerConfig.

  • tokenizer (str, default=None) – Path to a trained tokenizer, such as “roberta-base” on Hugging Face, or a local path. If tokenizer is provided, it ignores tokenizer_config.

train()

Loads a tokenizer if one is provided, otherwise trains a new one. After the tokenizer has been trained, it is persisted on the disk. The provided tokenizer or if one was trained, is loaded for tokenization. Tokenized dataset is then persisted on the disk.

class langformers.mlms.mlm_tokenizer_dataset_creator.TokenizerConfig(vocab_size: int = 50265, min_frequency: int = 2, max_length: int = 512, special_tokens: list[str] = <factory>, path_to_save_tokenizer: str = './tokenizer')

Bases: object

Configuration for the tokenizer.

Parameters:
  • max_length (int, default=512) – Maximum sequence length for tokenization. For each line in the dataset, any sentence longer than this length will be truncated.

  • vocab_size (int, default=50_265) – Size of the vocabulary.

  • min_frequency (int, default=2) – Minimum frequency for a token to make its way into vocabulary.

  • path_to_save_tokenizer (str, default="./tokenizer")

  • special_tokens (List[str], default=["<s>","<pad>","</s>","<unk>","<mask>"]) – Special tokens.

HuggingFaceMLMCreator

class langformers.mlms.HuggingFaceMLMCreator(tokenizer: str, tokenized_dataset: str, model_config: Dict | None = None, training_config: Dict | None = None, checkpoint_path: str | None = None)

Bases: object

Trains a masked language model (MLM) using the RoBERTa pretraining procedure.

This class sets up and trains a RoBERTa-based masked language model using a tokenized dataset. It supports loading from a checkpoint or initializing a new model with a given configuration.

__init__(tokenizer: str, tokenized_dataset: str, model_config: Dict | None = None, training_config: Dict | None = None, checkpoint_path: str | None = None)

Initializes the masked language model (MLM) training process.

Parameters:
  • tokenizer (str, required) – Path to the tokenizer.

  • tokenized_dataset (str, required) – Path to the tokenized dataset.

  • model_config (Optional[Dict], default=None) – Dictionary containing model configurations. If None, checkpoint_path must be provided a valid checkpoint path.

  • training_config (Optional[Dict], default=None) – Dictionary containing training configurations. If None, default values will be assigned from langformers.mlms.mlm_trainer.TrainingConfig.

  • checkpoint_path (Optional[str], default=None) – Path to a model checkpoint for resuming training.

train()

Starts the training process.

The final model is saved at: - {self.training_config.output_dir}/final_model (default location).

The logs are saved at: - {self.training_config.output_dir}/training_logs (default location).

class langformers.mlms.mlm_trainer.TrainingConfig(num_train_epochs: int = 10, save_total_limit: int = 10, learning_rate: float = 0.0002, per_device_train_batch_size: int = 32, gradient_accumulation_steps: int = 2, save_strategy: str = 'steps', save_steps: int = 100, logging_steps: int = 100, report_to: list = <factory>, mlm_probability: float = 0.15, warmup_ratio: float = 0.05, logging_dir: str = None, output_dir: str = None, n_gpus: int = <factory>, run_name: str = 'langformers-mlm-d20250504-t052724')

Bases: object

Default configuration for pretraining an MLM with RoBERTa pretraining procedure. In addition to these parameters, you can pass the parameters class transformers.TrainingArguments takes. Refer: https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments.

Parameters:
  • num_train_epochs (int, default=10) – Number of epochs to train the model.

  • save_total_limit (int, default=10) – Limits the number of saved checkpoints to save disk space.

  • learning_rate (float, default=0.0002) – Learning rate. Controls the step size during optimization.

  • per_device_train_batch_size (int, default=32) – Batch size during training (per device).

  • gradient_accumulation_steps (int, default=2) – Simulates a larger batch size by accumulating gradients from multiple batches before performing a weight update.

  • save_strategy (str, default="steps") – When to save checkpoints during training.

  • save_steps (int, default=100) – Number of steps between model checkpoint saves.

  • logging_steps (int, default=100) – Number of update steps between logging.

  • report_to (list, default=[“none”]) – List of integrations to report to (e.g., “tensorboard”, “wandb”).

  • mlm_probability (float, default=0.15) – Probability of masking tokens during masked language modeling (MLM).

  • warmup_ratio (float, default=0.05) – Fraction of total training steps used for learning rate warmup.

  • logging_dir (str, default=None) – Directory to save the training logs. If not provided, logging will be done in a timestamp-based directory inside logs/. (see run_name)

  • output_dir (str, default=None) – Directory to save model checkpoints. If not provided, logging will be done in a timestamp-based directory. (see run_name)

  • n_gpus (int, default=1) – Number of GPUs to train with. Automatically computed for CUDA. Used for computing total steps of the training.

  • run_name (str, default=<timestamp based string>) – Descriptor for the run. Will be typically used by logging tools “wandb”, “tensorboard” etc. Langformers will automatically generate a run name based on current timestamp.

Rerankers

CrossEncoder

class langformers.rerankers.CrossEncoder(model_name: str)

Bases: object

Ranks text pairs using a cross-encoder model from Hugging Face.

__init__(model_name: str)

Loads the cross-encoder model and its tokenizer.

Parameters:

model_name (str, required) – The model name from Hugging Face (e.g., “cross-encoder/ms-marco-MiniLM-L-6-v2”)

rank(query: str, documents: List[str])

Predict the ranking scores for the given pairs of texts. The function expects a query and a list of documents. Returns the documents sorted by their scores.

Parameters:
  • query (str, required) – Query. E.g., “What is the capital of France?”

  • documents (List(str), required) – List of documents to be reranked. E.g., [“Paris is the capital of France.”, “Berlin is the capital of Germany.”]

Searchers

FaissSearcher

class langformers.searchers.FaissSearcher(embedder: str = None, index_path: str | None = None, db_path: str | None = None, index_type: str | None = 'HNSW', index_parameters: dict | None = None)

Bases: object

A FAISS-based semantic search engine for vectorized text retrieval.

This class allows efficient nearest neighbor search using FAISS and stores text data in an SQLite database. It supports different FAISS index types (FLAT and HNSW) and provides methods for adding, querying, and managing the search index.

__init__(embedder: str = None, index_path: str | None = None, db_path: str | None = None, index_type: str | None = 'HNSW', index_parameters: dict | None = None)

Initializes the FAISS searcher.

Parameters:
  • embedder (str, default=None) – Name of the HuggingFace embedding model.

  • index_path (Optional[str], default=None) – Path to the FAISS index file. If None, a new index is created.

  • db_path (Optional[str], default=None) – Path to the SQLite database file. If None, a new database is created.

  • index_type (str, default="HNSW") – Type of FAISS index. Supported types are FLAT and HNSW.

  • index_parameters (Optional[dict], default=None) –

    Additional parameters for the FAISS index. If not provided, the following is used.

    index_parameters = { "m": 32, "efConstruction": 40 }
    

add(texts: List[str], metadata: List[Dict[str, Any]] | None = None)

Adds text data to the FAISS index and stores it in the SQLite database.

Parameters:
  • texts (List[str]) – List of text entries to be indexed.

  • metadata (Optional[List[dict]], default=None) – Metadata associated with each text.

Notes

  • The texts are first converted to embeddings.

  • The embeddings are then added to the FAISS index.

  • The text data is stored in the SQLite database for retrieval.

count()

Returns the number of items in the FAISS index.

query(query: str, items: int = 1, include_metadata: bool = True)

Queries the FAISS index to find similar texts.

Parameters:
  • query (str, required) – The input text query.

  • items (int, default=1) – Number of nearest neighbors to retrieve.

  • include_metadata (bool, default=True) – Whether to include the metadata in the results.

Notes

  • The query text is first converted into an embedding.

  • The similarity score is calculated as 1 - (distance / 2), where distance is the FAISS L2 distance.

ChromaDBSearcher

class langformers.searchers.ChromaDBSearcher(embedder: str = None, db_path: str | None = None, collection_name: str | None = 'my_collection')

Bases: object

A ChromaDB-based semantic search engine for vectorized text retrieval.

This class integrates with ChromaDB to store and search for text embeddings, using a Hugging Face embedding model for vectorization.

__init__(embedder: str = None, db_path: str | None = None, collection_name: str | None = 'my_collection')

Initializes the ChromaDB searcher.

Parameters:
  • embedder (str) – Name of the Hugging Face embedding model.

  • db_path (Optional[str], default=None) – Path to the ChromaDB database. If None, a default name is generated.

  • collection_name (Optional[str], default="my_collection") – Name of the ChromaDB collection.

add(texts: List[str], metadata: List[Dict[str, Any]] | None = None)

Adds text data to the ChromaDB collection.

Parameters:
  • texts (List[str], required) – List of text entries to be indexed.

  • metadata (Optional[List[dict]], default=None) – Metadata associated with each text.

Notes

  • Each text is converted into an embedding.

  • A unique UUID is generated for each text entry.

  • The embeddings, along with metadata, are added to the ChromaDB collection.

count()

Return the number of items in the collection.

query(query: str, items: int = 1, include_metadata: bool = True)

Queries the ChromaDB collection to find similar texts.

Parameters:
  • query (str, required) – The input text query.

  • items (int, default=1) – Number of nearest neighbors to retrieve.

  • include_metadata (bool, default=True) – Whether to include the metadata in the results.

Notes

  • The query text is first converted into an embedding.

  • The similarity is derived from the squared distance returned by ChromaDB.

PineconeSearcher

class langformers.searchers.PineconeSearcher(embedder: str = None, index_name: str | None = None, api_key: str = None, index_parameters: dict | None = None)

Bases: object

A Pinecone-based semantic search engine for vectorized text retrieval.

This class integrates with Pinecone to store and search for text embeddings, using a Hugging Face embedding model for vectorization.

__init__(embedder: str = None, index_name: str | None = None, api_key: str = None, index_parameters: dict | None = None)

Initializes the Pinecone searcher.

Parameters:
  • embedder (str) – Name of the Hugging Face embedding model.

  • index_name (Optional[str], default=None) – Name of the Pinecone index.

  • api_key (str) – Pinecone API key for authentication.

  • index_parameters (Optional[dict], default=None) –

    Additional parameters for index configuration. If not provided, the following is used.

    index_parameters = {'cloud': 'aws',
                        'region': 'us-east-1',
                        'metric': 'cosine',
                        'dimension': self.dimension  # embedding model's output dimension.
                        }
    

add(texts: List[str], metadata: List[Dict[str, Any]] | None = None)

Adds text data to the Pinecone index.

Parameters:
  • texts (List[str]) – List of text entries to be indexed.

  • metadata (Optional[List[dict]], default=None) – Metadata associated with each text.

Notes

  • Each text is converted into an embedding.

  • A unique UUID is generated for each text entry.

  • The embeddings, along with metadata, are upserted into the Pinecone index.

count()

Returns the number of items stored in the Pinecone index.

query(query: str, items: int = 1, include_metadata: bool = True)

Queries the Pinecone index to find similar texts.

Parameters:
  • query (str, required) – The input text query.

  • items (int, default=1) – Number of nearest neighbors to retrieve.

  • include_metadata (bool, default=True) – Whether to include the metadata in the results.