Logo

Getting Started

  • Installation
    • Optional Dependencies
    • Platform-specific Instructions
  • Dependencies
    • Core
    • Optional

LLMs

  • Seamless Chat with LLMs
    • Authentication
  • LLM Inference
    • API Request Format
    • StreamProcessor
    • Authentication
  • Data Labeling using LLMs
  • Chunking
    • Fixed-size chunking
    • Semantic Chunking
    • Recursive chunking

MLMs

  • Train Text Classifiers
    • Using the Classifier
  • Pretrain MLMs
    • Tokenizer and Tokenization
    • Initialize an MLM and Train
  • Further Pretrain MLMs
    • Model pretrained using Langformers
    • Existing Model from HuggingFace

Embeddings

  • Embed Sentences
    • Textual Similarity
  • Semantic Search
    • Loading an Existing Database
  • Rerank Sentences
  • Mimick a Pretrained Model

Library Reference

  • Tasks
    • tasks
  • Classifiers
    • HuggingFaceClassifier
    • LoadClassifier
  • Embedders
    • HuggingFaceEmbedder
  • Generators
    • OllamaGenerator
    • HuggingFaceGenerator
    • StreamProcessor
  • Labellers
    • HuggingFaceDataLabeller
    • OllamaDataLabeller
  • Mimickers
    • EmbeddingMimicker
  • MLMs
    • MLMTokenizerDatasetCreator
    • HuggingFaceMLMCreator
  • Rerankers
    • CrossEncoder
  • Searchers
    • FaissSearcher
    • ChromaDBSearcher
    • PineconeSearcher

Development

  • Changelog
    • v0.5.0 (2024-05-04)
    • v0.4.0 (2024-04-17)
    • v0.3.1 (2024-04-16)
    • v0.3.0 (2024-04-14)
    • v0.2.0 (2024-04-10)
    • v0.1.0 (2024-04-08)
  • License
  • Contributing
    • Setting Up the Project
    • Ways to Contribute
    • Pull Request Guidelines
    • Documentation
    • License
langformers
  • Semantic Search
  • Edit on GitHub

Semantic Search

Official Links:
Blog: Guides & Tutorials GitHub PyPI Support

Langformers can help you quickly set up a semantic search engine for vectorized text retrieval. All you need to do is specify an embedding model, the type of database (FAISS, ChromaDB, or Pinecone), and an index type (if required).

Here’s a sample code snippet to get you started:

# Import langformers
from langformers import tasks

# Initialize a searcher
searcher = tasks.create_searcher(embedder="sentence-transformers/all-MiniLM-L12-v2", database="faiss", index_type="HNSW")

'''
For other vector databases:

ChromaDB
searcher = tasks.create_searcher(embedder="sentence-transformers/all-MiniLM-L12-v2", database="chromadb")

Pinecone
searcher = tasks.create_searcher(embedder="sentence-transformers/all-MiniLM-L12-v2", database="pinecone", api_key="your-api-key-here")
'''

# Sentences to add in the vector database
sentences = [
    "He is learning Python programming.",
    "The coffee shop opens at 8 AM.",
    "She bought a new laptop yesterday.",
    "He loves to play basketball with friends.",
    "Artificial Intelligence is evolving rapidly.",
    "He studies CS at the University of Melbourne."
]

# Metadata for the respective sentences
metadata = [
    {"action": "learning", "category": "education"},
    {"action": "opens", "category": "business"},
    {"action": "bought", "category": "shopping"},
    {"action": "loves", "category": "sports"},
    {"action": "evolving", "category": "technology"},
    {"action": "studies", "category": "education"}
]

# Add the sentences
searcher.add(texts=sentences, metadata=metadata)

# Define a search query
query_sentence = "computer science"

# Query the vector database
results = searcher.query(query=query_sentence, items=2, include_metadata=True)
print(results)

Loading an Existing Database

Once a searcher is initialized, the specified index/database is persisted on disk (if FAISS and ChromaDB) and on cloud (if Pinecone). To load an existing database, initialize a searcher with the following parameters:

  • For FAISS: index_path and db_path.

  • For ChromaDB: db_path and collection_name.

  • For Pinecone: index_name

langformers.tasks.create_searcher(database: str, **kwargs)

Factory method for creating a searcher and perform vector search.

Parameters:
  • database (str, required) – Type of vector database (e.g., “faiss”). Supported database: faiss, chromadb, and pinecone.

  • **kwargs (dict, required) – Database specific keyword arguments.

Returns:

An instance of the appropriate searcher class, based on the selected database.

  • If database is “faiss”, an instance of FaissSearcher is returned.

  • If database is “chromadb”, an instance of ChromaDBSearcher is returned.

  • If database is “pinecone”, an instance of PineconeSearcher is returned.

kwargs for FAISS database:

class langformers.searchers.FaissSearcher(embedder: str = None, index_path: str | None = None, db_path: str | None = None, index_type: str | None = 'HNSW', index_parameters: dict | None = None)

Bases: object

A FAISS-based semantic search engine for vectorized text retrieval.

This class allows efficient nearest neighbor search using FAISS and stores text data in an SQLite database. It supports different FAISS index types (FLAT and HNSW) and provides methods for adding, querying, and managing the search index.

__init__(embedder: str = None, index_path: str | None = None, db_path: str | None = None, index_type: str | None = 'HNSW', index_parameters: dict | None = None)

Initializes the FAISS searcher.

Parameters:
  • embedder (str, default=None) – Name of the HuggingFace embedding model.

  • index_path (Optional[str], default=None) – Path to the FAISS index file. If None, a new index is created.

  • db_path (Optional[str], default=None) – Path to the SQLite database file. If None, a new database is created.

  • index_type (str, default="HNSW") – Type of FAISS index. Supported types are FLAT and HNSW.

  • index_parameters (Optional[dict], default=None) –

    Additional parameters for the FAISS index. If not provided, the following is used.

    index_parameters = { "m": 32, "efConstruction": 40 }
    

kwargs for ChromaDB database:

class langformers.searchers.ChromaDBSearcher(embedder: str = None, db_path: str | None = None, collection_name: str | None = 'my_collection')

Bases: object

A ChromaDB-based semantic search engine for vectorized text retrieval.

This class integrates with ChromaDB to store and search for text embeddings, using a Hugging Face embedding model for vectorization.

__init__(embedder: str = None, db_path: str | None = None, collection_name: str | None = 'my_collection')

Initializes the ChromaDB searcher.

Parameters:
  • embedder (str) – Name of the Hugging Face embedding model.

  • db_path (Optional[str], default=None) – Path to the ChromaDB database. If None, a default name is generated.

  • collection_name (Optional[str], default="my_collection") – Name of the ChromaDB collection.

kwargs for Pinecone database:

class langformers.searchers.PineconeSearcher(embedder: str = None, index_name: str | None = None, api_key: str = None, index_parameters: dict | None = None)

Bases: object

A Pinecone-based semantic search engine for vectorized text retrieval.

This class integrates with Pinecone to store and search for text embeddings, using a Hugging Face embedding model for vectorization.

__init__(embedder: str = None, index_name: str | None = None, api_key: str = None, index_parameters: dict | None = None)

Initializes the Pinecone searcher.

Parameters:
  • embedder (str) – Name of the Hugging Face embedding model.

  • index_name (Optional[str], default=None) – Name of the Pinecone index.

  • api_key (str) – Pinecone API key for authentication.

  • index_parameters (Optional[dict], default=None) –

    Additional parameters for index configuration. If not provided, the following is used.

    index_parameters = {'cloud': 'aws',
                        'region': 'us-east-1',
                        'metric': 'cosine',
                        'dimension': self.dimension  # embedding model's output dimension.
                        }
    

add() takes the following parameters:

  • texts (list[str], required): List of text entries to be indexed.

  • metadata (Optional[List[Dict[str, Any]]], default=None): Metadata associated with each text.

query() takes the following parameters:

  • query (str, required): The input text query.

  • items (int, default=1): Number of nearest neighbors to retrieve.

  • include_metadata (bool, default=True): Whether to include the metadata in the results.

count() does not take any parameters. Simply run <searcher object>.count().

Previous Next

© Copyright 2025. Built with ❤️ for the future of language AI.

Built with Sphinx using a theme provided by Read the Docs.