Semantic Search
Langformers can help you quickly set up a semantic search engine for vectorized text retrieval. All you need to do is specify an embedding model, the type of database (FAISS, ChromaDB, or Pinecone), and an index type (if required).
Here’s a sample code snippet to get you started:
# Import langformers
from langformers import tasks
# Initialize a searcher
searcher = tasks.create_searcher(embedder="sentence-transformers/all-MiniLM-L12-v2", database="faiss", index_type="HNSW")
'''
For other vector databases:
ChromaDB
searcher = tasks.create_searcher(embedder="sentence-transformers/all-MiniLM-L12-v2", database="chromadb")
Pinecone
searcher = tasks.create_searcher(embedder="sentence-transformers/all-MiniLM-L12-v2", database="pinecone", api_key="your-api-key-here")
'''
# Sentences to add in the vector database
sentences = [
"He is learning Python programming.",
"The coffee shop opens at 8 AM.",
"She bought a new laptop yesterday.",
"He loves to play basketball with friends.",
"Artificial Intelligence is evolving rapidly.",
"He studies CS at the University of Melbourne."
]
# Metadata for the respective sentences
metadata = [
{"action": "learning", "category": "education"},
{"action": "opens", "category": "business"},
{"action": "bought", "category": "shopping"},
{"action": "loves", "category": "sports"},
{"action": "evolving", "category": "technology"},
{"action": "studies", "category": "education"}
]
# Add the sentences
searcher.add(texts=sentences, metadata=metadata)
# Define a search query
query_sentence = "computer science"
# Query the vector database
results = searcher.query(query=query_sentence, items=2, include_metadata=True)
print(results)
Loading an Existing Database
Once a searcher is initialized, the specified index/database is persisted on disk (if FAISS and ChromaDB) and on cloud (if Pinecone). To load an existing database, initialize a searcher with the following parameters:
For FAISS:
index_path
anddb_path
.For ChromaDB:
db_path
andcollection_name
.For Pinecone:
index_name
- langformers.tasks.create_searcher(database: str, **kwargs)
Factory method for creating a searcher and perform vector search.
- Parameters:
database (str, required) – Type of vector database (e.g., “faiss”). Supported database:
faiss
,chromadb
, andpinecone
.**kwargs (dict, required) – Database specific keyword arguments.
- Returns:
An instance of the appropriate searcher class, based on the selected database.
If database is “faiss”, an instance of FaissSearcher is returned.
If database is “chromadb”, an instance of ChromaDBSearcher is returned.
If database is “pinecone”, an instance of PineconeSearcher is returned.
kwargs for FAISS database:
- class langformers.searchers.FaissSearcher(embedder: str = None, index_path: str | None = None, db_path: str | None = None, index_type: str | None = 'HNSW', index_parameters: dict | None = None)
Bases:
object
A FAISS-based semantic search engine for vectorized text retrieval.
This class allows efficient nearest neighbor search using FAISS and stores text data in an SQLite database. It supports different FAISS index types (FLAT and HNSW) and provides methods for adding, querying, and managing the search index.
- __init__(embedder: str = None, index_path: str | None = None, db_path: str | None = None, index_type: str | None = 'HNSW', index_parameters: dict | None = None)
Initializes the FAISS searcher.
- Parameters:
embedder (str, default=None) – Name of the HuggingFace embedding model.
index_path (Optional[str], default=None) – Path to the FAISS index file. If None, a new index is created.
db_path (Optional[str], default=None) – Path to the SQLite database file. If None, a new database is created.
index_type (str, default="HNSW") – Type of FAISS index. Supported types are
FLAT
andHNSW
.index_parameters (Optional[dict], default=None) –
Additional parameters for the FAISS index. If not provided, the following is used.
index_parameters = { "m": 32, "efConstruction": 40 }
kwargs for ChromaDB database:
- class langformers.searchers.ChromaDBSearcher(embedder: str = None, db_path: str | None = None, collection_name: str | None = 'my_collection')
Bases:
object
A ChromaDB-based semantic search engine for vectorized text retrieval.
This class integrates with ChromaDB to store and search for text embeddings, using a Hugging Face embedding model for vectorization.
- __init__(embedder: str = None, db_path: str | None = None, collection_name: str | None = 'my_collection')
Initializes the ChromaDB searcher.
- Parameters:
embedder (str) – Name of the Hugging Face embedding model.
db_path (Optional[str], default=None) – Path to the ChromaDB database. If None, a default name is generated.
collection_name (Optional[str], default="my_collection") – Name of the ChromaDB collection.
kwargs for Pinecone database:
- class langformers.searchers.PineconeSearcher(embedder: str = None, index_name: str | None = None, api_key: str = None, index_parameters: dict | None = None)
Bases:
object
A Pinecone-based semantic search engine for vectorized text retrieval.
This class integrates with Pinecone to store and search for text embeddings, using a Hugging Face embedding model for vectorization.
- __init__(embedder: str = None, index_name: str | None = None, api_key: str = None, index_parameters: dict | None = None)
Initializes the Pinecone searcher.
- Parameters:
embedder (str) – Name of the Hugging Face embedding model.
index_name (Optional[str], default=None) – Name of the Pinecone index.
api_key (str) – Pinecone API key for authentication.
index_parameters (Optional[dict], default=None) –
Additional parameters for index configuration. If not provided, the following is used.
index_parameters = {'cloud': 'aws', 'region': 'us-east-1', 'metric': 'cosine', 'dimension': self.dimension # embedding model's output dimension. }
add()
takes the following parameters:
texts
(list[str], required): List of text entries to be indexed.metadata
(Optional[List[Dict[str, Any]]], default=None): Metadata associated with each text.
query()
takes the following parameters:
query
(str, required): The input text query.items
(int, default=1): Number of nearest neighbors to retrieve.include_metadata
(bool, default=True): Whether to include the metadata in the results.
count()
does not take any parameters. Simply run <searcher object>.count()
.