Seamless Chat with LLMs

Chatting with generative Large Language Models (LLMs) is easy with Langformers.

Note

Chatting is supported for Ollama and Hugging Face[1] models. In the case of Ollama, please ensure that it is installed on your OS and that the desired model is pulled before starting a conversation.

Running the code

The code below must be run as a Python script or executed in a terminal using python3. It will not work inside notebooks.

Here’s a sample code snippet to get you started:

# Import langformers
from langformers import tasks

# Create a generator
generator = tasks.create_generator(provider="ollama", model_name="llama3.1:8b")

# Run the generator
generator.run(host="0.0.0.0", port=8000)

Open your browser at http://0.0.0.0:8000 (or the specific host and port you provided) to chat with the LLM.

langformers.tasks.create_generator(provider: str, model_name: str, memory: bool = True, dependency: Callable[[...], Any] | None = None, device: str = None)

Factory method for creating and managing LLMs chatting (User Interface) and LLM Inference (REST api).

Parameters:
  • provider (str, required) – The model provider (e.g., “ollama”). Currently supported providers: ollama, huggingface.

  • model_name (str, required) – The model name from the provider’s hub (e.g., “llama3.1:8b”).

  • memory (bool, default= True) – Whether to save previous chat interactions to maintain context. Chatting with an LLM (via a user interface) definitely makes sense to maintain memory, which is why this option defaults to True. But, when doing LLM Inference (via REST api) there might be some use cases where maintaining contexts might not be useful. Therefore, this option exists.

  • dependency (Optional[Callable[..., Any]], default=<no auth>) – A FastAPI dependency. The callable can return any value which will be injected into the route /api/generate.

  • device (str, default=None) – The device to load the model, data on (“cuda”, “mps” or “cpu”). If not provided, device will automatically be inferred. Currently used for HuggingFace models, as input ids and attention mask need to be on the save device as the model.

Returns:

An instance of the appropriate generator class, based on the selected provider.

  • If provider is “huggingface”, an instance of HuggingFaceGenerator is returned.

  • If provider is “ollama”, an instance of OllamaGenerator is returned.

Authentication

Opening the chat user interface does not require authorization. Since the interface interacts with an LLM via /api/generate endpoint, in some cases we might want to protect the endpoint from unauthorized access. Securing the endpoint is straightforward. You can pass a dependency to dependency when creating the generator.

async def auth_dependency():
    """Authorization dependency for request validation.

    - Implement your own logic here (e.g., API key check, authentication).
    - If the function returns a value, access is granted.
    - Raising an HTTPException will block access.
    - Modify this logic as needed.
    """

generator = tasks.create_generator(provider="ollama", model_name="llama3.1:8b", dependency=auth_dependency)

Example: Using API Key Authentication

You can implement a simple authentication dependency like this:

# Imports
from langformers import tasks
from fastapi import Request, HTTPException

# Define a set of valid API keys
API_KEYS = {"12345", "67890"}

async def auth_dependency(request: Request):
    """
    Extracts the Bearer token and verifies it against a list of valid API keys.
    """
    auth_header = request.headers.get("Authorization")

    if not auth_header or not auth_header.startswith("Bearer "):
        raise HTTPException(status_code=401, detail="Invalid authorization header format.")

    token = auth_header.split("Bearer ")[1]
    if token not in API_KEYS:
        raise HTTPException(status_code=401, detail="Unauthorized.")

    return True  # Allow access

# Create a generator with authentication
generator = tasks.create_generator(provider="ollama", model_name="llama3.1:8b", dependency=auth_dependency)

# Run the generator
generator.run(host="0.0.0.0", port=8000)

Now, a valid authorization token (one of the API keys in this case) should be entered into the “Authorization Token” text box (located in the left sidebar) of the chatting interface to interact with the LLM.

Warning

Langformers uses the Authorization: Bearer <token> header for the chat interface. For LLM inference, you can implement your own header format and authentication logic.

For industry-standard authentication in FastAPI, you can use OAuth2 with JWT (JSON Web Token), which is widely adopted for securing APIs.

Footnotes