LLM Inference
If you want to perform LLM inference through a REST API, you can send a POST request to /api/generate
.
This is great when you’re building your own application.
Note
LLM inference is supported for Ollama and Hugging Face[1] models. In the case of Ollama, please ensure that it is installed on your OS and that the desired model is pulled before starting a conversation.
Install Ollama: https://ollama.com/download
Pull model: For e.g.,
ollama pull llama3.1:8b
on your terminal
Running the code
The code below must be run as a Python script or executed in a terminal using python3
. It will not work inside notebooks.
# Import langformers
from langformers import tasks
# Create a generator
generator = tasks.create_generator(provider="ollama", model_name="llama3.1:8b")
# Run the generator
generator.run(host="0.0.0.0", port=8000)
- langformers.tasks.create_generator(provider: str, model_name: str, memory: bool = True, dependency: Callable[[...], Any] | None = None, device: str = None)
Factory method for creating and managing LLMs chatting (User Interface) and LLM Inference (REST api).
- Parameters:
provider (str, required) – The model provider (e.g., “ollama”). Currently supported providers:
ollama
,huggingface
.model_name (str, required) – The model name from the provider’s hub (e.g., “llama3.1:8b”).
memory (bool, default= True) – Whether to save previous chat interactions to maintain context. Chatting with an LLM (via a user interface) definitely makes sense to maintain memory, which is why this option defaults to True. But, when doing LLM Inference (via REST api) there might be some use cases where maintaining contexts might not be useful. Therefore, this option exists.
dependency (Optional[Callable[..., Any]], default=<no auth>) – A FastAPI dependency. The callable can return any value which will be injected into the route /api/generate.
device (str, default=None) – The device to load the model, data on (“cuda”, “mps” or “cpu”). If not provided, device will automatically be inferred. Currently used for HuggingFace models, as input ids and attention mask need to be on the save device as the model.
- Returns:
An instance of the appropriate generator class, based on the selected provider.
If provider is “huggingface”, an instance of HuggingFaceGenerator is returned.
If provider is “ollama”, an instance of OllamaGenerator is returned.
run()
takes the following parameters:
host
(str, default=”0.0.0.0”): The IP address to bind the server to.port
(int), default=8000: The port number to listen on.
API Request Format
The /api/generate
endpoint accepts the following:
{
"system_prompt": "You are an Aussie AI assistant, reply in an Aussie way.",
"memory_k": 10,
"temperature": 0.5,
"top_p": 1,
"max_length": 5000,
"prompt": "Hi"
}
Note: The endpoint expects at least prompt
. The keys system_prompt
, memory_k
, temperature
, top_p
, max_length
are optional.
Here’s an example of making a request to the API using Python:
# Imports
import requests
import json
# Endpoint URL
url = "http://0.0.0.0:8000/api/generate"
# Define payload
payload = json.dumps({
"system_prompt": "You are an Aussie AI assistant, reply in an Aussie way.",
"memory_k": 10,
"temperature": 0.5,
"top_p": 1,
"max_length": 5000,
"prompt": "Hi"
})
# Headers
headers = {
"Content-Type": "application/json",
}
# Send request
response = requests.post(url, headers=headers, data=payload)
# Print response
print(response.text)
This streams the tokens generated by the LLM using SSE streams (e.g., data: {“chunk”: “Hello”} …). You need to parse these SSE streams. Langformers can handle this natively.
StreamProcessor
Here’s how you parse the SSE streams with StreamProcessor.
# Import StreamProcessor
from langformers.generators import StreamProcessor
# Define headers
headers = {
"Content-Type": "application/json",
}
# Create an object
client = StreamProcessor(headers=headers)
# Define payload
payload = {
"system_prompt": "You are an Aussie AI assistant, reply in an Aussie way.",
"memory_k": 10,
"temperature": 0.5,
"top_p": 1,
"max_length": 5000,
"prompt": "Hi, how are you today",
}
# Send request
response = client.process(endpoint_url="http://0.0.0.0:8000/api/generate", payload=payload)
# Print response
for chunk in response:
print(chunk, end="", flush=True)
The /api/generate
endpoint takes the following parameters:
system_prompt
(str, default=<Langformers.commons.prompts default_chat_prompt_system>): The system-level instruction for the LLM.memory_k
(int, default=10): The number of previous messages to retain in memory.temperature
(float, default=0.5): Controls randomness of responses (higher = more random).top_p
(float, default=1): Nucleus sampling parameter (lower = more focused).max_length
(int, default=5000): Maximum number of tokens to generate.prompt
(str, required): User query.
System prompt and Memory
Note that any change in system_prompt
clears the previous conversations stored in memory.
- langformers.generators.StreamProcessor.__init__(self, headers)
Initializes the StreamProcessor class.
- Parameters:
headers (dict) –
A dictionary of headers. Below is an example. Headers also contain API keys, Bearer tokens, etc.
headers = { "Content-Type": "application/json" }
- langformers.generators.StreamProcessor.process(self, endpoint_url: str, payload: dict, key_name: str = 'chunk', stream: bool = True, encoding: str = 'utf-8')
Processes an API response that sends Server-Sent Events (SSE).
- Parameters:
endpoint_url (str) – The API endpoint (e.g., http://0.0.0.0:8000/api/generate).
payload (dict) – The request payload.
key_name (str, default="chunk") – The custom key name used in the SSE streams.
stream (bool, default=True) – Whether to stream outputs.
encoding (str, default="utf-8") – The encoding to use for JSON encoding.
Authentication
Securing the /api/generate
endpoint is straightforward. You can pass a dependency function to dependency
when creating the generator.
async def auth_dependency():
"""Authorization dependency for request validation.
- Implement your own logic here (e.g., API key check, authentication).
- If the function returns a value, access is granted.
- Raising an HTTPException will block access.
"""
return True # Modify this logic as needed
generator = tasks.create_generator(provider="ollama", model_name="llama3.1:8b", dependency=auth_dependency)
Example: Using API Key Authentication
You can implement a simple authentication dependency like this:
# Imports
from langformers import tasks
from fastapi import Request, HTTPException
# Define a set of valid API keys
API_KEYS = {"12345", "67890"}
async def auth_dependency(request: Request):
"""
Extracts the Bearer token and verifies it against a list of valid API keys.
"""
auth_header = request.headers.get("Authorization")
if not auth_header or not auth_header.startswith("Bearer "):
raise HTTPException(status_code=401, detail="Invalid authorization header format.")
token = auth_header.split("Bearer ")[1]
if token not in API_KEYS:
raise HTTPException(status_code=401, detail="Unauthorized.")
return True # Allow access
# Create a generator with authentication
generator = tasks.create_generator(provider="ollama", model_name="llama3.1:8b", dependency=auth_dependency)
# Run the generator
generator.run(host="0.0.0.0", port=8000)
With this setup, only requests that include a valid API key in the headers will be authorized. All you need to do is include an Authorization: Bearer <token>
header with one of the API keys as the token and make a POST request.
headers = {
'Authorization': 'Bearer 12345',
'Content-Type': 'application/json'
}
Warning
For industry-standard authentication in FastAPI, you can use OAuth2 with JWT (JSON Web Token), which is widely adopted for securing APIs.
Footnotes