Train Text Classifiers

Training text classifiers with Langformers is quite straightforward.

First, we define the training configurations, prepare the dataset, and select the MLM we would like to fine-tune for the classification task. All these can be achieved in few lines of code, but fully customizable!

Here’s a sample code to getting started.

# Import langformers
from langformers import tasks

# Define training configuration
training_config = {
    "max_length": 80,
    "num_train_epochs": 1,
    "report_to": ['tensorboard'],
    "logging_steps": 20,
    "save_steps": 20,
    "early_stopping_patience": 5,
    # ...
}

# Initialize the model
model = tasks.create_classifier(
    model_name="roberta-base",          # model from Hugging Face or a local path
    csv_path="/path/to/dataset.csv",    # csv dataset
    text_column="text",                 # text column name
    label_column="label",               # label/class column name
    training_config=training_config
)

# Start fine-tuning
model.train()

This fine-tunes the selected MLM (e.g., “roberta-base”) automatically based on the number of classes identified in the training dataset.

At the end of training, the best model is saved along with its configurations.

Labels/classes datatype

train() assumes that the labels/classes in your training dataset are formatted as strings (e.g., “positive”, “neutral”, “negative”) rather than numeric values (e.g., “0”, “1”, “2”). Using human-readable labels (instead of encoded numbers) makes the classifier more intuitive to use during inference, reducing potential confusion.

langformers.tasks.create_classifier(model_name: str, csv_path: str, text_column: str = 'text', label_column: str = 'label', training_config: Dict | None = None)

Factory method for training text classifiers. Only Hugging Face models are supported. Used for finetuning models such as BERT, RoBERTa, MPNet on a text classification dataset.

Parameters:
  • model_name (str, required) – Name or path of the pretrained transformer model (e.g., “bert-base-uncased”).

  • csv_path (str, required) – Path to the CSV file containing training data.

  • text_column (str, default="text") – Column name in the CSV file containing the input text.

  • label_column (str, default="label") – Column name in the CSV file containing labels.

  • training_config (Optional[Dict], required) – A dictionary containing training parameters. If not provided, default values will be assigned from langformers.classifiers.huggingface_classifier.TrainingConfig.

Returns:

An instance of HuggingFaceClassifier.

Using the Classifier

You can load the trained classifier with load_classifier().

# Import langformers
from langformers import tasks

# Load the trained classifier
classifier = tasks.load_classifier("/path/to/classifier")

# Classify texts
classifier.classify(["I dont like this movie. Worst ever.", "I loved this movie."])
langformers.tasks.load_classifier(model_name: str)

Factory method for loading a custom classifier from disk.

Parameters:

model_name (str, required) – Path to the classifier.

Returns:

An instance of LoadClassifier.