Train Text Classifiers
Training text classifiers with Langformers is quite straightforward.
First, we define the training configurations, prepare the dataset, and select the MLM we would like to fine-tune for the classification task. All these can be achieved in few lines of code, but fully customizable!
Here’s a sample code to getting started.
# Import langformers
from langformers import tasks
# Define training configuration
training_config = {
"max_length": 80,
"num_train_epochs": 1,
"report_to": ['tensorboard'],
"logging_steps": 20,
"save_steps": 20,
"early_stopping_patience": 5,
# ...
}
# Initialize the model
model = tasks.create_classifier(
model_name="roberta-base", # model from Hugging Face or a local path
csv_path="/path/to/dataset.csv", # csv dataset
text_column="text", # text column name
label_column="label", # label/class column name
training_config=training_config
)
# Start fine-tuning
model.train()
This fine-tunes the selected MLM (e.g., “roberta-base”) automatically based on the number of classes identified in the training dataset.
At the end of training, the best model is saved along with its configurations.
Labels/classes datatype
train()
assumes that the labels/classes in your training dataset are formatted as strings (e.g., “positive”, “neutral”, “negative”) rather than numeric values (e.g., “0”, “1”, “2”).
Using human-readable labels (instead of encoded numbers) makes the classifier more intuitive to use during inference, reducing potential confusion.
- langformers.tasks.create_classifier(model_name: str, csv_path: str, text_column: str = 'text', label_column: str = 'label', training_config: Dict | None = None)
Factory method for training text classifiers. Only Hugging Face models are supported. Used for finetuning models such as BERT, RoBERTa, MPNet on a text classification dataset.
- Parameters:
model_name (str, required) – Name or path of the pretrained transformer model (e.g., “bert-base-uncased”).
csv_path (str, required) – Path to the CSV file containing training data.
text_column (str, default="text") – Column name in the CSV file containing the input text.
label_column (str, default="label") – Column name in the CSV file containing labels.
training_config (Optional[Dict], required) – A dictionary containing training parameters. If not provided, default values will be assigned from
langformers.classifiers.huggingface_classifier.TrainingConfig
.
- Returns:
An instance of HuggingFaceClassifier.
- class langformers.classifiers.huggingface_classifier.TrainingConfig(num_train_epochs: int = 10, save_total_limit: int = 1, learning_rate: float = 2e-05, per_device_train_batch_size: int = 16, per_device_eval_batch_size: int = 16, save_strategy: str = 'steps', save_steps: int = 100, eval_strategy: str = 'steps', logging_steps: int = 100, report_to: list = <factory>, logging_dir: str = None, output_dir: str = None, run_name: str = 'langformers-classifier-d20250504-t052724', test_size: float = 0.2, val_size: float = 0.1, metric_for_best_model: str = 'f1_macro', early_stopping_patience: int = 5, load_best_model_at_end: bool = True, max_length: int = None)
Bases:
object
Default configuration for fine-tuning a Hugging Face model (e.g., BERT, RoBERTa, MPNet) on a text classification dataset. In addition to these parameters, you can pass the parameters
class transformers.TrainingArguments
takes. Refer: https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments.- Parameters:
num_train_epochs (int, default=10) – Number of epochs to train the model.
save_total_limit (int, default=1) – Limits the number of saved checkpoints to save disk space.
learning_rate (float, default=2e-5) – Learning rate. Controls the step size during optimization.
per_device_train_batch_size (int, default=16) – Batch size during training (per device).
per_device_eval_batch_size (int, default=16) – Batch size during evaluation (per device).
save_strategy (str, default="steps") – When to save checkpoints during training.
save_steps (int, default=100) – Number of steps between model checkpoint saves.
eval_strategy (str, default="steps") – When to run evaluation during training.
logging_strategy (str, default="steps") – When to log training metrics (“steps”, “epoch”, etc.).
logging_steps (int, default=100) – Number of update steps between logging.
report_to (list, default=["none"]) – List of integrations to report to (e.g., “tensorboard”, “wandb”).
logging_dir (str, default=None) – Directory to save the training logs. If not provided, logging will be done in a timestamp-based directory inside logs/. (see run_name)
output_dir (str, default=None) – Directory to save model checkpoints. If not provided, logging will be done in a timestamp-based directory. (see run_name)
run_name (str, default=<timestamp based string>) – Descriptor for the run. Will be typically used by logging tools “wandb”, “tensorboard” etc. Langformers will automatically generate a run name based on current timestamp.
test_size (float, default=0.2) – Proportion of data for test split.
val_size (float, default=0.1) – Proportion of data for validation split.
metric_for_best_model (str, default="f1_macro") – Metric to use for comparing models.
early_stopping_patience (int, default=5) – Number of evaluations to wait before early stopping.
early_stopping_threshold (float, default=0.0001) – Minimum improvement threshold for early stopping.
load_best_model_at_end (bool, default=True) – Whether to load best model at end of training.
max_length (int, default=None) – Maximum sequence length for tokenization. If not provided, automatically assigned to tokenizer’s model_max_length.
Using the Classifier
You can load the trained classifier with load_classifier()
.
# Import langformers
from langformers import tasks
# Load the trained classifier
classifier = tasks.load_classifier("/path/to/classifier")
# Classify texts
classifier.classify(["I dont like this movie. Worst ever.", "I loved this movie."])
- langformers.tasks.load_classifier(model_name: str)
Factory method for loading a custom classifier from disk.
- Parameters:
model_name (str, required) – Path to the classifier.
- Returns:
An instance of LoadClassifier.
- langformers.classifiers.load_classifier.LoadClassifier.classify(self, texts: list[str]) list[dict[str, Any]]
Classifies input texts into predefined categories.
- Parameters:
texts (list[str], required) – A list of text strings to classify.
Notes
Tokenizes input texts and feeds them into the model.
Uses softmax to obtain probability distributions.
Returns the most probable class label for each input text.