Further Pretrain MLMs
Continuing the pretraining of a Masked Language Model (MLM) follows a similar approach to how we pretrain MLMs. However, unlike standard pretraining, we don’t initialize the model with random weights. Instead, there are two possible scenarios:
You have a saved checkpoint of an MLM created by Langformers and want to continue training.
You want to take an existing MLM and further pretrain it on your custom dataset. Further pretraining an existing MLM produces great results [Lamsal2024].
Model pretrained using Langformers
If you trained an MLM with Langformers, you should already have a tokenizer and a tokenized dataset ready; simply provide checkpoint_path
instead of model_config
to create_mlm()
.
# Import Langformers
from langformers import tasks
# Define training configuration
training_config = {
"num_train_epochs": 2, # Number of training epochs
"save_total_limit": 1, # Maximum number of checkpoints to save
"learning_rate": 2e-4, # Learning rate for optimization
"per_device_train_batch_size": 4, # Batch size during training (per device)
# ...
}
# Initialize the training
model = tasks.create_mlm(
tokenizer="/path/to/tokenizer",
tokenized_dataset="/path/to/tokenized_dataset",
training_config=training_config,
checkpoint_path = "path/to/checkpoint"
)
# Start the training
model.train()
Existing Model from HuggingFace
However, if you want to train an existing MLM model such as RoBERTa, BERTweet, and CrisisTransformers (from Hugging Face), you’ll have a trained tokenizer compatible with that model but you’ll need to create a tokenized dataset on which you’ll be further pretraining.
Warning
Langformers supports further pretraining of only the models trained with RoBERTa pretraining procedure. Such as RoBERTa [Liu2019], BERTweet [Nguyen2020], or CrisisTransformers [Lamsal2024].
Here’s how you can create a tokenized dataset:
# Import Langformers
from langformers import tasks
# Tokenize the dataset with existing tokenizer.
# This example uses "roberta-base" tokenizer from Hugging Face.
dataset = tasks.create_tokenizer(data_path="data.txt", tokenizer="roberta-base")
dataset.train()
This saves the tokenized dataset inside “tokenized_dataset” in the working directory.
Next, we start the training.
# Import Langformers
from langformers import tasks
# Define training configuration
training_config = {
"per_device_train_batch_size": 4, # Batch size during training (per device)
"num_train_epochs": 2, # Number of training epochs
"save_total_limit": 1, # Maximum number of checkpoints to save
"learning_rate": 2e-4, # Learning rate for optimization
"per_device_train_batch_size": 4, # Batch size
# ...
}
# Initialize the training
# this example further pretrains "roberta-base"
model = tasks.create_mlm(
tokenizer="roberta-base",
tokenized_dataset="/path/to/tokenized_dataset",
training_config=training_config,
checkpoint_path = "roberta-base"
)
# Start the training
model.train()
- langformers.tasks.create_mlm(tokenizer: str, tokenized_dataset: str, model_config: Dict | None = None, training_config: Dict | None = None, checkpoint_path=None)
Factory method for training a masked language model based on RoBERTa pretraining procedure.
- Parameters:
tokenizer (str, required) – Path to the trained tokenizer.
tokenized_dataset (str, required) – Path to the tokenized dataset.
model_config (Optional[Dict], default=None) – Dictionary containing model configurations. If None, the model must be loaded from a checkpoint.
training_config (Optional[Dict], default=None) – Dictionary containing training configurations. If None, default training configurations will be used.
checkpoint_path (Optional[str], default=None) – Path to a model checkpoint for resuming training.
- Returns:
An instance of HuggingFaceMLMCreator.
Warning
At least one of model_config
or checkpoint_path
must be provided. If model_config
is specified,
a new model is initialized using the given configurations. If checkpoint_path
is provided, the model
from the specified path is resumed for pretraining.
- class langformers.mlms.mlm_trainer.TrainingConfig(num_train_epochs: int = 10, save_total_limit: int = 10, learning_rate: float = 0.0002, per_device_train_batch_size: int = 32, gradient_accumulation_steps: int = 2, save_strategy: str = 'steps', save_steps: int = 100, logging_steps: int = 100, report_to: list = <factory>, mlm_probability: float = 0.15, warmup_ratio: float = 0.05, logging_dir: str = None, output_dir: str = None, n_gpus: int = <factory>, run_name: str = 'langformers-mlm-d20250504-t052724')
Bases:
object
Default configuration for pretraining an MLM with RoBERTa pretraining procedure. In addition to these parameters, you can pass the parameters class
transformers.TrainingArguments
takes. Refer: https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments.- Parameters:
num_train_epochs (int, default=10) – Number of epochs to train the model.
save_total_limit (int, default=10) – Limits the number of saved checkpoints to save disk space.
learning_rate (float, default=0.0002) – Learning rate. Controls the step size during optimization.
per_device_train_batch_size (int, default=32) – Batch size during training (per device).
gradient_accumulation_steps (int, default=2) – Simulates a larger batch size by accumulating gradients from multiple batches before performing a weight update.
save_strategy (str, default="steps") – When to save checkpoints during training.
save_steps (int, default=100) – Number of steps between model checkpoint saves.
logging_steps (int, default=100) – Number of update steps between logging.
report_to (list, default=[“none”]) – List of integrations to report to (e.g., “tensorboard”, “wandb”).
mlm_probability (float, default=0.15) – Probability of masking tokens during masked language modeling (MLM).
warmup_ratio (float, default=0.05) – Fraction of total training steps used for learning rate warmup.
logging_dir (str, default=None) – Directory to save the training logs. If not provided, logging will be done in a timestamp-based directory inside logs/. (see run_name)
output_dir (str, default=None) – Directory to save model checkpoints. If not provided, logging will be done in a timestamp-based directory. (see run_name)
n_gpus (int, default=1) – Number of GPUs to train with. Automatically computed for CUDA. Used for computing total steps of the training.
run_name (str, default=<timestamp based string>) – Descriptor for the run. Will be typically used by logging tools “wandb”, “tensorboard” etc. Langformers will automatically generate a run name based on current timestamp.
Training loss is the main metric
Langformers does not evaluate checkpoints from MLM pretraining on a separate evaluation split, as it is generally unnecessary. In MLM pretraining, training loss is the primary metric since the goal is to learn rich representations rather than minimize validation loss. Real performance is ultimately determined by fine-tuning on downstream tasks.
References
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200.