Pretrain MLMs
Masked Language Models (MLMs) like BERT [Devlin2019], RoBERTa [Liu2019], and MPNet [Song2020] are auto-encoding models that use only the encoder component of the Transformer [Vaswani2017] architecture. These models are ideal for tasks requiring the entire input sequence to make decisions, such as text classification and named-entity recognition.
Langformers adopts the RoBERTa architecture [Liu2019] as the default for both MLM pretraining and model mimicking tasks. This design choice enables a unified and streamlined training pipeline. RoBERTa is well-suited for both pretraining and downstream fine-tuning or transfer tasks. Additionally, the architecture is highly customizable and easy to understand, with configurable parameters such as the number of layers, hidden size, and attention heads.
Tip
Pretraining an MLM is mostly beneficial when no domain-specific MLM exists. For example, BERTweet [Nguyen2020] was created for tweets, and CrisisTransformers [Lamsal2024] were developed for crisis-related tweets.
Maybe you would want to continue pretraining an exsiting MLM like RoBERTa on your domain-specific dataset. Further pretraining produces great results [Lamsal2024]. Check out Langformers’ Further Pretrain MLMs task.
There are three steps to training an MLM from scratch. First, train a tokenizer on your dataset. Second, use the trained tokenizer to tokenize the dataset. Finally, train the MLM on the tokenized dataset.
Let’s get started!
Tokenizer and Tokenization
Before training, you need to create a tokenizer (if you already don’t have one) and tokenize your dataset. The tokenizer converts raw text into tokens that the model can process.
# Import langformers
from langformers import tasks
# Define configuration for the tokenizer
tokenizer_config = {
"vocab_size": 50_265,
"min_frequency": 2,
"max_length": 512,
# ...
}
# Train the tokenizer and tokenize the dataset
tokenizer = tasks.create_tokenizer(data_path="data.txt", tokenizer_config=tokenizer_config)
tokenizer.train()
The trained tokenizer will be saved inside “tokenizer” and the tokenized dataset inside “tokenized_dataset” in the working directory.
- langformers.tasks.create_tokenizer(data_path: str, tokenizer_config: Dict | None = None, tokenizer: str = None)
Factory method for training a tokenizer on a custom dataset, and use the trained tokenizer to tokenize the dataset.
- Parameters:
data_path (str, required) – Path to a raw text data (e.g., data.txt). Each line in the dataset should contain a single sentence or document. Each line can also be multiple sentences, but note that truncation will be applied.
tokenizer_config (dict, default=None) – Configurations for the tokenizer. If not provided, default values will be assigned from the dataclass langformers.mlms.mlm_tokenizer_dataset_creator.TokenizerConfig.
tokenizer (str, default=None) – Path to a trained tokenizer, such as “roberta-base” on Hugging Face, or a local path. If tokenizer is provided, it ignores tokenizer_config.
- Returns:
An instance of MLMDatasetCreator.
- class langformers.mlms.mlm_tokenizer_dataset_creator.TokenizerConfig(vocab_size: int = 50265, min_frequency: int = 2, max_length: int = 512, special_tokens: list[str] = <factory>, path_to_save_tokenizer: str = './tokenizer')
Bases:
object
Configuration for the tokenizer.
- Parameters:
max_length (int, default=512) – Maximum sequence length for tokenization. For each line in the dataset, any sentence longer than this length will be truncated.
vocab_size (int, default=50_265) – Size of the vocabulary.
min_frequency (int, default=2) – Minimum frequency for a token to make its way into vocabulary.
path_to_save_tokenizer (str, default="./tokenizer")
special_tokens (List[str], default=["<s>","<pad>","</s>","<unk>","<mask>"]) – Special tokens.
Initialize an MLM and Train
With the tokenizer and dataset ready, initialize the MLM model and start training.
# Define model architecture
model_config = {
"vocab_size": 50_265, # Size of the vocabulary (must match tokenizer's `vocab_size`)
"max_position_embeddings": 514, # !imp Maximum sequence length (tokenizer's `max_length` + 2)
"num_attention_heads": 12, # Number of attention heads
"num_hidden_layers": 12, # Number of hidden layers
"hidden_size": 768, # Size of the hidden layers
"intermediate_size": 3072, # Size of the intermediate layer in the Transformer
# ...
}
# Define training configuration
training_config = {
"per_device_train_batch_size": 4, # Batch size during training (per device)
"num_train_epochs": 2, # Number of training epochs
"save_total_limit": 1, # Maximum number of checkpoints to save
"learning_rate": 2e-4, # Learning rate for optimization
# ...
}
# Initialize the training
model = tasks.create_mlm(
tokenizer="/path/to/tokenizer",
tokenized_dataset="/path/to/tokenized_dataset",
training_config=training_config,
model_config=model_config
)
# Start the training
model.train()
- langformers.tasks.create_mlm(tokenizer: str, tokenized_dataset: str, model_config: Dict | None = None, training_config: Dict | None = None, checkpoint_path=None)
Factory method for training a masked language model based on RoBERTa pretraining procedure.
- Parameters:
tokenizer (str, required) – Path to the trained tokenizer.
tokenized_dataset (str, required) – Path to the tokenized dataset.
model_config (Optional[Dict], default=None) – Dictionary containing model configurations. If None, the model must be loaded from a checkpoint.
training_config (Optional[Dict], default=None) – Dictionary containing training configurations. If None, default training configurations will be used.
checkpoint_path (Optional[str], default=None) – Path to a model checkpoint for resuming training.
- Returns:
An instance of HuggingFaceMLMCreator.
Warning
At least one of model_config
or checkpoint_path
must be provided. If model_config
is specified,
a new model is initialized using the given configurations. If checkpoint_path
is provided, the model
from the specified path is resumed for pretraining. The latter is particularly useful for addressing
issues in the current checkpoint’s behavior[1] or continuing the pretraining of an existing MLM.
- class langformers.mlms.mlm_trainer.ModelConfig(model_config: ~typing.Dict = <factory>)
Bases:
object
Default configuration for the model. Architecture will default to roberta-base architecture, if None. In addition to these parameters, you can pass the parameters class
transformers.RobertaConfig
takes. Refer: https://huggingface.co/docs/transformers/en/model_doc/roberta#transformers.RobertaConfig.- Parameters:
vocab_size (int, default=50_265) – Size of the vocabulary (must match with the tokenizer).
max_position_embeddings (int, default=512) – Maximum sequence length the model can handle.
num_attention_heads (int, default=12) – Number of attention heads in the Transformer.
num_hidden_layers (int, default=12) – Number of layers in the Transformer encoder.
hidden_size (int, default=768) – Dimensionality of the hidden layers.
intermediate_size (int, default=3072) – Dimensionality of the feedforward layer in the Transformer.
attention_probs_dropout_prob (float, default=0.1) – Dropout probability for attention layers to prevent overfitting.
- class langformers.mlms.mlm_trainer.TrainingConfig(num_train_epochs: int = 10, save_total_limit: int = 10, learning_rate: float = 0.0002, per_device_train_batch_size: int = 32, gradient_accumulation_steps: int = 2, save_strategy: str = 'steps', save_steps: int = 100, logging_steps: int = 100, report_to: list = <factory>, mlm_probability: float = 0.15, warmup_ratio: float = 0.05, logging_dir: str = None, output_dir: str = None, n_gpus: int = <factory>, run_name: str = 'langformers-mlm-d20250504-t052724')
Bases:
object
Default configuration for pretraining an MLM with RoBERTa pretraining procedure. In addition to these parameters, you can pass the parameters class
transformers.TrainingArguments
takes. Refer: https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments.- Parameters:
num_train_epochs (int, default=10) – Number of epochs to train the model.
save_total_limit (int, default=10) – Limits the number of saved checkpoints to save disk space.
learning_rate (float, default=0.0002) – Learning rate. Controls the step size during optimization.
per_device_train_batch_size (int, default=32) – Batch size during training (per device).
gradient_accumulation_steps (int, default=2) – Simulates a larger batch size by accumulating gradients from multiple batches before performing a weight update.
save_strategy (str, default="steps") – When to save checkpoints during training.
save_steps (int, default=100) – Number of steps between model checkpoint saves.
logging_steps (int, default=100) – Number of update steps between logging.
report_to (list, default=[“none”]) – List of integrations to report to (e.g., “tensorboard”, “wandb”).
mlm_probability (float, default=0.15) – Probability of masking tokens during masked language modeling (MLM).
warmup_ratio (float, default=0.05) – Fraction of total training steps used for learning rate warmup.
logging_dir (str, default=None) – Directory to save the training logs. If not provided, logging will be done in a timestamp-based directory inside logs/. (see run_name)
output_dir (str, default=None) – Directory to save model checkpoints. If not provided, logging will be done in a timestamp-based directory. (see run_name)
n_gpus (int, default=1) – Number of GPUs to train with. Automatically computed for CUDA. Used for computing total steps of the training.
run_name (str, default=<timestamp based string>) – Descriptor for the run. Will be typically used by logging tools “wandb”, “tensorboard” etc. Langformers will automatically generate a run name based on current timestamp.
Training loss is the main metric
Langformers does not evaluate checkpoints from MLM pretraining on a separate evaluation split, as it is generally unnecessary. In MLM pretraining, training loss is the primary metric since the goal is to learn rich representations rather than minimize validation loss. Real performance is ultimately determined by fine-tuning on downstream tasks.
References
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186).
Song, K., Tan, X., Qin, T., Lu, J., & Liu, T. Y. (2020). Mpnet: Masked and permuted pre-training for language understanding. Advances in neural information processing systems, 33, 16857-16867.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200.
Footnotes