Pretrain MLMs

Masked Language Models (MLMs) like BERT [Devlin2019], RoBERTa [Liu2019], and MPNet [Song2020] are auto-encoding models that use only the encoder component of the Transformer [Vaswani2017] architecture. These models are ideal for tasks requiring the entire input sequence to make decisions, such as text classification and named-entity recognition.

Langformers adopts the RoBERTa architecture [Liu2019] as the default for both MLM pretraining and model mimicking tasks. This design choice enables a unified and streamlined training pipeline. RoBERTa is well-suited for both pretraining and downstream fine-tuning or transfer tasks. Additionally, the architecture is highly customizable and easy to understand, with configurable parameters such as the number of layers, hidden size, and attention heads.

Tip

Pretraining an MLM is mostly beneficial when no domain-specific MLM exists. For example, BERTweet [Nguyen2020] was created for tweets, and CrisisTransformers [Lamsal2024] were developed for crisis-related tweets.

Maybe you would want to continue pretraining an exsiting MLM like RoBERTa on your domain-specific dataset. Further pretraining produces great results [Lamsal2024]. Check out Langformers’ Further Pretrain MLMs task.

There are three steps to training an MLM from scratch. First, train a tokenizer on your dataset. Second, use the trained tokenizer to tokenize the dataset. Finally, train the MLM on the tokenized dataset.

Let’s get started!

Tokenizer and Tokenization

Before training, you need to create a tokenizer (if you already don’t have one) and tokenize your dataset. The tokenizer converts raw text into tokens that the model can process.

# Import langformers
from langformers import tasks

# Define configuration for the tokenizer
tokenizer_config = {
     "vocab_size": 50_265,
     "min_frequency": 2,
     "max_length": 512,
     # ...
}

# Train the tokenizer and tokenize the dataset
tokenizer = tasks.create_tokenizer(data_path="data.txt", tokenizer_config=tokenizer_config)
tokenizer.train()

The trained tokenizer will be saved inside “tokenizer” and the tokenized dataset inside “tokenized_dataset” in the working directory.

langformers.tasks.create_tokenizer(data_path: str, tokenizer_config: Dict | None = None, tokenizer: str = None)

Factory method for training a tokenizer on a custom dataset, and use the trained tokenizer to tokenize the dataset.

Parameters:
  • data_path (str, required) – Path to a raw text data (e.g., data.txt). Each line in the dataset should contain a single sentence or document. Each line can also be multiple sentences, but note that truncation will be applied.

  • tokenizer_config (dict, default=None) – Configurations for the tokenizer. If not provided, default values will be assigned from the dataclass langformers.mlms.mlm_tokenizer_dataset_creator.TokenizerConfig.

  • tokenizer (str, default=None) – Path to a trained tokenizer, such as “roberta-base” on Hugging Face, or a local path. If tokenizer is provided, it ignores tokenizer_config.

Returns:

An instance of MLMDatasetCreator.

Initialize an MLM and Train

With the tokenizer and dataset ready, initialize the MLM model and start training.

# Define model architecture
model_config = {
    "vocab_size": 50_265,              # Size of the vocabulary (must match tokenizer's `vocab_size`)
    "max_position_embeddings": 514,    # !imp Maximum sequence length (tokenizer's `max_length` + 2)
    "num_attention_heads": 12,         # Number of attention heads
    "num_hidden_layers": 12,           # Number of hidden layers
    "hidden_size": 768,                # Size of the hidden layers
    "intermediate_size": 3072,         # Size of the intermediate layer in the Transformer
    # ...
}

# Define training configuration
training_config = {
    "per_device_train_batch_size": 4,  # Batch size during training (per device)
    "num_train_epochs": 2,             # Number of training epochs
    "save_total_limit": 1,             # Maximum number of checkpoints to save
    "learning_rate": 2e-4,             # Learning rate for optimization
    # ...
}

# Initialize the training
model = tasks.create_mlm(
    tokenizer="/path/to/tokenizer",
    tokenized_dataset="/path/to/tokenized_dataset",
    training_config=training_config,
    model_config=model_config
)

# Start the training
model.train()
langformers.tasks.create_mlm(tokenizer: str, tokenized_dataset: str, model_config: Dict | None = None, training_config: Dict | None = None, checkpoint_path=None)

Factory method for training a masked language model based on RoBERTa pretraining procedure.

Parameters:
  • tokenizer (str, required) – Path to the trained tokenizer.

  • tokenized_dataset (str, required) – Path to the tokenized dataset.

  • model_config (Optional[Dict], default=None) – Dictionary containing model configurations. If None, the model must be loaded from a checkpoint.

  • training_config (Optional[Dict], default=None) – Dictionary containing training configurations. If None, default training configurations will be used.

  • checkpoint_path (Optional[str], default=None) – Path to a model checkpoint for resuming training.

Returns:

An instance of HuggingFaceMLMCreator.

Warning

At least one of model_config or checkpoint_path must be provided. If model_config is specified, a new model is initialized using the given configurations. If checkpoint_path is provided, the model from the specified path is resumed for pretraining. The latter is particularly useful for addressing issues in the current checkpoint’s behavior[1] or continuing the pretraining of an existing MLM.

References

[Liu2019] (1,2)

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

[Devlin2019]

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186).

[Song2020]

Song, K., Tan, X., Qin, T., Lu, J., & Liu, T. Y. (2020). Mpnet: Masked and permuted pre-training for language understanding. Advances in neural information processing systems, 33, 16857-16867.

[Vaswani2017]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

[Nguyen2020]

Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200.

[Lamsal2024] (1,2)

Lamsal, R., Read, M. R., & Karunasekera, S. (2024). CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts. Knowledge-Based Systems, 296, 111916.

Footnotes