alephbert-base / README.md
aseker00's picture
First version of tokenizer and basic pytorch model.
8b36f5b
|
raw
history blame
No virus
1.12 kB
metadata
language:
  - he
tags:
  - language model
license: apache-2.0
datasets:
  - oscar
  - wikipedia
  - twitter

AlephBERT

Hebrew Language Model

State-of-the-art language model for Hebrew. Based on BERT.

How to use

from transformers import BertModel, BertTokenizerFast

alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base')
alephbert = BertModel.from_pretrained('onlplab/alephbert-base')

# if not finetuning - disable dropout
alephbert.eval()

Training data

  • OSCAR (10G text, 20M sentences)
  • Wikipedia dump (0.6G text, 3M sentences)
  • Tweets (7G text, 70M sentences)

Training procedure

Trained on a DGX machine (8 V100 GPUs) using the standard huggingface training procedure.

To optimize training time we split the data into 4 sections based on max number of tokens:

  1. num tokens < 32 (70M sentences)
  2. 32 <= num tokens < 64 (12M sentences)
  3. 64 <= num tokens < 128 (10M sentences)
  4. 128 <= num tokens < 512 (70M sentences)

Each section was trained for 5 epochs with an initial learning rate set to 1e-4.

Total training time was 5 days.

Eval