File size: 1,120 Bytes
8b36f5b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
---
language:
- he
tags:
- language model
license: apache-2.0
datasets:
- oscar
- wikipedia
- twitter
---
# AlephBERT
## Hebrew Language Model
State-of-the-art language model for Hebrew. Based on BERT.
#### How to use
```python
from transformers import BertModel, BertTokenizerFast
alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base')
alephbert = BertModel.from_pretrained('onlplab/alephbert-base')
# if not finetuning - disable dropout
alephbert.eval()
```
## Training data
- OSCAR (10G text, 20M sentences)
- Wikipedia dump (0.6G text, 3M sentences)
- Tweets (7G text, 70M sentences)
## Training procedure
Trained on a DGX machine (8 V100 GPUs) using the standard huggingface training procedure.
To optimize training time we split the data into 4 sections based on max number of tokens:
1. num tokens < 32 (70M sentences)
2. 32 <= num tokens < 64 (12M sentences)
3. 64 <= num tokens < 128 (10M sentences)
4. 128 <= num tokens < 512 (70M sentences)
Each section was trained for 5 epochs with an initial learning rate set to 1e-4.
Total training time was 5 days.
## Eval
|