Update README.md
Browse files
README.md
CHANGED
@@ -31,7 +31,7 @@ alephbert.eval()
|
|
31 |
|
32 |
## Training data
|
33 |
1. OSCAR [(Ortiz, 2019)](https://oscar-corpus.com/) Hebrew section (10GB text, 20M sentences).
|
34 |
-
2. Hebrew dump of [Wikipedia](https://dumps.wikimedia.org/hewiki/latest/) (650 MB text,
|
35 |
3. Hebrew Tweets collected from the Twitter sample stream (7G text, 70M sentences).
|
36 |
|
37 |
## Training procedure
|
@@ -43,7 +43,7 @@ To optimize training time we split the data into 4 sections based on max number
|
|
43 |
1. num tokens < 32 (70M sentences)
|
44 |
2. 32 <= num tokens < 64 (12M sentences)
|
45 |
3. 64 <= num tokens < 128 (10M sentences)
|
46 |
-
4. 128 <= num tokens < 512 (
|
47 |
|
48 |
Each section was first trained for 5 epochs with an initial learning rate set to 1e-4. Then each section was trained for another 5 epochs with an initial learning rate set to 1e-5, for a total of 10 epochs.
|
49 |
|
|
|
31 |
|
32 |
## Training data
|
33 |
1. OSCAR [(Ortiz, 2019)](https://oscar-corpus.com/) Hebrew section (10GB text, 20M sentences).
|
34 |
+
2. Hebrew dump of [Wikipedia](https://dumps.wikimedia.org/hewiki/latest/) (650 MB text, 3M sentences).
|
35 |
3. Hebrew Tweets collected from the Twitter sample stream (7G text, 70M sentences).
|
36 |
|
37 |
## Training procedure
|
|
|
43 |
1. num tokens < 32 (70M sentences)
|
44 |
2. 32 <= num tokens < 64 (12M sentences)
|
45 |
3. 64 <= num tokens < 128 (10M sentences)
|
46 |
+
4. 128 <= num tokens < 512 (1.5M sentences)
|
47 |
|
48 |
Each section was first trained for 5 epochs with an initial learning rate set to 1e-4. Then each section was trained for another 5 epochs with an initial learning rate set to 1e-5, for a total of 10 epochs.
|
49 |
|