antalvdb
/

bart-base-spelling-nl

Text2Text Generation

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

antalvdb commited on Apr 4, 2023

Commit

1a918fa

•

1 Parent(s): 9de8742

Upload README.md

Files changed (1) hide show

README.md +10 -7

README.md CHANGED Viewed

@@ -21,7 +21,8 @@ This is a text-to-text fine-tuned version of
 [facebook/bart-base](https://huggingface.co/facebook/bart-base)
 trained on spelling correction. It leans on the excellent work by
 Oliver Guhr ([github](https://github.com/oliverguhr/spelling),
-[huggingface](https://huggingface.co/oliverguhr/spelling-correction-english-base)). Training was performed on an AWS EC2 instance (g5.xlarge) on a single GPU.
 ## Intended uses & limitations
@@ -31,13 +32,15 @@ checker. A next version of the model will be trained on more data.
 ## Training and evaluation data
-The model was trained on a Dutch dataset composed of 1,500,000 lines of
-text from three public Dutch sources, downloaded from the [Opus
-corpus](https://opus.nlpl.eu/):
-- nl-europarlv7.100k.txt (1,000,000 lines)
-- nl-opensubtitles2016.100k.txt (1,000,000 lines)
-- nl-wikipedia.100k.txt (964,203 lines)
 ## Training procedure

 [facebook/bart-base](https://huggingface.co/facebook/bart-base)
 trained on spelling correction. It leans on the excellent work by
 Oliver Guhr ([github](https://github.com/oliverguhr/spelling),
+[huggingface](https://huggingface.co/oliverguhr/spelling-correction-english-base)). Training
+was performed on an AWS EC2 instance (g5.xlarge) on a single GPU.
 ## Intended uses & limitations
 ## Training and evaluation data
+The model was trained on a Dutch dataset composed of 2,964,203 (nearly
+3m lines) of text from three public Dutch sources, downloaded from the
+[Opus corpus](https://opus.nlpl.eu/):
+- nl-europarlv7.1m.txt (1,000,000 lines)
+- nl-opensubtitles2016.1m.txt (1,000,000 lines)
+- nl-wikipedia.txt (964,203 lines)
+Together these texts comprise 45,308,056 tokens.
 ## Training procedure