nguyenvulebinh commited on
Commit
ea41081
1 Parent(s): 0e69533

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -1
README.md CHANGED
@@ -1 +1,65 @@
1
- # RoBERTa for Vietnamese and English (envibert)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: vi
3
+ tags:
4
+ - exbert
5
+ license: cc-by-nc-4.0
6
+ ---
7
+
8
+ # RoBERTa for Vietnamese and English (envibert)
9
+
10
+ This RoBERTa version is trained by using 100GB of text (50GB of Vietnamese and 50GB of English) so it is named ***envibert***. The model architecture is custom for production so it only contains 70M parameters.
11
+
12
+ ## Usages
13
+
14
+ ```python
15
+ from transformers import RobertaModel
16
+ from transformers.file_utils import cached_path, hf_bucket_url
17
+ from importlib.machinery import SourceFileLoader
18
+ import os
19
+
20
+ cache_dir='./cache'
21
+ model_name='nguyenvulebinh/envibert'
22
+
23
+ def download_tokenizer_files():
24
+ resources = ['envibert_tokenizer.py', 'dict.txt', 'sentencepiece.bpe.model']
25
+ for item in resources:
26
+ if not os.path.exists(os.path.join(cache_dir, item)):
27
+ tmp_file = hf_bucket_url(model_name, filename=item)
28
+ tmp_file = cached_path(tmp_file,cache_dir=cache_dir)
29
+ os.rename(tmp_file, os.path.join(cache_dir, item))
30
+
31
+ download_tokenizer_files()
32
+ tokenizer = SourceFileLoader("envibert.tokenizer", os.path.join(cache_dir,'envibert_tokenizer.py')).load_module().RobertaTokenizer(cache_dir)
33
+ model = RobertaModel.from_pretrained(model_name,cache_dir=cache_dir)
34
+
35
+ # Encode text
36
+ text_input = 'Đại học Bách Khoa Hà Nội .'
37
+ text_ids = tokenizer(text_input, return_tensors='pt').input_ids
38
+ # tensor([[ 0, 705, 131, 8751, 2878, 347, 477, 5, 2]])
39
+
40
+ # Extract features
41
+ text_features = model(text_ids)
42
+ text_features['last_hidden_state'].shape
43
+ # torch.Size([1, 9, 768])
44
+ len(text_features['hidden_states'])
45
+ # 7
46
+ ```
47
+
48
+ ```text
49
+ @inproceedings{nguyen20d_interspeech,
50
+ author={Thai Binh Nguyen and Quang Minh Nguyen and Thi Thu Hien Nguyen and Quoc Truong Do and Chi Mai Luong},
51
+ title={{Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models}},
52
+ year=2020,
53
+ booktitle={Proc. Interspeech 2020},
54
+ pages={4263--4267},
55
+ doi={10.21437/Interspeech.2020-1896}
56
+ }
57
+ ```
58
+ **Please CITE** our repo when it is used to help produce published results or is incorporated into other software.
59
+
60
+
61
+ # Contact
62
+
63
+ nguyenvulebinh@gmail.com
64
+
65
+ [![Follow](https://img.shields.io/twitter/follow/nguyenvulebinh?style=social)](https://twitter.com/intent/follow?screen_name=nguyenvulebinh)