File size: 2,223 Bytes
f1029a5 258bc45 f1029a5 da3e6ad 258bc45 da3e6ad 258bc45 da3e6ad 258bc45 da3e6ad |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
---
datasets:
- mc4
license: apache-2.0
---
# ByT5-Korean - large
ByT5-Korean is a Korean specific extension of Google's [ByT5](https://github.com/google-research/byt5).
A Korean syllable has three components (called Jamo): a beginning consonant, a middle vowel, and an optional final consonant; they are like individual characters of alphabet.
While the ByT5's utf-8 encoding allows generic encoding for multiple languages, it is unnatural for Korean because it splits the bits representation of each Jamo in the middle.
ByT5-Korean extends ByT5's utf-8 encoding with special care for Korean syllables; each Jamo is represented with a extra token.
ByT5-Korean was pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual) with 70% Korean and 30% English.
## Encoding Scheme
```text
id: token
0: <pad>
1: <unk>
2: <eos>
3~258: utf-8 encoding
259~277: beginning consonants(์ด์ฑ), from ใฑ to ใ
279~299: middle vowel(์ค์ฑ), from ใ
to ใ
ฃ
300~327: final consonant(์ข
์ฑ), None, from ใฑ to ใ
328~384: from <extra_id_0> to <extra_id_56>
```
## Example Inference
```python
import torch
from tokenizer import ByT5KoreanTokenizer # https://github.com/everdoubling/byt5-Korean
from transformers import T5ForConditionalGeneration
tokenizer_jamo = ByT5KoreanTokenizer()
model = T5ForConditionalGeneration.from_pretrained('everdoubling/byt5-Korean-large')
input_sentence = 'ํ๊ตญ์ด ์ํค๋ฐฑ๊ณผ(์์ด: Korean Wikipedia)๋ ํ๊ตญ์ด๋ก ์ด์๋๋ ์ํค๋ฐฑ๊ณผ์ ๋ค์ธ์ดํ ๊ฐ์ด๋ฐ ํ๋๋ก์, 2002๋
10์ 11์ผ์ <extra_id_0>. ๋ํ ํ์ฌ ํ๊ตญ์ด ์ํค๋ฐฑ๊ณผ์๋ ๋๊ฒจ์ฃผ๊ธฐ, ํ ๋ก , ๊ทธ๋ฆผ ๋ฑ ํ์ด์ง๋ก ๋ถ๋ฆฌ๋ ๋ชจ๋ ๋ฌธ์๋ฅผ ํฌํจํ๋ฉด ์ด 2,629,860๊ฐ๊ฐ <extra_id_1>๋์ด ์์ผ๋ฉฐ, ๋๊ฒจ์ฃผ๊ธฐ๋ฅผ ํฌํจํ ์ผ๋ฐ ๋ฌธ์ ์๋ 1,278,560๊ฐ,[1] ๊ทธ์ค ๋๊ฒจ์ฃผ๊ธฐ, ๋ง๋ค๋ฅธ ๋ฌธ์๋ฅผ ์ ์ธํ ์ผ๋ฐ ๋ฌธ์ ์๋ 573,149๊ฐ์ด๋ค.'
input_ids_jamo = tokenizer_jamo(input_sentence).input_ids
outputs_jamo = model_jamo.generate(torch.tensor([input_ids_jamo]))
print(tokenizer_jamo.decode(outputs_jamo[0]))
# <pad><extra_id_0>์ค๋ฆฝ๋์๋ค<extra_id_1>ฤฤ
```
Additional information coming soon...
|