metadata
datasets:
- mc4
license: apache-2.0
ByT5-Korean - large
ByT5-Korean is a Korean specific extension of Google's ByT5.
A Korean syllable has three components (called Jamo): a beginning consonant, a middle vowel, and an optional final consonant; they are like individual characters of alphabet. While the ByT5's utf-8 encoding allows generic encoding for multiple languages, it is unnatural for Korean because it splits the bits representation of each Jamo in the middle.
ByT5-Korean extends ByT5's utf-8 encoding with special care for Korean syllables; each Jamo is represented with a extra token. ByT5-Korean was pre-trained on mC4 with 70% Korean and 30% English.
Encoding Scheme
id: token
0: <pad>
1: <unk>
2: <eos>
3~258: utf-8 encoding
259~277: beginning consonants(์ด์ฑ), from ใฑ to ใ
279~299: middle vowel(์ค์ฑ), from ใ
to ใ
ฃ
300~327: final consonant(์ข
์ฑ), None, from ใฑ to ใ
328~384: from <extra_id_0> to <extra_id_56>
Example Inference
import torch
from tokenizer import ByT5KoreanTokenizer # https://github.com/everdoubling/byt5-Korean
from transformers import T5ForConditionalGeneration
tokenizer_jamo = ByT5KoreanTokenizer()
model = T5ForConditionalGeneration.from_pretrained('everdoubling/byt5-Korean-large')
input_sentence = 'ํ๊ตญ์ด ์ํค๋ฐฑ๊ณผ(์์ด: Korean Wikipedia)๋ ํ๊ตญ์ด๋ก ์ด์๋๋ ์ํค๋ฐฑ๊ณผ์ ๋ค์ธ์ดํ ๊ฐ์ด๋ฐ ํ๋๋ก์, 2002๋
10์ 11์ผ์ <extra_id_0>. ๋ํ ํ์ฌ ํ๊ตญ์ด ์ํค๋ฐฑ๊ณผ์๋ ๋๊ฒจ์ฃผ๊ธฐ, ํ ๋ก , ๊ทธ๋ฆผ ๋ฑ ํ์ด์ง๋ก ๋ถ๋ฆฌ๋ ๋ชจ๋ ๋ฌธ์๋ฅผ ํฌํจํ๋ฉด ์ด 2,629,860๊ฐ๊ฐ <extra_id_1>๋์ด ์์ผ๋ฉฐ, ๋๊ฒจ์ฃผ๊ธฐ๋ฅผ ํฌํจํ ์ผ๋ฐ ๋ฌธ์ ์๋ 1,278,560๊ฐ,[1] ๊ทธ์ค ๋๊ฒจ์ฃผ๊ธฐ, ๋ง๋ค๋ฅธ ๋ฌธ์๋ฅผ ์ ์ธํ ์ผ๋ฐ ๋ฌธ์ ์๋ 573,149๊ฐ์ด๋ค.'
input_ids_jamo = tokenizer_jamo(input_sentence).input_ids
outputs_jamo = model_jamo.generate(torch.tensor([input_ids_jamo]))
print(tokenizer_jamo.decode(outputs_jamo[0]))
# <pad><extra_id_0>์ค๋ฆฝ๋์๋ค<extra_id_1>ฤฤ
Additional information coming soon...