File size: 6,098 Bytes
e56c3e9
 
5791b0d
 
 
 
 
 
 
 
 
e56c3e9
 
5791b0d
 
e56c3e9
 
5791b0d
e56c3e9
 
0b62f3f
e56c3e9
6e563cc
e56c3e9
5791b0d
e56c3e9
5791b0d
e56c3e9
5791b0d
 
 
e56c3e9
 
 
6e563cc
 
5791b0d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e563cc
5791b0d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e563cc
5791b0d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e563cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5791b0d
0b62f3f
5791b0d
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
library_name: transformers
license: apache-2.0
datasets:
- nampdn-ai/tiny-codes
- nlpai-lab/openassistant-guanaco-ko
- philschmid/guanaco-sharegpt-style
language:
- ko
- en
inference: false
tags:
- unsloth
- phi-3
pipeline_tag: text-generation
---

# Phi-3-medium-4k-instruct-ko-poc-v0.1

## Model Details
This model is trained using unsloth toolkit based on Microsoft's phi-3 Phi-3-medium-4k-instruct model (https://huggingface.co/unsloth/Phi-3-medium-4k-instruct) with some Korean instruction data added to enhance its Korean generation performance

Since my role is not as a working developer, but as ML Technical Specialist helping customers with quick PoCs/prototypes, and I was limited by Azure GPU resources available, I only trained with 40,000 samples on a single VM Azure Standard_NC24ads_A100_v4 for PoC purposes. Because I have not done any tokenizer extensions, you need a lot more tokens than English for text generation.

### Dataset

The dataset used for training is as follows. To prevent catastrophic forgetting, I included non-Korean corpus as training data. Note that we did not use all of the data, but only sampled some of it. Korean textbooks were converted to Q&A format. The Guanaco dataset has been reformatted to fit the multiturn format like <|user|>\n{Q1}<|end|>\n<|assistant|>\n{A1}<|end|>\n<|user|>\n{Q2}<|end|>\n<|assistant|>\n{A2}<|end|>.

- Korean textbooks (https://huggingface.co/datasets/nampdn-ai/tiny-codes)
- Korean translation of Guanaco (https://huggingface.co/datasets/nlpai-lab/openassistant-guanaco-ko)
- Guanaco Sharegpt style (https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style)


## How to Get Started with the Model

### Code snippets
```python
### Load model
import torch
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model_path = "daekeun-ml/Phi-3-medium-4k-instruct-ko-poc-v0.1"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_tar_dir, # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

params = {
    "max_new_tokens": 256,
    "use_cache": True,
    "temperature": 0.05,
    "do_sample": True
}

### Inference
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# 1st example
messages = [
    {"from": "human", "value": "Continue the fibonnaci sequence in Korean: 1, 1, 2, 3, 5, 8,"},  
    {"from": "assistant", "value": "ν”Όλ³΄λ‚˜μΉ˜ μˆ˜μ—΄μ˜ λ‹€μŒ μˆ«μžλŠ” 13, 21, 34, 55, 89 λ“±μž…λ‹ˆλ‹€. 각 μˆ«μžλŠ” μ•žμ˜ 두 숫자의 ν•©μž…λ‹ˆλ‹€."},    
    {"from": "human", "value": "Compute 2x+3=12 in Korean"}, 
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, **params)

# 2nd example
messages = [
    {"from": "human", "value": "What is Machine Learning in Korean?"},  
    {"from": "assistant", "value": "인곡지λŠ₯의 ν•œ λΆ„μ•Όλ‘œ λ°©λŒ€ν•œ 데이터λ₯Ό 뢄석해 ν–₯ν›„ νŒ¨ν„΄μ„ μ˜ˆμΈ‘ν•˜λŠ” κΈ°λ²•μž…λ‹ˆλ‹€."},    
    {"from": "human", "value": "What is Deep Learning in Korean?"}, 
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, **params)
```

### Inference results
```
# 1st example
<s><|user|> Continue the fibonnaci sequence in Korean: 1, 1, 2, 3, 5, 8,<|end|><|assistant|> ν”Όλ³΄λ‚˜μΉ˜ μˆ˜μ—΄μ˜ λ‹€μŒ μˆ«μžλŠ” 13, 21, 34, 55, 89 λ“±μž…λ‹ˆλ‹€. 각 μˆ«μžλŠ” μ•žμ˜ 두 숫자의 ν•©μž…λ‹ˆλ‹€.<|end|><|user|> Compute 2x+3=12 in Korean<|end|><|assistant|> 방정식 2x + 3 = 12μ—μ„œ xλ₯Ό ν’€λ €λ©΄ λ‹€μŒ 단계λ₯Ό λ”°λ₯΄μ‹­μ‹œμ˜€.

1. λ°©μ •μ‹μ˜ μ–‘μͺ½μ—μ„œ 3을 λΉΌμ„œ λ°©μ •μ‹μ˜ ν•œμͺ½μ— λŒ€ν•΄ xλ₯Ό λΆ„λ¦¬ν•©λ‹ˆλ‹€.

2x + 3 - 3 = 12 - 3

2x = 9

2. 이제 λ°©μ •μ‹μ˜ μ–‘μͺ½μ„ 2둜 λ‚˜λˆ„μ–΄ x의 값을 κ΅¬ν•©λ‹ˆλ‹€.

2λ°° / 2 = 9 / 2

x = 4.5

λ”°λΌμ„œ 방정식 2x + 3 = 12에 λŒ€ν•œ ν•΄λŠ” x = 4.5μž…λ‹ˆλ‹€.<|end|>

# 2nd example
<s><|user|> What is Machine Learning in Korean?<|end|><|assistant|> 인곡지λŠ₯의 ν•œ λΆ„μ•Όλ‘œ λ°©λŒ€ν•œ 데이터λ₯Ό 뢄석해 ν–₯ν›„ νŒ¨ν„΄μ„ μ˜ˆμΈ‘ν•˜λŠ” κΈ°λ²•μž…λ‹ˆλ‹€.<|end|><|user|> What is Deep Learning in Korean?<|end|><|assistant|> λ³΅μž‘ν•œ 데이터 μ„ΈνŠΈλ₯Ό λΆ„μ„ν•˜κ³  λ³΅μž‘ν•œ νŒ¨ν„΄μ„ μΈμ‹ν•˜κ³  ν•™μŠ΅ν•˜λŠ” 데 μ‚¬μš©λ˜λŠ” λ”₯λŸ¬λ‹μ€ λ§Žμ€ λ ˆμ΄μ–΄λ‘œ κ΅¬μ„±λœ μ‹ κ²½λ§μ˜ ν•˜μœ„ μ§‘ν•©μž…λ‹ˆλ‹€. 이 κΈ°μˆ μ€ 이미지 인식, μžμ—°μ–΄ 처리 및 자율 μš΄μ „κ³Ό 같은 λ‹€μ–‘ν•œ μ‘μš© λΆ„μ•Όμ—μ„œ 큰 λ°œμ „μ„ μ΄λ€˜μŠ΅λ‹ˆλ‹€.<|end|>
```

### References
- Base model: [unsloth/Phi-3-medium-4k-instruct](https://huggingface.co/unsloth/Phi-3-medium-4k-instruct)

## Notes 

### License

apache 2.0; The license of phi-3 is MIT, but I considered the licensing of the dataset and library used for training.

### Caution
This model was created as a personal experiment, unrelated to the organization I work for. The model may not operate correctly because separate verification was not performed. Please be careful unless it is for personal experimentation or PoC (Proof of Concept)!