hkiyomaru commited on
Commit
90fd564
1 Parent(s): 2be220a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -73
README.md CHANGED
@@ -1,83 +1,71 @@
1
  ---
2
  license: apache-2.0
3
  language:
4
- - en
5
- - ja
6
  programming_language:
7
- - C
8
- - C++
9
- - C#
10
- - Go
11
- - Java
12
- - JavaScript
13
- - Lua
14
- - PHP
15
- - Python
16
- - Ruby
17
- - Rust
18
- - Scala
19
- - TypeScript
20
  library_name: transformers
21
  pipeline_tag: text-generation
22
  inference: false
23
- datasets:
24
- - databricks/databricks-dolly-15k
25
- - llm-jp/databricks-dolly-15k-ja
26
- - llm-jp/oasst1-21k-en
27
- - llm-jp/oasst1-21k-ja
28
  ---
29
-
30
  # llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0
31
 
32
  This repository provides large language models developed by [LLM-jp](https://llm-jp.nii.ac.jp/), a collaborative project launched in Japan.
33
 
34
  | Model Variant |
35
  | :--- |
36
- |**Instruction models ver1.1**|
37
- | [llm-jp-13b-dpo-lora-hh_rlhf_ja-v1.1](https://huggingface.co/llm-jp/llm-jp-13b-dpo-lora-hh_rlhf_ja-v1.1)|
38
- | [llm-jp-13b-instruct-full-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1) |
39
- | [llm-jp-13b-instruct-lora-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1](https://huggingface.co/llm-jp/llm-jp-13b-instruct-lora-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1) |
40
- |**Instruction models ver1.0**|
41
- | [llm-jp-13b-instruct-full-jaster-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-jaster-v1.0) |
42
- | [llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0) |
43
- | [llm-jp-13b-instruct-full-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-dolly-oasst-v1.0) |
44
- | [llm-jp-13b-instruct-lora-jaster-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-lora-jaster-v1.0) |
45
- | [llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0) |
46
- | [llm-jp-13b-instruct-lora-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-lora-dolly-oasst-v1.0) |
47
 
48
 
49
  | |
50
  | :--- |
51
  |**Pre-trained models**|
52
- | [llm-jp-13b-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-v1.0) |
53
- | [llm-jp-1.3b-v1.0](https://huggingface.co/llm-jp/llm-jp-1.3b-v1.0) |
54
- Checkpoints format: Hugging Face Transformers (Megatron-DeepSpeed format models are available [here](https://huggingface.co/llm-jp/llm-jp-13b-v1.0-mdsfmt))
55
 
56
 
57
  ## Required Libraries and Their Versions
58
 
59
- - torch>=2.0.0
60
- - transformers>=4.34.0
61
- - tokenizers>=0.14.0
62
- - accelerate==0.23.0
 
63
 
64
  ## Usage
65
 
66
  ```python
67
  import torch
68
  from transformers import AutoTokenizer, AutoModelForCausalLM
69
- tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-instruct-full-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1")
70
- model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-instruct-full-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1", device_map="auto", torch_dtype=torch.float16)
71
- text = "以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n### 指示:\n{instruction}\n\n### 応答:\n".format(instruction="自然言語処理とは何か")
72
  tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
73
  with torch.no_grad():
74
  output = model.generate(
75
  tokenized_input,
76
- max_new_tokens=512,
77
  do_sample=True,
78
  top_p=0.95,
79
  temperature=0.7,
80
- repetition_penalty=1.1,
81
  )[0]
82
  print(tokenizer.decode(output))
83
  ```
@@ -86,32 +74,34 @@ print(tokenizer.decode(output))
86
  ## Model Details
87
 
88
  - **Model type:** Transformer-based Language Model
89
- - **Total seen tokens:** 300B
90
 
91
  |Model|Params|Layers|Hidden size|Heads|Context length|
92
  |:---:|:---:|:---:|:---:|:---:|:---:|
93
- |13b model|13b|40|5120|40|2048|
94
- |1.3b model|1.3b|24|2048|16|2048|
95
 
96
 
97
  ## Training
98
 
99
  - **Pre-training:**
100
- - **Hardware:** 96 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
101
- - **Software:** Megatron-DeepSpeed
102
 
103
  - **Instruction tuning:**
104
  - **Hardware:** 8 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
105
  - **Software:** [TRL](https://github.com/huggingface/trl), [PEFT](https://github.com/huggingface/peft), and [DeepSpeed](https://github.com/microsoft/DeepSpeed)
106
 
107
  ## Tokenizer
 
108
  The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
109
- The vocabulary entries were converted from [`llm-jp-tokenizer v2.1 (50k)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.1).
110
- Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure.
 
111
  - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
112
- - **Training algorithm:** SentencePiece Unigram byte-fallback
113
  - **Training data:** A subset of the datasets for model pre-training
114
- - **Vocabulary size:** 50,570 (mixed vocabulary of Japanese, English, and source code)
 
115
 
116
 
117
  ## Datasets
@@ -121,33 +111,33 @@ Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-
121
  The models have been pre-trained using a blend of the following datasets.
122
 
123
  | Language | Dataset | Tokens|
124
- |:---:|:---:|:---:|
125
- |Japanese|[Wikipedia](https://huggingface.co/datasets/wikipedia)|1.5B
126
- ||[mC4](https://huggingface.co/datasets/mc4)|136B
127
- |English|[Wikipedia](https://huggingface.co/datasets/wikipedia)|5B
128
- ||[The Pile](https://huggingface.co/datasets/EleutherAI/pile)|135B
129
- |Codes|[The Stack](https://huggingface.co/datasets/bigcode/the-stack)|10B
130
-
131
- The pre-training was continuously conducted using a total of 10 folds of non-overlapping data, each consisting of approximately 27-28B tokens.
132
- We finalized the pre-training with additional (potentially) high-quality 27B tokens data obtained from the identical source datasets listed above used for the 10-fold data.
133
 
134
  ### Instruction tuning
135
 
136
  The models have been fine-tuned on the following datasets.
137
 
138
  | Language | Dataset | description |
139
- |:---|:---:|:---:|
140
- |Japanese|[jaster](https://github.com/llm-jp/llm-jp-eval)| An automatically transformed data from the existing Japanese NLP datasets |
141
- |English|[databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)| - |
142
- |Japanese|[databricks-dolly-15k-ja](https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja)| A translated one by DeepL in LLM-jp |
143
- |English|[oasst1-21k-en](https://huggingface.co/datasets/llm-jp/oasst1-21k-en)| English subset of [oasst1 dataset](https://huggingface.co/datasets/OpenAssistant/oasst1) |
144
- |Japanese|[oasst1-21k-ja](https://huggingface.co/datasets/llm-jp/oasst1-21k-ja)| A translated one by DeepL in LLM-jp |
145
- |Japanese|[ichikara_003_001](https://liat-aip.sakura.ne.jp/wp/llm%E3%81%AE%E3%81%9F%E3%82%81%E3%81%AE%E6%97%A5%E6%9C%AC%E8%AA%9E%E3%82%A4%E3%83%B3%E3%82%B9%E3%83%88%E3%83%A9%E3%82%AF%E3%82%B7%E3%83%A7%E3%83%B3%E3%83%87%E3%83%BC%E3%82%BF%E4%BD%9C%E6%88%90/)| ichikara-instruction dataset (ver.003-001)
146
- |Japanese|[hh-rlhf-12k-ja](https://huggingface.co/datasets/llm-jp/hh-rlhf-12k-ja)| A translated one by DeepL in LLM-jp |
147
-
148
 
149
  ## Evaluation
150
- You can view the evaluation results of several LLMs on this [leaderboard](http://wandb.me/llm-jp-leaderboard). We used [llm-jp-eval](https://github.com/llm-jp/llm-jp-eval) for the evaluation.
 
 
 
 
151
 
152
  ## Risks and Limitations
153
 
@@ -165,6 +155,7 @@ llm-jp(at)nii.ac.jp
165
 
166
 
167
  ## Model Card Authors
 
168
  *The names are listed in alphabetical order.*
169
 
170
- Hirokazu Kiyomaru, Hiroshi Matsuda, Jun Suzuki, Namgi Han, Saku Sugawara, Shota Sasaki, Shuhei Kurita, Taishi Nakamura, Takashi Kodama, Takumi Okamoto.
 
1
  ---
2
  license: apache-2.0
3
  language:
4
+ - en
5
+ - ja
6
  programming_language:
7
+ - C
8
+ - C++
9
+ - C#
10
+ - Go
11
+ - Java
12
+ - JavaScript
13
+ - Lua
14
+ - PHP
15
+ - Python
16
+ - Ruby
17
+ - Rust
18
+ - Scala
19
+ - TypeScript
20
  library_name: transformers
21
  pipeline_tag: text-generation
22
  inference: false
 
 
 
 
 
23
  ---
 
24
  # llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0
25
 
26
  This repository provides large language models developed by [LLM-jp](https://llm-jp.nii.ac.jp/), a collaborative project launched in Japan.
27
 
28
  | Model Variant |
29
  | :--- |
30
+ |**Instruction models**|
31
+ | [llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0) |
32
+ | [llm-jp-13b-instruct-full-ac_001-dolly-ichikara_004_001_single-oasst-oasst2-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-ac_001-dolly-ichikara_004_001_single-oasst-oasst2-v2.0) |
33
+ | [llm-jp-13b-instruct-full-ac_001_16x-dolly-ichikara_004_001_single-oasst-oasst2-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-ac_001_16x-dolly-ichikara_004_001_single-oasst-oasst2-v2.0) |
 
 
 
 
 
 
 
34
 
35
 
36
  | |
37
  | :--- |
38
  |**Pre-trained models**|
39
+ | [llm-jp-13b-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-v2.0) |
40
+
41
+ Checkpoints format: Hugging Face Transformers
42
 
43
 
44
  ## Required Libraries and Their Versions
45
 
46
+ - torch>=2.3.0
47
+ - transformers>=4.40.1
48
+ - tokenizers>=0.19.1
49
+ - accelerate>=0.29.3
50
+ - flash-attn>=2.5.8
51
 
52
  ## Usage
53
 
54
  ```python
55
  import torch
56
  from transformers import AutoTokenizer, AutoModelForCausalLM
57
+ tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0")
58
+ model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0", device_map="auto", torch_dtype=torch.float16)
59
+ text = "自然言語処理とは何か"
60
  tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
61
  with torch.no_grad():
62
  output = model.generate(
63
  tokenized_input,
64
+ max_new_tokens=100,
65
  do_sample=True,
66
  top_p=0.95,
67
  temperature=0.7,
68
+ repetition_penalty=1.05,
69
  )[0]
70
  print(tokenizer.decode(output))
71
  ```
 
74
  ## Model Details
75
 
76
  - **Model type:** Transformer-based Language Model
77
+ - **Total seen tokens:** 256B
78
 
79
  |Model|Params|Layers|Hidden size|Heads|Context length|
80
  |:---:|:---:|:---:|:---:|:---:|:---:|
81
+ |13b model|13b|40|5120|40|4096|
 
82
 
83
 
84
  ## Training
85
 
86
  - **Pre-training:**
87
+ - **Hardware:** 128 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
88
+ - **Software:** Megatron-LM
89
 
90
  - **Instruction tuning:**
91
  - **Hardware:** 8 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
92
  - **Software:** [TRL](https://github.com/huggingface/trl), [PEFT](https://github.com/huggingface/peft), and [DeepSpeed](https://github.com/microsoft/DeepSpeed)
93
 
94
  ## Tokenizer
95
+
96
  The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
97
+ The vocabulary entries were converted from [`llm-jp-tokenizer v2.2 (100k: code20K_en40K_ja60K.ver2.2)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.2).
98
+ Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).
99
+
100
  - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
101
+ - **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
102
  - **Training data:** A subset of the datasets for model pre-training
103
+ - **Vocabulary size:** 96,867 (mixed vocabulary of Japanese, English, and source code)
104
+ - The acutal size of vocabulary in the pretrained model is 97,024 due to round-up to multiples of 256.
105
 
106
 
107
  ## Datasets
 
111
  The models have been pre-trained using a blend of the following datasets.
112
 
113
  | Language | Dataset | Tokens|
114
+ |:---|:---|---:|
115
+ |Japanese|[Wikipedia](https://huggingface.co/datasets/wikipedia)|1.4B
116
+ ||[Common Crawl](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v2)|130.7B
117
+ |English|[Wikipedia](https://huggingface.co/datasets/wikipedia)|4.7B
118
+ ||[The Pile](https://huggingface.co/datasets/EleutherAI/pile)|110.3B
119
+ |Codes|[The Stack](https://huggingface.co/datasets/bigcode/the-stack)|8.7B
 
 
 
120
 
121
  ### Instruction tuning
122
 
123
  The models have been fine-tuned on the following datasets.
124
 
125
  | Language | Dataset | description |
126
+ |:---|:---|:---|
127
+ |Japanese|[ichikara-instruction-004-001](https://liat-aip.sakura.ne.jp/wp/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf%e4%bd%9c%e6%88%90/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf-%e5%85%ac%e9%96%8b/)| A manually constructed Japanese instruction dataset |
128
+ | |[answer-carefully-001](https://liat-aip.sakura.ne.jp/wp/answercarefully-dataset/)| A manually constructed Japanese instruction dataset focusing on LLMs' safety |
129
+ | |[databricks-dolly-15k-ja](https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja)| [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) translated into Japanese using DeepL |
130
+ | |[oasst1-21k-ja](https://huggingface.co/datasets/llm-jp/oasst1-21k-ja)| A subset of [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) translated into Japanese using DeepL |
131
+ | |[oasst2-33k-ja](https://huggingface.co/datasets/llm-jp/oasst2-33k-ja)| A subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2) translated into Japanese using DeepL |
132
+ |English |[oasst1-21k-en](https://huggingface.co/datasets/llm-jp/oasst1-21k-en)| A subset of [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) |
133
+ | |[oasst2-33k-en](https://huggingface.co/datasets/llm-jp/oasst2-33k-en)| A subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2) |
 
134
 
135
  ## Evaluation
136
+
137
+ You can view the evaluation results of several LLMs on this [leaderboard](http://wandb.me/llm-jp-leaderboard). We used [llm-jp-eval](https://github.com/llm-jp/llm-jp-eval) (v1.3.0) for the evaluation.
138
+
139
+ Besides, we used LLM-as-a-judge frameworks, [Japanese Vicuna QA Benchmark](https://github.com/ku-nlp/ja-vicuna-qa-benchmark/) and [Japanese MT Bench](https://github.com/Stability-AI/FastChat/tree/jp-stable/fastchat/llm_judge), for evaluation.
140
+ For details, please refer to [our technical blog](https://llm-jp.nii.ac.jp/blog/2024/04/30/v2.0-release.html) (in Japanese).
141
 
142
  ## Risks and Limitations
143
 
 
155
 
156
 
157
  ## Model Card Authors
158
+
159
  *The names are listed in alphabetical order.*
160
 
161
+ Namgi Han, Tatsuya Hiraoka, Hirokazu Kiyomaru, Takashi Kodama, and Hiroshi Matsuda.