llm-jp
/

llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0

@@ -1,83 +1,71 @@
 ---
 license: apache-2.0
 language:
-- en
-- ja
 programming_language:
-- C
-- C++
-- C#
-- Go
-- Java
-- JavaScript
-- Lua
-- PHP
-- Python
-- Ruby
-- Rust
-- Scala
-- TypeScript
 library_name: transformers
 pipeline_tag: text-generation
 inference: false
-datasets:
-- databricks/databricks-dolly-15k
-- llm-jp/databricks-dolly-15k-ja
-- llm-jp/oasst1-21k-en
-- llm-jp/oasst1-21k-ja
 ---
 # llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0
 This repository provides large language models developed by [LLM-jp](https://llm-jp.nii.ac.jp/), a collaborative project launched in Japan.
 | Model Variant |
 | :--- |
-|**Instruction models ver1.1**|
-| [llm-jp-13b-dpo-lora-hh_rlhf_ja-v1.1](https://huggingface.co/llm-jp/llm-jp-13b-dpo-lora-hh_rlhf_ja-v1.1)|
-| [llm-jp-13b-instruct-full-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1) |
-| [llm-jp-13b-instruct-lora-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1](https://huggingface.co/llm-jp/llm-jp-13b-instruct-lora-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1) |
-|**Instruction models ver1.0**|
-| [llm-jp-13b-instruct-full-jaster-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-jaster-v1.0) |
-| [llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0) |
-| [llm-jp-13b-instruct-full-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-dolly-oasst-v1.0) |
-| [llm-jp-13b-instruct-lora-jaster-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-lora-jaster-v1.0) |
-| [llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0) |
-| [llm-jp-13b-instruct-lora-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-lora-dolly-oasst-v1.0) |
 |  |
 | :--- |
 |**Pre-trained models**|
-| [llm-jp-13b-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-v1.0) |
-| [llm-jp-1.3b-v1.0](https://huggingface.co/llm-jp/llm-jp-1.3b-v1.0) |
-Checkpoints format: Hugging Face Transformers (Megatron-DeepSpeed format models are available [here](https://huggingface.co/llm-jp/llm-jp-13b-v1.0-mdsfmt))
 ## Required Libraries and Their Versions
-- torch>=2.0.0
-- transformers>=4.34.0
-- tokenizers>=0.14.0
-- accelerate==0.23.0
 ## Usage
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-instruct-full-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1")
-model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-instruct-full-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1", device_map="auto", torch_dtype=torch.float16)
-text = "以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n### 指示:\n{instruction}\n\n### 応答:\n".format(instruction="自然言語処理とは何か")
 tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
 with torch.no_grad():
     output = model.generate(
         tokenized_input,
-        max_new_tokens=512,
         do_sample=True,
         top_p=0.95,
         temperature=0.7,
-        repetition_penalty=1.1,
     )[0]
 print(tokenizer.decode(output))
 ```
@@ -86,32 +74,34 @@ print(tokenizer.decode(output))
 ## Model Details
 - **Model type:** Transformer-based Language Model
-- **Total seen tokens:** 300B
 |Model|Params|Layers|Hidden size|Heads|Context length|
 |:---:|:---:|:---:|:---:|:---:|:---:|
-|13b model|13b|40|5120|40|2048|
-|1.3b model|1.3b|24|2048|16|2048|
 ## Training
 - **Pre-training:**
-  - **Hardware:** 96 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
-  - **Software:** Megatron-DeepSpeed
 - **Instruction tuning:**
   - **Hardware:** 8 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
   - **Software:** [TRL](https://github.com/huggingface/trl), [PEFT](https://github.com/huggingface/peft), and [DeepSpeed](https://github.com/microsoft/DeepSpeed)
 ## Tokenizer
 The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
-The vocabulary entries were converted from [`llm-jp-tokenizer v2.1 (50k)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.1).
-Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure.
 - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
-- **Training algorithm:** SentencePiece Unigram byte-fallback
 - **Training data:** A subset of the datasets for model pre-training
-- **Vocabulary size:** 50,570 (mixed vocabulary of Japanese, English, and source code)
 ## Datasets
@@ -121,33 +111,33 @@ Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-
 The models have been pre-trained using a blend of the following datasets.
 | Language | Dataset | Tokens|
-|:---:|:---:|:---:|
-|Japanese|[Wikipedia](https://huggingface.co/datasets/wikipedia)|1.5B
-||[mC4](https://huggingface.co/datasets/mc4)|136B
-|English|[Wikipedia](https://huggingface.co/datasets/wikipedia)|5B
-||[The Pile](https://huggingface.co/datasets/EleutherAI/pile)|135B
-|Codes|[The Stack](https://huggingface.co/datasets/bigcode/the-stack)|10B
-The pre-training was continuously conducted using a total of 10 folds of non-overlapping data, each consisting of approximately 27-28B tokens.
-We finalized the pre-training with additional (potentially) high-quality 27B tokens data obtained from the identical source datasets listed above used for the 10-fold data.
 ### Instruction tuning
 The models have been fine-tuned on the following datasets.
 | Language | Dataset | description |
-|:---|:---:|:---:|
-|Japanese|[jaster](https://github.com/llm-jp/llm-jp-eval)| An automatically transformed data from the existing Japanese NLP datasets |
-|English|[databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)| - |
-|Japanese|[databricks-dolly-15k-ja](https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja)| A translated one by DeepL in LLM-jp |
-|English|[oasst1-21k-en](https://huggingface.co/datasets/llm-jp/oasst1-21k-en)| English subset of [oasst1 dataset](https://huggingface.co/datasets/OpenAssistant/oasst1) |
-|Japanese|[oasst1-21k-ja](https://huggingface.co/datasets/llm-jp/oasst1-21k-ja)| A translated one by DeepL in LLM-jp |
-|Japanese|[ichikara_003_001](https://liat-aip.sakura.ne.jp/wp/llm%E3%81%AE%E3%81%9F%E3%82%81%E3%81%AE%E6%97%A5%E6%9C%AC%E8%AA%9E%E3%82%A4%E3%83%B3%E3%82%B9%E3%83%88%E3%83%A9%E3%82%AF%E3%82%B7%E3%83%A7%E3%83%B3%E3%83%87%E3%83%BC%E3%82%BF%E4%BD%9C%E6%88%90/)| ichikara-instruction dataset (ver.003-001)
-|Japanese|[hh-rlhf-12k-ja](https://huggingface.co/datasets/llm-jp/hh-rlhf-12k-ja)| A translated one by DeepL in LLM-jp |
 ## Evaluation
-You can view the evaluation results of several LLMs on this [leaderboard](http://wandb.me/llm-jp-leaderboard). We used [llm-jp-eval](https://github.com/llm-jp/llm-jp-eval) for the evaluation.
 ## Risks and Limitations
@@ -165,6 +155,7 @@ llm-jp(at)nii.ac.jp
 ## Model Card Authors
 *The names are listed in alphabetical order.*
-Hirokazu Kiyomaru, Hiroshi Matsuda, Jun Suzuki, Namgi Han, Saku Sugawara, Shota Sasaki, Shuhei Kurita, Taishi Nakamura, Takashi Kodama, Takumi Okamoto.

 ---
 license: apache-2.0
 language:
+  - en
+  - ja
 programming_language:
+  - C
+  - C++
+  - C#
+  - Go
+  - Java
+  - JavaScript
+  - Lua
+  - PHP
+  - Python
+  - Ruby
+  - Rust
+  - Scala
+  - TypeScript
 library_name: transformers
 pipeline_tag: text-generation
 inference: false
 ---
 # llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0
 This repository provides large language models developed by [LLM-jp](https://llm-jp.nii.ac.jp/), a collaborative project launched in Japan.
 | Model Variant |
 | :--- |
+|**Instruction models**|
+| [llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0) |
+| [llm-jp-13b-instruct-full-ac_001-dolly-ichikara_004_001_single-oasst-oasst2-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-ac_001-dolly-ichikara_004_001_single-oasst-oasst2-v2.0) |
+| [llm-jp-13b-instruct-full-ac_001_16x-dolly-ichikara_004_001_single-oasst-oasst2-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-ac_001_16x-dolly-ichikara_004_001_single-oasst-oasst2-v2.0) |
 |  |
 | :--- |
 |**Pre-trained models**|
+| [llm-jp-13b-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-v2.0) |
+Checkpoints format: Hugging Face Transformers
 ## Required Libraries and Their Versions
+- torch>=2.3.0
+- transformers>=4.40.1
+- tokenizers>=0.19.1
+- accelerate>=0.29.3
+- flash-attn>=2.5.8
 ## Usage
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0")
+model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0", device_map="auto", torch_dtype=torch.float16)
+text = "自然言語処理とは何か"
 tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
 with torch.no_grad():
     output = model.generate(
         tokenized_input,
+        max_new_tokens=100,
         do_sample=True,
         top_p=0.95,
         temperature=0.7,
+        repetition_penalty=1.05,
     )[0]
 print(tokenizer.decode(output))
 ```
 ## Model Details
 - **Model type:** Transformer-based Language Model
+- **Total seen tokens:** 256B
 |Model|Params|Layers|Hidden size|Heads|Context length|
 |:---:|:---:|:---:|:---:|:---:|:---:|
+|13b model|13b|40|5120|40|4096|
 ## Training
 - **Pre-training:**
+  - **Hardware:** 128 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
+  - **Software:** Megatron-LM
 - **Instruction tuning:**
   - **Hardware:** 8 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
   - **Software:** [TRL](https://github.com/huggingface/trl), [PEFT](https://github.com/huggingface/peft), and [DeepSpeed](https://github.com/microsoft/DeepSpeed)
 ## Tokenizer
 The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
+The vocabulary entries were converted from [`llm-jp-tokenizer v2.2 (100k: code20K_en40K_ja60K.ver2.2)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.2).
+Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).
 - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
+- **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
 - **Training data:** A subset of the datasets for model pre-training
+- **Vocabulary size:** 96,867 (mixed vocabulary of Japanese, English, and source code)
+  - The acutal size of vocabulary in the pretrained model is 97,024 due to round-up to multiples of 256.
 ## Datasets
 The models have been pre-trained using a blend of the following datasets.
 | Language | Dataset | Tokens|
+|:---|:---|---:|
+|Japanese|[Wikipedia](https://huggingface.co/datasets/wikipedia)|1.4B
+||[Common Crawl](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v2)|130.7B
+|English|[Wikipedia](https://huggingface.co/datasets/wikipedia)|4.7B
+||[The Pile](https://huggingface.co/datasets/EleutherAI/pile)|110.3B
+|Codes|[The Stack](https://huggingface.co/datasets/bigcode/the-stack)|8.7B
 ### Instruction tuning
 The models have been fine-tuned on the following datasets.
 | Language | Dataset | description |
+|:---|:---|:---|
+|Japanese|[ichikara-instruction-004-001](https://liat-aip.sakura.ne.jp/wp/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf%e4%bd%9c%e6%88%90/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf-%e5%85%ac%e9%96%8b/)| A manually constructed Japanese instruction dataset |
+|        |[answer-carefully-001](https://liat-aip.sakura.ne.jp/wp/answercarefully-dataset/)| A manually constructed Japanese instruction dataset focusing on LLMs' safety |
+|        |[databricks-dolly-15k-ja](https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja)| [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) translated into Japanese using DeepL  |
+|        |[oasst1-21k-ja](https://huggingface.co/datasets/llm-jp/oasst1-21k-ja)| A subset of [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) translated into Japanese using DeepL |
+|        |[oasst2-33k-ja](https://huggingface.co/datasets/llm-jp/oasst2-33k-ja)| A subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2) translated into Japanese using DeepL |
+|English |[oasst1-21k-en](https://huggingface.co/datasets/llm-jp/oasst1-21k-en)| A subset of [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) |
+|        |[oasst2-33k-en](https://huggingface.co/datasets/llm-jp/oasst2-33k-en)| A subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2) |
 ## Evaluation
+You can view the evaluation results of several LLMs on this [leaderboard](http://wandb.me/llm-jp-leaderboard). We used [llm-jp-eval](https://github.com/llm-jp/llm-jp-eval) (v1.3.0) for the evaluation.
+Besides, we used LLM-as-a-judge frameworks, [Japanese Vicuna QA Benchmark](https://github.com/ku-nlp/ja-vicuna-qa-benchmark/) and [Japanese MT Bench](https://github.com/Stability-AI/FastChat/tree/jp-stable/fastchat/llm_judge), for evaluation.
+For details, please refer to [our technical blog](https://llm-jp.nii.ac.jp/blog/2024/04/30/v2.0-release.html) (in Japanese).
 ## Risks and Limitations
 ## Model Card Authors
 *The names are listed in alphabetical order.*
+Namgi Han, Tatsuya Hiraoka, Hirokazu Kiyomaru, Takashi Kodama, and Hiroshi Matsuda.