File size: 5,095 Bytes
57e5006
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
<div align="center">
  <img src="./images/通古logo.png" width="400"/>
</div>

# TongGu LLM

## Introduction

Tonggu is a classical Chinese LLM developed by the Deep Learning and Visual Computing Laboratory (SCUT-DLVCLab) at South China University of Technology. It has strong capabilities in understanding and processing ancient texts. Tonggu uses multi-stage instruction fine-tuning and innovatively proposes a Redundancy-Aware Tuning (RAT) method, which greatly retains the capabilities of the base model while enhancing the performance of downstream tasks.

<div align="center">
  <img src="./images/model_training.png">
</div>


## Evaluation

Tonggu has surpassed existing models in a wide range of classical Chinese understanding and processing tasks. A comparison with its base model Baichuan2-7B-Chat demonstrates the effectiveness of Tonggu's training process and methods. In the future, Tonggu will continue to update its model and benefit from even more powerful base models.


<div align="center">
  <img src="./images/evaluation_table.png">
</div>

<div align="center">
  <img src="./images/evaluation_table2.png" width="600">
</div>

# Open-source List

## Model

[**TongGu-7B-Instruct**](https://huggingface.co/SCUT-DLVCLab/TongGu-7B-Instruct): The 7B classical Chinese language model is based on Baichuan2-7B-Base, which has undergone unsupervised incremental pre-training on a corpus of 2.41 billion classical Chinese texts, and fine-tuned on 4 million classical Chinese dialogue data, possessing functions such as ancient text annotation, translation, and appreciation.


## Data

**ACCN-INS**: 4 million classical Chinese instruction data, covering 24 estimated tasks across three dimensions of ancient text understanding, generation, and knowledge.

The ACCN-INS dataset can only be used for non-commercial research purposes. For scholar or organization who wants to use the MSDS dataset, please first fill in this [Application Form](https://github.com/SCUT-DLVCLab/TongGu-LLM/blob/main/application-form/Application-Form-for-Using-ACCN-INS.docx) and email them to us. When submitting the application form to us, please list or attached 1-2 of your publications in the recent 6 years to indicate that you (or your team) do research in the related research fields of classical Chinese.
We will give you the download link and the decompression password after your application has been received and approved.
All users must follow all use conditions; otherwise, the authorization will be revoked.


# News

- 2024/9/21 The paper of Tonggu has been accepted by EMNLP 2024.
- 2024/9/26 Tonggu model and instruction data have been opened sourced.


# Examples

<details><summary><b>句读</b></summary>
    
![image](./images/标点.png)

</details>

<details><summary><b>成语解释</b></summary>
    
![image](./images/成语解释.png)

</details>

<details><summary><b>文白翻译</b></summary>
    
![image](./images/文白翻译.png)

</details>

<details><summary><b>白文翻译</b></summary>
    
![image](./images/白文翻译.png)

</details>

<details><summary><b>诗词创作</b></summary>
    
![image](./images/词创作.png)

</details>


# Inference

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "SCUT-DLVCLab/TongGu-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_path, device_map='auto', torch_dtype=torch.bfloat16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

system_message = "你是通古,由华南理工大学DLVCLab训练而来的古文大模型。你具备丰富的古文知识,为用户提供有用、准确的回答。"
user_query = "翻译成白话文:大学之道,在明明德,在亲民,在止于至善。"
prompt = f"{system_message}\n<用户> {user_query}\n<通古> "
inputs = tokenizer(prompt, return_tensors='pt')
generate_ids = model.generate(
    inputs.input_ids.cuda(), 
    max_new_tokens=128
)
generate_text = tokenizer.batch_decode(
    generate_ids, 
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0][len(prompt):]

print(generate_text)
```


# Citation

```
@article{cao2024tonggu,
  title={TongGu: Mastering Classical Chinese Understanding with Knowledge-Grounded Large Language Models},
  author={Cao, Jiahuan and Peng, Dezhi and Zhang, Peirong and Shi, Yongxin and Liu, Yang and Ding, Kai and Jin, Lianwen},
  journal={EMNLP 2024},
  year={2024}
}
```

# Statement:

After extensive data incremental pre-training and instruction fine-tuning, Tonggu has strong capabilities in processing ancient texts, such as punctuation and translation. However, due to limitations in model size and the autoregressive generation paradigm, Tonggu may still generate misleading replies containing factual errors or harmful content that includes bias or discrimination. Please use it cautiously and be aware of discerning such content. Do not spread harmful content generated by Tonggu on the Internet. If any adverse consequences arise, the disseminator shall bear the responsibility.