hejunqing commited on
Commit
01ce712
1 Parent(s): a7eb0d6

add sentence piece model, update README

Browse files
Files changed (2) hide show
  1. README.md +89 -1
  2. spiece.model +3 -0
README.md CHANGED
@@ -1,3 +1,91 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - zh
4
+
5
+ tags:
6
+ - Question Answering
7
+ - Machine Reading
8
+ - Text Generation
9
+ - Pretrained Chinese T5-Large model
10
+
11
+ datasets:
12
+ - CMRC 2018 dev
13
+
14
+ metrics:
15
+ - RougeL
16
+ - BLEU-4
17
+ - F1
18
+ - EM
19
+ - Contain Answer Rate
20
+
21
+ licence: apache-2.0
22
  ---
23
+ # T5 for Chinese Question Answering
24
+ Randeng-T5-784M-QA-Chinese
25
+
26
+
27
+ ## Brief Introduction
28
+ This T5-Large model, is the first pretrained generative question answering model for Chinese in huggingface. It was pretrained on the Wudao 180G corpus, finetuned on Chinese SQuAD and CMRC2018 dataset. It can produce a fluent and accurate answer given a passage and question.
29
+
30
+ 这是huggingface上首个中文的生成式问答模型。它基于T5-Large结构,使用悟道180G语料进行预训练,之后在翻译的中文SQuAD和CMRC2018两个阅读理解数据集上进行微调。输入一篇文章和一个问题,可以生成准确流畅的回答。
31
+
32
+ ## Performance
33
+
34
+ CMRC 2018 dev (Original span prediction task, we cast it as a generative QA task)
35
+ CMRC 2018的测试集上的效果(原始任务是一个起始和结束预测问题,这里作为一个生成回答的问题)
36
+
37
+ | model | F1 | EM | Contain Answer Rate| RougeL | BLEU-4 |
38
+ |-------|----|----|--------------------|--------|--------|
39
+ | Ours |77.9 |57.1| 76.0 | 82.7 |61.1|
40
+ |MacBERT-Large(SOTA)|88.9|70.0|-|-|-|
41
+
42
+ Our model enjoys a high level of generation quality and accuracy, with 76% of generated answers containing the ground truth, which rivals the EM of span prediction SOTA. Our model has a lower EM because it generates complete sentences while groud truth are segmentations of sentences. The extremely high RougeL and BLEU-4 reveals the overlap between generated results and groud truth.
43
+ P.S.The SOTA model only predicts the start and end tag as an extractive MRC task.
44
+
45
+ 我们的模型有着极高的生成质量和准确率,76%的回答包含了正确答案(Contain Answer Rate),和当前最好模型MacBERT-Large想媲美,它70%的起始位置预测和答案精确匹配(EM)。我们的模型EM值较低,因为生成的大部分为完整的句子,而标准答案通常是句子片段。
46
+ P.S. SOTA模型只需预测起始和结束位置,这种抽取式阅读理解任务比生成式的简单很多。
47
+
48
+ ## Cases
49
+
50
+ ![avatar][cases_t5_cmrc.png]
51
+
52
+ *pred* in picture are generated results,*target* indicates groud truth.
53
+
54
+ ## Usage
55
+ ```python
56
+ import numpy as np
57
+ from transformers import T5Tokenizer,MT5ForConditionalGeneration
58
+
59
+ pretrain_path = 'IDEA-CCNL/Randeng-T5-784M-QA-Chinese'
60
+ tokenizer=T5Tokenizer.from_pretrained(pretrain_path)
61
+ model=MT5ForConditionalGeneration.from_pretrained(pretrain_path)
62
+
63
+ sample={"context":"","question":"","idx":1}
64
+ plain_text='question:'+sample['question']+'knowledge:'+sample['context'][:self.max_knowledge_length]
65
+
66
+ res_prefix=tokenizer.encode('answer'+'<extra_id_0>',add_special_token=False)
67
+ l_rp=len(res_prefix)
68
+
69
+ tokenized=tokenizer.encode(plain_text,add_special_tokens=False,truncation=True,max_length=self.max_seq_length-2-l_rp)
70
+
71
+ tokenized+=res_prefix
72
+ tokenized.append(EOS_TOKEN_ID)
73
+
74
+ # Generate answer
75
+ pred_ids = model.generate(input_ids=tokenized,max_new_token=self.max_target_length,do_sample=True,top_p=0.9)
76
+ tokenizer.batch_decode(pred_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
77
+ ```
78
+
79
+
80
+ # Citation
81
+
82
+ You can also cite our [website](https://github.com/IDEA-CCNL/Fengshenbang-LM/):
83
+ 欢迎引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/):
84
+ ```text
85
+ @misc{Fengshenbang-LM,
86
+ title={Fengshenbang-LM},
87
+ author={IDEA-CCNL},
88
+ year={2021},
89
+ howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
90
+ }
91
+ ```
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c65feffa65ff0378759778193852083d23349cb1b40c906e9463a12f8076ff32
3
+ size 680811