dongxiaoqun commited on
Commit
a1f0b25
1 Parent(s): 22a254a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: zh
3
+ tags:
4
+ - summarization
5
+ inference: False
6
+ ---
7
+
8
+
9
+ Randeng_Pegasus_523M_Summary model (Chinese),which codes has merged into [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM)
10
+
11
+ The 523M million parameter randeng_pegasus_large model, training with sampled gap sentence ratios on 180G Chinese data, and stochastically sample important sentences. The pretraining task just same as the paper [PEGASUS: Pre-training with Extracted Gap-sentences for
12
+ Abstractive Summarization](https://arxiv.org/pdf/1912.08777.pdf) mentioned.
13
+
14
+ Different from the English version of pegasus, considering that the Chinese sentence piece is unstable, we use jieba and Bertokenizer as the tokenizer in chinese pegasus model.
15
+
16
+ This model we provided in hugging face hub is only the pretrained model, has not finetuned with download data yet.
17
+
18
+ We also pretained a base model, available with [Randeng_Pegasus_238M_Summary](https://huggingface.co/IDEA-CCNL/Randeng_Pegasus_238M_Summary)
19
+
20
+
21
+ Task: Summarization
22
+
23
+ ## Usage
24
+ ```python
25
+ from transformers import PegasusForConditionalGeneration
26
+ import jieba
27
+ jieba.initialize()
28
+ # Need to download tokenizers_pegasus.py and other Python script from Fengshenbang-LM github repo in advance,
29
+ # or you can mv download in tokenizers_pegasus.py and data_utils.py in https://huggingface.co/IDEA-CCNL/Randeng_Pegasus_523M_Summary/tree/main
30
+ # Strongly recommend you git clone the Fengshenbang-LM repo:
31
+ # 1. git clone https://github.com/IDEA-CCNL/Fengshenbang-LM
32
+ # 2. cd Fengshenbang-LM/fengshen/examples/pegasus/
33
+ # and then you will see the tokenizers_pegasus.py and data_utils.py which are needed by pegasus model
34
+ # from tokenizers_pegasus import PegasusTokenizer
35
+ class PegasusTokenizer(BertTokenizer):
36
+ model_input_names = ["input_ids", "attention_mask"]
37
+ def __init__(self, pre_tokenizer=lambda x: jieba.cut(x, HMM=False), **kwargs):
38
+ self.pre_tokenizer = pre_tokenizer
39
+ super().__init__(pre_tokenizer=self.pre_tokenizer, **kwargs)
40
+ self.add_special_tokens({'additional_special_tokens':["<mask_1>"]})
41
+
42
+ def build_inputs_with_special_tokens(
43
+ self,
44
+ token_ids_0: List[int],
45
+ token_ids_1: Optional[List[int]] = None) -> List[int]:
46
+
47
+ if token_ids_1 is None:
48
+ return token_ids_0 + [self.eos_token_id]
49
+ return token_ids_0 + token_ids_1 + [self.eos_token_id]
50
+
51
+ def _special_token_mask(self, seq):
52
+ all_special_ids = set(
53
+ self.all_special_ids) # call it once instead of inside list comp
54
+ # all_special_ids.remove(self.unk_token_id) # <unk> is only sometimes special
55
+ return [1 if x in all_special_ids else 0 for x in seq]
56
+
57
+ def get_special_tokens_mask(
58
+ self,
59
+ token_ids_0: List[int],
60
+ token_ids_1: Optional[List[int]] = None,
61
+ already_has_special_tokens: bool = False) -> List[int]:
62
+ if already_has_special_tokens:
63
+ return self._special_token_mask(token_ids_0)
64
+ elif token_ids_1 is None:
65
+ return self._special_token_mask(token_ids_0) + [self.eos_token_id]
66
+ else:
67
+ return self._special_token_mask(token_ids_0 +
68
+ token_ids_1) + [self.eos_token_id]
69
+
70
+ model = PegasusForConditionalGeneration.from_pretrained("IDEA-CCNL/randeng_pegasus_523M_summary")
71
+ tokenizer = PegasusTokenizer.from_pretrained("path/to/vocab.txt")
72
+
73
+ text = "在北京冬奥会自由式滑雪女子坡面障碍技巧决赛中,中国选手谷爱凌夺得银牌。祝贺谷爱凌!今天上午,自由式滑雪女子坡面障碍技巧决赛举行。决赛分三轮进行,取选手最佳成绩排名决出奖牌。第一跳,中国选手谷爱凌获得69.90分。在12位选手中排名第三。完成动作后,谷爱凌又扮了个鬼脸,甚是可爱。第二轮中,谷爱凌在道具区第三个障碍处失误,落地时摔倒。获得16.98分。网友:摔倒了也没关系,继续加油!在第二跳失误摔倒的情况下,谷爱凌顶住压力,第三跳稳稳发挥,流畅落地!获得86.23分!此轮比赛,共12位选手参赛,谷爱凌第10位出场。网友:看比赛时我比谷爱凌紧张,加油!"
74
+ inputs = tokenizer(text, max_length=1024, return_tensors="pt")
75
+
76
+ # Generate Summary
77
+ summary_ids = model.generate(inputs["input_ids"])
78
+ tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
79
+ ```
80
+
81
+ ## Citation
82
+ If you find the resource is useful, please cite the following website in your paper.
83
+ ```
84
+ @misc{Fengshenbang-LM,
85
+ title={Fengshenbang-LM},
86
+ author={IDEA-CCNL},
87
+ year={2022},
88
+ howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
89
+ }
90
+ ```