James-WYang commited on
Commit
f0dd625
1 Parent(s): 658cc7b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -0
README.md CHANGED
@@ -1,3 +1,18 @@
1
  ---
2
  license: lgpl-3.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: lgpl-3.0
3
  ---
4
+ **BigTrans: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages**
5
+ ### Large-scale Parallel Dataset Construction
6
+ In order to enhance the language capabilities of the Chinese LLaMA model to support 102 languages, we constructed a comprehensive parallel corpus dataset consisting of 102 languages. This dataset was employed to continue training the foundational model. The compilation of this dataset drew upon multiple sources, including widely available public parallel corpus datasets and household datasets. The public datasets utilized in our study contain IWSLT, WMT, CCMT, and OPUS-100, forming the initial corpus of our dataset.
7
+
8
+ To effectively illustrate the distribution of the corpus, we present a visual representation of the language-pair distribution within the multilingual datasets. The matter pertaining to the imbalance between high-resource and low-resource language pairs continues to be a prominent concern within the current corpus.
9
+
10
+ ### Incremental Multilingual Pre-training
11
+ In this incremental pre-training method, we gradually expose the model to language pairs in a curriculum-like manner. Initially, the model is exposed to high-resource language pairs, allowing it to establish a solid foundation in those languages. Subsequently, we progressively introduce low-resource language pairs, enabling the model to gradually expand its knowledge and proficiency in these languages.
12
+
13
+ Specifically, we follow a three-step approach in our incremental pre-training method. Firstly, we set the sample interval size and divide language pairs into distinct intervals based on the number of instances for each language pair. Secondly, we calculate the sample mean for all language pairs in each interval. Thirdly, we dynamically measure the moment of adding the language-pair samples next interval according to the sample mean in the previous sample interval. In the following part, we detail the three steps.
14
+
15
+ ### Experiments
16
+ To verify the effectiveness of our BigTrans model, we conduct preliminary multilingual translation experiments on all 102 languages. We compare BigTrans with both Google Translate and ChatGPT. Since the automatic evaluation metric BLEU is usually criticized for the poor correlation with human judgments in machine translation quality, we further employ GPT-4 which shows a high correlation with human as the evaluator and we design well-defined prompts to make GPT-4 act like a human evaluator. The experiments show that BigTrans performs comparably with Google and ChatGPT in many languages, and even outperforms ChatGPT in 8 language pairs.
17
+
18
+ **More Details can be found at https://github.com/ZNLP/BigTrans and https://arxiv.org/abs/2305.18098**