Still following your human intuition to mix corpora from different sources for pre-training 🧠? Everyone says that data mixture has a big impact on model performance, but how - and why🕵️? Did you know that web corpora are actually highly impactful for downstream tasks 🏆?

Check out our preprint "RegMix: Data Mixture as Regression for Language Model Pre-training" 📄

🔬 In this paper, we've proposed an automatic data mixture method RegMix that achieves a 6.3% improvement over human selection on the widely used HellaSwag benchmark - and it only needs a 2% extra training FLOPs! 📈

📄 Paper: RegMix: Data Mixture as Regression for Language Model Pre-training (2407.01492)
💻 Code: https://github.com/sail-sg/regmix
📊 Collection: sail/regmix-data-mixture-as-regression-6682b6caab37b9442877f0ce
🎮 Demo: https://huggingface.co/spaces/sail/RegMix
posted an update 5 months ago
Introducing Sailor-14B Model and Sailor2 Project 🚢

We're thrilled to announce the release of the Sailor-14B models, including the Base and the Chat versions!

✅Built upon the Qwen1.5-14B model, the Base version follows a similar procedure as our Sailor-7B model.
✅The Chat version is optimized using DPO on our in-house human preference dataset, yielding a better experience than our previous Chat models.

🏠Home: https://sailorllm.github.io
🤗Model: sail/Sailor-14B-Chat
💻Demo: sail/Sailor-14B-Chat

We're also excited to introduce the Sailor2 project, ✨ an open collaboration opportunity for the entire community! ✨

🌐 The Sailor2 project aims to build a LLM with ~30B parameters, optimized for multiple South-East Asian languages, including Cebuano, Indonesian, Khmer, Lao, Minangkabau, Malay, Burmese, Sundanese, Javanese, Thai, and Vietnamese.

🎯The model will undergo continual pre-training from a base model proficient in both Chinese and English using nearly 800B SEA tokens, with an expected performance comparable to the most advanced business models for the above SEA languages.

🤝 Contribute your data, expertise, and ideas to shape the future of open-source LLMs for the SEA region.

🌍 Everyone passionate about the SEA region is welcome aboard! Join the party and get involved by scanning the QR code! 🔍

Let's sail together and enjoy the journey!⚓
posted an update 5 months ago
✨ Today, we're excited to share the full data processing script used in developing our Sailor models. The repo provides an end-to-end data processing pipeline for LLM training. 🚀

💻Code: https://github.com/sail-sg/sailcraft
🤗Model: sail/sailor-language-models-65e19a749f978976f1959825
📜Paper: Sailor: Open Language Models for South-East Asia (2404.03608)
🌐Homepage: https://sailorllm.github.io

# Overview 🔍

The pipeline consists of 4 stages🧹:
1️⃣ Initial data cleaning
2️⃣ Near deduplication
3️⃣ Exact deduplication
4️⃣ Second round of data cleaning

A special focus was given to the data cleaning part of South-East Asian (SEA) languages🌍

# Use Case ✨

With this codebase, you can clean your own dataset with:

✅ Get filtered data counts after each processing stage
✅ Easily configure language-specific cleaning rules (we support Arabic, Bengali, Catalan, Spanish, Basque, French, Hindi, Portuguese, Urdu, and optimize for English, Indonesian, Vietnamese, Chinese, Thai, Lao, Malay)
✅ Investigate what data was removed at each processing stage

# Acknowledgement 🙏

The main credit goes to @dreamerdeo , the first author of our Sailor paper ❤️! He put in tremendous effort on the data processing pipeline, enabling the model's great performance. We believe the mini repo will be a valuable resource for researchers working on dataset curation for large language models. 🎉

Sharing the recipe openly aligns with our commitment to open language model development. 💪 And this repo would not have been possible without the contributions from the open community, including the BigScience data cleaning tool, the all-in-one deduplication tool by @chenghao , and the deduplication project from Google. 🧠

# What's Next 🚀

Share your thoughts or leave any comments on what you'd like the Sailor models to do! We also have some exciting news coming soon, and please stay tuned. 🚄
⚓️ Sailor: A New Multilingual Open LLM for South-East Asia 🌏

Last month we have released a new family of multilingual language models called **Sailor**, ranging from 0.5B to 7B parameters, continually pre-trained from the Qwen1.5 models. Based on our extensive benchmarking, the Sailor models demonstrate exceptional performance on South-East Asian languages, taking us one step closer to multilingual LLMs that can serve the diverse needs of the region and beyond.

Today, we're more than excited to share the key technical details behind the Sailor models! 💪

**Key highlights**:
🔍 Data curation: Merging short examples, document-level code-switching, aggressive data cleaning and deduplication.
🤖 Tokenization Robustness: We find that BPE dropout is really effective to deal with prompt variations.
🔍 Optimizing Data Mixture: We propose a new approach to automatically balance capabilities across different languages!
🌟 Recipe in Continual Pre-training: We discover a powerful metric that can help predict how well the Sailor models will perform on the original domain (e.g., English) after continual pre-training.

We are thrilled to share these technical details with the community and invite you to explore the Sailor models. We hope Sailor models take us one step closer to multilingual LLMs in the world! 🌍✨

To learn more, please access our research paper or reach out to our team.
🔗 Paper: Sailor: Open Language Models for South-East Asia (2404.03608)
🧩 Model: sail/sailor-language-models-65e19a749f978976f1959825
💻 Code: https://github.com/sail-sg/sailor-llm