Distributed Training: Train BART/T5 for Summarization using 🤗 Transformers and Amazon SageMaker Apr 8, 2021
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models Paper • 2407.01906 • Published Jul 2 • 34 • 1
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems Paper • 2407.01370 • Published Jul 1 • 85 • 7
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published Jun 25 • 85 • 3
How Do Large Language Models Acquire Factual Knowledge During Pretraining? Paper • 2406.11813 • Published Jun 17 • 29 • 1
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model Paper • 2405.04434 • Published May 7 • 13 • 2
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models Paper • 2404.02258 • Published Apr 2 • 104 • 6
Improved Baselines with Visual Instruction Tuning Paper • 2310.03744 • Published Oct 5, 2023 • 36 • 8
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models Paper • 2402.13064 • Published Feb 20 • 46 • 2