m-ric (Aymeric Roucher)

posted an update 1 day ago

Post

1556

📜 𝐎𝐥𝐝-𝐬𝐜𝐡𝐨𝐨𝐥 𝐑𝐍𝐍𝐬 𝐜𝐚𝐧 𝐚𝐜𝐭𝐮𝐚𝐥𝐥𝐲 𝐫𝐢𝐯𝐚𝐥 𝐟𝐚𝐧𝐜𝐲 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬!

Researchers from Mila and Borealis AI just have shown that simplified versions of good old Recurrent Neural Networks (RNNs) can match the performance of today's transformers.

They took a fresh look at LSTMs (from 1997!) and GRUs (from 2014). They stripped these models down to their bare essentials, creating "minLSTM" and "minGRU". The key changes:
❶ Removed dependencies on previous hidden states in the gates
❷ Dropped the tanh that had been added to restrict output range in order to avoid vanishing gradients
❸ Ensured outputs are time-independent in scale (not sure I understood that well either, don't worry)

⚡️ As a result, you can use a “parallel scan” algorithm to train these new, minimal RNNs, in parallel, taking 88% more memory but also making them 200x faster than their traditional counterparts for long sequences

🔥 The results are mind-blowing! Performance-wise, they go toe-to-toe with Transformers or Mamba.

And for Language Modeling, they need 2.5x fewer training steps than Transformers to reach the same performance! 🚀

🤔 Why does this matter?

By showing there are simpler models with similar performance to transformers, this challenges the narrative that we need advanced architectures for better performance!

💬 François Chollet wrote in a tweet about this paper:

“The fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)”

“Curve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape.”

It’s the Bitter lesson by Rich Sutton striking again: don’t need fancy thinking architectures, just scale up your model and data!

Read the paper 👉 Were RNNs All We Needed? (2410.01201)

2 replies

·

posted an update 2 days ago

Post

1080

🇨🇳⛵️ 出海: Chinese AI is expanding globally

Fact: Chinese LLMs are heavily underrated, for instance recently the excellent Deepseek-v2.5 or Qwen models.

Luckily for us, @AdinaY just wrote an excellent blog post explaining the Chinese AI ecosystem!

My key takeaways:

Since Google, OpenAI and Anthropic models are not available in China, local companies are fighting for the market. A really good market - AI has much higher penetration there than in the rest of the world, both with companies and individual users!

💰 But since Deepseek heavily cut prices in May 24, this spiraled into a price war that created a cut-throat environment with unsustainably low prices.

📋 On top of this, the local regulation is stringent: models must undergo licensing from a local censor (the Cyberspace Administration of China), that for instance requires models to refuse answering certain questions on the CCP. Although this is certainly simpler to implement than certain condition of the European AI Act.

💸 If this wasn't enough, VC investment in AI is drying out: By mid-2024, Chinese AI startups raised approximately $4.4 billion, vs $55B for US startups just in Q2 24.

📱 To get profitability companies have shifted from foundational models to model + application, for instance PopAI from [01.AI](http://01.ai/) with millions of users and high profitability.

⛏️ They also try to drill down specific industries: but these niches are also getting crowded.

➡️ Since their home market is becoming both too crowded and unhospitable, Chinese companies are now going for international market, "Sailing abroad" following the expression consacred for Zheng He's legendary journey in 1500.

There, they'll have to adapt to different infrastructures and regulations, but they have bright prospects for growth!

Read her post 👉 https://huggingface.co/blog/AdinaY/chinese-ai-global-expansion

posted an update 3 days ago

Post

1056

Emu3: Next-token prediction conquers multimodal tasks 🔥

This is the most important research in months: we’re now very close to having a single architecture to handle all modalities. The folks at Beijing Academy of Artificial Intelligence (BAAI) just released Emu3, a single model that handles text, images, and videos all at once.

𝗪𝗵𝗮𝘁'𝘀 𝘁𝗵𝗲 𝗯𝗶𝗴 𝗱𝗲𝗮𝗹?
🌟 Emu3 is the first model to truly unify all these different types of data (text, images, video) using just one simple trick: predicting the next token.
And it’s only 8B, but really strong:
🖼️ For image generation, it's matching the best specialized models out there, like SDXL.
👁️ In vision tasks, it's outperforming top models like LLaVA-1.6-7B, which is a big deal for a model that wasn't specifically designed for this.
🎬 It's the first to nail video generation without using complicated diffusion techniques.

𝗛𝗼𝘄 𝗱𝗼𝗲𝘀 𝗶𝘁 𝘄𝗼𝗿𝗸?
🧩 Emu3 uses a special tokenizer (SBER-MoVQGAN) to turn images and video clips into sequences of 4,096 tokens.
🔗 Then, it treats everything - text, images, and videos - as one long series of tokens to predict.
🔮 During training, it just tries to guess the next token, whether that's a word, part of an image, or a video frame.

𝗖𝗮𝘃𝗲𝗮𝘁𝘀 𝗼𝗻 𝘁𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁𝘀:
👉 In image generation, Emu3 beats SDXL, but it’s also much bigger (8B vs 3.5B). It would be more difficult to beat the real diffusion GOAT FLUX-dev.
👉 In vision, authors also don’t show a comparison against all the current SOTA models like Qwen-VL or Pixtral.

This approach is exciting because it's simple (next token prediction) and scalable(handles all sorts of data)!

Read the paper 👉 Emu3: Next-Token Prediction is All You Need (2409.18869)

posted an update 4 days ago

Post

1191

𝗔𝗱𝗱 𝘀𝗼𝘂𝗿𝗰𝗲 𝗵𝗶𝗴𝗵𝗹𝗶𝗴𝗵𝘁𝗶𝗻𝗴 𝘁𝗼 𝘆𝗼𝘂𝗿 𝗥𝗔𝗚 𝘀𝘆𝘀𝘁𝗲𝗺! 📄💡

RAG systems are supposed to make your LLM's answer more trustworthy, by inserting in the prompt some supporting documents from a knowledge base : we say that we're "adding some context".

👎 But if you don't know which part of the answer has been generated based on which input tokens, it's hard to tell wether it was effectively grounded in the context knowledge or not!

🤔 I've been working on the question: is it possible to add notes to the answer linking to which part of the context they're generated from?

And I've found a great solution: a great technique called Layer-wise Relevance Propagation (LRP), showcased in a paper at ICML `24 by Reduan Achtibat et al allows, allows to precisely score how important each input token was in generating your output! They've made it into a library called LXT.

📊 For each generated output token, LXT gives you attribution scores for each input token.

⚙️ So I've worked a bit more on aggregating these scores into meaningful spans between successive input and output tokens, and I finally obtained my desired result: RAG with source highlighting!

Try the demo here 👉 m-ric/rag_highlights

Caveats:
- It slows down generation (for now quite a lot, could hopefully be reduced a lot)
- For now it supports only specific models: Llama models and Mixtral

If there's enough interest in this solution, I can improve it further and spin it off into a specific library for RAG! 🚀

posted an update 10 days ago

Post

1382

Transformers v4.45.0 released: includes a lightning-fast method to build tools! ⚡️

During user research with colleagues @MoritzLaurer and @Jofthomas , we discovered that the class definition currently in used to define a Tool in
transformers.agents is a bit tedious to use, because it goes in great detail.

➡️ So I’ve made an easier way to build tools: just make a function with type hints + a docstring, and add a @tool decorator in front.

✅ Voilà, you’re good to go!

Read all about it in the new doc here: https://huggingface.co/docs/transformers/main/en/agents#create-a-new-tool

And don’t hesitate to give feedback, I’m all ears! 🤗

posted an update 11 days ago

Post

3169

🌎 𝐓𝐡𝐞 𝐟𝐢𝐫𝐬𝐭 𝐞𝐯𝐞𝐫 𝐅𝐨𝐮𝐧𝐝𝐚𝐭𝐢𝐨𝐧 𝐰𝐞𝐚𝐭𝐡𝐞𝐫 𝐦𝐨𝐝𝐞𝐥: 𝐏𝐫𝐢𝐭𝐡𝐯𝐢 𝐖𝐱𝐂 𝐞𝐧𝐚𝐛𝐥𝐞𝐬 𝐥𝐢𝐟𝐞-𝐬𝐚𝐯𝐢𝐧𝐠 𝐰𝐞𝐚𝐭𝐡𝐞𝐫 𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐨𝐧𝐬

Hurricane Katrina killed hundreds of people as it made landfall on New Orleans in 2005 - many of these deaths could have been avoided if alerts had been given one day earlier. Accurate weather forecasts are really life-saving.

🔥 Now, NASA and IBM just dropped a game-changing new model: the first ever foundation model for weather! This means, it's the first time we have a generalist model not restricted to one task, but able to predict 160 weather variables!

Prithvi WxC (Prithvi, “पृथ्वी”, is the Sanskrit name for Earth) - is a 2.3 billion parameter model, with an architecture close to previous vision transformers like Hiera.

💡 But it comes with some important tweaks: under the hood, Prithvi WxC uses a clever transformer-based architecture with 25 encoder and 5 decoder blocks. It alternates between "local" and "global" attention to capture both regional and global weather patterns.

𝗞𝗲𝘆 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀:
🔮 Nails short-term forecasts - Prithvi WxC crushed it on 6-12 hour predictions, even outperforming some traditional numerical weather models
🌀 Tracks hurricanes like a champ - For Hurricane Ida, it predicted the landfall location within 5 km (vs 20+ km errors from other AI models), which is a huge progress!
🔍 6x downscaling power - Can zoom in on weather data to 6x higher resolution with 4x lower error than basic methods
🌊 Models elusive gravity waves - Accurately simulates these crucial but hard-to-capture atmospheric oscillations

As climate change intensifies, tools like Prithvi WxC will become more and more crucial to avoid disasters!

Announcement post 👉 https://newsroom.ibm.com/2024-09-23-ibm-and-nasa-release-open-source-ai-model-on-hugging-face-for-weather-and-climate-applications

Model on the Hub 👉 https://huggingface.co/Prithvi-WxC

Thank you @clem for highlighting it!

posted an update 15 days ago

Post

1045

🧠 Stanford paper might be the key to OpenAI o1’s performance: What’s so effective about Chain of Thought? ⇒ it unlocks radically different sequential tasks!

💭 Reminder: A Chain of Thought (CoT) means that you instruct the model to “think step by step”. Often it’s literally just putting in the prompt “let’s think step by step.”

🤔 This method has been shown to be unreasonably effective to increase perf on benchmarks. However why it works so well remains unclear.

Here's the scoop: Transformers are amazing at parallel processing, but they've always struggled with tasks that require sequential reasoning.

⛔️ For instance if you ask them the result of 3^2^2^2^…, with 20 iterations, they’ll nearly always fail.

💡 Indeed, researchers prove mathematically, by assimilating transformers networks to logical circuits, that effectively they cannot solve sequential tasks that require more than a certain threshold of sequences.

But CoT enables sequential reasoning:

- 🧱 Each step in the CoT corresponds to simulating one operation in a complex circuit.
- 🔄 This allows the transformer to "reset" the depth of intermediate outputs, overcoming previous limitations.
- 🚀 Thus, with CoT, constant-depth transformers can now solve ANY problem computable by polynomial-size circuits! (That's a huge class of problems in computer science.)
- 🔑 Transformers can now handle tricky tasks like iterated squares (computing 3^2^2^2^2) composed permutations and evaluating circuits - stuff that requires serial computation.
- 📊 The improvement is especially dramatic for transformers with a limited depth. Empirical tests on four arithmetic problems showed massive accuracy gains with CoT on inherently serial tasks.

Main takeaway: Chain-of-thought isn't just a neat trick - it fundamentally expands what transformer models can do!

Read the paper 👉 Chain of Thought Empowers Transformers to Solve Inherently Serial Problems (2402.12875)

posted an update 15 days ago

Post

325

Anthropic just released a chunk improvement technique that vastly improves RAG performance! 🔥

Crash reminder: Retrieval Augmented Generation (RAG) is a widely-used technique for improving your LLM chatbot's answers to user questions.

It goes like this: instead of generating an LLM answer straight away, it just adds a previous step called Retrieval, that retrieves relevant documents from your knowledge base through semantic search, and just appends the top K documents to the prompt. ➡️ As a result, the LLM answer is grounded in context.

⛔️ The difficulty with this retrieval step is that when you split your documents into chunks that will be retrieved, you lose context. So importance chunks could be missed.

💡 Anthropic's just released blog post shows that you can add some context to each chunk, with one LLM call. Then you embed the original chunk + a bit of added context, so that the embedding is much more representative of the document in its context!

🤔 Isn't that crazy expensive? Well it would have been before, but not so much anymore with their new Prompt caching feature that makes duplicating thousands of requests with the same prompt much less expensive. They give an indicative price tag of only $1.02 per million chunks processed!

✅ And this vastly improves performance on their benchmark!

Read their blog post 👉 https://www.anthropic.com/news/contextual-retrieval

posted an update 17 days ago

Post

3339

🔥 𝐐𝐰𝐞𝐧 𝐫𝐞𝐥𝐞𝐚𝐬𝐞𝐬 𝐭𝐡𝐞𝐢𝐫 𝟐.𝟓 𝐟𝐚𝐦𝐢𝐥𝐲 𝐨𝐟 𝐦𝐨𝐝𝐞𝐥𝐬: 𝐍𝐞𝐰 𝐒𝐎𝐓𝐀 𝐟𝐨𝐫 𝐚𝐥𝐥 𝐬𝐢𝐳𝐞𝐬 𝐮𝐩 𝐭𝐨 𝟕𝟐𝐁!

The Chinese LLM maker just dropped a flurry of different models, ensuring there will be a Qwen SOTA model for every application out there:
Qwen2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B
Qwen2.5-Coder: 1.5B, 7B, and 32B on the way
Qwen2.5-Math: 1.5B, 7B, and 72B.

And they didn't sleep: the performance is top of the game for each weight category!

𝐊𝐞𝐲 𝐢𝐧𝐬𝐢𝐠𝐡𝐭𝐬:

🌐 All models have 𝟭𝟮𝟴𝗸 𝘁𝗼𝗸𝗲𝗻 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗹𝗲𝗻𝗴𝘁𝗵

📚 Models pre-trained on 18T tokens, even longer than the 15T of Llama-3

💪 The flagship 𝗤𝘄𝗲𝗻𝟮.𝟱-𝟳𝟮𝗕 𝗶𝘀 ~𝗰𝗼𝗺𝗽𝗲𝘁𝗶𝘁𝗶𝘃𝗲 𝘄𝗶𝘁𝗵 𝗟𝗹𝗮𝗺𝗮-𝟯.𝟭-𝟰𝟬𝟱𝗕, 𝗮𝗻𝗱 𝗵𝗮𝘀 𝗮 𝟯-𝟱% 𝗺𝗮𝗿𝗴𝗶𝗻 𝗼𝗻 𝗟𝗹𝗮𝗺𝗮-𝟯.𝟭-𝟳𝟬𝗕 𝗼𝗻 𝗺𝗼𝘀𝘁 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀.

🇫🇷 On top of this, it 𝘁𝗮𝗸𝗲𝘀 𝘁𝗵𝗲 #𝟭 𝘀𝗽𝗼𝘁 𝗼𝗻 𝗺𝘂𝗹𝘁𝗶𝗹𝗶𝗻𝗴𝘂𝗮𝗹 𝘁𝗮𝘀𝗸𝘀 so it might become my standard for French

💻 Qwen2.5-Coder is only 7B but beats competing models up to 33B (DeeSeek-Coder 33B-Instruct). Let's wait for their 32B to come out!

🧮 Qwen2.5-Math sets a new high in the ratio of MATH benchmark score to # of parameters. They trained it by "aggregating more high-quality mathematical data, particularly in Chinese, from web sources, books, and codes across multiple recall cycles."

📄 Technical report to be released "very soon"

🔓 All models have the most permissive license apache2.0, except the 72B models that have a custom license mentioning "you can use it for free EXCEPT if your product has over 100M users"

🤗 All models are available on the HF Hub! ➡️ Qwen/qwen25-66e81a666513e518adb90d9e

2 replies

·

posted an update 19 days ago

Post

1629

𝗔𝗿𝗲 𝗔𝗴𝗲𝗻𝘁𝘀 𝗰𝗮𝗽𝗮𝗯𝗹𝗲 𝗲𝗻𝗼𝘂𝗴𝗵 𝗳𝗼𝗿 𝗗𝗮𝘁𝗮 𝗦𝗰𝗶𝗲𝗻𝗰𝗲? ⇒ 𝗠𝗲𝗮𝘀𝘂𝗿𝗲 𝘁𝗵𝗲𝗶𝗿 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝘄𝗶𝘁𝗵 𝗗𝗦𝗕𝗲𝗻𝗰𝗵 📊

A team from Tencent AI wanted to evaluate agentic systems on data science (DS) tasks : but they noticed that existing agentic benchmarks were severely limited in several aspects: they were limited to text and did not include tables or images, were only specific to certain packages, only performed exact match evaluation…

➡️ So they set out to build a much more exhaustive approach, to finally make the definitive DS agent benchmark.

𝗧𝗵𝗲 𝗗𝗦𝗕𝗲𝗻𝗰𝗵 𝗱𝗮𝘁𝗮𝘀𝗲𝘁
▪️DS bench has 466 data analysis tasks and 74 data modelling tasks
▪️The tasks are sourced from ModelOff and Kaggle, the platforms hosting the most popular data science competitions
▪️Difference with previous DS benchmarks:
❶ This benchmark leverages various modalities on top of text: images, Excel files, tables
❷ Complex tables: sometimes several tables should be leveraged to answer one question
❸ The context is richer, with longer descriptions.
▪️ Evaluation metrics : the benchmark is scored with an LLM as a judge, using a specific prompt.

𝗜𝗻𝘀𝗶𝗴𝗵𝘁𝘀 𝗳𝗿𝗼𝗺 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗻𝗴 𝗮𝗴𝗲𝗻𝘁𝘀
▪️ Their evaluation confirms that using LLMs in an agent setup, for instance by allowing them to run a single step of code execution, is more costly (especially with multi-turn frameworks like autogen) but also much more performant than the vanilla LLM.
▪️ The sets of tasks solved by different models (like GPT-3.5 vs Llama-3-8B) has quite low overlap, which suggests that different models tend to try very different approches.

This new benchmark is really welcome, can't wait to try transformers agents on it! 🤗

Read their full paper 👉 DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? (2409.07703)

posted an update 23 days ago

Post

1084

𝐎𝐩𝐞𝐧𝐀𝐈 𝐟𝐢𝐧𝐚𝐥𝐥𝐲 𝐫𝐞𝐯𝐞𝐚𝐥𝐬 “🍓”: 𝐜𝐫𝐚𝐳𝐲 𝐜𝐡𝐚𝐢𝐧-𝐨𝐟-𝐭𝐡𝐨𝐮𝐠𝐡𝐭-𝐭𝐮𝐧𝐞𝐝 𝐦𝐨𝐝𝐞𝐥 >> 𝐆𝐏𝐓-𝟒𝐨 💥

OpenAI had hinted at a mysterious “project strawberry” for a long time: 𝘁𝗵𝗲𝘆 𝗽𝘂𝗯𝗹𝗶𝘀𝗵𝗲𝗱 𝘁𝗵𝗶𝘀 𝗻𝗲𝘄 𝗺𝗼𝗱𝗲𝗹 𝗰𝗮𝗹𝗹𝗲𝗱 “𝗼𝟭” 𝟭𝗵𝗼𝘂𝗿 𝗮𝗴𝗼, 𝗮𝗻𝗱 𝘁𝗵𝗲 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗶𝘀 𝗷𝘂𝘀𝘁 𝗺𝗶𝗻𝗱-𝗯𝗹𝗼𝘄𝗶𝗻𝗴.

🤯 Ranks among the top 500 students in the US in a qualifier for the USA Math Olympiad
🤯 Beats human-PhD-level accuracy by 8% on GPQA, hard science problems benchmark where the previous best was Claude 3.5 Sonnet with 59.4.
🤯 Scores 78.2% on vision benchmark MMMU, making it the first model competitive w/ human experts
🤯 GPT-4o on MATH scored 60% ⇒ o1 scores 95%

How did they pull this? Sadly OpenAI keeps increasing their performance in “making cryptic AF reports to not reveal any real info”, so here are excerpts:

💬 “𝗼𝟭 𝘂𝘀𝗲𝘀 𝗮 𝗰𝗵𝗮𝗶𝗻 𝗼𝗳 𝘁𝗵𝗼𝘂𝗴𝗵𝘁 𝘄𝗵𝗲𝗻 𝗮𝘁𝘁𝗲𝗺𝗽𝘁𝗶𝗻𝗴 𝘁𝗼 𝘀𝗼𝗹𝘃𝗲 𝗮 𝗽𝗿𝗼𝗯𝗹𝗲𝗺. 𝗧𝗵𝗿𝗼𝘂𝗴𝗵 𝗿𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴, 𝗼𝟭 𝗹𝗲𝗮𝗿𝗻𝘀 𝘁𝗼 𝗵𝗼𝗻𝗲 𝗶𝘁𝘀 𝗰𝗵𝗮𝗶𝗻 𝗼𝗳 𝘁𝗵𝗼𝘂𝗴𝗵𝘁 𝗮𝗻𝗱 𝗿𝗲𝗳𝗶𝗻𝗲 𝘁𝗵𝗲 𝘀𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗲𝘀 𝗶𝘁 𝘂𝘀𝗲𝘀. It learns to recognize and correct its mistakes.”

And of course, they decide to hide the content of this precious Chain-of-
Thought. Would it be for maximum profit? Of course not, you awful capitalist, it’s to protect users:

💬 “We also do not want to make an unaligned chain of thought directly visible to users.”

They’re right, it would certainly have hurt my feelings to see the internal of this model tearing apart math problems.

🤔 I suspect it could be not only CoT, but also some agentic behaviour where the model can just call a code executor. The kind of score improvement the show certainly looks like the ones you see with agents.

This model will be immediately released for ChatGPT and some “trusted API users”.

Let’s start cooking to release the same thing in 6 months! 🚀

1 reply

·

posted an update 23 days ago

Post

664

𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗻𝗴 𝘆𝗼𝘂𝗿 𝗛𝗧𝗠𝗟 𝘄𝗲𝗯𝗽𝗮𝗴𝗲𝘀 𝘁𝗼 𝗺𝗮𝗿𝗸𝗱𝗼𝘄𝗻 𝗶𝘀 𝗻𝗼𝘄 𝗽𝗼𝘀𝘀𝗶𝗯𝗹𝗲 𝗲𝗻𝗱-𝘁𝗼-𝗲𝗻𝗱 𝘄𝗶𝘁𝗵 𝗮 𝘀𝗶𝗺𝗽𝗹𝗲 𝗟𝗟𝗠! 👏

Jina just released Reader-LM, that handles the whole pipeline of extracting markdown from HTML webpages.

A while ago, Jina had released a completely code-based deterministic program to do this extraction, based on some heuristics : e.g., “if the text is in a <p> tag, keep it, but if it’s hidden behind another, remove it”.

🤔 But they received complaints from readers: some found it too detailed, other not enough, depending on the pages.

➡️ So they decided, 𝗺𝗮𝘆𝗯𝗲 𝗵𝗲𝘂𝗿𝗶𝘀𝘁𝗶𝗰𝘀 𝘄𝗲𝗿𝗲 𝗻𝗼𝘁 𝗲𝗻𝗼𝘂𝗴𝗵: 𝗶𝗻𝘀𝘁𝗲𝗮𝗱, 𝘁𝗵𝗲𝘆 𝘁𝗿𝗶𝗲𝗱 𝘁𝗼 𝘁𝗿𝗮𝗶𝗻 𝗮 𝗟𝗟𝗠 𝘁𝗼 𝗱𝗼 𝘁𝗵𝗲 𝗰𝗼𝗺𝗽𝗹𝗲𝘁𝗲 𝗲𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻. This LLM does not need to be very strong,but it should handle a very long context: it’s a challenging, “shallow-but-wide” architecture.

𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀:
2️⃣ models: Reader-LM-0.5B and 1.5B
⚙️ Two stages of training: first, short and simple HTML to get the basics, then ramp up to longer and harder HTML up to 128k tokens
🔎 Use contrastive search for decoding: this empirically reduces “repeating output” issues
➡️ Their models beat much larger models at HTML extraction 🔥
🤗 Weights available on HF (sadly cc-by-nc license): jinaai/reader-lm-1.5b

1 reply

·

posted an update 24 days ago

Post

1341

𝗢𝗽𝗲𝗻 𝗟𝗟𝗠𝘀 𝗮𝗿𝗲 𝗼𝗻 𝗳𝗶𝗿𝗲 𝗿𝗶𝗴𝗵𝘁 𝗻𝗼𝘄! 🔥 𝗗𝗲𝗲𝗽𝗦𝗲𝗲𝗸-𝗩𝟮.𝟱 𝗮𝗻𝗱 𝗼𝘁𝗵𝗲𝗿 𝘁𝗼𝗽 𝗿𝗲𝗹𝗲𝗮𝘀𝗲𝘀

Mistral AI just released Pixtral-12B, a vision models that seems to perform extremely well! From Mistral’s own benchmark, it beats the great Qwen2-7B and Llava-OV.

🤔 But Mistral’s benchmarks evaluate in Chain-of-Thought, and even in CoT they show lower scores for other models than the scores already published in non-CoT, which is very strange… Evaluation is not a settled science!

But it’s only the last of a flurry of great models. Here are the ones currently squatting the top of the Models Hub page:

❶ 🔊 𝐋𝐥𝐚𝐦𝐚-𝟑.𝟏-𝟖𝐁 𝐎𝐦𝐧𝐢, a model built upon Llama-3.1-8B-Instruct, that simultaneously generates text and speech response with an extremely low latency of 250ms (Moshi, Kyutai’s 8B, did 140ms)

❷ 🐟🗣️ 𝐅𝐢𝐬𝐡 𝐒𝐩𝐞𝐞𝐜𝐡 𝐯𝟏.𝟒, text-to-speech model that supports 8 languages 🇬🇧🇨🇳🇩🇪🇯🇵🇫🇷🇪🇸🇰🇷🇸🇦 with extremely good quality for a light size (~1GB weights) and low latency

❸ 🐳 𝐃𝐞𝐞𝐩𝐒𝐞𝐞𝐤-𝐕𝟐.𝟓, a 236B model with 128k context length that combines the best of DeepSeek-V2-Chat and the more recent DeepSeek-Coder-V2-Instruct. Depending on benchmarks, it ranks just below Llama-3.1-405B. Released with custom ‘deepseek’ license, quite commercially permissive.

❹ 𝐒𝐨𝐥𝐚𝐫 𝐏𝐫𝐨 published by Upstage: a 22B model (so inference fits on a single GPU) that comes just under Llama-3.1-70B performance : MMLU: 79, GPQA: 36, IFEval: 84

❺ 𝐌𝐢𝐧𝐢𝐂𝐏𝐌𝟑-𝟒𝐁, a small model that claims very impressive scores, even beating much larger models like Llama-3.1-8B. Let's wait for more scores because these look too good!

Let’s keep looking, more good stuff is coming our way 🔭

posted an update 24 days ago

Post

2162

𝗔𝗿𝗰𝗲𝗲 𝗿𝗲𝗹𝗲𝗮𝘀𝗲𝘀 𝗦𝘂𝗽𝗲𝗿𝗡𝗼𝘃𝗮, 𝗯𝗲𝘁𝘁𝗲𝗿 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗲 𝗼𝗳 𝗟𝗹𝗮𝗺𝗮-𝟯.𝟭-𝟳𝟬𝗕!

2️⃣ versions: 70B and 8B
🧠 Trained by distilling logits from Llama-3.1-405B
🐥 Used a clever compression method to reduce dataset weight from 2.9 Petabytes down to 50GB (may share it in a paper)
⚙️ Not all benchmarks are improved: GPQA and MUSR go down a slight bit
🤗 8B weights are available on HF (not the 70B)

Read their blog post 👉 https://blog.arcee.ai/arcee-supernova-training-pipeline-and-model-composition/
Model weights (8B) 👉 arcee-ai/Llama-3.1-SuperNova-Lite

posted an update 25 days ago

Post

628

> 𝗪𝗮𝗻𝘁 𝘁𝗼 𝗸𝗻𝗼𝘄 𝗵𝗼𝘄 𝗺𝘂𝗰𝗵 𝗮𝗻 𝗔𝗣𝗜 𝗟𝗟𝗠 𝗰𝗮𝗹𝗹 𝗰𝗼𝘀𝘁𝘀 𝘆𝗼𝘂?

I've just made this Space that gets you the API price for any LLM call, for nearly all inference providers out there!

This is based on a comment by @victor under my HF Post a few months back, and leverages BerriAI's data for LLM prices.

Check it out here 👉 m-ric/text_to_dollars

replied to their post 25 days ago

FYI, I've finally built this here: https://huggingface.co/spaces/m-ric/text_to_dollars

replied to their post 26 days ago

Read the article here: https://www.adyen.com/knowledge-hub/llm-inference-at-scale-with-tgi

posted an update 26 days ago

Post

1176

> Article read: Simple guide to LLM inference and to TGI

I've just read article "LLM inference at scale with TGI" by @martinigoyanes . It's really good content, a must-read if you want a good low-level intro to LLM inference with TGI!

My takeaways:

How does inference work?
🧠 Prefill: the input prompt is tokenized on CPU, then transferred to GPU. Then one single forward pass generates the initial token.
🔄 Decode: the model generates ("decodes") tokens one by one, each time appending the new token to the current input of size N to then generate a new token again with this augmented input of length N+1. This loop ends either when a specific token called "End-of-sequence" is generated or when the completion reaches a pre-specified maximum length. Then the sequence is de-tokenized on CPU to yield text again.
⏱️ This step's speed determines the Time Per Output Token, which directly translates to the key metric: Throughput

🤔 How was the separation between the two steps decided ? Like, why does prefill include this strange generation of only one token at then end?
➡️ The cost of attention scales quadratically with the number of tokens, so it can really explode quickly.
To compensate for that, a really important technique called KV caching was devised: using the fact that when generating token N+1, the Key and Value (K and V) matrices generated inside the Transformers are a simple extension from the K and V from the previous step, the model caches the K and V matrices between steps : thus the separation - the prefill part is the part that prepares this KV cache, while the decoding is the one that leverages it and expands it by one at each step.

TGI-specific takeaways:
⚙️ TGI has many SOTA techniques for decoding: Paged Attention, KV Caching and Flash Attention…
🔀 TGI's router handles generations finishing early because of an EOS token: instead of static batching, it continuously batches requests to the inference engine & filters away finished requests.

1 reply

·

posted an update 30 days ago

Post

1900

🤯 𝗔 𝗻𝗲𝘄 𝟳𝟬𝗕 𝗼𝗽𝗲𝗻-𝘄𝗲𝗶𝗴𝗵𝘁𝘀 𝗟𝗟𝗠 𝗯𝗲𝗮𝘁𝘀 𝗖𝗹𝗮𝘂𝗱𝗲-𝟯.𝟱-𝗦𝗼𝗻𝗻𝗲𝘁 𝗮𝗻𝗱 𝗚𝗣𝗧-𝟰𝗼!

@mattshumer , CEO from Hyperwrite AI, had an idea he wanted to try out: why not fine-tune LLMs to always output their thoughts in specific parts, delineated by <thinking> tags?

Even better: inside of that, you could nest other sections, to reflect critically on previous output. Let’s name this part <reflection>. Planning is also put in a separate step.

He named the method “Reflection tuning” and set out to fine-tune a Llama-3.1-70B with it.

Well it turns out, it works mind-boggingly well!

🤯 Reflection-70B beats GPT-4o, Sonnet-3.5, and even the much bigger Llama-3.1-405B!

𝗧𝗟;𝗗𝗥
🥊 This new 70B open-weights model beats GPT-4o, Claude Sonnet, et al.
⏰ 405B in training, coming soon
📚 Report coming next week
⚙️ Uses GlaiveAI synthetic data
🤗 Available on HF!

I’m starting an Inference Endpoint right now for this model to give it a spin!

Check it out 👉 mattshumer/Reflection-Llama-3.1-70B

3 replies

·

posted an update about 1 month ago

Post

833

🚀 𝗪𝗵𝗲𝗿𝗲 𝘀𝗰𝗮𝗹𝗶𝗻𝗴 𝗹𝗮𝘄𝘀 𝗮𝗿𝗲 𝘁𝗮𝗸𝗶𝗻𝗴 𝘂𝘀 : 𝗯𝘆 𝟮𝟬𝟮𝟴, 𝗔𝗜 𝗖𝗹𝘂𝘀𝘁𝗲𝗿𝘀 𝘄𝗶𝗹𝗹 𝗿𝗲𝗮𝗰𝗵 𝘁𝗵𝗲 𝗽𝗼𝘄𝗲𝗿 𝗰𝗼𝗻𝘀𝘂𝗺𝗽𝘁𝗶𝗼𝗻 𝗼𝗳 𝗲𝗻𝘁𝗶𝗿𝗲 𝗰𝗼𝘂𝗻𝘁𝗿𝗶𝗲𝘀

Reminder : “Scaling laws” are empirical laws saying that if you keep multiplying your compute by x10, your models will mechanically keep getting better and better.

To give you an idea, GPT-3 can barely write sentences, and GPT-4, which only used x15 its amount of compute, already sounds much smarter than some of my friends (although it's not really - or at least I haven't tested them side-by side). So you can imagine how far a x100 over GPT-4 can take us.

🏎️ As a result, tech titans are racing to build the biggest models, and for this they need gigantic training clusters.

The picture below shows the growth of training compute: it is increasing at a steady exponential rate of a x10 every 2 years. So let’s take this progress a bit further:
- 2022: starting training for GPT-4 : 10^26 FLOPs, cost of $100M
- 2024: today, companies start training on much larger clusters like the “super AI cluster” of Elon Musk’s xAI, 10^27 FLOPS, $1B
- 2026 : by then clusters will require 1GW, i.e. around the full power generated by a nuclear reactor
- 2028: we reach cluster prices in the 100 billion dollars, using 10GW, more than the most powerful power stations currently in use in the US. This last size seems crazy, but Microsoft and OpenAI already are planning one.

Will AI clusters effectively reach these crazy sizes where the consume as much as entire countries?
➡️ Three key ingredients of training might be a roadblock to scaling up :
💸 Money: but it’s very unlikely, given the potential market size for AGI, that investors lose interest.
⚡️ Energy supply at a specific location
📚 Training data: we’re already using 15 trillion tokens for Llama-3.1 when Internet has something like 60 trillion.

🤔 I’d be curious to hear your thoughts: do you think we’ll race all the way there?

3 replies

·

posted an update about 1 month ago

Post

2109

🥳 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿𝘀 𝗔𝗴𝗲𝗻𝘁𝘀 𝗻𝗼𝘄 𝘀𝘂𝗽𝗽𝗼𝗿𝘁𝘀 𝗠𝘂𝗹𝘁𝗶-𝗮𝗴𝗲𝗻𝘁 𝘀𝘆𝘀𝘁𝗲𝗺𝘀!

Multi-agent systems have been introduced in Microsoft's framework Autogen. It simply means having several agents working together to solve your task instead of only one : this paradigm empirically yields better performance on most benchmarks. The reason for this better performance is conceptually simple: for many tasks, rather than using a do-it-all system, you would prefer to specialize units on sub-tasks. Here, having agents with separate tool sets and memories allows to achieve efficient specialization.

You can now easily build hierarchical multi-agent systems with transformers.agents (not released yet, use the dev version)

To do so, encapsulate the agent in a ManagedAgent object. This object needs arguments agent, name, and a description, which will then be embedded in the manager agent's system prompt to let it know how to call this managed agent, as we also do for tools.

Cf the example in the image! We'll keep building on this paradigm in the upcoming weeks 🚀

Read more in the doc 👉 https://github.com/huggingface/transformers/blob/main/docs/source/en/agents_advanced.md

Checkout an advanced multi-agent system that tops the GAIA leaderboard 👉 https://github.com/aymeric-roucher/GAIA/blob/main/gaia_multiagent.py

posted an update about 1 month ago

Post

802

🚨 𝗛𝘂𝗺𝗮𝗻 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸 𝗳𝗼𝗿 𝗔𝗜 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴: 𝗡𝗼𝘁 𝘁𝗵𝗲 𝗴𝗼𝗹𝗱𝗲𝗻 𝗴𝗼𝗼𝘀𝗲 𝘄𝗲 𝘁𝗵𝗼𝘂𝗴𝗵𝘁?

I’ve just read a great paper where Cohere researchers raises significant questions about using Human feedback to evaluate AI language models.

Human feedback is often regarded as the gold standard for judging AI performance, but it turns out, it might be more like fool's gold : the study reveals that our human judgments are easily swayed by factors that have nothing to do with actual AI performance.

𝗞𝗲𝘆 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀:
🧠 Test several models: Llama-2, Falcon-40B, Cohere Command 6 and 52B 🙅‍♂️ Refusing to answer tanks AI ratings more than getting facts wrong. We apparently prefer a wrong answer to no answer!

💪 Confidence is key (even when it shouldn't be): More assertive AI responses are seen as more factual, even when they're not. This could be pushing AI development in the wrong direction, with systems like RLHF.

🎭 The assertiveness trap: As AI responses get more confident-sounding, non-expert annotators become less likely to notice when they're wrong or inconsistent.

And a consequence of the above:
🔄 𝗥𝗟𝗛𝗙 𝗺𝗶𝗴𝗵𝘁 𝗯𝗮𝗰𝗸𝗳𝗶𝗿𝗲: Using human feedback to train AI (Reinforcement Learning from Human Feedback) could accidentally make AI more overconfident and less accurate.

This paper means we need to think carefully about how we evaluate and train AI systems to ensure we're rewarding correctness over apparences of it like confident talk.

⛔️ Chatbot Arena’s ELO leaderboard, based on crowdsourced answers from average joes like you and me, might become completely irrelevant as models will become smarter and smarter.

Read the paper 👉 Human Feedback is not Gold Standard (2309.16349)

posted an update about 1 month ago

Post

2202

🤖 𝗧𝗵𝗲 𝗔𝗜 𝗦𝗰𝗶𝗲𝗻𝘁𝗶𝘀𝘁: 𝗔𝗴𝗲𝗻𝘁𝗶𝗰, 𝗳𝘂𝗹𝗹𝘆-𝗮𝘂𝘁𝗼𝗺𝗮𝘁𝗲𝗱 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲 𝗳𝗼𝗿 𝘂𝗻𝗱𝗲𝗿 $𝟭𝟱 𝗽𝗲𝗿 𝗽𝗮𝗽𝗲𝗿

Researchers have just created an AI system that 𝗰𝗮𝗻 𝗰𝗼𝗻𝗱𝘂𝗰𝘁 𝗲𝗻𝘁𝗶𝗿𝗲 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗽𝗿𝗼𝗷𝗲𝗰𝘁𝘀 𝗳𝗿𝗼𝗺 𝘀𝘁𝗮𝗿𝘁 𝘁𝗼 𝗳𝗶𝗻𝗶𝘀𝗵, 𝗽𝗼𝘁𝗲𝗻𝘁𝗶𝗮𝗹𝗹𝘆 𝗿𝗲𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻𝗶𝘇𝗶𝗻𝗴 𝗵𝗼𝘄 𝘀𝗰𝗶𝗲𝗻𝘁𝗶𝗳𝗶𝗰 𝗱𝗶𝘀𝗰𝗼𝘃𝗲𝗿𝗶𝗲𝘀 𝗮𝗿𝗲 𝗺𝗮𝗱𝗲.

It doesn't just assist with specific tasks - it automates the entire research process, from generating ideas to writing and reviewing papers.
1 - brainstorm novel research directions, 2- write and execute code for experiments & visualize results, get references, and even 3- write up findings in a full academic paper format!

And it can do all this for under $15 per paper! 🤯

𝗞𝗲𝘆 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀:
🧠 Generates novel research ideas across multiple topics (e.g. diffusion modeling, transformers, learning dynamics aka “grokking”)
👨‍💻 Uses open-source coding assistant Aider to implement ideas and run experiments. This is especially important since this agentic assistant can iterate if it fails somewhere.
📊 Visualizes results and plans follow-up experiments (up to 5 rounds)
✍️ Writes full academic papers, including finding references using Semantic Search API
🕵️ Runs a simulated peer review process to evaluate paper quality
💰 Total cost per paper is under $15. This system can generate "hundreds of interesting, medium-quality papers" in just a week !

𝗦𝘁𝗶𝗹𝗹 𝗻𝗼𝘁 𝗿𝗲𝗮𝗱𝘆 𝘁𝗼 𝗳𝗶𝗹𝗹 𝗜𝗖𝗟𝗥 𝘄𝗶𝘁𝗵 𝗽𝗮𝗽𝗲𝗿𝘀:
🔁 Ideas generated in one domain tend to be repetitive across different runs, and even different language model
👀 Does not use vision capabilities to fix visual issues in plots
💭 Models occasionally hallucinate entire results tables
⇒ Only few of the generated papers would actually meet the threshold for acceptance at a top AI conference

👉 Read their paper: The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (2408.06292)

posted an update about 1 month ago

Post

2097

🎮 𝗔 𝗻𝗲𝘂𝗿𝗮𝗹 𝗻𝗲𝘁𝘄𝗼𝗿𝗸 𝘀𝗶𝗺𝘂𝗹𝗮𝘁𝗲𝘀 𝗗𝗢𝗢𝗠: 𝗚𝗼𝗼𝗴𝗹𝗲 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵𝗲𝗿𝘀 𝗼𝗽𝗲𝗻 𝘁𝗵𝗲 𝘄𝗮𝘆 𝗳𝗼𝗿 𝗰𝗼𝗺𝗽𝗹𝗲𝘁𝗲𝗹𝘆-𝗔𝗜-𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗲𝗱 𝗴𝗮𝗺𝗲𝘀!

Imagine if games were completely live-generated by an AI model : the NPCs and their dialogues, the storyline, and even the game environment. The player’s in-game actions would have a real, lasting impact on the game story.

In a very exciting paper, Google researchers just gave us the first credible glimpse of this future.

➡️ They created GameNGen, the first neural model that can simulate a complex 3D game in real-time. They use it to simulate the classic game DOOM running at over 20 frames per second on a single TPU, with image quality comparable to lossy JPEG compression. And it feels just like the true game!

Here's how they did it:
1. They trained an RL agent to play DOOM and recorded its gameplay sessions.
2. They then used these recordings to train a diffusion model to predict the next frame, based on past frames and player actions.
3. During inference, they use only 4 denoising steps (instead of the usual dozens) to generate each frame quickly.

𝗞𝗲𝘆 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀:
🎮🤔 Human players can barely tell the difference between short clips (3 seconds) of the real game or the simulation
🧠 The model maintains game state (health, ammo, etc.) over long periods despite having only 3 seconds of effective context length
🔄 They use "noise augmentation" during training to prevent quality degradation in long play sessions
🚀 The game runs on one TPU at 20 FPS with 4 denoising steps, or 50 FPS with model distillation (with some quality loss)

The researchers did not open source the code, but I feel like we’ve just seen a part of the future being written!

Their paper (exploding the upvote counter) 👉 Diffusion Models Are Real-Time Game Engines (2408.14837)
In a similar vein, play @Jofthomas 's 'Everchanging Quest' 🎮 Jofthomas/Everchanging-Quest

replied to their post about 1 month ago

Yes that's why I added the big if behind!

posted an update about 1 month ago

Post

913

𝗔𝗜𝟮𝟭 𝗶𝘁𝗲𝗿𝗮𝘁𝗲𝘀 𝘄𝗶𝘁𝗵 𝗻𝗲𝘄 𝗝𝗮𝗺𝗯𝗮 𝟭.𝟱 𝗿𝗲𝗹𝗲𝗮𝘀𝗲: 𝗡𝗲𝘄 𝘀𝘁𝗮𝗻𝗱𝗮𝗿𝗱 𝗳𝗼𝗿 𝗹𝗼𝗻𝗴-𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝘂𝘀𝗲-𝗰𝗮𝘀𝗲𝘀!🏅

@ai21labs used a different architecture to beat the status-quo Transformers models: Jamba architecture combines classic Transformers layers with the new Mamba layers, for which the complexity is a linear (instead of quadratic) function of the context length.

What does this imply?

➡️ Jamba models are much more efficient for long contexts: faster (up to 2.5x faster for long context), takes less memory, and also performs better to recall everything in the prompt.

That means it’s a new go-to model for RAG or agentic applications!

And the performance is not too shabby: Jamba 1.5 models are comparable in perf to similar-sized Llama-3.1 models! The largest model even outperforms Llama-3.1 405B on Arena-Hard.

✌️ Comes in 2 sizes: Mini (12B active/52B) and Large (94B active/399B)
📏 Both deliver 256k context length, for low memory: Jamba-1.5 mini fits 140k context length on one single A100.
⚙️ New quanttization method: Experts Int8 quantizes only the weights parts of the MoE layers, which account for 85% of weights
🤖 Natively supports JSON format generation & function calling.
🔓 Permissive license *if your org makes <$50M revenue*

Available on the Hub 👉 ai21labs/jamba-15-66c44befa474a917fcf55251
Read their release blog post 👉 https://www.ai21.com/blog/announcing-jamba-model-family

2 replies

·

posted an update about 2 months ago

Post

3370

𝗚𝗼𝗼𝗴𝗹𝗲 𝗽𝗮𝗽𝗲𝗿 : 𝘀𝗰𝗮𝗹𝗶𝗻𝗴 𝘂𝗽 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗰𝗼𝗺𝗽𝘂𝘁𝗲 𝗯𝗲𝗮𝘁𝘀 𝟭𝟰𝘅 𝗹𝗮𝗿𝗴𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 🚀

Remember scaling laws? These are empirical laws that say "the bigger your model, the better it gets". More precisely, "as your compute increases exponentially, loss decreases in a linear fashion". They have wild implications, suggesting that spending 100x more training compute would make you super-LLMs. That's why companies are racing to build the biggest AI superclusters ever, and Meta bought 350k H100 GPUs, which probably cost in the order of $1B.

But think of this : we're building huge reasoning machines, but only ask them to do one pass through the model to get one token of the final answer : i.e., we expend a minimal effort on inference. That's like building a Caterpillar truck and making it run on a lawnmower's motor. 🚚🛵 Couldn't we optimize this? 🤔

💡 So instead of scaling up on training by training even bigger models on many more trillions of tokens, Google researchers explored this under-explored avenue : scaling up inference compute.

They combine two methods to use more compute : either a reviser that iterated to adapt the model distribution, or generate N different completions (for instance through Beam Search) and select only the best one using an additional verifier model.

They use a Palm-2 model (released in May 23) on the MATH dataset : Palm-2 has the advantage of getting a low performance on MATH, but not zero, so that improvements will be noticeable.

And the results show that for the same fixed amount of inference compute:
💥 a smaller model with more effort on decoding beats a x14 bigger model using naive greedy sampling.

That means that you can divide your training costs by 14 and still get the same perf for the same inference cost!

Take that, scaling laws. Mark Zuckerberg, you're welcome, hope I can get some of these H100s.

Read the paper here 👉 Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (2408.03314)

1 reply

·

posted an update about 2 months ago

Post

988

𝗭𝗲𝗿𝗼-𝗺𝗮𝘁𝗵 𝗶𝗻𝘁𝗿𝗼 𝘁𝗼 𝗔𝗜 𝗵𝗶𝘀𝘁𝗼𝗿𝘆: 𝗳𝗿𝗼𝗺 𝘁𝗵𝗲 𝟭𝟵𝟱𝟬𝘀 𝘁𝗼 𝘁𝗼𝗱𝗮𝘆'𝘀 𝗟𝗟𝗠𝘀 📖

I wanted to structure my thinking about LLMs by going through their history since the 50s. This history is captivating, with the opposition between Connexionists (Rosenblatt, LeCun) and Symbolists, the first victories of "deep" neural networks, the revolution of Attention...

So I might have gone a bit too far! 😅

📝 I've made a long post summarizing the main stages of building LLMs: neural networks, optimization, backpropagation, attention layers...

✅ And I've made sure to keep it 100% horrible-latex-math-free: the technical stuff is conveyed in graphs only, so it should be accessible to really anyone, even your grandfather (I'm sending it to mine right now).

Read it here in english 👉 https://aymeric-roucher.github.io/brief-history-of-ai/
Pour le post en français 👉 https://aymeric-roucher.github.io/breve-histoire-de-l-ia/

posted an update 2 months ago

Post

1732

𝗦𝗔𝗠 𝟮 𝗿𝗲𝗹𝗲𝗮𝘀𝗲𝗱: 𝗡𝗲𝘄 𝗦𝗢𝗧𝗔 𝗼𝗻 𝘀𝗲𝗴𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻, 𝗯𝘆 𝗰𝗼𝗺𝗯𝗶𝗻𝗶𝗻𝗴 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮 𝘄𝗶𝘁𝗵 𝗵𝘂𝗺𝗮𝗻 𝗳𝗲𝗲𝗱𝗯𝗮𝗰𝗸 🚀

It's a model for Object segmentation, for both image and video:
👉 input = a text prompt, or a click on a specific object
👉 output = the model draws a mask around the object. In video segmentation, the mask should follow the object's movements (it is then called a masklet)

💪 SAM 2 is 6x faster than the previous version, it now also works on a video, and it beats SOTA by far on both image and video segmentation tasks.

How did they pull that?

The main blocker for video segmentation was that data is really hard to collect: to build your training dataset, should you manually draw masks on every frame? That would be way too costly! ➡️ As a result, existing video segmentation datasets have a real lack of coverage: few examples, few masklets drawn.

💡 Key idea: researchers they decided to use a segmentation model to help them collect the dataset.

But then it’s a chicken and egg problem: you need the model to create the dataset and the opposite as well? 🤔

⇒ To solve this, they build a data generation system that they scale up progressively in 3 successive manual annotations phases:

𝗦𝘁𝗲𝗽 𝟭: Annotators use only SAM + manual editing tools on each frame ⇒ Create 16k masklets across 1.4k videos

𝗦𝘁𝗲𝗽 𝟮: Then train a first SAM 2, add it in the loop to temporally propagate frames, and correct by re-doing a mask manually when an error has occured ⇒ This gets a 5.1x speedup over data collection in phase 1! 🏃 Collect 60k masklets

𝗦𝘁𝗲𝗽 𝟯: Now SAM 2 is more powerful, it has the “single click” prompting option, thus annotators can use it with simple clicks to re-annotate data.

They even add a completely automatic step to generate 350k more masklets!
And in turn, the model perf gradually increases.

I find this a great example of combining synthetic data generation with human annotation 👏

posted an update 2 months ago

Post

1089

𝗟𝗹𝗮𝗺𝗮-𝟯.𝟭 𝗺𝗼𝗱𝗲𝗹𝘀 𝗳𝗶𝗻𝗮𝗹𝗹𝘆 𝗴𝗲𝘁 𝘁𝗵𝗲𝗶𝗿 𝗖𝗵𝗮𝘁𝗯𝗼𝘁 𝗔𝗿𝗲𝗻𝗮 𝗿𝗮𝗻𝗸𝗶𝗻𝗴 🎖️

Given the impressive benchmarks published my Meta for their Llama-3.1 models, I was curious to see how these models would compare to top proprietary models on Chatbot Arena.

Now we've got the results! LMSys released the ELO derived from thousands of user votes for the new models, and here are the rankings:

💥 405B Model ranks 5th overall, in front of GPT-4-turbo! But behind GPT-4o, Claude-3.5 Sonnet and Gemini-advanced.
👏 70B Model climbs up to 9th rank ! From 1206 ➡️ 1244.
👍 8B Model improves from 1152 ➡️ 1170.

✅ This confirms that Llama-3.1 is a good contender for any task: any of its 3 model size is much cheaper to run than equivalent proprietary models!

For instance, here are the inference prices for the top models;
➤ GPT-4-Turbo inference price from OpenAI: $5/M input tokens, $15/M output tokens
➤ Llama-3.1-405B from HF API (for testing only): 3$/M for input or output tokens (Source linked in the first comment)
➤ Llama-3.1-405B from HF API (for testing only): free ✨

Get a head start on the HF API (resource by @andrewrreed ) 👉 https://huggingface.co/learn/cookbook/enterprise_hub_serverless_inference_api

1 reply

·

posted an update 2 months ago

Post

2247

𝗧𝗵𝗲 𝗵𝘂𝗴𝗲 𝗰𝗼𝘀𝘁 𝗼𝗳 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗼𝗻 𝗳𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗟𝗟𝗠𝘀 💸

Google DeepMind recently released a great paper that shows optimal hyperparameters to train across different regimes: Scaling Exponents Across Parameterizations and Optimizers, with data from 10,000 training runs.

One engineer decided to quantify the price of such a large-scale experiment.

😬 And the bill is hefty: ~13M USD

This exact number is to take with a grain of salt because many approximations were necessary to get the final result.

⛔️ But still this ballpark means that for this sole experiment, the price is way over what most startups or research labs could afford.

This means that open-sourcing research is more important than ever, to put everyone in the ecosystem on a roughly equal footing. Don't let OpenAI run first, they'll keep everything for themselves!

Read the full post that quantifies the paper's cost 👉 https://152334h.github.io/blog/scaling-exponents/

1 reply

·

posted an update 2 months ago

Post

2249

𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗗𝗮𝘁𝗮 𝗮𝗻𝗮𝗹𝘆𝘀𝘁: 𝗱𝗿𝗼𝗽 𝘆𝗼𝘂𝗿 𝗱𝗮𝘁𝗮 𝗳𝗶𝗹𝗲, 𝗹𝗲𝘁 𝘁𝗵𝗲 𝗟𝗟𝗠 𝗱𝗼 𝘁𝗵𝗲 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 📊⚙️

Need to make quick exploratory data analysis? ➡️ Get help from an agent.

I was impressed by Llama-3.1's capacity to derive insights from data. Given a csv file, it makes quick work of exploratory data analysis and can derive interesting insights.

On the data from the Kaggle titanic challenge, that records which passengers survived the Titanic wreckage, it was able by itself to derive interesting trends like "passengers that paid higher fares were more likely to survive" or "survival rate was much higher for women than men".

The cookbook even lets the agent built its own submission to the challenge, and it ranks under 3,000 out of 17,000 submissions: 👏 not bad at all!

Try it for yourself in this Space demo 👉 m-ric/agent-data-analyst

2 replies

·

replied to Wauplin's post 3 months ago

Thanks a lot @Wauplin !

posted an update 3 months ago

Post

3199

𝐍𝐞𝐰 𝐝𝐞𝐜𝐨𝐝𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞 𝐢𝐧 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬 𝐬𝐢𝐠𝐧𝐢𝐟𝐢𝐜𝐚𝐧𝐭𝐥𝐲 𝐫𝐞𝐝𝐮𝐜𝐞𝐬 𝐡𝐚𝐥𝐥𝐮𝐜𝐢𝐧𝐚𝐭𝐢𝐨𝐧𝐬 👏

DoLa decoding, which made a conference paper at ICLR '24, has just been merged in Transformers by @joaogante and Yung-Sung Chuang.
This new decoding method is simple yet extremely impressive!

Reminder: Decoder LLMs (the GPT kind of LLM, the most common one) generate their outputs one token at a time: at each step, given a current text, they compute a logit for each token in their vocabulary that should represent the probability of this token coming next.

Then they either pick the highest logit token (greedy decoding) or sample one with a probability defined by the logits (sampling).

The authors of DoLa wanted to improve that simple method.

They knew this established fact that transformer LMs encode low-level info (like base syntax) in early layers and more high-level info like knowledge in the later layers.

💡 This gave them their key idea: During decoding, rather than picking the token with the highest logit, 𝘄𝗵𝘆 𝗻𝗼𝘁 𝗽𝗶𝗰𝗸 𝘁𝗵𝗲 𝘁𝗼𝗸𝗲𝗻 𝘄𝗶𝘁𝗵 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝗶𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗶𝗻𝗰𝗿𝗲𝗮𝘀𝗲 𝗶𝗻 𝗹𝗼𝗴𝗶𝘁 𝗮𝗰𝗿𝗼𝘀𝘀 𝗹𝗮𝘆𝗲𝗿𝘀?

This gives impressive results:
🚀 𝟱% - 𝟮𝟬% 𝗯𝗮𝘀𝗲 𝗽𝗼𝗶𝗻𝘁𝘀 𝗶𝗻𝗰𝗿𝗲𝗮𝘀𝗲 𝗮𝗰𝗿𝗼𝘀𝘀 𝘁𝗵𝗲 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀
🚀 For instance on TruthfulQA / Open-ended, across all model sizes the increase in truthfulness is 14 base points, which is 𝗮𝗿𝗼𝘂𝗻𝗱 𝟰𝟬% 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝗰𝗼𝗺𝗽𝗮𝗿𝗲𝗱 𝘁𝗼 𝘀𝘁𝗮𝗻𝗱𝗮𝗿𝗱 𝗱𝗲𝗰𝗼𝗱𝗶𝗻𝗴!

🤔 Wouldn't decoding take longer because of this added contrasting step? 👉 𝗧𝗵𝗲 𝗿𝘂𝗻𝘁𝗶𝗺𝗲 𝗶𝗻𝗰𝗿𝗲𝗮𝘀𝗲 𝗶𝘀 𝗻𝗲𝗴𝗹𝗶𝗴𝗶𝗯𝗹𝗲, 𝟭 𝘁𝗼 𝟴% 𝗼𝗻𝗹𝘆.

Paper added to my collection 👉 m-ric/optimization-mechanics-661d543a5fc6ca1dc84284a0

2 replies

·

replied to their post 3 months ago

You can use this newer repo in Github @sidbin : https://github.com/aymeric-roucher/GAIA, it has a requirements.txt!

posted an update 3 months ago

Post

854

One more cookbook:
Agent for self-correcting Text-to-SQL 🧑‍💻

What if the query generated by your Text-to-SQL pipeline is correct SQL but returns wrong results?
👉 We need to add a critique step

✅ That's very simple with an agent!

Check out the notebook! 👇
https://huggingface.co/learn/cookbook/agent_text_to_sql

posted an update 3 months ago

Post

1855

New cookbook!

I show to to make agentic RAG using Transformers Agents.

Compared to vanilla RAG, agentic RAG can:
✅ Reformulate the query
✅ Critique the retrived content to re-retrieve if needed

➡️ Score increase of 8.5%! 💪 (Llama-3-70B-judge)

Read it here 👉 https://huggingface.co/learn/cookbook/agent_rag

replied to their post 3 months ago

It's not using GPT-4o for evaluation, evaluation is done with exact string match!

posted an update 3 months ago

Post

2656

𝗬𝗼𝘂 𝗱𝗼𝗻'𝘁 𝗻𝗲𝗲𝗱 "𝗳𝘂𝗻𝗰𝘁𝗶𝗼𝗻 𝗰𝗮𝗹𝗹𝗶𝗻𝗴 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴" 𝘁𝗼 𝗯𝘂𝗶𝗹𝗱 𝗴𝗼𝗼𝗱 𝗮𝗴𝗲𝗻𝘁𝘀 ⛔

It's trendy to share models "fine-tuned for function calling"; but from my observations, this fine-tuning is not necessary or sufficient to build good agent systems.
To name only a few:
🐦‍⬛ Nexusflow/𝗡𝗲𝘅𝘂𝘀𝗥𝗮𝘃𝗲𝗻-𝗩𝟮-𝟭𝟯𝗕
⌘ CohereForAI/𝗰𝟰𝗮𝗶-𝗰𝗼𝗺𝗺𝗮𝗻𝗱-𝗿-𝗽𝗹𝘂𝘀
⛵️ mistralai/𝗠𝗶𝘅𝘁𝗿𝗮𝗹-𝟴𝘅𝟮𝟮𝗕-𝗜𝗻𝘀𝘁𝗿𝘂𝗰𝘁-𝘃𝟬.𝟭
"Fine-tuned for function-calling" generally means "fine-tuned to generate function calls in correct JSON for extremely simple tasks". In other terms, it means "improve the formatting of the tool calls".

Yet I discovered two things while improving Transformers Agents:
🧐 Even when used as JSON agents, these fine-tuned models don't perform very well
🏅 𝙂𝙤𝙤𝙙 𝙗𝙖𝙨𝙚 𝙢𝙤𝙙𝙚𝙡𝙨 𝙥𝙚𝙧𝙛𝙤𝙧𝙢 𝙗𝙚𝙩𝙩𝙚𝙧 𝙬𝙞𝙩𝙝𝙤𝙪𝙩 𝙖𝙣𝙮 𝙛𝙞𝙣𝙚-𝙩𝙪𝙣𝙞𝙣𝙜, 𝙟𝙪𝙨𝙩 𝙥𝙡𝙖𝙞𝙣 𝙥𝙧𝙤𝙢𝙥𝙩𝙞𝙣𝙜. (Llama-3-70B-Instruct, GPT-4o, Claude-3.5-Sonnet)

👇 The graph below shows the count of errors for my GPT-4o validation run on the GAIA benchmark: 𝙰𝚐𝚎𝚗𝚝𝙿𝚊𝚛𝚜𝚒𝚗𝚐𝙴𝚛𝚛𝚘𝚛 and 𝙰𝚐𝚎𝚗𝚝𝙴𝚡𝚎𝚌𝚞𝚝𝚒𝚘𝚗𝙴𝚛𝚛𝚘𝚛 are the ones caused by incorrect formatting.
➤ As you can see, their count is already close to 0!
And given that GPT-4o is certainly not fine-tuned for our Code tool calling format, this shows that "function calling fine-tuning" is not necessary!

The hardest thing to get right in an agent is still to 𝙥𝙡𝙖𝙣 𝙜𝙤𝙤𝙙 𝙩𝙖𝙨𝙠-𝙨𝙤𝙡𝙫𝙞𝙣𝙜 𝙩𝙧𝙖𝙟𝙚𝙘𝙩𝙤𝙧𝙞𝙚𝙨 𝙤𝙫𝙚𝙧 𝙨𝙚𝙫𝙚𝙧𝙖𝙡 𝙨𝙩𝙚𝙥𝙨.
To improve this, we could:
- Use more powerful base models
- Make tool calling datasets with complex solving trajectories
- Use RL! cc @lvwerra

3 replies

·

posted an update 3 months ago

Post

770

𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬 𝐀𝐠𝐞𝐧𝐭𝐬 𝐫𝐞𝐚𝐜𝐡𝐞𝐬 𝐭𝐡𝐞 𝐭𝐨𝐩 𝐨𝐟 𝐆𝐀𝐈𝐀 𝐥𝐞𝐚𝐝𝐞𝐫𝐛𝐨𝐚𝐫𝐝! 🥳

We've been improving Transformers Agents a lot lately.

So with @sergeipetrov we set out to prove that it's the best agent framework out there.

To prove this, we went to beat the 𝗚𝗔𝗜𝗔 𝗹𝗲𝗮𝗱𝗲𝗿𝗯𝗼𝗮𝗿𝗱, the most comprehensive benchmark out there for evaluating LLM agents.
Its questions make you explore different flavours of pain:

🛠️ 𝗥𝗲𝗾𝘂𝗶𝗿𝗲 𝘂𝘀𝗶𝗻𝗴 𝘁𝗼𝗼𝗹𝘀, at least a web browser
🔢 𝗥𝗶𝗴𝗼𝗿𝗼𝘂𝘀 𝗹𝗼𝗴𝗶𝗰, many questions having strong math aspects
🖼️ 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹, the agent had to handle all file types: 🔊, 🖼️, 🎬...
👣 𝗠𝘂𝗹𝘁𝗶-𝘀𝘁𝗲𝗽, with many questions requiring over 10 steps to be solved.

Some Level 3 questions are crazy hard 😳
> "In NASA’s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute?"
(𝘯𝘰 𝘧𝘪𝘭𝘦 𝘢𝘵𝘵𝘢𝘤𝘩𝘦𝘥 𝘰𝘧 𝘤𝘰𝘶𝘳𝘴𝘦, 𝘵𝘩𝘦 𝘢𝘨𝘦𝘯𝘵 𝘩𝘢𝘴 𝘵𝘰 𝘧𝘪𝘯𝘥 𝘢𝘭𝘭 𝘵𝘩𝘦 𝘪𝘯𝘧𝘰)

➡️ We used Transformers Agents' React Code Agent, that writes its actions in code. We created a new planning component that we'll incorporate in the framework. More info soon in a blog post!

𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
🚀 Our submission scores #2 overall on the test set and #1 on the validation set. On both sets we're the leading submission based on a public framework, beating Microsoft's Autogen.
🥇 On both sets we are #1 on the hardest Level 3 questions, reaching nearly 20%.

𝙂𝙤 𝙘𝙝𝙚𝙘𝙠 𝙤𝙪𝙩 𝙩𝙝𝙚 𝙡𝙚𝙖𝙙𝙚𝙧𝙗𝙤𝙖𝙧𝙙 👉 gaia-benchmark/leaderboard

2 replies

·

replied to their post 4 months ago

Great idea! Can I build it @victor or you'd like to make it yourself?

posted an update 4 months ago

Post

3119

💰 𝗚𝗲𝘁 𝘁𝗵𝗲 𝗽𝗿𝗶𝗰𝗲 𝗼𝗳 𝗮𝗻𝘆 𝗟𝗟𝗠 𝗔𝗣𝗜 𝗿𝗲𝗾𝘂𝗲𝘀𝘁 ⇒ 𝘁𝗼𝗸𝗲𝗻𝗰𝗼𝘀𝘁

I've just found out about 𝙰𝚐𝚎𝚗𝚝𝙾𝚙𝚜-𝙰𝙸/𝚝𝚘𝚔𝚎𝚗𝚌𝚘𝚜𝚝 (https://github.com/AgentOps-AI/tokencost).
𝗧𝗵𝗶𝘀 𝗹𝗶𝗯𝗿𝗮𝗿𝘆 𝗴𝗶𝘃𝗲𝘀 𝘆𝗼𝘂 𝘁𝗵𝗲 𝗽𝗿𝗶𝗰𝗲 𝗼𝗳 𝘆𝗼𝘂𝗿 𝗰𝗮𝗹𝗹𝘀 𝘁𝗼 𝗮𝗻𝘆 𝗟𝗟𝗠 𝗔𝗣𝗜: OpenAI, Anthropic, Mistral, AWS or Databricks...

For any model, you can use as input either string prompts or messages, and get as outputs either the price or token count.

Congrats to the AgentOps-AI team: this will be very useful when trying to get a ballpark estimate of a project's price, to compare APIs, or for precise monitoring of usage!

✨ Daily reminder: 𝗿𝘂𝗻𝗻𝗶𝗻𝗴 𝗮𝗻 𝗔𝟭𝟬𝟬 𝗰𝗼𝘀𝘁𝘀 𝘆𝗼𝘂 𝗲𝘅𝗮𝗰𝘁𝗹𝘆 $𝟬.𝟬𝟬/𝗵𝗼𝘂𝗿 (or 0.00€ in current exchange rates) on a HF space with ZeroGPU!
Learn more on ZeroGPU 👉 https://www.datacenterdynamics.com/en/news/hugging-face-launches-zerogpu-project-to-democratize-ai-gives-away-10-million-worth-of-compute/

5 replies

·

posted an update 4 months ago

Post

1841

𝗛𝗼𝘄 𝗱𝗼𝗲𝘀 𝗮𝗻 𝗮𝗴𝗲𝗻𝘁𝗶𝗰 𝘄𝗼𝗿𝗸𝗳𝗹𝗼𝘄 𝘂𝘀𝗲 𝗶𝘁𝘀 𝗟𝗟𝗠 𝗲𝗻𝗴𝗶𝗻𝗲 𝘁𝗼 𝘀𝗼𝗹𝘃𝗲 𝘁𝗮𝘀𝗸𝘀?

➡️ I made my first ever 𝘮𝘢𝘯𝘪𝘮 video to show just that:

𝗪𝗮𝘁𝗰𝗵 𝗯𝗲𝗹𝗼𝘄 𝗵𝗼𝘄 𝗮 𝗥𝗲𝗮𝗰𝘁 𝗔𝗴𝗲𝗻𝘁 𝘀𝗼𝗹𝘃𝗲𝘀 𝗮 𝘀𝗶𝗺𝗽𝗹𝗲 𝘁𝗮𝘀𝗸, by leveraging its memory to iterate on previous actions! 🎬👇

Read our blog post on Agents: https://huggingface.co/blog/agents

1 reply

·

posted an update 4 months ago

Post

834

𝙒𝙧𝙞𝙩𝙞𝙣𝙜 𝙩𝙤𝙤𝙡 𝙘𝙖𝙡𝙡𝙨 𝙞𝙣 𝙘𝙤𝙙𝙚 𝙟𝙪𝙨𝙩 𝙬𝙤𝙧𝙠𝙨 𝙗𝙚𝙩𝙩𝙚𝙧 𝙩𝙝𝙖𝙣 𝙅𝙎𝙊𝙉 💪

I was really happy to learn today by @sergeipetrov that paper 𝘌𝘹𝘦𝘤𝘶𝘵𝘢𝘣𝘭𝘦 𝘊𝘰𝘥𝘦 𝘈𝘤𝘵𝘪𝘰𝘯𝘴 𝘌𝘭𝘪𝘤𝘪𝘵 𝘉𝘦𝘵𝘵𝘦𝘳 𝘓𝘓𝘔 𝘈𝘨𝘦𝘯𝘵𝘴 was accepted at ICLR 2024!

As a reminder, an agent is a system in which you embed a LLM engine, to let it call tools.

These tools are meant like an IronMan suit, to supplement the LLM in areas that it isn't good at.
🧑‍💻 For instance your friendly LLM may be terrible at calculating powers of floating numbers ("What is X ^0.2947 ?"), so it should use a calculator.
🔎It may be terrible at knowing precise facts ("What was the date of the Golden Bull?") so it should use a web browser.

So the agent system will prompt an agent with "Now you can use these tools: calculator, search,..."

But 𝙝𝙤𝙬 𝙨𝙝𝙤𝙪𝙡𝙙 𝙩𝙝𝙚 𝙖𝙜𝙚𝙣𝙩 𝙚𝙭𝙥𝙧𝙚𝙨𝙨 𝙞𝙩𝙨 𝙖𝙘𝙩𝙞𝙤𝙣𝙨?

All well known frameworks let agents write their actions as JSON strings.

We 𝗽𝗿𝗲𝗳𝗲𝗿𝗿𝗲𝗱 𝘁𝗼 𝗴𝗼 𝘄𝗶𝘁𝗵 𝗳𝗼𝗿𝗺𝘂𝗹𝗮𝘁𝗶𝗻𝗴 𝗮𝗰𝘁𝗶𝗼𝗻𝘀 𝗶𝗻 𝗖𝗼𝗱𝗲, 𝘄𝗵𝗶𝗰𝗵 𝗶𝘀 𝗺𝘂𝗰𝗵 𝗺𝗼𝗿𝗲 𝘃𝗲𝗿𝘀𝗮𝘁𝗶𝗹𝗲 𝗮𝗻𝗱 𝗰𝗼𝗻𝗰𝗶𝘀𝗲, 𝗮𝗻𝗱 𝗮𝗹𝗹𝗼𝘄𝘀 𝘁𝗼 𝗰𝗵𝗮𝗶𝗻 𝗮𝗰𝘁𝗶𝗼𝗻𝘀 𝘀𝗲𝗮𝗺𝗹𝗲𝘀𝘀𝗹𝘆: see the picture attached for an example where Code formulation really shines.

And the paper confirms our choice: researchers show that 𝗰𝗼𝗺𝗽𝗮𝗿𝗲𝗱 𝘁𝗼 𝗝𝗦𝗢𝗡 𝗼𝗿 𝗽𝗹𝗮𝗶𝗻 𝘁𝗲𝘅𝘁, 𝗖𝗼𝗱𝗲 𝗶𝘀 𝗯𝗲𝘁𝘁𝗲𝗿 𝗯𝗼𝘁𝗵 𝗶𝗻 𝗰𝗼𝗻𝗰𝗶𝘀𝗲𝗻𝗲𝘀𝘀 𝗮𝗻𝗱 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲:
➤ Up to 30% fewer steps for the same actions (much more concise)
➤ Up to 20% higher performance on benchmarks

And we find additional benefits, for instance a natural handling of variables.

Read the paper here 📖 Executable Code Actions Elicit Better LLM Agents (2402.01030)
Get your ReactCodeAgent running with our Agents framework! 👉 https://huggingface.co/learn/cookbook/agents

posted an update 4 months ago

Post

996

𝐍𝐞𝐰 𝐠𝐮𝐢𝐝𝐞 𝐢𝐧 𝐨𝐮𝐫 𝐎𝐩𝐞𝐧-𝐒𝐨𝐮𝐫𝐜𝐞 𝐀𝐈 𝐜𝐨𝐨𝐤𝐛𝐨𝐨𝐤: 𝙎𝙩𝙧𝙪𝙘𝙩𝙪𝙧𝙚𝙙 𝙜𝙚𝙣𝙚𝙧𝙖𝙩𝙞𝙤𝙣! ✨

Many use LLM use cases involve generating outputs with a specific structure.

➡️ For instance when using an LLM as a judge to evaluate another model's outputs, you need it to give you not only a score, but also the rationale for this score, and maybe a confidence level.
So you do not need only "score: 1", but more a dictionary like:

{
     "rationale": "The answer does not match the true answer at all."
     "score": 1,
     "confidence_level": 0.85
}

🤔 How to force your LLM to generate such a structured output?

🏗️ 𝗖𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝗲𝗱 𝗱𝗲𝗰𝗼𝗱𝗶𝗻𝗴 is a great technique to generate structured output: you can specify a grammar (=set of rules) that the output should follow, and 𝗰𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝗲𝗱 𝗱𝗲𝗰𝗼𝗱𝗶𝗻𝗴 𝘁𝗵𝗲𝗻 𝗳𝗼𝗿𝗰𝗲𝘀 𝘁𝗵𝗲 𝗱𝗲𝗰𝗼𝗱𝗲𝗿 𝘁𝗼 𝗼𝗻𝗹𝘆 𝗽𝗶𝗰𝗸 𝘁𝗼𝗸𝗲𝗻𝘀 𝘁𝗵𝗮𝘁 𝗿𝗲𝘀𝗽𝗲𝗰𝘁 𝘆𝗼𝘂𝗿 𝗴𝗿𝗮𝗺𝗺𝗮𝗿.

I've created a guide to show you how to use it, both via our Inference API and locally using 𝘰𝘶𝘵𝘭𝘪𝘯𝘦𝘴!

👉 Read it here: https://huggingface.co/learn/cookbook/structured_generation

Thank you @stevhliu for your great help in improving it!

posted an update 5 months ago

Post

2765

💰❌ 𝐑𝐞𝐬𝐞𝐚𝐫𝐜𝐡 𝐟𝐨𝐫 𝐭𝐡𝐞 𝐯𝐞𝐫𝐲 𝐆𝐏𝐔 𝐏𝐨𝐨𝐫 - 𝐒𝐜𝐚𝐥𝐢𝐧𝐠 𝐥𝐚𝐰𝐬 𝐫𝐞𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧

🎆 Good news: 𝘆𝗼𝘂 𝗰𝗮𝗻 𝗱𝗼 𝗰𝘂𝘁𝘁𝗶𝗻𝗴-𝗲𝗱𝗴𝗲 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝘄𝗶𝘁𝗵 𝗮 𝗰𝗮𝗹𝗰𝘂𝗹𝗮𝘁𝗼𝗿 𝗮𝗻𝗱 𝗠𝗶𝗰𝗿𝗼𝘀𝗼𝗳𝘁 𝗣𝗮𝗶𝗻𝘁 𝟮𝟬𝟬𝟲!

The Chinchilla experiments (by Google DeepMind) ran hundreds of pre-trainings with models >1B parameters (I do not want to imagine how much that cost) to 𝗳𝗶𝗻𝗱 𝘁𝗵𝗲 𝗼𝗽𝘁𝗶𝗺𝗮𝗹 𝗿𝗮𝘁𝗶𝗼 𝗼𝗳 𝗺𝗼𝗱𝗲𝗹 𝘀𝗶𝘇𝗲 𝘃𝘀 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝘁𝗼𝗸𝗲𝗻𝘀. Why is this question so important?
Well, you only ever have access to a fixed compute, counted in FLOPs (floating point operations). So if your model is bigger, you will have less compute to train on many tokens, and if you want to train on more tokens, your model will be smaller. When model trainings cost million, you absolutely need to get this right.

The new paper "Chinchilla Scaling: A replication attempt" by Epoch AI sets on on the ambitious goal of reproducing this.

But since the authors do not have infinite money, they decided to directly run their computations from DeepMind's own experiments! They took the figure from the last experiment (cf slide below), measured point positions, picked color codes, and ended up reconstructing the underlying data.

💥 They then just fit the scaling laws proposed by the Chinchilla Authors, but arrived at wildly different results! They find that as a rough rule of thumb, you should use 20 training tokens for each parameter in your model, instead of the 70 obtained in the original paper. They also point out inconsistencies in the paper, and unrealistically narrow confidence intervals.

➡️ This only contradicts the results from the last (out of 3) experiments in the Chinchilla paper. And the model trained at the end of the Chinchilla paper still seems properly scaled.

✅ But it does show that a tiny bit more theoretical work can go a long way, especially given the huge financial costs that such an error can have!

posted an update 6 months ago

Post

2461

𝐏𝐚𝐩𝐞𝐫 𝐑𝐞𝐯𝐢𝐞𝐰: 𝐑𝐡𝐨-𝟏 - 𝐃𝐨 𝐧𝐨𝐭 𝐮𝐬𝐞 𝐚𝐥𝐥 𝐭𝐨𝐤𝐞𝐧𝐬 𝐞𝐪𝐮𝐚𝐥𝐥𝐲 𝐢𝐧 𝐲𝐨𝐮𝐫 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠! ⚖️⛔️

A new paper topping Daily papers questions a hidden assumption in LLM training:

🤔 𝙎𝙝𝙤𝙪𝙡𝙙 𝙬𝙚 𝙧𝙚𝙖𝙡𝙡𝙮 𝙪𝙨𝙚 𝙖𝙡𝙡 𝙩𝙤𝙠𝙚𝙣𝙨 𝙚𝙦𝙪𝙖𝙡𝙡𝙮 𝙞𝙣 𝙤𝙪𝙧 𝙇𝙇𝙈'𝙨 𝙩𝙧𝙖𝙞𝙣𝙞𝙣𝙜 ?

Some tokens are more relevant than others, and some are mostly noise (just look up the history of 𝘚𝘰𝘭𝘪𝘥𝘎𝘰𝘭𝘥𝘔𝘢𝘨𝘪𝘬𝘢𝘳𝘱).

So this paper introduces 𝗦𝗲𝗹𝗲𝗰𝘁𝗶𝘃𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴, which is actually really simple:
➡️ A specific metric measures the relevance of each token. Then during training, only the top k% tokens for this relevance metric count in the loss calculation.

Authors test this method by training models on the difficult MATH dataset (only competition mathematics problems).

➡️ Their technique seems like a new must-do in LLM training: Training is much faster and reaches an impressive performance!

𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
◆ ⏱️ Training is x5 to x10 faster to reach equivalent performance compared to standard language modeling.
◆ 💪 Their 1B model achieves close to GPT4 Chain-of-Thought performance on MATH!
◆ 🚀 Their 7B model match performance of the state-of-the-art DeepSeek for the same size, while trained on only 3% of tokens

𝐀𝐝𝐝𝐢𝐭𝐢𝐨𝐧𝐚𝐥 𝐢𝐧𝐬𝐢𝐠𝐡𝐭𝐬 💡
◆ Datasets used for pre-training, even after pre-filtering, still contain a large proportion of noisy tokens 😖
◆ Authors show that when you reduce loss on noisy tokens, you actually reduce accuracy (Figure 7). So Selective Language Modeling seems fundamental! ✅

Find great reads in @akhaliq 's Daily Papers 👉 https://huggingface.co/papers
Paper added to my collection 👉 m-ric/spinning-up-in-llms-659e698f9dd5a71bd3f579a7

posted an update 6 months ago

Post

2169

𝗡𝗲𝘄 𝗦𝗽𝗮𝗰𝗲: 𝘼𝙄 𝙏𝙧𝙖𝙫𝙚𝙡 𝙥𝙡𝙖𝙣𝙣𝙚𝙧 🗺️🏕️ Plan your next vacation in a few minutes!

I wanted to try out if a powerful LLM like Mixtral-8x7b had geographical reasoning capabilities.
So I built a small space that prompts the LLM to provide a JSON list of places based on a user input.

And the result was impressive! 🤯

⇒ 𝗜𝘁 𝘀𝗲𝗲𝗺𝘀 𝗹𝗶𝗸𝗲 𝗠𝗶𝘅𝘁𝗿𝗮𝗹 𝗵𝗮𝘀 𝗮 𝗴𝗿𝗮𝘀𝗽 𝗼𝗳 𝗴𝗲𝗼𝗴𝗿𝗮𝗽𝗵𝗶𝗰𝗮𝗹 𝗰𝗼𝗻𝗰𝗲𝗽𝘁𝘀 𝗹𝗶𝗸𝗲 𝗡𝗼𝗿𝘁𝗵 - 𝗦𝗼𝘂𝘁𝗵, 𝗼𝗿 𝘀𝗽𝗮𝘁𝗶𝗮𝗹 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁.🧭 Not just describing these concepts, but really applying them in practice, for instance to successfully answer "give me 4 European cities that are aligned on the map". This is a 𝗻𝗶𝗰𝗲 𝗲𝘅𝗮𝗺𝗽𝗹𝗲 𝗼𝗳 𝗮𝗻 𝗲𝗺𝗲𝗿𝗴𝗲𝗻𝘁 𝗰𝗮𝗽𝗮𝗯𝗶𝗹𝗶𝘁𝘆, since nothing in the LLM's training data should prepare it for this specific task.

Anyway, I added API calls and a nice visualization on top of the LLM, streaming output, caching for the answers and locations... and ta-da! ✨ I got the 𝗔𝗜 𝗧𝗿𝗮𝘃𝗲𝗹 𝗣𝗹𝗮𝗻𝗻𝗲𝗿.

𝙔𝙤𝙪 𝙘𝙖𝙣 𝙙𝙚𝙨𝙘𝙧𝙞𝙗𝙚 𝙞𝙩 𝙮𝙤𝙪𝙧 𝙩𝙧𝙞𝙥, 𝙖𝙣𝙙 𝙞𝙩 𝙬𝙞𝙡𝙡 𝙘𝙤𝙢𝙚 𝙪𝙥 𝙬𝙞𝙩𝙝 𝙣𝙞𝙘𝙚 𝙖𝙣𝙙 𝙘𝙤𝙣𝙫𝙚𝙣𝙞𝙚𝙣𝙩 𝙡𝙤𝙘𝙖𝙩𝙞𝙤𝙣𝙨!

𝙏𝙧𝙮 𝙞𝙩 𝙝𝙚𝙧𝙚 👉 m-ric/ai-travel-planner

Thank you @freddyaboulton for the 𝚐𝚛𝚊𝚍𝚒𝚘_𝚏𝚘𝚕𝚒𝚞𝚖 component, and @clem , @pngwn , @abidlabs for your ideas and support!

1 reply

·

posted an update 6 months ago

Post

2068

[𝐍𝐞𝐰 𝐏𝐚𝐩𝐞𝐫] 𝐀𝐥𝐥 𝐭𝐨𝐤𝐞𝐧𝐬 𝐬𝐡𝐨𝐮𝐥𝐝 𝐧𝐨𝐭 𝐫𝐞𝐪𝐮𝐢𝐫𝐞 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞 𝐞𝐟𝐟𝐨𝐫𝐭 𝐭𝐨 𝐜𝐨𝐦𝐩𝐮𝐭𝐞! ⇒ 𝐌𝐢𝐱𝐭𝐮𝐫𝐞 𝐨𝐟 𝐝𝐞𝐩𝐭𝐡𝐬 🫧🐠

Google Researchers were unhappy with the way current decoding generally works: all tokens go through the same layers, thus requiring exactly the same effort to compute.

Whereas in reality, completing the answer to a difficult math problem for instance should be more computationally intense than completing the text of the Declaration of Independence: 𝗻𝗼𝘁 𝗮𝗹𝗹 𝘁𝗼𝗸𝗲𝗻𝘀 𝗮𝗿𝗲 𝗰𝗿𝗲𝗮𝘁𝗲𝗱 𝗲𝗾𝘂𝗮𝗹!

➡️ 𝗧𝗵𝗲𝘆 𝗵𝗮𝗱 𝘁𝗵𝗶𝘀 𝗴𝗲𝗻𝗶𝘂𝘀 𝗶𝗱𝗲𝗮: 💡 𝗵𝗮𝘃𝗶𝗻𝗴 𝗮 𝘁𝗼𝗸𝗲𝗻 𝗴𝗼 𝘁𝗵𝗿𝗼𝘂𝗴𝗵 𝗮 𝗯𝗹𝗼𝗰𝗸 𝘀𝗵𝗼𝘂𝗹𝗱 𝗯𝗲 𝗼𝗽𝘁𝗶𝗼𝗻𝗮𝗹. The token can go through the block (thus undergoing expensive self-attention computation) or avoid it through a skip connection.
The routing decision is taken on the block level: each block selects from the total sequence the top-k tokens that will go through it, and the others tokens will skip it. 𝘛𝘩𝘪𝘴 𝘢𝘭𝘭𝘰𝘸𝘴 𝘵𝘰 𝘤𝘩𝘰𝘰𝘴𝘦 𝘵𝘩𝘦 𝘦𝘹𝘢𝘤𝘵 𝙘𝙖𝙥𝙖𝙘𝙞𝙩𝙮 𝘰𝘧 𝘢 𝘣𝘭𝘰𝘤𝘬, 𝘪.𝘦. 𝘵𝘩𝘦 𝘱𝘳𝘰𝘱𝘰𝘳𝘵𝘪𝘰𝘯 𝘰𝘧 𝘵𝘰𝘬𝘦𝘯𝘴 𝘵𝘩𝘢𝘵 𝘨𝘰 𝘵𝘩𝘳𝘰𝘶𝘨𝘩 𝘪𝘵, 𝘸𝘩𝘪𝘤𝘩 𝘥𝘪𝘳𝘦𝘤𝘵𝘭𝘺 𝘪𝘯𝘧𝘭𝘶𝘦𝘯𝘤𝘦𝘴 𝘵𝘩𝘦 𝘤𝘰𝘮𝘱𝘶𝘵𝘢𝘵𝘪𝘰𝘯𝘢𝘭 𝘪𝘯𝘵𝘦𝘯𝘴𝘪𝘵𝘺 𝘰𝘧 𝘵𝘩𝘦 𝘧𝘰𝘳𝘸𝘢𝘳𝘥 𝘱𝘢𝘴𝘴.

This yields Mixture-of-Depths (MoD), with spectacular results.

✨ 𝗥𝗲𝘀𝘂𝗹𝘁𝘀:
🎚️ 𝗖𝗮𝗽𝗮𝗰𝗶𝘁𝘆 𝗰𝗮𝗻 𝗯𝗲 𝘁𝘂𝗻𝗲𝗱 𝗮𝗹𝗹 𝘁𝗵𝗲 𝘄𝗮𝘆 𝗱𝗼𝘄𝗻 𝘁𝗼 𝟭𝟮.𝟱% for every second block: thus 87.5% of tokens just skip the block!
🚀 For the same training time and performance, >𝟲𝟬% 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝘀𝗽𝗲𝗲𝗱!
🤝 𝗖𝗮𝗻 𝗯𝗲 𝗰𝗼𝗺𝗯𝗶𝗻𝗲𝗱 𝘄𝗶𝘁𝗵 𝗠𝗶𝘅𝘁𝘂𝗿𝗲-𝗼𝗳-𝗘𝘅𝗽𝗲𝗿𝘁𝘀 for further improvements.

📄 𝗣𝗮𝗽𝗲𝗿 𝗵𝗲𝗿𝗲 👉 Mixture-of-Depths: Dynamically allocating compute in transformer-based language models (2404.02258)
📚 I added it to my paper collection 👉 m-ric/spinning-up-in-llms-659e698f9dd5a71bd3f579a7

1 reply

·

posted an update 6 months ago

Post

1877

𝟐𝟎𝟐𝟒, 𝐭𝐡𝐞 𝐲𝐞𝐚𝐫 𝐨𝐟 𝐚𝐠𝐞𝐧𝐭 𝐰𝐨𝐫𝐤𝐟𝐥𝐨𝐰𝐬 🔧🦾🤖

I've just watched Andrew Ng's talk at Sequoia last week.
If you're interested in Agents, you should really watch it!

𝗪𝗵𝘆 𝘂𝘀𝗲 𝗮𝗴𝗲𝗻𝘁 𝘄𝗼𝗿𝗸𝗳𝗹𝗼𝘄𝘀?
The current LLM task solving workflow is not very intuitive:
We ask it “write an essay all in one shot, without ever using backspace.”

Why not allow the LLM a more similar process to what we would do?
- “Write an essay outline”
- “Do you need wen research?”
- “Write a first draft”
- “Consider improvements”
…

This is called an Agentic workflow. Existing ones bring a huge performance boost. With HumanEval: GPT-4 zero-shot gets 67% score, agentic with either one of tool use or reflection goes over 90%, and the combination of the two scores even higher!

𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗱𝗲𝘀𝗶𝗴𝗻 𝗽𝗮𝘁𝘁𝗲𝗿𝗻𝘀
On the following two points, the tech is robust:

⚙️ 𝗥𝗲𝗳𝗹𝗲𝘅𝗶𝗼𝗻: For instance: add a critic step after the writing step
🛠️ 𝗧𝗼𝗼𝗹 𝘂𝘀𝗲: extends the capabilities of the LLM by allowing it to call tools, like search or calculator

The next two will be needed to go further, but the tech for them is more emerging and not reliable yet:
🗺️ 𝗣𝗹𝗮𝗻𝗻𝗶𝗻𝗴 forward to decompose task into subtasks. This allows great behaviours like an AI Agent re-routing after a failure
🐝 𝗠𝘂𝗹𝘁𝗶-𝗮𝗴𝗲𝗻𝘁 𝗰𝗼𝗹𝗹𝗮𝗯𝗼𝗿𝗮𝘁𝗶𝗼𝗻: Program a flock of agents with tasks.
Improving the two above points will unlock huge performance boosts!

Andrew NG says Research agents are already part of his workflow!

𝗖𝗹𝗼𝘀𝗶𝗻𝗴 𝘁𝗵𝗼𝘂𝗴𝗵𝘁𝘀
Andrew speculates that through agentic workflows, maybe generating many tokens fast from a small LLM will give better results than slower throughput from a powerful LLM like GPT-5.

🎬 Watch the talk here 👉 https://www.youtube.com/watch?v=sal78ACtGTc
📚 I've added his recommended reads to m-ric/agents-65ba776fbd9e29f771c07d4e

1 reply

·

posted an update 6 months ago

Post

1796

𝐓𝐡𝐞 𝐫𝐞𝐭𝐮𝐫𝐧 𝐨𝐟 𝐭𝐡𝐞 𝐑𝐍𝐍𝐬 ⚔ 𝐍𝐞𝐰 𝐌𝐚𝐦𝐛𝐚-𝐛𝐚𝐬𝐞𝐝 𝐚𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 "𝐉𝐚𝐦𝐛𝐚"

Since the release of BERT by Google in 2019, Transformers architecture have taken over machine learning thanks to their 𝗮𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻 𝗺𝗲𝗰𝗵𝗮𝗻𝗶𝘀𝗺, that gives them the ability to focus on important points of the input. But 𝙖𝙩𝙩𝙚𝙣𝙩𝙞𝙤𝙣 𝙘𝙤𝙢𝙥𝙪𝙩𝙖𝙩𝙞𝙤𝙣 𝙞𝙨 𝙦𝙪𝙖𝙙𝙧𝙖𝙩𝙞𝙘 𝙞𝙣 𝙩𝙝𝙚 𝙞𝙣𝙥𝙪𝙩 𝙡𝙚𝙣𝙜𝙩𝙝.

💫 The Mamba paper, published in December 2023, announced the return of the RNNs: it has no attention, but integrates a selection mechanism, which should be able to reproduce the “focus” ability of attention, in an architecture for which the compute requirements 𝗴𝗿𝗼𝘄 𝗼𝗻𝗹𝘆 𝗹𝗶𝗻𝗲𝗮𝗿𝗹𝘆 𝗶𝗻 𝗶𝗻𝗽𝘂𝘁 𝗹𝗲𝗻𝗴𝘁𝗵!
🤔 Would this work? We had yet to see a large Mamba model recovering the performance of Attention-based Transformers.

💥 But now it's done! A (Mamba + Transformers) hybrid just beat Transformers!

The AI21 Labs team just released Jamba.
They insert a few Transformer layers to inject some attention in a big pile of Mamba layers, thus getting the best of both worlds.

𝙏𝙇;𝘿𝙍:
🏗️ 𝗡𝗲𝘄 𝗠𝗼𝗘 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲: 4 Jamba blocks, each of these being 7 Mamba layers for 1 Transformer.
🏋️ 𝟱𝟮𝗕 𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿𝘀, 𝟭𝟮𝗕 𝗮𝗰𝘁𝗶𝘃𝗲 𝗮𝘁 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲: This reduction is enabled by Mixture of Experts, and similar to Mixtral (47B parameters - 13B active).
🏎️ 𝗦𝗽𝗲𝗲𝗱: 𝘅𝟯 𝘁𝗵𝗿𝗼𝘂𝗴𝗵𝗽𝘂𝘁. Jamba is much faster than similar-sized Transformer models on long contexts.
📏 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗹𝗲𝗻𝗴𝘁𝗵: 𝟭𝟰𝟬𝗞 𝘁𝗼𝗸𝗲𝗻𝘀 on a single 80GB A100!
💪 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲: 𝘀𝘁𝗮𝘁𝗲-𝗼𝗳-𝘁𝗵𝗲-𝗮𝗿𝘁 𝗳𝗼𝗿 𝘁𝗵𝗶𝘀 𝘀𝗶𝘇𝗲. The small injection of attention seems sufficient since Jamba beats the open-source reference Mixtral-8x7B on many benchmarks!

Try it here 👉 ai21labs/Jamba-v0.1

posted an update 6 months ago

Post

1700

𝗛𝗼𝘄 𝗱𝗼𝗲𝘀 𝗯𝗲𝗮𝗺 𝘀𝗲𝗮𝗿𝗰𝗵 𝗱𝗲𝗰𝗼𝗱𝗶𝗻𝗴 𝘄𝗼𝗿𝗸? ➡️ 𝙉𝙚𝙬 𝙫𝙞𝙨𝙪𝙖𝙡𝙞𝙯𝙖𝙩𝙞𝙤𝙣 𝙩𝙤𝙤𝙡! 👀

In Decoder-type LLMs like GPT4 or Mistral-Large, the output is generated one token (=word part) at a time. That's why they're nicknamed "stochastic parrots": the "thinking" process only happens one step at a time, so it can seem really myopic.

𝐒𝐨 𝐡𝐨𝐰 𝐢𝐬 𝐭𝐡𝐞 𝐧𝐞𝐱𝐭 𝐭𝐨𝐤𝐞𝐧 𝐬𝐞𝐥𝐞𝐜𝐭𝐞𝐝?

📊 Given its input sentence like "𝘞𝘩𝘢𝘵 𝘪𝘴 𝘵𝘩𝘦 7𝘵𝘩 𝘍𝘪𝘣𝘰𝘯𝘢𝘤𝘤𝘪 𝘯𝘶𝘮𝘣𝘦𝘳? 𝘛𝘩𝘦 7𝘵𝘩 𝘍𝘪𝘣𝘰𝘯𝘢𝘤𝘤𝘪 𝘯𝘶𝘮𝘣𝘦𝘳", the Decoder LLM generates, for each token in its vocabulary, a score that represents this token's probability of coming next.
For instance: "𝙞𝙨" gets score 0.56, and "𝙘𝙖𝙣" gets score 0.35.

🤑 𝐆𝐫𝐞𝐞𝐝𝐲 𝐝𝐞𝐜𝐨𝐝𝐢𝐧𝐠 is the naive option where you simply take the next most probable token at each step. But this creates paths that maximize very short-term rewards, thus may overlook better paths for the long term (like this time when you played FIFA all evening and arrived unprepared to your school exam on the next day).
In our example, the next highest score token might be "𝙞𝙨", but this will strongly bias the LLM towards giving an hasty response. On the opposite, starting with "𝙘𝙖𝙣" could have been completed with "𝘣𝘦 𝘰𝘣𝘵𝘢𝘪𝘯𝘦𝘥 𝘧𝘳𝘰𝘮 𝘤𝘰𝘮𝘱𝘶𝘵𝘪𝘯𝘨 𝘱𝘳𝘦𝘷𝘪𝘰𝘶𝘴 𝘍𝘪𝘣𝘰𝘯𝘢𝘤𝘤𝘪 𝘯𝘶𝘮𝘣𝘦𝘳𝘴 𝘧𝘪𝘳𝘴𝘵", which steers the LLM towards a correct reasoning!

🗺️ 𝐁𝐞𝐚𝐦 𝐬𝐞𝐚𝐫𝐜𝐡 improves on greedy decoding by generating at each step several paths - called beams - instead of one. This allows the generation to explore a much larger space, thus find better completions. In our example, both the "𝙞𝙨" and the "𝙘𝙖𝙣" completion could be tested. ✅

👉 I've created a tool to let you visualize it, thank you @joaogante for your great help!
𝙏𝙧𝙮 𝙞𝙩 𝙝𝙚𝙧𝙚: m-ric/beam_search_visualizer

posted an update 7 months ago

Post

2036

𝗨𝘀𝗶𝗻𝗴 𝗟𝗟𝗠-𝗮𝘀-𝗮-𝗷𝘂𝗱𝗴𝗲 🧑‍⚖️ 𝗳𝗼𝗿 𝗮𝗻 𝗮𝘂𝘁𝗼𝗺𝗮𝘁𝗲𝗱 𝗮𝗻𝗱 𝘃𝗲𝗿𝘀𝗮𝘁𝗶𝗹𝗲 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻

Evaluating LLM outputs is often hard, since many tasks require open-ended answers for which no deterministic metrics work: for instance, when asking a model to summarize a text, there could be hundreds of correct ways to do it. The most versatile way to grade these outputs is then human evaluation, but it is very time-consuming, thus costly.

🤔 Then 𝘄𝗵𝘆 𝗻𝗼𝘁 𝗮𝘀𝗸 𝗮𝗻𝗼𝘁𝗵𝗲𝗿 𝗟𝗟𝗠 𝘁𝗼 𝗱𝗼 𝘁𝗵𝗲 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻, by providing it relevant rating criteria? 👉 This is the idea behind LLM-as-a-judge.

⚙️ To implement a LLM judge correctly, you need a few tricks.
✅ So 𝗜'𝘃𝗲 𝗷𝘂𝘀𝘁 𝗽𝘂𝗯𝗹𝗶𝘀𝗵𝗲𝗱 𝗮 𝗻𝗲𝘄 𝗻𝗼𝘁𝗲𝗯𝗼𝗼𝗸 𝘀𝗵𝗼𝘄𝗶𝗻𝗴 𝗵𝗼𝘄 𝘁𝗼 𝗶𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁 𝗶𝘁 𝗽𝗿𝗼𝗽𝗲𝗿𝗹𝘆 𝗶𝗻 𝗼𝘂𝗿 𝗛𝘂𝗴𝗴𝗶𝗻𝗴 𝗙𝗮𝗰𝗲 𝗖𝗼𝗼𝗸𝗯𝗼𝗼𝗸! (you can run it instantly in Google Colab)
➡️ 𝗟𝗟𝗠-𝗮𝘀-𝗮-𝗷𝘂𝗱𝗴𝗲 𝗰𝗼𝗼𝗸𝗯𝗼𝗼𝗸: https://huggingface.co/learn/cookbook/llm_judge

The Cookbook is a great collection of notebooks demonstrating recipes (thus the "cookbook") for common LLM usages. I recommend you to go take a look!
➡️ 𝗔𝗹𝗹 𝗰𝗼𝗼𝗸𝗯𝗼𝗼𝗸𝘀: https://huggingface.co/learn/cookbook/index

Thank you @MariaK for your support!

2 replies

·

posted an update 7 months ago

Post

Interesting paper: 𝐆𝐚𝐋𝐨𝐫𝐞: 𝐭𝐫𝐚𝐢𝐧 𝟕𝐁 𝐦𝐨𝐝𝐞𝐥𝐬 𝐨𝐧 𝐜𝐨𝐧𝐬𝐮𝐦𝐞𝐫-𝐠𝐫𝐚𝐝𝐞 𝐆𝐏𝐔𝐬 💪
It's now possible to 𝙛𝙪𝙡𝙡𝙮 𝙥𝙧𝙚-𝙩𝙧𝙖𝙞𝙣 a 7B model on a consumer-grade GPU of 24Gb RAM, without any performance loss!

The memory usage of training models has always been an acute issue. For instance full pre-training of a 7B model used to eat ~50Gb of RAM!

The common workarounds to reduce memory load are:
- separate models on multiple GPUs ("sharding")
- quantize models: encode weights on fewer bits

Another technique is to 𝙥𝙧𝙤𝙟𝙚𝙘𝙩 𝙩𝙝𝙚 𝙬𝙚𝙞𝙜𝙝𝙩 𝙢𝙖𝙩𝙧𝙞𝙭 𝙩𝙤 𝙡𝙤𝙬𝙚𝙧-𝙧𝙖𝙣𝙠 𝙨𝙥𝙖𝙘𝙚𝙨, (since sometimes the weights do not really vary on all dimensions): this can save a lot of space!
This low-rank projection can be done on adapters to preserve the original weights (go check out LoRA), but it still generally hurts the performance too much for pre-training.

➡️ Enter the authors of 𝘎𝘢𝘓𝘰𝘳𝘦: 𝘔𝘦𝘮𝘰𝘳𝘺-𝘌𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘵 𝘓𝘓𝘔 𝘛𝘳𝘢𝘪𝘯𝘪𝘯𝘨 𝘣𝘺 𝘎𝘳𝘢𝘥𝘪𝘦𝘯𝘵 𝘓𝘰𝘸-𝘙𝘢𝘯𝘬 𝘗𝘳𝘰𝘫𝘦𝘤𝘵𝘪𝘰𝘯. They gather (and prove) interesting insights:
⛔ The weight matrix does not reliably converge to lower ranks during training.
✅ But the gradient matrix does!

Based on these insights, 𝘁𝗵𝗲𝘆 𝗯𝘂𝗶𝗹𝗱 𝗚𝗮𝗟𝗼𝗿𝗲, that projects the gradient to lower ranks.
🗺️ 𝗚𝗿𝗲𝗮𝘁 𝗶𝗱𝗲𝗮: to leave the optimization free to explore more space, they periodically re-build the low-rank projection throughout the training (a nice illustration is in the paper).

🤝 This method can even be combined with previous ones such as 8-bit Adam (quantizing the optimizer states to 8-bit).

➡️ 𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
📉 Of course, huge reduction in memory footprint allowing the training on consumer-grade GPU (cf figure).
💪 No reduction in performance: this scales well up to 7B parameters (and was independently confirmed since) ⇒ this is essential, it confirms that the method is viable!

Read the full paper here: GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (2403.03507)

posted an update 8 months ago

Post

📚🔎 If you're building RAG applications, you should check this out:

⚙️ I've built a new space to let you visualize the chunks you get with different text splitting methods!

➡️ Visualize your chunks here:
m-ric/chunk_visualizer

2 replies

·

Aymeric Roucher

AI & ML interests

Articles

Our Transformers Code Agent beats the GAIA benchmark!

Extracting Concepts from LLMs: Anthropic’s recent discoveries 📖

License to Call: Introducing Transformers Agents 2.0

Open-source LLMs as LangChain Agents

Organizations

m-ric's activity