David Berenstein
AI & ML interests
Articles
Organizations
davidberenstein1957's activity
Dataset: argilla/FinePersonas-v0.1
Example usage: https://distilabel.argilla.io/dev/sections/pipeline_samples/examples/fine_personas_social_network/
First meetup that is coming up: Setting up a text classification project using Argilla and SetFit!
Deploy Argilla on Spaces
Vibe check your dataset
Configure and create an Argilla dataset
Add records
Add zero-shot suggestions
Evaluate model suggestions in Argilla
Train a SetFit model
Hope to see all of you guys there and looking forward to your questions and AI use cases. Don't be shy about bringing your own issues and questions to the table. We would love to answer them.
Sign up here: https://lu.ma/31mecp34
Notebook: https://colab.research.google.com/drive/1nHNXUbgwRMyjFeBZbQvNLqaXGkqE4Wcs?usp=sharing
Model: hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF
Dataset: argilla/FinePersonas-v0.1
Library: https://github.com/argilla-io/distilabel
We're thrilled to announce the release of Argilla 2.2.0, packed with powerful new features to enhance your data annotation and LLM workflow:
π¨οΈ ChatField: Work with text conversations natively in Argilla. Perfect for building datasets for conversational LLMs!
βοΈ Adjustable Task Distribution: Modify settings on the fly and automatically recalculate completed and pending records.
π Progress Tracking: Monitor annotation progress directly from the SDK, including user-specific metrics.
π§ Automatic Settings Inference: Importing datasets from Hugging Face Hub just got easier with automatic settings detection.
π Task Templates: Jump-start your projects with pre-built templates for common dataset types.
π§ Background Jobs Support: Improved performance for long-running tasks (requires Redis).
Upgrade now and supercharge your data workflows!
Check out our full changelog for more details: https://github.com/argilla-io/argilla/compare/v2.1.0...v2.2.0
π Workflow
- Write down your custom GenAI usecase
- Automatically generate system prompts
- Create sample datasets for quick iteration
- Produce full-scale datasets with customizable parameters
- Push generated datasets directly to the Hugging Face Hub
β‘οΈ Powered by Argilla's distilabel and open source LLMs
π Uses Free Serverless HF Inference Endpoints
π‘ Use Cases:
- Fine-tuning language models for specific domains
- Creating diverse datasets for robust model training
- Rapid prototyping of AI applications
- Generating synthetic data for privacy-sensitive projects
π Start crafting your custom datasets today and do it quicker, easier and more private with distilabel DataCraft!
https://huggingface.co/spaces/argilla/distilabel-datacraft
davidberenstein1957/text-to-sql-hub-datasets
We've been doing some interview with community members to understand the needs surrounding synthetic data. Many thanks to the participants. Note that, given they interviewees were sourced from our community, so the results will likely represent that.
Things distilabel does well
- security and reliability by caching generations and having serializable pipelines.
- scaling up generation by parallelising inference and Anyscale Ray
- solid implementations of state of the art research papers
Things to improve
- communication about the fact we support structured generation
- customization of existing prompt implementations are difficult
- creation of new tasks prove difficult
- arguments and parameters for tasks aren't available at first glance
- the learning curve can be steep
- more tutorials that represent real-life usage
Things to note
- create small scale and large scale dataset to Millions of records
- people use synthetic data to move away from frontier model providers
- people mostly use 7B or 70B models for generating
Participate here: https://github.com/argilla-io/distilabel/issues
βWith the rise of recent interest in Vision Language Models (VLMs), we decided to make a push to include an ImageField within Argilla! This means any open source developer can now work on better models for vision ML tasks too and we would like to show you how.
βWe would love to introduce this new feature to you, so we've prepared a set of notebooks to go over some common image scenarios.
finetune an CLIP retrieval model with sentence transformers
use ColPali+ Qwen VL for RAG and log the results to Argilla
image-generation preference: creating multi-modal preference datasets for free using Hugging Face inference endpoints.
βSee you on Thursday!
https://lu.ma/x7id1jqu
πΌ Image Field: Seamlessly work with multimodal datasets
π Dark Mode: Reduce eye strain with our sleek new look
π€ Enhanced Hugging Face Hub import with the SDK
πͺπΈ Spanish UI: Breaking language barriers
Plus more improvements to supercharge your model curation workflow!
Check out the full announcement for details and code examples: https://github.com/argilla-io/argilla/compare/v2.0.1...v2.1.0
Why? Not trying to be all fancy and formal just to iterate on your data and to get familiar with your prompts and the produced data. Under the hood, it relies on Hugging Face Inference endpoints and the latest LLMs and VLMs like Meta Llama 3.1 and BlackForest Labs Flux models.
An addition to the other Interfaces that are already support.
- CollectorInterface: Lazily collect data of model interactions without human annotation.
- AnnotatorInterface: Walk through your data and annotate it with models in the loop.
- Synthesizer: Synthesize data with distilabel in the loop.
- BulkInterface: Explore your data distribution and annotate in bulk.
βοΈ Give some good vibes: https://github.com/davidberenstein1957/dataset-viber
Released an initial version a while ago
Archived it because of a cleaner solution described in a blog by Philipp Schmid
Reimplemented it based on that cleaner solution
Unarchived the project
Packaged it up
Released a 0.5 version
pip install fast-sentence-transformers
https://github.com/davidberenstein1957/fast-sentence-transformers
Want to see more? βοΈ the repo https://github.com/davidberenstein1957/dataset-viber
Some new features!
- manual import from a CSV or the Hugging Face Hub
- manual export to CSV or the Hub
- improved automated export to the Hub and CSV
- limit interaction with specific components
- stream data with custom next_input features (SO to Ben Burtenshaw for the suggestions)
- model in-the-loop support for all tasks
dataset-viber/gradio-annotators-66c5ce73d5e3bf99caa445b1
In this session, we'll walk you through the essentials of building a distilabel pipeline by exploring two key use cases: cleaning an existing dataset and generating a preference dataset for DPO/ORPO. Youβll also learn how to make the most of AI feedback, integrating Argilla to gather human feedback and improve the overall data quality.
This session is perfect for you
- if youβre getting started with distilabel or synthetic data
- if you want to learn how to use LLM inference endpoints for **free**
- if you want to discover new functionalities
- if you want to provide us with new feedback
Sign up here: https://lu.ma/dt0c7jru
I've cooked up Dataset Viber, a set of cool tools designed to make data preparation for AI models easier, more approachable and enjoyable for standalone AI engineers and enthusiasts.
π§ What Dataset Viber offers:
- CollectorInterface: Lazily collect model interaction data without human annotation
- AnnotatorInterface: Annotate your data with models in the loop
- BulkInterface: Explore data distribution and annotate in bulk
- Embedder: Efficiently embed data with ONNX-optimized speeds
π― Key features:
- Supports various tasks for text, chat, and image modalities
- Runs in .ipynb notebooks
- Logs data to local CSV or directly to Hugging Face Hub
- Easy to install via pip:
pip install dataset-viber
It's not designed for team collaboration or production use, but rather as a fun and efficient toolkit for individual projects.
Want to give it a try? Check out the repository link https://github.com/davidberenstein1957/dataset-viber/.
I'm excited to hear your feedback and learn how you vibe with your data. Feel free to open an issue or reach out if you have any questions or suggestions!
Some shoutouts:
- Gradio for the amazing backbone
- Daniel van Strien for some initial presentations I did on vibe checks
- Emily Omier for the workshop on structuring GitHub repo READMEs
- Hamel Husain for keeping mentioning that people should look at their data.
- Philipp Schmid for his code for ONNX feature-extractors
- Ben Burtenshaw for the first PR
Each dataset contains a my_dataset_name/tree/main/creation_script.py to see the full config and creation pipeline.
https://huggingface.co/datasets/argilla/multi-modal-vlm-visit-bench/blob/main/creation_script.py
Or explore them in our UI by logging in with your @huggingface account!
https://huggingface.co/spaces/argilla/argilla-template-space
import argilla as rg
ds = rg.Dataset.from_hub(
"argilla/multi-modal-vlm-visit-bench"
)
argilla/argilla-v20-compatible-datasets-66a8e670f351acac61a0421c
Find your pipline and use
$ distilabel pipeline run --config "hugging_face_dataset_url/pipeline.yaml"
Some components I used
- Embedded dataset viewer https://huggingface.co/docs/hub/main/en/datasets-viewer-embed
- Hugging Face fsspec https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system
- distilabel https://distilabel.argilla.io/latest/
- Gradio leaderboard by Freddy Boulton freddyaboulton/gradio_leaderboard
- Gradio modal by Ali Abid
Space: davidberenstein1957/distilabel-synthetic-data-pipeline-explorer
- Model meta-llama/Meta-Llama-3.1-8B-Instruct
- Data SFT, KTO and DPO data
- Runs on free Zero GPUs in Hugging Face Spaces
- Might need some human curation in Argilla
- Or provide some AI feedback with distilabel
https://huggingface.co/collections/davidberenstein1957/chatinterface-llm-human-feedback-collectors-66a22859c9e703d2af7500c1
Argilla has moved its community from Slack to the Hugging Face Discord server!
When part of the Hugging Face Discord, you can select βChannels & rolesβ and select βArgillaβ along with any of the other groups that are interesting to you. βArgillaβ will cover anything about argilla and distilabel, and it will give you access to 1) #argilla-distilabel-general, for all general discussions and news, and 2) #argilla-distilabel-help, for any usage-focused questions.
Cool demo!βπΌ
We've also got this playlist of NLP/AI topics applied in practise with Argilla: https://www.youtube.com/watch?v=OO235zLZTT4&list=PLBmuFBJ5cjcbsr49KFoC4DQoo3ZWT7q_d
We are launching our 2.0 SDK these days. Feel free to check our docs and install from main: https://argilla-io.github.io/argilla/dev/.
I created this cool notebook for a workshop @davanstrien and I gave it a couple of weeks back. It uses https://distilabel.argilla.io/dev/ and I think it is a good entry point for anyone with a practical interest in the topic.
https://colab.research.google.com/github/davanstrien/data-for-fine-tuning-llms/blob/main/03-synthetic-data-generation.ipynb
KTO formats for:
- UltraFeedback Cleaned Binarized
- Distilabel Intel Orca
- Distilabel Capybara
- DPO mix
argilla/preference-datasets-for-kto-65f98314d7c1b04ab54d41a7
Paper claims :)
https://arxiv.org/abs/2402.01306
KTO matches or exceeds DPO performance at scales from 1B to 30B parameters.1 That is, taking a preference dataset of n DPO pairs and breaking it up into 2n examples for KTO can yield better generations, despite the model ostensibly learning from a weaker signal.
KTO can handle extreme data imbalances, matching DPO performance while using up to 90% fewer desirable examples (i.e., examples of good generations). Its success thus cannot be ascribed to the alignment data being sourced from a preference dataset.
When the pretrained model is sufficiently good, one can skip supervised finetuning and go straight to KTO without a loss in generation quality. In contrast, we find that without doing SFT first, DPO-aligned models are significantly worse at all scales.
Do you need something custom? Take a look at @davanstrien his guide on creating your own KTO dataset with Argilla and our community.
https://github.com/huggingface/data-is-better-together/tree/main/kto-preference
On which part @Ali-C137?
ref_model
because we can just swap out the LoRa adapters during training. Cool feature π€https://colab.research.google.com/drive/1PGMj7jlkJaCiSNNihA2NtpILsRgkRXrJ#scrollTo=wXqoH2TMnjjp