# Model Card for Diva Llama 3 This is an end-to-end Voice Assistant Model which can handle speech and text as inputs. It is trained using distillation loss. More details will be in a paper [COMING SOON]! See the model in action compared to SALMONN and Qwen-Audio at [diva-audio.github.io](https://diva-audio.github.io). ## Citation No Publication As of Yet, But If You Use Please Cite the Below **BibTeX:** ``` @misc{held2024diva, author="Held, Will and Zhang, Yanzhe and Ryan, Michael and Shi, Weiyan and Li, Ella and Yang, Diyi", title="Distilling an End-to-End Voice Assistant from Speech Recognition Data", year="2024", publisher="HuggingFace", } ``` ## Table of Contents - [Model Card for DiVA Llama 3](#model-card-for-DiVA-Llama-3) - [Citation](#citation) - [Table of Contents](#table-of-contents) - [Training Details](#training-details) - [Training Data](#training-data) - [Training Procedure](#training-procedure) - [Environmental Impact](#environmental-impact) - [Technical Specifications [optional]](#technical-specifications-optional) - [Model Architecture and Objective](#model-architecture-and-objective) - [Compute Infrastructure](#compute-infrastructure) - [Hardware](#hardware) - [Software](#software) - [Model Card Contact](#model-card-contact) ## Training Details ### Training Data This model was trained on the [CommonVoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1) corpus. ### Training Procedure This model was trained for 7k gradient steps with a batch size of 512 Recordings and a linearly decaying learning rate from 5e-5 to zero, with a linear warmup of 70 steps. ### Environmental Impact - **Hardware Type:** V4-32 TPU - **Hours used:** 8 Hours - **Cloud Provider:** Google Cloud. - **Compute Region:** US Central C ### Hardware This model was trained on at V4 TPU on Google Cloud. ### Software This model was trained with [Levanter](https://github.com/stanford-crfm/levanter) ## Model Card Authors [optional] Will Held ## Model Card Contact held@stanford.edu