File size: 2,046 Bytes
7a61bc1
 
 
 
 
 
 
c306bb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
datasets:
- DeSTA-ntu/DeSTA2-Llama3-8B-Instruct
base_model:
- meta-llama/Meta-Llama-3-8B-Instruct
- openai/whisper-small
---

## DeSTA2

[πŸ“‘ Paper](https://arxiv.org/pdf/2409.20007) | [🌐 Website](https://kehanlu.github.io/DeSTA2/) | [πŸ‘©β€πŸ’» Github](https://github.com/kehanlu/DeSTA2) | [πŸ€— Model](https://huggingface.co/DeSTA-ntu/DeSTA2-8B-beta) | [πŸ€— Dataset](https://huggingface.co/datasets/DeSTA-ntu/DeSTA2-Llama3-8B-Instruct) | 


## Quickstart

```python

from huggingface import AutoModel

HF_TOKEN = "hf_..." # your huggingface token for downloading Llama3 from official Meta repo

model = AutoModel.from_pretrained("DeSTA-ntu/DeSTA2-8B-beta", trust_remote_code=True, token=HF_TOKEN)

messages = [
            {"role": "system", "content": "You are a helpful voice assistant."},
            {"role": "audio", "content": "<path_to_audio_file>"},
            {"role": "user", "content": "Describe the audio."}
        ]

generated_ids = model.chat(
    messages, 
    max_new_tokens=128, 
    do_sample=True, 
    temperature=0.6, 
    top_p=0.9
)

response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```


## Citation

if you find our work useful, please consider citing the paper:

```
@article{lu2024developing,
  title={Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data},
  author={Lu, Ke-Han and Chen, Zhehuai and Fu, Szu-Wei and Yang, Chao-Han Huck and Balam, Jagadeesh and Ginsburg, Boris and Wang, Yu-Chiang Frank and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2409.20007},
  year={2024}
}

@inproceedings{lu24c_interspeech,
  title     = {DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment},
  author    = {Ke-Han Lu and Zhehuai Chen and Szu-Wei Fu and He Huang and Boris Ginsburg and Yu-Chiang Frank Wang and Hung-yi Lee},
  year      = {2024},
  booktitle = {Interspeech 2024},
  pages     = {4159--4163},
  doi       = {10.21437/Interspeech.2024-457},
  issn      = {2958-1796},
}
```