root commited on
Commit
18f9df3
β€’
1 Parent(s): 424a94c

new README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -249
README.md CHANGED
@@ -1,249 +1,9 @@
1
- <p align="center" width="100%">
2
- <a target="_blank"><img src="figs/video_llama_logo.jpg" alt="Video-LLaMA" style="width: 50%; min-width: 200px; display: block; margin: auto;"></a>
3
- </p>
4
-
5
-
6
-
7
- # Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
8
- <!-- **Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding** -->
9
-
10
- This is the repo for the Video-LLaMA project, which is working on empowering large language models with video and audio understanding capabilities.
11
-
12
- <div style='display:flex; gap: 0.25rem; '>
13
- <a href='https://modelscope.cn/studios/damo/video-llama/summary'><img src='https://img.shields.io/badge/ModelScope-Demo-blueviolet'></a>
14
- <a href='https://www.modelscope.cn/models/damo/videollama_7b_llama2_finetuned/summary'><img src='https://img.shields.io/badge/ModelScope-Checkpoint-blueviolet'></a>
15
- <a href='https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-blue'></a>
16
- <a href='https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-7B-Finetuned'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-blue'></a>
17
- <a href='https://arxiv.org/abs/2306.02858'><img src='https://img.shields.io/badge/Paper-PDF-red'></a>
18
- </div>
19
-
20
- ## News
21
- - [08.03] πŸš€πŸš€ Release **Video-LLaMA-2** with [Llama-2-7B/13B-Chat](https://huggingface.co/meta-llama) as language decoder
22
- - **NO** delta weights and separate Q-former weights anymore, full weights to run Video-LLaMA are all here :point_right: [[7B](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-7B-Finetuned)][[13B](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-13B-Finetuned)]
23
- - Allow further customization starting from our pre-trained checkpoints [[7B-Pretrained](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-7B-Pretrained)] [[13B-Pretrained](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-13B-Pretrained)]
24
- - [06.14] **NOTE**: The current online interactive demo is primarily for English chatting and it may **NOT** be a good option to ask Chinese questions since Vicuna/LLaMA does not represent Chinese texts very well.
25
- - [06.13] **NOTE**: The audio support is **ONLY** for Vicuna-7B by now although we have several VL checkpoints available for other decoders.
26
- - [06.10] **NOTE**: We have NOT updated the HF demo yet because the whole framework (with the audio branch) cannot run normally on A10-24G. The current running demo is still the previous version of Video-LLaMA. We will fix this issue soon.
27
- - [06.08] πŸš€πŸš€ Release the checkpoints of the audio-supported Video-LLaMA. Documentation and example outputs are also updated.
28
- - [05.22] πŸš€πŸš€ Interactive demo online, try our Video-LLaMA (with **Vicuna-7B** as language decoder) at [Hugging Face](https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA) and [ModelScope](https://pre.modelscope.cn/studios/damo/video-llama/summary)!!
29
- - [05.22] ⭐️ Release **Video-LLaMA v2** built with Vicuna-7B
30
- - [05.18] πŸš€πŸš€ Support video-grounded chat in Chinese
31
- - [**Video-LLaMA-BiLLA**](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-billa7b-zh.pth): we introduce [BiLLa-7B-SFT](https://huggingface.co/Neutralzz/BiLLa-7B-SFT) as language decoder and fine-tune the video-language aligned model (i.e., stage 1 model) with machine-translated [VideoChat](https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data) instructions.
32
- - [**Video-LLaMA-Ziya**](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-ziya13b-zh.pth): same with Video-LLaMA-BiLLA but the language decoder is changed to [Ziya-13B](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1).
33
- - [05.18] ⭐️ Create a Hugging Face [repo](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series) to store the model weights of all the variants of our Video-LLaMA.
34
- - [05.15] ⭐️ Release [**Video-LLaMA v2**](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna13b-v2.pth): we use the training data provided by [VideoChat](https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data) to further enhance the instruction-following capability of Video-LLaMA.
35
- - [05.07] Release the initial version of **Video-LLaMA**, including its pre-trained and instruction-tuned checkpoints.
36
-
37
- <p align="center" width="100%">
38
- <a target="_blank"><img src="figs/architecture_v2.png" alt="Video-LLaMA" style="width: 80%; min-width: 200px; display: block; margin: auto;"></a>
39
- </p>
40
-
41
- ## Introduction
42
-
43
-
44
- - Video-LLaMA is built on top of [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2) and [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4). It is composed of two core components: (1) Vision-Language (VL) Branch and (2) Audio-Language (AL) Branch.
45
- - **VL Branch** (Visual encoder: ViT-G/14 + BLIP-2 Q-Former)
46
- - A two-layer video Q-Former and a frame embedding layer (applied to the embeddings of each frame) are introduced to compute video representations.
47
- - We train VL Branch on the Webvid-2M video caption dataset with a video-to-text generation task. We also add image-text pairs (~595K image captions from [LLaVA](https://github.com/haotian-liu/LLaVA)) into the pre-training dataset to enhance the understanding of static visual concepts.
48
- - After pre-training, we further fine-tune our VL Branch using the instruction-tuning data from [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA) and [VideoChat](https://github.com/OpenGVLab/Ask-Anything).
49
- - **AL Branch** (Audio encoder: ImageBind-Huge)
50
- - A two-layer audio Q-Former and an audio segment embedding layer (applied to the embedding of each audio segment) are introduced to compute audio representations.
51
- - As the used audio encoder (i.e., ImageBind) is already aligned across multiple modalities, we train AL Branch on video/image instruction data only, just to connect the output of ImageBind to the language decoder.
52
- - Only the Video/Audio Q-Former, positional embedding layers, and linear layers are trainable during cross-modal training.
53
-
54
-
55
-
56
- ## Example Outputs
57
-
58
-
59
- - **Video with background sound**
60
-
61
- <p float="left">
62
- <img src="https://github.com/DAMO-NLP-SG/Video-LLaMA/assets/18526640/7f7bddb2-5cf1-4cf4-bce3-3fa67974cbb3" style="width: 45%; margin: auto;">
63
- <img src="https://github.com/DAMO-NLP-SG/Video-LLaMA/assets/18526640/ec76be04-4aa9-4dde-bff2-0a232b8315e0" style="width: 45%; margin: auto;">
64
- </p>
65
-
66
-
67
- - **Video without sound effects**
68
- <p float="left">
69
- <img src="https://github.com/DAMO-NLP-SG/Video-LLaMA/assets/18526640/539ea3cc-360d-4b2c-bf86-5505096df2f7" style="width: 45%; margin: auto;">
70
- <img src="https://github.com/DAMO-NLP-SG/Video-LLaMA/assets/18526640/7304ad6f-1009-46f1-aca4-7f861b636363" style="width: 45%; margin: auto;">
71
- </p>
72
-
73
- - **Static image**
74
- <p float="left">
75
- <img src="https://github.com/DAMO-NLP-SG/Video-LLaMA/assets/18526640/a146c169-8693-4627-96e6-f885ca22791f" style="width: 45%; margin: auto;">
76
- <img src="https://github.com/DAMO-NLP-SG/Video-LLaMA/assets/18526640/66fc112d-e47e-4b66-b9bc-407f8d418b17" style="width: 45%; margin: auto;">
77
- </p>
78
-
79
-
80
-
81
- ## Pre-trained & Fine-tuned Checkpoints
82
-
83
- The following checkpoints store learnable parameters (positional embedding layers, Video/Audio Q-former, and linear projection layers) only.
84
-
85
- #### Vision-Language Branch
86
- | Checkpoint | Link | Note |
87
- |:------------|-------------|-------------|
88
- | pretrain-vicuna7b | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain_vicuna7b-v2.pth) | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
89
- | finetune-vicuna7b-v2 | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna7b-v2.pth) | Fine-tuned on the instruction-tuning data from [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA) and [VideoChat](https://github.com/OpenGVLab/Ask-Anything)|
90
- | pretrain-vicuna13b | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain-vicuna13b.pth) | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
91
- | finetune-vicuna13b-v2 | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna13b-v2.pth) | Fine-tuned on the instruction-tuning data from [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA) and [VideoChat](https://github.com/OpenGVLab/Ask-Anything)|
92
- | pretrain-ziya13b-zh | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain-ziya13b-zh.pth) | Pre-trained with Chinese LLM [Ziya-13B](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1) |
93
- | finetune-ziya13b-zh | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-ziya13b-zh.pth) | Fine-tuned on machine-translated [VideoChat](https://github.com/OpenGVLab/Ask-Anything) instruction-following dataset (in Chinese)|
94
- | pretrain-billa7b-zh | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain-billa7b-zh.pth) | Pre-trained with Chinese LLM [BiLLA-7B-SFT](https://huggingface.co/Neutralzz/BiLLa-7B-SFT) |
95
- | finetune-billa7b-zh | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-billa7b-zh.pth) | Fine-tuned on machine-translated [VideoChat](https://github.com/OpenGVLab/Ask-Anything) instruction-following dataset (in Chinese) |
96
-
97
- #### Audio-Language Branch
98
- | Checkpoint | Link | Note |
99
- |:------------|-------------|-------------|
100
- | pretrain-vicuna7b | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain_vicuna7b_audiobranch.pth) | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
101
- | finetune-vicuna7b-v2 | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune_vicuna7b_audiobranch.pth) | Fine-tuned on the instruction-tuning data from [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA) and [VideoChat](https://github.com/OpenGVLab/Ask-Anything)|
102
-
103
-
104
- ## Usage
105
- #### Enviroment Preparation
106
-
107
- First, install ffmpeg.
108
- ```
109
- apt update
110
- apt install ffmpeg
111
- ```
112
- Then, create a conda environment:
113
- ```
114
- conda env create -f environment.yml
115
- conda activate videollama
116
- ```
117
-
118
-
119
- ## Prerequisites
120
-
121
- Before using the repository, make sure you have obtained the following checkpoints:
122
-
123
- #### Pre-trained Language Decoder
124
-
125
- - Get the original LLaMA weights in the Hugging Face format by following the instructions [here](https://huggingface.co/docs/transformers/main/model_doc/llama).
126
- - Download Vicuna delta weights :point_right: [[7B](https://huggingface.co/lmsys/vicuna-7b-delta-v0)][[13B](https://huggingface.co/lmsys/vicuna-13b-delta-v0)] (Note: we use **v0 weights** instead of v1.1 weights).
127
- - Use the following command to add delta weights to the original LLaMA weights to obtain the Vicuna weights:
128
-
129
- ```
130
- python apply_delta.py \
131
- --base /path/to/llama-13b \
132
- --target /output/path/to/vicuna-13b --delta /path/to/vicuna-13b-delta
133
- ```
134
-
135
- #### Pre-trained Visual Encoder in Vision-Language Branch
136
- - Download the MiniGPT-4 model (trained linear layer) from this [link](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view).
137
-
138
- #### Pre-trained Audio Encoder in Audio-Language Branch
139
- - Download the weight of ImageBind from this [link](https://github.com/facebookresearch/ImageBind).
140
-
141
- ## Download Learnable Weights
142
- Use `git-lfs` to download the learnable weights of our Video-LLaMA (i.e., positional embedding layer + Q-Former + linear projection layer):
143
- ```bash
144
- git lfs install
145
- git clone https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series
146
- ```
147
- The above commands will download the model weights of all the Video-LLaMA variants. For sure, you can choose to download the weights on demand. For example, if you want to run Video-LLaMA with Vicuna-7B as language decoder locally, then:
148
- ```bash
149
- wget https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna7b-v2.pth
150
- wget https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune_vicuna7b_audiobranch.pth
151
- ```
152
- should meet the requirement.
153
-
154
- ## How to Run Demo Locally
155
-
156
- Firstly, set the `llama_model`, `imagebind_ckpt_path`, `ckpt` and `ckpt_2` in [eval_configs/video_llama_eval_withaudio.yaml](./eval_configs/video_llama_eval_withaudio.yaml).
157
- Then run the script:
158
- ```
159
- python demo_audiovideo.py \
160
- --cfg-path eval_configs/video_llama_eval_withaudio.yaml \
161
- --model_type llama_v2 \ # or vicuna
162
- --gpu-id 0
163
- ```
164
-
165
- ## Training
166
-
167
- The training of each cross-modal branch (i.e., VL branch or AL branch) in Video-LLaMA consists of two stages,
168
-
169
- 1. Pre-training on the [Webvid-2.5M](https://github.com/m-bain/webvid) video caption dataset and [LLaVA-CC3M]((https://github.com/haotian-liu/LLaVA)) image caption dataset.
170
-
171
- 2. Fine-tuning using the image-based instruction-tuning data from [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4)/[LLaVA](https://github.com/haotian-liu/LLaVA) and the video-based instruction-tuning data from [VideoChat](https://github.com/OpenGVLab/Ask-Anything).
172
-
173
- ### 1. Pre-training
174
- #### Data Preparation
175
- Download the metadata and video following the instructions from the official Github repo of [Webvid](https://github.com/m-bain/webvid).
176
- The folder structure of the dataset is shown below:
177
- ```
178
- |webvid_train_data
179
- |──filter_annotation
180
- |────0.tsv
181
- |──videos
182
- |────000001_000050
183
- |──────1066674784.mp4
184
- ```
185
- ```
186
- |cc3m
187
- |──filter_cap.json
188
- |──image
189
- |────GCC_train_000000000.jpg
190
- |────...
191
- ```
192
- #### Script
193
- Config the the checkpoint and dataset paths in [video_llama_stage1_pretrain.yaml](./train_configs/video_llama_stage1_pretrain.yaml).
194
- Run the script:
195
- ```
196
- conda activate videollama
197
- torchrun --nproc_per_node=8 train.py --cfg-path ./train_configs/video_llama_stage1_pretrain.yaml
198
- ```
199
-
200
- ### 2. Instruction Fine-tuning
201
- #### Data
202
- For now, the fine-tuning dataset consists of:
203
- * 150K image-based instructions from LLaVA [[link](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/raw/main/llava_instruct_150k.json)]
204
- * 3K image-based instructions from MiniGPT-4 [[link](https://github.com/Vision-CAIR/MiniGPT-4/blob/main/dataset/README_2_STAGE.md)]
205
- * 11K video-based instructions from VideoChat [[link](https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data)]
206
-
207
- #### Script
208
- Config the checkpoint and dataset paths in [video_llama_stage2_finetune.yaml](./train_configs/video_llama_stage2_finetune.yaml).
209
- ```
210
- conda activate videollama
211
- torchrun --nproc_per_node=8 train.py --cfg-path ./train_configs/video_llama_stage2_finetune.yaml
212
- ```
213
-
214
- ## Recommended GPUs
215
- * Pre-training: 8xA100 (80G)
216
- * Instruction-tuning: 8xA100 (80G)
217
- * Inference: 1xA100 (40G/80G) or 1xA6000
218
-
219
- ## Acknowledgement
220
- We are grateful for the following awesome projects our Video-LLaMA arising from:
221
- * [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4): Enhancing Vision-language Understanding with Advanced Large Language Models
222
- * [FastChat](https://github.com/lm-sys/FastChat): An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
223
- * [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2): Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
224
- * [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP): Improved Training Techniques for CLIP at Scale
225
- * [ImageBind](https://github.com/facebookresearch/ImageBind): One Embedding Space To Bind Them All
226
- * [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models
227
- * [VideoChat](https://github.com/OpenGVLab/Ask-Anything): Chat-Centric Video Understanding
228
- * [LLaVA](https://github.com/haotian-liu/LLaVA): Large Language and Vision Assistant
229
- * [WebVid](https://github.com/m-bain/webvid): A Large-scale Video-Text dataset
230
- * [mPLUG-Owl](https://github.com/X-PLUG/mPLUG-Owl/tree/main): Modularization Empowers Large Language Models with Multimodality
231
-
232
- The logo of Video-LLaMA is generated by [Midjourney](https://www.midjourney.com/).
233
-
234
-
235
- ## Term of Use
236
- Our Video-LLaMA is just a research preview intended for non-commercial use only. You must **NOT** use our Video-LLaMA for any illegal, harmful, violent, racist, or sexual purposes. You are strictly prohibited from engaging in any activity that will potentially violate these guidelines.
237
-
238
- ## Citation
239
- If you find our project useful, hope you can star our repo and cite our paper as follows:
240
- ```
241
- @article{damonlpsg2023videollama,
242
- author = {Zhang, Hang and Li, Xin and Bing, Lidong},
243
- title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
244
- year = 2023,
245
- journal = {arXiv preprint arXiv:2306.02858},
246
- url = {https://arxiv.org/abs/2306.02858}
247
- }
248
- ```
249
-
 
1
+ title: Video LLaMA2 IBM
2
+ emoji: πŸš€
3
+ colorFrom: purple
4
+ colorTo: gray
5
+ sdk: gradio
6
+ sdk_version: 3.29.0
7
+ app_file: app.py
8
+ pinned: false
9
+ license: other