OpenGVLab
/

Mini-InternVL-Chat-2B-V1-5

Image-Text-to-Text

feature-extraction

Model card Files Files and versions Community

opengvlab-admin commited on May 28

Commit

3ec3d5c

•

1 Parent(s): f2f20b3

Update README.md

Files changed (1) hide show

README.md +7 -4

README.md CHANGED Viewed

@@ -21,16 +21,19 @@ pipeline_tag: visual-question-answering
 You can run multimodal large models using a 1080Ti now.
-We are delighted to introduce Mini-InternVL-Chat-2B-V1-5. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model InternViT-6B-448px-V1-5 down to 300M and used InternLM2-Chat-1.8B as our language model. This resulted in a small multimodal model with excellent performance.
-As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/SmtrR0c5NVhmSTvnwwCE9.png)
 ## Model Details
 - **Model Type:** multimodal large language model (MLLM)
 - **Model Stats:**
-  - Architecture: InternViT-300M-448px + MLP + [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)
   - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
   - Params: 2.2B

 You can run multimodal large models using a 1080Ti now.
+We are delighted to introduce the Mini-InternVL-Chat series. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) down to 300M and used [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b) or [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) as our language model. This resulted in a small multimodal model with excellent performance.
+As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B / Phi-3-mini-128k-instruct. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/rDyoe66Sqev44T0wsP5Z7.png)
 ## Model Details
 - **Model Type:** multimodal large language model (MLLM)
 - **Model Stats:**
+  - Architecture: [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) + MLP + [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)
   - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
   - Params: 2.2B