opengvlab-admin commited on
Commit
3ec3d5c
1 Parent(s): f2f20b3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -4
README.md CHANGED
@@ -21,16 +21,19 @@ pipeline_tag: visual-question-answering
21
 
22
  You can run multimodal large models using a 1080Ti now.
23
 
24
- We are delighted to introduce Mini-InternVL-Chat-2B-V1-5. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model InternViT-6B-448px-V1-5 down to 300M and used InternLM2-Chat-1.8B as our language model. This resulted in a small multimodal model with excellent performance.
25
 
26
- As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.
27
 
28
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/SmtrR0c5NVhmSTvnwwCE9.png)
 
 
 
 
29
 
30
  ## Model Details
31
  - **Model Type:** multimodal large language model (MLLM)
32
  - **Model Stats:**
33
- - Architecture: InternViT-300M-448px + MLP + [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)
34
  - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
35
  - Params: 2.2B
36
 
 
21
 
22
  You can run multimodal large models using a 1080Ti now.
23
 
24
+ We are delighted to introduce the Mini-InternVL-Chat series. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) down to 300M and used [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b) or [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) as our language model. This resulted in a small multimodal model with excellent performance.
25
 
 
26
 
27
+ As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B / Phi-3-mini-128k-instruct. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.
28
+
29
+
30
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/rDyoe66Sqev44T0wsP5Z7.png)
31
+
32
 
33
  ## Model Details
34
  - **Model Type:** multimodal large language model (MLLM)
35
  - **Model Stats:**
36
+ - Architecture: [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) + MLP + [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)
37
  - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
38
  - Params: 2.2B
39