czczup commited on
Commit
0b5b24e
1 Parent(s): eb1380e

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,55 +1,57 @@
1
  ---
2
  license: mit
3
  datasets:
4
- - laion/laion2B-en
5
- - laion/laion-coco
6
- - laion/laion2B-multi
7
- - kakaobrain/coyo-700m
8
- - conceptual_captions
9
- - wanng/wukong100m
10
  pipeline_tag: visual-question-answering
11
  ---
12
 
13
  # Model Card for Mini-InternVL-Chat-2B-V1-5
 
14
  <p align="center">
15
  <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
16
  </p>
17
 
18
  > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
19
 
20
- \[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\] \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
21
 
22
  You can run multimodal large models using a 1080Ti now.
23
 
24
  We are delighted to introduce the Mini-InternVL-Chat series. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) down to 300M and used [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b) or [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) as our language model. This resulted in a small multimodal model with excellent performance.
25
 
26
-
27
  As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B / Phi-3-mini-128k-instruct. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.
28
 
29
-
30
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/rDyoe66Sqev44T0wsP5Z7.png)
31
 
32
-
33
  ## Model Details
 
34
  - **Model Type:** multimodal large language model (MLLM)
 
35
  - **Model Stats:**
 
36
  - Architecture: [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) + MLP + [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)
37
  - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
38
  - Params: 2.2B
39
 
40
  - **Training Strategy:**
 
41
  - Learnable component in the pretraining stage: ViT + MLP
42
  - Learnable component in the finetuning stage: ViT + MLP + LLM
43
- - For more details on training hyperparameters, take a look at our code: [pretrain]() | [finetune]()
44
-
45
  ## Released Models
46
 
47
- | Model | Vision Foundation Model | Release Date |Note |
48
- | :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
49
- | InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
50
- | InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.21 | more SFT data and stronger |
51
- | InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.11 | scaling up LLM to 34B |
52
- | InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |2024.01.24 | support Chinese and stronger OCR |
53
 
54
  ## Performance
55
 
@@ -59,7 +61,7 @@ As shown in the figure below, we adopted the same model architecture as InternVL
59
 
60
  We provide an example code to run Mini-InternVL-Chat-2B-V1.5 using `transformers`.
61
 
62
- You can also use our [online demo](https://internvl.opengvlab.com/) to get a quick experience of this model.
63
 
64
  > Please use transformers==4.37.2 to ensure the model works normally.
65
 
@@ -150,7 +152,6 @@ def load_image(image_file, input_size=448, max_num=6):
150
  pixel_values = torch.stack(pixel_values)
151
  return pixel_values
152
 
153
-
154
  path = "OpenGVLab/Mini-InternVL-Chat-2B-V1-5"
155
  model = AutoModel.from_pretrained(
156
  path,
@@ -222,12 +223,18 @@ If you find this project useful in your research, please consider citing:
222
  journal={arXiv preprint arXiv:2312.14238},
223
  year={2023}
224
  }
 
 
 
 
 
 
225
  ```
226
 
227
  ## License
228
 
229
- This project is released under the MIT license.
230
 
231
  ## Acknowledgement
232
 
233
- InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
 
1
  ---
2
  license: mit
3
  datasets:
4
+ - laion/laion2B-en
5
+ - laion/laion-coco
6
+ - laion/laion2B-multi
7
+ - kakaobrain/coyo-700m
8
+ - conceptual_captions
9
+ - wanng/wukong100m
10
  pipeline_tag: visual-question-answering
11
  ---
12
 
13
  # Model Card for Mini-InternVL-Chat-2B-V1-5
14
+
15
  <p align="center">
16
  <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
17
  </p>
18
 
19
  > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
20
 
21
+ \[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\] \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)\]
22
 
23
  You can run multimodal large models using a 1080Ti now.
24
 
25
  We are delighted to introduce the Mini-InternVL-Chat series. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) down to 300M and used [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b) or [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) as our language model. This resulted in a small multimodal model with excellent performance.
26
 
 
27
  As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B / Phi-3-mini-128k-instruct. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.
28
 
 
29
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/rDyoe66Sqev44T0wsP5Z7.png)
30
 
 
31
  ## Model Details
32
+
33
  - **Model Type:** multimodal large language model (MLLM)
34
+
35
  - **Model Stats:**
36
+
37
  - Architecture: [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) + MLP + [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)
38
  - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
39
  - Params: 2.2B
40
 
41
  - **Training Strategy:**
42
+
43
  - Learnable component in the pretraining stage: ViT + MLP
44
  - Learnable component in the finetuning stage: ViT + MLP + LLM
45
+ - For more details on training hyperparameters, take a look at our code: [pretrain](<>) | [finetune](<>)
46
+
47
  ## Released Models
48
 
49
+ | Model | Vision Foundation Model | Release Date | Note |
50
+ | :----------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------: | :----------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
51
+ | InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) | 2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new) |
52
+ | InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) | 2024.02.21 | more SFT data and stronger |
53
+ | InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) | 2024.02.11 | scaling up LLM to 34B |
54
+ | InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) | InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) | 2024.01.24 | support Chinese and stronger OCR |
55
 
56
  ## Performance
57
 
 
61
 
62
  We provide an example code to run Mini-InternVL-Chat-2B-V1.5 using `transformers`.
63
 
64
+ You can also use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
65
 
66
  > Please use transformers==4.37.2 to ensure the model works normally.
67
 
 
152
  pixel_values = torch.stack(pixel_values)
153
  return pixel_values
154
 
 
155
  path = "OpenGVLab/Mini-InternVL-Chat-2B-V1-5"
156
  model = AutoModel.from_pretrained(
157
  path,
 
223
  journal={arXiv preprint arXiv:2312.14238},
224
  year={2023}
225
  }
226
+ @article{chen2024far,
227
+ title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
228
+ author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
229
+ journal={arXiv preprint arXiv:2404.16821},
230
+ year={2024}
231
+ }
232
  ```
233
 
234
  ## License
235
 
236
+ This project is released under the MIT license.
237
 
238
  ## Acknowledgement
239
 
240
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
config.json CHANGED
@@ -6,13 +6,14 @@
6
  ],
7
  "auto_map": {
8
  "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
9
- "AutoModel": "modeling_internvl_chat.InternVLChatModel"
 
10
  },
11
  "downsample_ratio": 0.5,
12
  "dynamic_image_size": true,
13
  "force_image_size": 448,
14
  "llm_config": {
15
- "_name_or_path": "./pretrained/internlm2-chat-1_8b",
16
  "add_cross_attention": false,
17
  "architectures": [
18
  "InternLM2ForCausalLM"
@@ -113,12 +114,16 @@
113
  "use_llm_lora": 0,
114
  "use_thumbnail": true,
115
  "vision_config": {
116
- "_name_or_path": "",
117
  "add_cross_attention": false,
118
  "architectures": [
119
  "InternVisionModel"
120
  ],
121
  "attention_dropout": 0.0,
 
 
 
 
122
  "bad_words_ids": null,
123
  "begin_suppress_tokens": null,
124
  "bos_token_id": null,
@@ -189,11 +194,11 @@
189
  "tokenizer_class": null,
190
  "top_k": 50,
191
  "top_p": 1.0,
192
- "torch_dtype": "float32",
193
  "torchscript": false,
194
  "transformers_version": "4.36.2",
195
  "typical_p": 1.0,
196
- "use_bfloat16": false,
197
  "use_flash_attn": true
198
  }
199
  }
 
6
  ],
7
  "auto_map": {
8
  "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
9
+ "AutoModel": "modeling_internvl_chat.InternVLChatModel",
10
+ "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
11
  },
12
  "downsample_ratio": 0.5,
13
  "dynamic_image_size": true,
14
  "force_image_size": 448,
15
  "llm_config": {
16
+ "_name_or_path": "pretrained/internlm2-chat-1_8b",
17
  "add_cross_attention": false,
18
  "architectures": [
19
  "InternLM2ForCausalLM"
 
114
  "use_llm_lora": 0,
115
  "use_thumbnail": true,
116
  "vision_config": {
117
+ "_name_or_path": "OpenGVLab/InternViT-300M-448px",
118
  "add_cross_attention": false,
119
  "architectures": [
120
  "InternVisionModel"
121
  ],
122
  "attention_dropout": 0.0,
123
+ "auto_map": {
124
+ "AutoConfig": "configuration_intern_vit.InternVisionConfig",
125
+ "AutoModel": "modeling_intern_vit.InternVisionModel"
126
+ },
127
  "bad_words_ids": null,
128
  "begin_suppress_tokens": null,
129
  "bos_token_id": null,
 
194
  "tokenizer_class": null,
195
  "top_k": 50,
196
  "top_p": 1.0,
197
+ "torch_dtype": "bfloat16",
198
  "torchscript": false,
199
  "transformers_version": "4.36.2",
200
  "typical_p": 1.0,
201
+ "use_bfloat16": true,
202
  "use_flash_attn": true
203
  }
204
  }
conversation.py CHANGED
@@ -1258,4 +1258,3 @@ register_conv_template(
1258
  sep2='</s>',
1259
  )
1260
  )
1261
-
 
1258
  sep2='</s>',
1259
  )
1260
  )
 
modeling_intern_vit.py CHANGED
@@ -26,9 +26,9 @@ try:
26
  except: # v2
27
  from flash_attn.flash_attn_interface import \
28
  flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
29
-
30
  from flash_attn.bert_padding import pad_input, unpad_input
31
-
32
  has_flash_attn = True
33
  except:
34
  print('FlashAttention is not installed.')
@@ -47,12 +47,12 @@ class FlashAttention(nn.Module):
47
  attention_dropout: The dropout rate to apply to the attention
48
  (default: 0.0)
49
  """
50
-
51
  def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
52
  super().__init__()
53
  self.softmax_scale = softmax_scale
54
  self.dropout_p = attention_dropout
55
-
56
  def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
57
  max_s=None, need_weights=False):
58
  """Implements the multihead softmax attention.
@@ -65,7 +65,7 @@ class FlashAttention(nn.Module):
65
  assert not need_weights
66
  assert qkv.dtype in [torch.float16, torch.bfloat16]
67
  assert qkv.is_cuda
68
-
69
  if cu_seqlens is None:
70
  batch_size = qkv.shape[0]
71
  seqlen = qkv.shape[1]
@@ -97,7 +97,7 @@ class FlashAttention(nn.Module):
97
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
98
  softmax_scale=self.softmax_scale, causal=causal
99
  )
100
-
101
  return output, None
102
 
103
 
@@ -160,7 +160,7 @@ class InternVisionEmbeddings(nn.Module):
160
  target_dtype = pos_embed.dtype
161
  pos_embed = pos_embed.float().reshape(
162
  1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
163
- pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False).\
164
  reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
165
  return pos_embed
166
 
 
26
  except: # v2
27
  from flash_attn.flash_attn_interface import \
28
  flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
29
+
30
  from flash_attn.bert_padding import pad_input, unpad_input
31
+
32
  has_flash_attn = True
33
  except:
34
  print('FlashAttention is not installed.')
 
47
  attention_dropout: The dropout rate to apply to the attention
48
  (default: 0.0)
49
  """
50
+
51
  def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
52
  super().__init__()
53
  self.softmax_scale = softmax_scale
54
  self.dropout_p = attention_dropout
55
+
56
  def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
57
  max_s=None, need_weights=False):
58
  """Implements the multihead softmax attention.
 
65
  assert not need_weights
66
  assert qkv.dtype in [torch.float16, torch.bfloat16]
67
  assert qkv.is_cuda
68
+
69
  if cu_seqlens is None:
70
  batch_size = qkv.shape[0]
71
  seqlen = qkv.shape[1]
 
97
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
98
  softmax_scale=self.softmax_scale, causal=causal
99
  )
100
+
101
  return output, None
102
 
103
 
 
160
  target_dtype = pos_embed.dtype
161
  pos_embed = pos_embed.float().reshape(
162
  1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
163
+ pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False). \
164
  reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
165
  return pos_embed
166
 
modeling_internlm2.py CHANGED
@@ -48,16 +48,13 @@ _CONFIG_FOR_DOC = 'InternLM2Config'
48
 
49
  flash_attn_func, flash_attn_varlen_func = None, None
50
  pad_input, index_first_axis, unpad_input = None, None, None
51
-
52
  try:
53
  from flash_attn import flash_attn_func as _flash_attn_func
54
- from flash_attn import \
55
- flash_attn_varlen_func as _flash_attn_varlen_func
56
- from flash_attn.bert_padding import \
57
- index_first_axis as _index_first_axis
58
  from flash_attn.bert_padding import pad_input as _pad_input
59
  from flash_attn.bert_padding import unpad_input as _unpad_input
60
-
61
  flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
62
  pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
63
  has_flash_attn = True
@@ -164,7 +161,7 @@ class InternLM2RotaryEmbedding(nn.Module):
164
 
165
  def _set_cos_sin_cache(self, seq_len, device, dtype):
166
  self.max_seq_len_cached = seq_len
167
- t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
168
 
169
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
170
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
@@ -193,7 +190,7 @@ class InternLM2LinearScalingRotaryEmbedding(InternLM2RotaryEmbedding):
193
 
194
  def _set_cos_sin_cache(self, seq_len, device, dtype):
195
  self.max_seq_len_cached = seq_len
196
- t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
197
  t = t / self.scaling_factor
198
 
199
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
@@ -223,7 +220,7 @@ class InternLM2DynamicNTKScalingRotaryEmbedding(InternLM2RotaryEmbedding):
223
  inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
224
  self.register_buffer('inv_freq', inv_freq, persistent=False)
225
 
226
- t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
227
 
228
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
229
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
@@ -810,6 +807,9 @@ class InternLM2Model(InternLM2PreTrainedModel):
810
  self.padding_idx = config.pad_token_id
811
  self.vocab_size = config.vocab_size
812
  self.config = config
 
 
 
813
 
814
  self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
815
 
@@ -870,7 +870,7 @@ class InternLM2Model(InternLM2PreTrainedModel):
870
 
871
  return_dict = return_dict if return_dict is not None else self.config.use_return_dict
872
 
873
- if self.config.attn_implementation == 'flash_attention_2' and has_flash_attn:
874
  _import_flash_attn()
875
 
876
  # retrieve input_ids and inputs_embeds
 
48
 
49
  flash_attn_func, flash_attn_varlen_func = None, None
50
  pad_input, index_first_axis, unpad_input = None, None, None
 
51
  try:
52
  from flash_attn import flash_attn_func as _flash_attn_func
53
+ from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func
54
+ from flash_attn.bert_padding import index_first_axis as _index_first_axis
 
 
55
  from flash_attn.bert_padding import pad_input as _pad_input
56
  from flash_attn.bert_padding import unpad_input as _unpad_input
57
+
58
  flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
59
  pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
60
  has_flash_attn = True
 
161
 
162
  def _set_cos_sin_cache(self, seq_len, device, dtype):
163
  self.max_seq_len_cached = seq_len
164
+ t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
165
 
166
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
167
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
 
190
 
191
  def _set_cos_sin_cache(self, seq_len, device, dtype):
192
  self.max_seq_len_cached = seq_len
193
+ t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
194
  t = t / self.scaling_factor
195
 
196
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
 
220
  inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
221
  self.register_buffer('inv_freq', inv_freq, persistent=False)
222
 
223
+ t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
224
 
225
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
226
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
 
807
  self.padding_idx = config.pad_token_id
808
  self.vocab_size = config.vocab_size
809
  self.config = config
810
+ if not has_flash_attn:
811
+ self.config.attn_implementation = 'eager'
812
+ print('Warning: Flash attention is not available, using eager attention instead.')
813
 
814
  self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
815
 
 
870
 
871
  return_dict = return_dict if return_dict is not None else self.config.use_return_dict
872
 
873
+ if self.config.attn_implementation == 'flash_attention_2':
874
  _import_flash_attn()
875
 
876
  # retrieve input_ids and inputs_embeds
modeling_internvl_chat.py CHANGED
@@ -233,7 +233,7 @@ class InternVLChatModel(PreTrainedModel):
233
  return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
234
  IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
235
  if history is not None or return_history:
236
- print("Now multi-turn chat is not supported in batch_chat.")
237
  raise NotImplementedError
238
  img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
239
  self.img_context_token_id = img_context_token_id
@@ -241,9 +241,9 @@ class InternVLChatModel(PreTrainedModel):
241
  eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>') # 92542, InternLM2
242
  else:
243
  eos_token_id = tokenizer.eos_token_id
244
-
245
  from .conversation import get_conv_template
246
-
247
  queries = []
248
  image_bs = pixel_values.shape[0]
249
  # print(f'dynamic ViT batch size: {image_bs}, image_counts: {image_counts}')
@@ -260,7 +260,7 @@ class InternVLChatModel(PreTrainedModel):
260
  input_ids = model_inputs['input_ids'].cuda()
261
  attention_mask = model_inputs['attention_mask'].cuda()
262
  generation_config['eos_token_id'] = eos_token_id
263
-
264
  generation_output = self.generate(
265
  pixel_values=pixel_values,
266
  input_ids=input_ids,
 
233
  return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
234
  IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
235
  if history is not None or return_history:
236
+ print('Now multi-turn chat is not supported in batch_chat.')
237
  raise NotImplementedError
238
  img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
239
  self.img_context_token_id = img_context_token_id
 
241
  eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>') # 92542, InternLM2
242
  else:
243
  eos_token_id = tokenizer.eos_token_id
244
+
245
  from .conversation import get_conv_template
246
+
247
  queries = []
248
  image_bs = pixel_values.shape[0]
249
  # print(f'dynamic ViT batch size: {image_bs}, image_counts: {image_counts}')
 
260
  input_ids = model_inputs['input_ids'].cuda()
261
  attention_mask = model_inputs['attention_mask'].cuda()
262
  generation_config['eos_token_id'] = eos_token_id
263
+
264
  generation_output = self.generate(
265
  pixel_values=pixel_values,
266
  input_ids=input_ids,
preprocessor_config.json CHANGED
@@ -16,4 +16,4 @@
16
  ],
17
  "resample": 3,
18
  "size": 448
19
- }
 
16
  ],
17
  "resample": 3,
18
  "size": 448
19
+ }