khang119966 commited on
Commit
87dbe75
1 Parent(s): dfe03a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +282 -190
README.md CHANGED
@@ -1,199 +1,291 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
11
 
12
  ## Model Details
13
 
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ datasets:
4
+ - 5CD-AI/Viet-OCR-VQA
5
+ - 5CD-AI/Viet-Doc-VQA
6
+ - 5CD-AI/Viet-Doc-VQA-II
7
+ - Vi-VLM/Vista
8
+ - 5CD-AI/Viet-Receipt-VQA
9
+ - 5CD-AI/Viet-Sketches-VQA
10
+ - 5CD-AI/Viet-Geometry-VQA
11
+ - 5CD-AI/Viet-Wiki-Handwriting
12
+ - 5CD-AI/Viet-ComputerScience-VQA
13
+ - 5CD-AI/Viet-Handwriting-gemini-VQA
14
+ - 5CD-AI/Viet-Menu-gemini-VQA
15
+ - 5CD-AI/Viet-Vintext-gemini-VQA
16
+ - 5CD-AI/Viet-OpenViVQA-gemini-VQA
17
+ - 5CD-AI/Viet-Resume-VQA
18
+ - 5CD-AI/Viet-ViTextVQA-gemini-VQA
19
+ language:
20
+ - vi
21
+ - en
22
+ pipeline_tag: visual-question-answering
23
  ---
24
 
25
+ <div align="center">
26
+ <img src="Vintern_logo.png" width="700"/>
27
+ </div>
28
 
29
+ ## Vintern-1B-v2 ❄️ (Viet-InternVL2-1B-v2) [\[🤗 HF Demo\]](https://huggingface.co/spaces/khang119966/Vintern) - The LLaVA 🌋 Challenger
30
 
31
+ We are excited to introduce **Vintern-1B-v2** the Vietnamese 🇻🇳 multimodal model that combines the advanced Vietnamese language model [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)[1] with the latest visual model, [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px)[2], CVPR 2024. This model excels in tasks such as OCR-VQA, Doc-VQA, and Chart-VQA,... With only 1 billion parameters, it is **4096 context length** finetuned from the Viet-InternVL-1B model on over 3 million specialized image-question-answer pairs for optical character recognition 🔍, text recognition 🔤, document extraction 📑, and general QA. The model can be integrated into various on-device applications 📱, demonstrating its versatility and robust capabilities.
32
 
33
  ## Model Details
34
 
35
+ | Model Name | Vision Part | Language Part |
36
+ | :------------------: | :---------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------: |
37
+ | Vintern-1B-v2 | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) |
38
+
39
+
40
+ Vintern-1B-v2 is a multimodal large language model series, featuring models of various sizes. For each size, we release instruction-tuned models optimized for multimodal tasks. Vintern-1B-v2 consists of [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px), an MLP projector, and [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct).
41
+
42
+ ## Training details 📚
43
+
44
+ The fine-tuning dataset was meticulously sampled in part from the following datasets:
45
+ - [Viet-OCR-VQA](https://huggingface.co/datasets/5CD-AI/Viet-OCR-VQA) 📄
46
+ - [Viet-Doc-VQA](https://huggingface.co/datasets/5CD-AI/Viet-Doc-VQA) 📚
47
+ - [Viet-Doc-VQA-II](https://huggingface.co/datasets/5CD-AI/Viet-Doc-VQA-II) 📑
48
+ - [Vista](https://huggingface.co/datasets/Vi-VLM/Vista) 🌄
49
+ - [Viet-Receipt-VQA](https://huggingface.co/datasets/5CD-AI/Viet-Receipt-VQA) 🧾
50
+ - [Viet-Sketches-VQA](https://huggingface.co/datasets/5CD-AI/Viet-Sketches-VQA) ✏️
51
+ - [Viet-Geometry-VQA](https://huggingface.co/datasets/5CD-AI/Viet-Geometry-VQA) 📐
52
+ - [Viet-Wiki-Handwriting](https://huggingface.co/datasets/5CD-AI/Viet-Wiki-Handwriting) 📝
53
+ - [Viet-ComputerScience-VQA](https://huggingface.co/datasets/5CD-AI/Viet-ComputerScience-VQA) 💻
54
+ - [Viet-Handwriting-gemini-VQA](https://huggingface.co/datasets/5CD-AI/Viet-Handwriting-gemini-VQA) ✍️
55
+ - [Viet-Menu-gemini-VQA](https://huggingface.co/datasets/5CD-AI/Viet-Menu-gemini-VQA) 🍽️
56
+ - [Viet-Vintext-gemini-VQA](https://huggingface.co/datasets/5CD-AI/Viet-Vintext-gemini-VQA) 🏷️
57
+ - [Viet-OpenViVQA-gemini-VQA](https://huggingface.co/datasets/5CD-AI/Viet-OpenViVQA-gemini-VQA) 🔍
58
+ - [Viet-Resume-VQA](https://huggingface.co/datasets/5CD-AI/Viet-Resume-VQA) 📃
59
+ - [Viet-ViTextVQA-gemini-VQA](https://huggingface.co/datasets/5CD-AI/Viet-ViTextVQA-gemini-VQA) 📑
60
+
61
+ ## Benchmarks 📈
62
+
63
+ Since there are still many different metrics that need to be tested, we chose a quick and simple metric first to guide the development of our model. Our metric is inspired by Lavy[4]. For the time being, we are using GPT-4 to evaluate the quality of answers on two datasets: OpenViVQA and ViTextVQA. Detailed results can be found at the provided . The inputs are images, questions, labels, and predicted answers. The model will return a score from 0 to 10 for the corresponding answer quality. The results table is shown below.
64
+
65
+ <table border="1" cellspacing="0" cellpadding="5">
66
+ <tr align="center">
67
+ <td rowspan="2"><b>Model</b></td>
68
+ <td colspan="2"><b>Gpt4o-Score</b></td>
69
+ </tr>
70
+ <tr align="center">
71
+ <td><b>OpenViVQA-dev</b></td>
72
+ <td><b>ViTextVQA-dev</b></td>
73
+ </tr>
74
+ <tr align="center">
75
+ <td align="left">Vintern-1B</td>
76
+ <td>7.1/10</td>
77
+ <td>7.6/10</td>
78
+ </tr>
79
+ <tr align="center">
80
+ <td align="left">Vintern-1B-v2</td>
81
+ <td>....</td>
82
+ <td>....</td>
83
+ </tr>
84
+ </table>
85
+
86
+ We are still working on more detailed benchmarks.
87
+
88
+ ## Examples
89
+
90
+ <div align="center">
91
+ <img src="ex_images/1.jpg" width="500"/>
92
+ </div>
93
+
94
+ ```
95
+ User: <image>
96
+ Mô tả hình ảnh một cách chi tiết.
97
+ Assistant: Bức ảnh là một giấy tờ có tiêu đề "SƠ YẾU LÝ LỊCH" được in trên nền màu xanh dương nhạt. Phía trên cùng bên trái là dòng chữ "GIÁO HỘI PHẬT GIÁO VIỆT NAM", bên phải là dòng chữ "CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM". Dưới tiêu đề là thông tin cá nhân của người được sơ yếu lý lịch, bao gồm:
98
+ - Họ và tên khai sinh: VƯƠNG TẤN VIỆT
99
+ - Họ và tên gia pha: HỒ CHÍ VIỆT
100
+ - Chân danh, Pháp hiệu: CHÂN QUANG
101
+ - Ngày nơi sinh: 09/12/1959 tại Buôn Thuột
102
+ - Giấy CMND số: 273038848... cấp tại CA Vũng Tàu ngày 04/05/2005.
103
+ - Quê quán: Nghệ An
104
+ - Hộ khẩu thường trú: Chùa Phật Quang - Chu Hai - Tân Hải - Tân Hải - Tân Thành - BR. VT
105
+ - Phòng ở: Chùa Phật Quang - Chu Hai - Tân Hải - Tân Thành - BR. VT
106
+ - Trình độ: + Văn hóa: Cử nhân Anh Văn
107
+ - + Ngoại ngữ: + Phát học (chuyên môn): Diễn giảng
108
+ - Ngày và nơi xuất trình: 15/01/1980 tại Thiên Viện Thường Chiểu
109
+ ```
110
+
111
+ <div align="center">
112
+ <img src="ex_images/2.jpg" width="500"/>
113
+ </div>
114
+
115
+ ```
116
+ User: <image>
117
+ Mô tả hình ảnh một cách chi tiết.
118
+ Assistant: Bức ảnh chụp một hóa đơn của cửa hàng Circle K Việt Nam. Hóa đơn có tiêu đề "CIRCLE K VIETNAM" ở đầu trang, bên dưới là thông tin liên lạc của cửa hàng: "113 Trần Đại Nghĩa - Hà Nội", "Date: Oct 5 2019 06:39 PM SFB",
119
+ "Cashier: 5360-Linh Luong Thi Thuy", "Description: 1 DASANI Bottled Water 1.5L+1PT 12.000", "Item(s) (VAT included): 12.000", "CASH", "CHANGE", "Chỉ xuất hóa đơn trong ngày", "Xin cảm ơn quy khách!"
120
+ ```
121
+
122
+ <div align="center">
123
+ <img src="ex_images/3.jpg" width="400"/>
124
+ </div>
125
+
126
+ ```
127
+ User: <image>
128
+ tả hình ảnh một cách chi tiết.
129
+ Assistant: Bức ảnh là một thông báo tuyển dụng công tác viên kinh doanh của FPT Telecom Bình Thuận. Bố cục được chia thành 4 phần chính:
130
+
131
+ 1. **Phần tiêu đề:** Gồm logo FPT Telecom Bình Thuận và dòng chữ "FPT TELECOM BÌNH THUẬN TUYỂN DỤNG CÔNG TÁC VIÊN KINH DOANH".
132
+ 2. **Phần lịch phỏng vấn:** Liệt thời gian và ngày tháng cụ thể cho lịch phỏng vấn (9h00 - 10h00 thứ Sáu hàng tuần).
133
+ 3. **Phần chi tiết công việc:** Liệt kê các nhiệm vụ liên quan đến công việc, bao gồm:
134
+ - Tìm kiếm thông tin, tiếp cận khách hàng tiềm năng.
135
+ - Tư vấn, giải thích cho khách hàng về dịch vụ của FPT Telecom.
136
+ - Thực hiện thủ tục ký kết hợp đồng với khách hàng.
137
+ 4. **Phần yêu cầu công việc:** Liệt kê các yêu cầu cần thiết cho ứng viên, bao gồm:
138
+ - Nam, nữ không giới hạn độ tuổi.
139
+ - Đam mê kinh doanh.
140
+ - Không cần kinh nghiệm.
141
+ - Phù hợp với Sinh viên, người cần kiếm thêm thu nhập, làm online,....
142
+
143
+ 5. **Phần quyền lợi:** Liệt kê các quyền lợi mà ứng viên sẽ nhận được, bao gồm:
144
+ - Thời gian linh động.
145
+ - Thu nhập hấp dẫn.
146
+ - Môi trường làm việc thân thiện, được hướng dẫn tận tình.
147
+ - Cơ hội trở thành nhân viên chính thức của FPT và được hưởng các chế độ, quyền lợi dành cho nhân viên chính thức.
148
+
149
+ 6. **Phần hồ sơ chuẩn bị:** Liệt kê các giấy tờ cần chuẩn bị cho ứng viên, bao gồm:
150
+ - 01 Bản sao hộ khẩu có xác thực của địa phương.
151
+ - 01 Bản sao CMND có xác thực của địa phương.
152
+
153
+ 7. **Phần thông tin:**
154
+ - Địa chỉ VP: 07 Tuyên Quang - Phan Thiết - Bình Thuận.
155
+ - Liên hệ: Ms.Lộc - 0908 391 305.
156
+
157
+ 8. **Phần lưu ý:** Nêu rõ FPT Telecom không thu bất kỳ khoản phí nào từ ứng viên.
158
+ ```
159
+
160
+ ## Quickstart
161
+
162
+ Here provides a code snippet to show you how to load the tokenizer and model and how to generate contents.
163
+
164
+ [**Quickstart with Google Colab📒 ⬅️ ⬅️**](https://colab.research.google.com/drive/1ZD1oB56PF0lF66RCuTVJYLTEV0tM3CFf?usp=sharing)
165
+
166
+ ```python
167
+ import numpy as np
168
+ import torch
169
+ import torchvision.transforms as T
170
+ # from decord import VideoReader, cpu
171
+ from PIL import Image
172
+ from torchvision.transforms.functional import InterpolationMode
173
+ from transformers import AutoModel, AutoTokenizer
174
+
175
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
176
+ IMAGENET_STD = (0.229, 0.224, 0.225)
177
+
178
+ def build_transform(input_size):
179
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
180
+ transform = T.Compose([
181
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
182
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
183
+ T.ToTensor(),
184
+ T.Normalize(mean=MEAN, std=STD)
185
+ ])
186
+ return transform
187
+
188
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
189
+ best_ratio_diff = float('inf')
190
+ best_ratio = (1, 1)
191
+ area = width * height
192
+ for ratio in target_ratios:
193
+ target_aspect_ratio = ratio[0] / ratio[1]
194
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
195
+ if ratio_diff < best_ratio_diff:
196
+ best_ratio_diff = ratio_diff
197
+ best_ratio = ratio
198
+ elif ratio_diff == best_ratio_diff:
199
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
200
+ best_ratio = ratio
201
+ return best_ratio
202
+
203
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
204
+ orig_width, orig_height = image.size
205
+ aspect_ratio = orig_width / orig_height
206
+
207
+ # calculate the existing image aspect ratio
208
+ target_ratios = set(
209
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
210
+ i * j <= max_num and i * j >= min_num)
211
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
212
+
213
+ # find the closest aspect ratio to the target
214
+ target_aspect_ratio = find_closest_aspect_ratio(
215
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
216
+
217
+ # calculate the target width and height
218
+ target_width = image_size * target_aspect_ratio[0]
219
+ target_height = image_size * target_aspect_ratio[1]
220
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
221
+
222
+ # resize the image
223
+ resized_img = image.resize((target_width, target_height))
224
+ processed_images = []
225
+ for i in range(blocks):
226
+ box = (
227
+ (i % (target_width // image_size)) * image_size,
228
+ (i // (target_width // image_size)) * image_size,
229
+ ((i % (target_width // image_size)) + 1) * image_size,
230
+ ((i // (target_width // image_size)) + 1) * image_size
231
+ )
232
+ # split the image
233
+ split_img = resized_img.crop(box)
234
+ processed_images.append(split_img)
235
+ assert len(processed_images) == blocks
236
+ if use_thumbnail and len(processed_images) != 1:
237
+ thumbnail_img = image.resize((image_size, image_size))
238
+ processed_images.append(thumbnail_img)
239
+ return processed_images
240
+
241
+ def load_image(image_file, input_size=448, max_num=12):
242
+ image = Image.open(image_file).convert('RGB')
243
+ transform = build_transform(input_size=input_size)
244
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
245
+ pixel_values = [transform(image) for image in images]
246
+ pixel_values = torch.stack(pixel_values)
247
+ return pixel_values
248
+
249
+ model = AutoModel.from_pretrained(
250
+ "5CD-AI/Viet-InternVL2-1B",
251
+ torch_dtype=torch.bfloat16,
252
+ low_cpu_mem_usage=True,
253
+ trust_remote_code=True,
254
+ ).eval().cuda()
255
+ tokenizer = AutoTokenizer.from_pretrained("5CD-AI/Viet-InternVL2-1B", trust_remote_code=True, use_fast=False)
256
+
257
+ test_image = 'test-image.jpg'
258
+
259
+ pixel_values = load_image(test_image, max_num=12).to(torch.bfloat16).cuda()
260
+ generation_config = dict(max_new_tokens= 1024, do_sample=False, num_beams = 3, repetition_penalty=2.5)
261
+
262
+ question = '<image>\nMô tả hình ảnh một cách chi tiết.'
263
+
264
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
265
+ print(f'User: {question}\nAssistant: {response}')
266
+
267
+ #question = "Câu hỏi khác ......"
268
+ #response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
269
+ #print(f'User: {question}\nAssistant: {response}')
270
+ ```
271
+
272
+ ## Citation
273
+
274
+ ```
275
+ @article{vintern1b,
276
+ title={Vintern-1B: An Efficient Multimodal Model for Vietnamese},
277
+ author={Doan, Thanh-Khang and Huynh, Gia-Bao and Pham, Dinh-Thuc and Hoang, Tien-Dung and Pham, Huynh-Nhat},
278
+ year={2024},
279
+ url={https://huggingface.co/5CD-AI/Viet-InternVL2-1B}
280
+ }
281
+ ```
282
+
283
+ ## References
284
+
285
+ [1] Yang, An, et al. "Qwen2 technical report." arXiv preprint arXiv:2407.10671 (2024).
286
+
287
+ [2] Chen, Zhe, et al. "Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
288
+
289
+ [3] Chen, Zhe, et al. "How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites." arXiv preprint arXiv:2404.16821 (2024).
290
+
291
+ [4] Tran, Chi, and Huong Le Thanh. "LaVy: Vietnamese Multimodal Large Language Model." arXiv preprint arXiv:2404.07922 (2024).