Manli commited on
Commit
7325a90
β€’
1 Parent(s): 9a9d8d8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -27
README.md CHANGED
@@ -5,39 +5,52 @@ language:
5
  pipeline_tag: image-text-to-text
6
  ---
7
 
8
-
9
  # Model description
10
- We are excited to announce the continuation and rebranding of our **BLIP series** into **XGen-MM**, to be better aligned with Salesforce's unified XGen initiative for large foundation models! This rebranding marks a significant step in our ongoing development of cutting-edge multimodal technologies.
11
 
12
- `XGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.
 
 
 
 
13
 
14
- In the v1.1 (08/2024) release, we present a series of XGen-MM models including:
15
- - Base model `xgen-mm-phi3-mini-base-r-v1.5`
16
- - Single-image instruct model `xgen-mm-phi3-mini-instruct-r-v1.5`
17
- - Multi-image instruct model `xgen-mm-phi3-mini-instruct-multi-r-v1.5`
18
- - DPO instruct model `xgen-mm-phi3-mini-instruct-dpo-r-v1.5`
19
 
20
- In addition to the models, we are also releasing a series of datasets for multi-modal pre-training, including:
21
- - [MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens](https://arxiv.org/abs/2406.11271)
22
- - BLIP3-OCR-200M: a dataset with dense OCR annotations.
23
- - BLIP3-GROUNDING-50M: a dataset for enhancing the ability to ground semantic concepts in images.
24
- - BLIP3-KALE-300M (stay tuned): a large-scale curated high-quality caption dataset.
25
 
26
- # Data
27
 
 
 
28
 
29
  # Results
30
 
31
- ### Base model (without instruction tuning)
32
 
33
- ### Instruct model
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
- ### DPO model
 
36
 
37
 
38
  # How to use
39
 
40
- Please check out our [inference notebook](demo.ipynb) for example code to use our model. We also provide example script for [batch inference](batch_inference.ipynb).
41
 
42
  # Reproducibility:
43
 
@@ -53,23 +66,23 @@ We strongly recommend users assess safety and fairness before applying to downst
53
 
54
  # License
55
 
56
- Our code and weights are released under the Creative Commons Attribution Non Commercial 4.0 [LICENSE](LICENSE.txt). Please fill out a form at [here](https://forms.gle/ffPc9oZC2ZGeJ1N68) to consult the commercial use of model weights.
57
 
58
- # Code acknowledgement
59
  Our training code is based on [OpenFlamingo: An open-source framework for training large multimodal models.](https://github.com/mlfoundations/open_flamingo), and part of our data preprocessing code is adapted from [LLaVA](https://github.com/haotian-liu/LLaVA).
60
- Our evaluation code is based on [VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LVLMs)](https://github.com/open-compass/VLMEvalKit).
61
 
62
  We thank the authors for their open-source implementations.
63
 
64
 
65
  # Citation
66
  ```
67
- @misc{xgen_mm_phi3_mini,
68
- title={xgen-mm-phi3-mini-instruct Model Card},
69
- url={https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1},
70
- author={Salesforce AI Research},
71
- month={May},
72
- year={2024}
73
  }
74
  ```
75
 
 
5
  pipeline_tag: image-text-to-text
6
  ---
7
 
 
8
  # Model description
9
+ `xGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.
10
 
11
+ In the v1.5 (08/2024) release, we present a series of XGen-MM models including:
12
+ - [πŸ€— xGen-MM-base](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-base-r-v1.5): `xgen-mm-phi3-mini-base-r-v1.5`
13
+ - [πŸ€— xGen-MM-instruct](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-singleimg-r-v1.5): `xgen-mm-phi3-mini-instruct-singleimg-r-v1.5`
14
+ - [πŸ€— xGen-MM-instruct-interleave (our main instruct model)](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-multi-r-v1.5): `xgen-mm-phi3-mini-instruct-interleave-r-v1.5`
15
+ - [πŸ€— xGen-MM-instruct-dpo](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-dpo-r-v1.5): `xgen-mm-phi3-mini-instruct-dpo-r-v1.5`
16
 
17
+ In addition to the models, our team also released a series of datasets for multi-modal pre-training, including:
18
+ - [πŸƒ MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens](https://arxiv.org/abs/2406.11271)
19
+ - [πŸ€— BLIP3-OCR-200M](https://huggingface.co/datasets/Salesforce/blip3-ocr-200m): a dataset with dense OCR annotations.
20
+ - [πŸ€— BLIP3-GROUNDING-50M](https://huggingface.co/datasets/Salesforce/blip3-grounding-50m): a dataset for enhancing the ability to ground semantic concepts in images.
21
+ - BLIP3-KALE (stay tuned): a large-scale curated high-quality caption dataset.
22
 
23
+ For more details, check out our [tech report](https://arxiv.org/pdf/2408.08872), [fine-tuning code](https://github.com/salesforce/LAVIS/tree/xgen-mm), and project page (coming soon).
 
 
 
 
24
 
 
25
 
26
+ # Data
27
+ The instruct model is fine-tuned on a mixture of around 1 million samples from multiple domains. All the fine-tuning data are from public sources, most of which are covered in [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron).
28
 
29
  # Results
30
 
31
+ ### Single-image benchmarks
32
 
33
+ | Model (Size) | SEED -IMG | SEED v2 | MMB (dev) | MM Star | MME (norm) | CVB -2D | CVB -3D | RealW QA | MMMU (val) | Math Vista | Sci QA | POPE | Text VQA | Avg. all | Avg. perc. |
34
+ |--------------------------------|:---------:|:-------:|:----------:|:-------:|:-----------:|:-------:|:-----------------:|-------------------|:-----------------:|:-----------------:|:-----------------:|:-----------------:|----------------|:--------------:|----------------|
35
+ | Closed-source models | | | | | | | | | | | | | | | |
36
+ | GPT-4V<sup>&ast;</sup> | 72.0 | - | 80.8 | 49.7 | 63.3 | 64.3 | 73.8 | 56.5 | 53.8 | 48.2 | 82.1 | 75.4 | - | - | - |
37
+ | MM1-3B-Chat (3B) | 68.8 | - | 67.8 | - | 62.9 | - | - | - | 33.9 | - | - | 87.4 | - | - | - |
38
+ | Open-source models | | | | | | | | | | | | | | | |
39
+ | HPT-1.5-edge (4B) | **72.3** | - | 74.6 | 45.8 | - | - | - | - | 42.6 | **45.1** | 85.4 | **91.0** | - | - | - |
40
+ | VILA-1.5-3B (3B) | 67.9 | - | 63.4 | - | - | - | - | - | 33.3 | - | 69.0 | 85.9 | - | - | - |
41
+ | VILA-1.5-3B<sup>&ast;&ast;</sup> (3B) | 67.9 | 51.9 | 62.4 | 40.3 | 58.5 | 50.1 | 60.3 | 53.3 | 34.1 | 30.6 | 68.9 | 86.9 | 58.1 | 55.6 | 59.1 |
42
+ | phi-3-vision (4B) | - | - | 80.5 | - | - | - | - | - | - | 44.5 | 90.8 | 85.8 | 70.9 | - | - |
43
+ | phi-3-vision<sup>&ast;&ast;</sup> (4B) | 71.0 | 52.7 | 74.2 | <u>47.9</u> | 55.3 | 60.7 | 68.2 | 59.1 | **46.1** | **45.1** | **90.2** | 83.5 | **73.3** | 63.6 | 63.6 |
44
+ | **<u>xGen-MM-inst. (4B)</u>** | 71.8 | <u>53.9</u> | <u>76</u> | 46.7 | <u>63.8</u> | <u>66.2</u> | **75.4** | **61.6** | <u>42.8</u> | 39.2 | 85.6 | 87.0 | <u>72.0</u> | <u>64.8</u> | <u>66.9</u> |
45
+ | xGen-MM-inst.-interleave (4B) | <u>72.2</u> | **55.5** | **76.8** | **48.1** | **64.4** | **69.3** | <u>72.3</u> | <u>60.5</u> | 41.1 | <u>39.6</u> | <u>88.3</u> | 87.0 | 71.0 | **65.1** | **67.3** |
46
 
47
+ &ast; GPT-4V(gpt-4-1106-preview) results are taken from this third-party [leaderborad](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard).
48
+ &ast;&ast; Model results are tested with our evaluation code for a fair comparison.
49
 
50
 
51
  # How to use
52
 
53
+ Please check out our [inference notebook](demo.ipynb) for example code to use our model. We also provide an example script for [batch inference](batch_inference.ipynb).
54
 
55
  # Reproducibility:
56
 
 
66
 
67
  # License
68
 
69
+ Our code and weights are released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt) license.
70
 
71
+ # Code acknowledgment
72
  Our training code is based on [OpenFlamingo: An open-source framework for training large multimodal models.](https://github.com/mlfoundations/open_flamingo), and part of our data preprocessing code is adapted from [LLaVA](https://github.com/haotian-liu/LLaVA).
73
+ The evaluation code for the instruct models is based on [VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LVLMs)](https://github.com/open-compass/VLMEvalKit).
74
 
75
  We thank the authors for their open-source implementations.
76
 
77
 
78
  # Citation
79
  ```
80
+ @article{blip3-xgenmm,
81
+ author = {Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu},
82
+ title = {xGen-MM(BLIP-3): A Family of Open Large Multimodal Models},
83
+ journal = {arXiv preprint},
84
+ month = {August},
85
+ year = {2024},
86
  }
87
  ```
88