izhx commited on
Commit
d04a2aa
1 Parent(s): 722b8b1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -3
README.md CHANGED
@@ -1,3 +1,81 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ **English** | [中文](./README_zh.md)
6
+
7
+ <!-- [Arxiv PDF](https://arxiv.org/pdf/2407.19669), [HF paper page](https://huggingface.co/papers/2407.19669)
8
+ -->
9
+
10
+ ## Code implementation of Qwen2 based embeddings
11
+
12
+ This model code is for Qwen2 based embedding models.
13
+
14
+ We enable the bidirectional attention by default.
15
+
16
+ ### Usage
17
+
18
+ 1. Download the `configuration.py` and `modeling.py` to your saved `gte-Qwen2` model directory.
19
+ 2. Replace the `modeling_qwen.` with `modeling.` in `auto_map` field of `config.json`.
20
+
21
+
22
+ ### Recommendation: Enable Unpadding and Acceleration with `xformers`
23
+
24
+ This code supports the acceleration of attention computations using `xformers`,
25
+ which can automatically choose the optimal implementation based on the type of device, such as `flash_attn`.
26
+ Therefore, we can also achieve significant acceleration on old devices like the V100.
27
+
28
+
29
+ Firstly, install `xformers` (with `pytorch` pre-installed):
30
+ ```
31
+ if pytorch is installed using conda:
32
+ conda install xformers -c xformers
33
+ elif pytorch is installed using pip:
34
+ # cuda 11.8 version
35
+ pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118
36
+ # cuda 12.1 version
37
+ pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121
38
+ ```
39
+ For more information, refer to [Installing xformers](https://github.com/facebookresearch/xformers?tab=readme-ov-file#installing-xformers).
40
+
41
+ Then, when loading the model, set `unpad_inputs` and `use_memory_efficient_attention` to `true`,
42
+ and set `torch_dtype` to `torch.float16` (or `torch.bfloat16`) to achieve the acceleration.
43
+
44
+ ```python
45
+ import torch
46
+ from transformers import AutoModel, AutoTokenizer
47
+
48
+ path = 'Alibaba-NLP/gte-Qwen2-1.5B-instruct'
49
+ device = torch.device('cuda')
50
+ tokenzier = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
51
+ model = AutoModel.from_pretrained(
52
+ path,
53
+ trust_remote_code=True,
54
+ unpad_inputs=True,
55
+ use_memory_efficient_attention=True,
56
+ torch_dtype=torch.float16
57
+ ).to(device)
58
+
59
+ inputs = tokenzier(['test input'], truncation=True, max_length=8192, padding=True, return_tensors='pt')
60
+
61
+ with torch.inference_mode():
62
+ outputs = model(**inputs.to(device))
63
+
64
+ ```
65
+
66
+ Alternatively, you can directly modify the `unpad_inputs` and `use_memory_efficient_attention` settings to `true` in the model's `config.json`,
67
+ eliminating the need to set them in the code.
68
+
69
+
70
+ ## Citation
71
+ ```
72
+ @misc{zhang2024mgte,
73
+ title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
74
+ author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
75
+ year={2024},
76
+ eprint={2407.19669},
77
+ archivePrefix={arXiv},
78
+ primaryClass={cs.CL},
79
+ url={https://arxiv.org/abs/2407.19669},
80
+ }
81
+ ```