Alibaba-NLP/gte-Qwen2-1.5B-instruct

Jul 7

Hello,
It's great with your open-source model. But there seems to be confusion about the model's score when evaluating on mteb-french. I tried running evaluation locally, the average score is 59.92, different from 66.6 as on the leaderboard.

thenlper

Alibaba-NLP org Jul 8

Could you please provide your evaluation results or at least some results from your dataset? We would like to compare the results of the dataset. Please note that this model is trained with instructions (instruct training), and when encoding the text, it is necessary to concatenate the instruction on the query side.

abhamadi

Jul 18

"""Example script for benchmarking all datasets constituting the MTEB French leaderboard & average scores"""

from future import annotations
import os
import logging
import torch
import gc
from sentence_transformers import SentenceTransformer
device = torch.device('cuda:0')
torch.cuda.set_device(device)
from mteb import MTEB

logging.basicConfig(level=logging.INFO)

logger = logging.getLogger("main")

TASK_LIST_CLASSIFICATION = [
"AmazonReviewsClassification",
"MasakhaNEWSClassification",
"MassiveIntentClassification",
"MassiveScenarioClassification",
"MTOPDomainClassification",
"MTOPIntentClassification",
]

TASK_LIST_CLUSTERING = [
"AlloProfClusteringP2P",
"AlloProfClusteringS2S",
"HALClusteringS2S",
"MasakhaNEWSClusteringP2P",
"MasakhaNEWSClusteringS2S",
"MLSUMClusteringP2P",
"MLSUMClusteringS2S",
]

TASK_LIST_PAIR_CLASSIFICATION = [
"OpusparcusPC",
"PawsX",
]

TASK_LIST_RERANKING = ["SyntecReranking", "AlloprofReranking"]

TASK_LIST_RETRIEVAL = [
"AlloprofRetrieval",
"BSARDRetrieval",
"SyntecRetrieval",
"XPQARetrieval",
"MintakaRetrieval",
]

TASK_LIST_STS = ["SummEvalFr", "STSBenchmarkMultilingualSTS", "STS22", "SICKFr"]

TASK_LIST = (
TASK_LIST_CLASSIFICATION
+ TASK_LIST_CLUSTERING
+ TASK_LIST_PAIR_CLASSIFICATION
+ TASK_LIST_RERANKING
+ TASK_LIST_RETRIEVAL
+ TASK_LIST_STS
)

model_name = "Alibaba-NLP/gte-Qwen2-1.5B-instruct"

model = SentenceTransformer(model_name, trust_remote_code=True)

logger.info(f"Task list : {TASK_LIST}")
for task in TASK_LIST:
logger.info(f"Running task: {task}")
evaluation = MTEB(
tasks=[task], task_langs=["fr"]
) # Remove "fr" for running all languages
evaluation.run(model, batch_size = 1, output_folder=f"results/{model_name}")

This is the result after I ran the above code with 26 resulting json files:
https://www.dropbox.com/scl/fi/7is59edlapzdnhacp2ysf/Alibaba-NLP__gte-Qwen2-1.5B-instruct.zip?rlkey=pv0hppw7dvdbb25e7rftybd2c&st=867jjbh0&dl=0

Alibaba-NLP
/

gte-Qwen2-1.5B-instruct

score mteb french