Results from the SentenceTransformer library are sometimes different than the ones imported from the FlagModel

by Anaudia - opened Aug 6, 2023

Aug 6, 2023

Long sentences or paragraphs seem to be broken apart in the FlagModel implementation and then embedded individually, resulting in an output that contains 2 or more lists for one input. This is not true for the implementation via SentenceTransformer. Here one input leads to one output, containing one embedding.

Shitao

Beijing Academy of Artificial Intelligence org Aug 6, 2023

We don't set the tokenizer to split the long sentence. I cannot reproduce this error, can you help to provide some scripts?

Anaudia

Aug 6, 2023

Ofc,

text = "In a bustling town where shadows whispered secrets, a cat named Mira gained the ability to speak human language. One evening, Mira whispered a forgotten legend about a hidden treasure beneath the town's oldest tree to a young, curious adventurer. Together, they embarked on a moonlit quest, forging an unbreakable bond while unearthing mysteries of the past."

from FlagEmbedding import FlagModel
query_instruction_for_retrieval='Represent this sentence for searching relevant passages: '
model = FlagModel('BAAI/bge-large-en', query_instruction_for_retrieval=query_instruction_for_retrieval)
model.encode(text)

The output is:

array([[ 0.01307663, 0.01230509, -0.02265706, ..., 0.01391939,
-0.03492293, -0.00590275],
[ 0.00093336, 0.02802942, -0.03854717, ..., -0.0290005 ,
-0.00743153, 0.00360901]], dtype=float32)

If I run:

from sentence_transformers import SentenceTransformer
instruction = query_instruction_for_retrieval
model2 = SentenceTransformer('BAAI/bge-large-en')
p_embeddings = model2.encode(text, normalize_embeddings=True)

The output is:

array([ 0.01453966, 0.01963736, -0.02570449, ..., 0.01102038,
-0.03386544, -0.00838666], dtype=float32)

Hope that is helpful!

Shitao

Beijing Academy of Artificial Intelligence org Aug 6, 2023

Thanks!
The FlagEmbedding doesn't support inputting a string, so it makes this error.
We have updated the FlagEmbedding repo, and you can install it:
pip install -U FlagEmbedding

Anaudia

Aug 6, 2023

Perfect! Thank you

Anaudia

Aug 6, 2023

Can I ask another question: In what way would I best compare to sentences with each other. I have requirements and skills of people and I want to find out if a person has the skill required. However, I have realized that the best matches oftentimes depend more on the similarity of the sentence structure (length, gramma etc.) than on the actual content. What instruction would you recommend using ur model, and do you have any general tips?

Shitao

Beijing Academy of Artificial Intelligence org Aug 6, 2023

If you need to search the answer to a short query, you need to add provided instruction to the query; in other cases, no instruction is needed, just use the original query directly.
bge models focus on the general ability. Since your scenario is different from the classical retrieval task or similarity task, It's better to fine-tune it based on your data, you can use this tool fine-tune it.

Anaudia

Aug 6, 2023

Thank you so much! That is incredible helpful – and I see Github is now also online :)

Shitao

Beijing Academy of Artificial Intelligence org Aug 6, 2023

Besides, you can select some negatives which have the same sentence structure as your sentences to let the model depends more on the actual content

Anaudia

Aug 6, 2023

Ye, I agree – can probably create very diverse sentences in terms of structure/ length and make only the meaning stand out. I let you know if it works :d thanks again – really cool project!

Anaudia

Aug 6, 2023

Would I use for the fine-tuning some kind of prompt: 'Create this sentence so that it can be compared in meaning with other sentences'. Similar to how you use the "Represent this sentence for searching relevant passages:" for retrieval.

Shitao

Beijing Academy of Artificial Intelligence org Aug 6, 2023

You can try both using prompt and not using prompt. In fact, I'm not sure which is better for your task.

Anaudia

Aug 6, 2023

Ok, will try. Thanks again

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment