Feature Request: change request file format to disambiguate chat and non-chat models?

#954
by CombinHorizon - opened

such as instead of:
ModelName-SizeB_eval_request_False_bfloat16_Original.json
perhaps,

ModelName-SizeB_eval_request_False_bfloat16_ChatOn_Original.json
ModelName-SizeB_eval_request_False_bfloat16_ChatOff_Original.json

  • so that they don't overwrite each other on reruns
    and also the scores seem to be, (esp for some models), significantly different, depending on config
  • so that chat on and chat off entries are listed separately on the leaderboard

from what i'm seeing, is that for some similar models,
it seems that, chat template affects IFEval scores (โ‡ˆ), & MUSR too (โ‡Š), (but by how much?)

if this is to be updated, maybe look into the request files' commit history, also the multiple result files (which don't overwrite),
to help disambiguate & sort things out,

The chat template's, its effect on scores, seems to have a more significant impact than: bfloat16 vs float16

Question: what determines what chat-template will be used, what file or process, (e.g. generation_config.json) , what else, or other assumptions / defaults ?

Open LLM Leaderboard org

Hi @CombinHorizon ,

Thank you for your suggestion!

We agree that this modification can help to compare a model with and without the chat template. We're actually in the process of revamping our request naming system, as some current parameters are no longer relevant

We'll come back to you as soon as we have decided on a new simpler format!

Sign up or log in to comment