Eval time vs. score diagram

#950
by HenkPoley - opened

On the Portuguese version of the old/'v1' Open LLM Leaderboard I saw an interesting plot.

See the Metrics tab, and look at the bottom: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard

There you can kind of oggle the scaling laws. Also, that around 9B the models can ace these older style tests.

Maybe add something like that, or one size vs. score; instead of evaluation time.

Open LLM Leaderboard org

Hi @HenkPoley ,

This is a very good idea! We're a bit short on time at the moment, would you be interested in contributing this feature?

some of the notable models that performed well in Portuguese are

THUDM/glm-4-9b-chat-1m
THUDM/glm-4-9b-chat
THUDM/glm-4-9b

but unfortunately they trigger the error message: “needs to be launched with trust_remote_code=True”

could the model be changed to somehow mitigate this? what are the prospects?

Open LLM Leaderboard org

Hi @CombinHorizon ,

Currently we have results for THUDM/glm-4-9b and THUDM/glm-4-9b-chat that we added manually, you can find them on the Leaderboard. If you're interested, we can also add THUDM/glm-4-9b-chat-1m as well

Sign up or log in to comment