imone commited on
Commit
0d6a884
1 Parent(s): 700b6b3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -0
README.md CHANGED
@@ -144,6 +144,7 @@ assert tokens == [1, 420, 6316, 28781, 3198, 3123, 1247, 28747, 22557, 32000, 42
144
  <div align="center">
145
  <h2> (Experimental) Evaluator / Feedback Capabilities </h2>
146
  </div>
 
147
  We've included evaluator capabilities in this release to advance open-source models as evaluators. You can use `Default Mode (GPT4 Correct)` with the following prompt (same as [Prometheus](https://huggingface.co/datasets/kaist-ai/Feedback-Collection)) to evaluate a response.
148
 
149
  ```
@@ -191,6 +192,7 @@ Score 5: {orig_score5_description}
191
 
192
  <details>
193
  <summary>Evaluation Details(click to expand)</summary>
 
194
  *: ChatGPT (March) results are from [GPT-4 Technical Report](https://arxiv.org/abs/2303.08774), [Chain-of-Thought Hub](https://github.com/FranxYao/chain-of-thought-hub), and our evaluation. Please note that ChatGPT is not a fixed baseline and evolves rapidly over time.
195
 
196
  ^: Zephyr-β often fails to follow few-shot CoT instructions, likely because it was aligned with only chat data but not trained on few-shot data.
@@ -198,6 +200,7 @@ Score 5: {orig_score5_description}
198
  **: Mistral and Open-source SOTA results are taken from reported results in instruction-tuned model papers and official repositories.
199
 
200
  All models are evaluated in chat mode (e.g. with the respective conversation template applied). All zero-shot benchmarks follow the same setting as in the AGIEval paper and Orca paper. CoT tasks use the same configuration as Chain-of-Thought Hub, HumanEval is evaluated with EvalPlus, and MT-bench is run using FastChat. To reproduce our results, follow the instructions in [our repository](https://github.com/imoneoi/openchat/#benchmarks).
 
201
  </details>
202
  <div>
203
  <h3>HumanEval+</h3>
 
144
  <div align="center">
145
  <h2> (Experimental) Evaluator / Feedback Capabilities </h2>
146
  </div>
147
+
148
  We've included evaluator capabilities in this release to advance open-source models as evaluators. You can use `Default Mode (GPT4 Correct)` with the following prompt (same as [Prometheus](https://huggingface.co/datasets/kaist-ai/Feedback-Collection)) to evaluate a response.
149
 
150
  ```
 
192
 
193
  <details>
194
  <summary>Evaluation Details(click to expand)</summary>
195
+
196
  *: ChatGPT (March) results are from [GPT-4 Technical Report](https://arxiv.org/abs/2303.08774), [Chain-of-Thought Hub](https://github.com/FranxYao/chain-of-thought-hub), and our evaluation. Please note that ChatGPT is not a fixed baseline and evolves rapidly over time.
197
 
198
  ^: Zephyr-β often fails to follow few-shot CoT instructions, likely because it was aligned with only chat data but not trained on few-shot data.
 
200
  **: Mistral and Open-source SOTA results are taken from reported results in instruction-tuned model papers and official repositories.
201
 
202
  All models are evaluated in chat mode (e.g. with the respective conversation template applied). All zero-shot benchmarks follow the same setting as in the AGIEval paper and Orca paper. CoT tasks use the same configuration as Chain-of-Thought Hub, HumanEval is evaluated with EvalPlus, and MT-bench is run using FastChat. To reproduce our results, follow the instructions in [our repository](https://github.com/imoneoi/openchat/#benchmarks).
203
+
204
  </details>
205
  <div>
206
  <h3>HumanEval+</h3>