flowaicom
/

Flow-Judge-v0.1-GGUF

@@ -26,8 +26,6 @@ model_creator: Flow AI
 model_type: phi3.5
 quantized_by: Flow AI
 ---
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/63368577d184e6b53c50e6d0/6kSJKgPh2pDh4tA-Ky0xW.png)
 # Flow-Judge-v0.1-GGUF
 - Original model: [Flow-Judge-v0.1](https://huggingface.co/flowaicom/Flow-Judge-v0.1)
 - Model collection: [Flow-Judge-v0.1 models](https://huggingface.co/collections/flowaicom/flow-judge-v01-66e6af5fc3b3a128bde07dec)
@@ -40,6 +38,7 @@ quantized_by: Flow AI
 This repo contains GGUF quants for [Flow-Judge-v0.1](https://huggingface.co/flowaicom/Flow-Judge-v0.1).
 ## Quantization config
 Version used: github:ggerganov/llama.cpp/8e6e2fbe1458ac91387266241262294a964d6b95?narHash=sha256-Z3Rg43p8G9MdxiGvSl9m43KsJ1FvvhQwtzRy/grg9X0%3D
@@ -48,6 +47,7 @@ llama-convert-hf-to-gguf ./flowaicom/Flow-Judge-v0.1 --outfile flow-judge-v0.1-b
 llama-quantize flow-judge-v0.1-bf16.gguf flow-judge-v0.1-Q4_K_M.gguf Q4_K_M
 ```
 ## Running the GGUF file
 ```shell
@@ -55,27 +55,20 @@ llama-server -ngl 33 -t 16 -m Flow-Judge-v0.1-GGUF/flow-judge-v0.1-Q4_K_M.gguf -
 ```
-# Original model card: Flow-Judge-v0.1
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/63368577d184e6b53c50e6d0/NgFJqVmUgrhOnphd47VEm.png)
-<div class="center-content">
-    <div class="links">
-        <a href="https://github.com/flowaicom/flow-judge">flow-judge library</a>
-      |
-        <a href="https://www.flow-ai.com/blog/flow-judge">Technical report</a>
-    </div>
-</div>
 ## Model Summary
 Flow-Judge-v0.1 is a compact yet powerful 3.8B model that offers customizable LLM system evaluations across various fields. The model inherits it's architecture from Phi-3.5-mini instruct model which enables Flow-Judge to deliver high-quality results while maintaining a small footprint. Despite its smaller size, it achieves performance comparable to larger models in both held-out and out-of-domain benchmarks. Flow-Judge-v0.1 supports multiple scoring scales, provides qualitative feedback, and generates structured evaluation outputs. Trained on a smaller synthetic dataset, it represents an efficient approach to AI development. Released under the Apache 2.0 license, Flow Judge is an open and accessible model suitable for developers and companies seeking cost-effective and rapid evaluations using custom rubrics.
-__More information__
-- [Flow Judge website](https://www.flow-ai.com/judge)
-- [Technical report](https://www.flow-ai.com/blog/flow-judge)
-- [Github repo](https://github.com/flowaicom/flow-judge)
 __Quantized weights__
 - [flowaicom/Flow-Judge-v0.1-AWQ](https://huggingface.co/flowaicom/Flow-Judge-v0.1-AWQ)
 - [flowaicom/Flow-Judge-v0.1-GGUF](https://huggingface.co/flowaicom/Flow-Judge-v0.1-GGUF)
@@ -94,7 +87,7 @@ Flow Judge is intended to be used on custom LLM system evaluation tasks.
     - 5-Likert: Provides an even more nuanced assessment, with scores ranging from strongly negative to strongly positive, enabling users to capture subtle differences in quality or sentiment.
 - Easy to interpret results:
-    - Flow Judge produces structured evaluations with <feedback> and <score> tags.
         - Qualitative feedback: Flow Judge detects errors and grades outputs and provides qualitative feedback that explains its reasoning for assigning a particular score from the rubric while highlighting problematic parts of the responses.
         - Score: Based on a grading rubric Flow Judge will return a numerical score on binary, likert-3 or likert-5 scale.
@@ -116,12 +109,12 @@ Flow-Judge-v0.1 has been trained on synthetically generated datasets. The constr
 This process creates a comprehensive and diverse set of training instances that enable accurate, domain-specific evaluations of LLM systems in generative AI products while minimizing human intervention.
-Read more about the dataset construction from [here](https://www.flow-ai.com/blog/flow-judge)
 ### Fine-tuning
-For fine-tuning we used Axolotl's preprocessing to ensure input training data is consistent. We then conducted supervised fine-tuning based on microsoft/Phi-3.5-mini-instruct using RSLoRa. More detailed information about the fine-tuning process is provided in our [technical report](https://www.flow-ai.com/blog/flow-judge).
 ## Usage
@@ -406,7 +399,7 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
   </tbody>
 </table>
-\* _not suitable for 3 likert_
 ### RAGTruth
@@ -526,7 +519,7 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
   </tr>
 </table>
-\* _reported in Galileo luna paper_
 ### HaluEval, Covid-QA, PubMedQA
@@ -707,7 +700,7 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
   </tbody>
 </table>
-\* _reported in lynx paper_
 ### Feedback Bench
 <table border="1" cellpadding="10" cellspacing="0" style="border-collapse: collapse; width: auto;">
@@ -758,4 +751,16 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
   </tr>
 </table>
-\* _reported in prometheus paper using reference answer. Note the rest of the models have been evaluated without reference answer_

 model_type: phi3.5
 quantized_by: Flow AI
 ---
 # Flow-Judge-v0.1-GGUF
 - Original model: [Flow-Judge-v0.1](https://huggingface.co/flowaicom/Flow-Judge-v0.1)
 - Model collection: [Flow-Judge-v0.1 models](https://huggingface.co/collections/flowaicom/flow-judge-v01-66e6af5fc3b3a128bde07dec)
 This repo contains GGUF quants for [Flow-Judge-v0.1](https://huggingface.co/flowaicom/Flow-Judge-v0.1).
 ## Quantization config
 Version used: github:ggerganov/llama.cpp/8e6e2fbe1458ac91387266241262294a964d6b95?narHash=sha256-Z3Rg43p8G9MdxiGvSl9m43KsJ1FvvhQwtzRy/grg9X0%3D
 llama-quantize flow-judge-v0.1-bf16.gguf flow-judge-v0.1-Q4_K_M.gguf Q4_K_M
 ```
 ## Running the GGUF file
 ```shell
 ```
+# Original model card: Flow-Judge-v0.1
+<p align="center">
+  <img src="https://cdn-uploads.huggingface.co/production/uploads/63368577d184e6b53c50e6d0/6kSJKgPh2pDh4tA-Ky0xW.png" alt="Centered image">
+</p>
+<p align="center">🚀 <a href="https://www.flow-ai.com/judge">Flow Judge</a> | 📄 <a href="https://www.flow-ai.com/blog/flow-judge">Technical report</a> | 💻 <a href="https://github.com/flowaicom/flow-judge">flow-judge</a></p>
 ## Model Summary
 Flow-Judge-v0.1 is a compact yet powerful 3.8B model that offers customizable LLM system evaluations across various fields. The model inherits it's architecture from Phi-3.5-mini instruct model which enables Flow-Judge to deliver high-quality results while maintaining a small footprint. Despite its smaller size, it achieves performance comparable to larger models in both held-out and out-of-domain benchmarks. Flow-Judge-v0.1 supports multiple scoring scales, provides qualitative feedback, and generates structured evaluation outputs. Trained on a smaller synthetic dataset, it represents an efficient approach to AI development. Released under the Apache 2.0 license, Flow Judge is an open and accessible model suitable for developers and companies seeking cost-effective and rapid evaluations using custom rubrics.
 __Quantized weights__
 - [flowaicom/Flow-Judge-v0.1-AWQ](https://huggingface.co/flowaicom/Flow-Judge-v0.1-AWQ)
 - [flowaicom/Flow-Judge-v0.1-GGUF](https://huggingface.co/flowaicom/Flow-Judge-v0.1-GGUF)
     - 5-Likert: Provides an even more nuanced assessment, with scores ranging from strongly negative to strongly positive, enabling users to capture subtle differences in quality or sentiment.
 - Easy to interpret results:
+    - Flow Judge produces structured evaluations with `<feedback>` and `<score>` tags.
         - Qualitative feedback: Flow Judge detects errors and grades outputs and provides qualitative feedback that explains its reasoning for assigning a particular score from the rubric while highlighting problematic parts of the responses.
         - Score: Based on a grading rubric Flow Judge will return a numerical score on binary, likert-3 or likert-5 scale.
 This process creates a comprehensive and diverse set of training instances that enable accurate, domain-specific evaluations of LLM systems in generative AI products while minimizing human intervention.
+Read more about the dataset construction from [here](https://www.flow-ai.com/blog/flow-judge#dataset-construction)
 ### Fine-tuning
+For fine-tuning we used Axolotl's preprocessing to ensure input training data is consistent. We then conducted supervised fine-tuning based on microsoft/Phi-3.5-mini-instruct using RSLoRa. More detailed information about the fine-tuning process is provided in our [technical report](https://www.flow-ai.com/blog/flow-judge#fine-tuning).
 ## Usage
   </tbody>
 </table>
+\* _Reported in model paper_
 ### RAGTruth
   </tr>
 </table>
+\* _reported in model paper_
 ### HaluEval, Covid-QA, PubMedQA
   </tbody>
 </table>
+\* _reported in model paper_
 ### Feedback Bench
 <table border="1" cellpadding="10" cellspacing="0" style="border-collapse: collapse; width: auto;">
   </tr>
 </table>
+\* _reported in model paper using reference answers_
+## License
+We opted for the Apache 2.0 license for Flow Judge to provide the community with an open, small yet powerful LM evaluator. Our goal is to support the wider adoption of rigorous evaluation techniques in LLM system development, making them more accessible to practitioners and researchers.
+## Limitations and future work
+Multilingual evaluation: Flow Judge has been fine-tuned exclusively on English data. While the foundation model (Phi-3.5-mini-instruct [17]) may possess multilingual capabilities, we have not systematically evaluated Flow Judge performance in non-English contexts. We plan to explore multi-lingual LM evaluators in the future.
+Long context and structured Inputs: Our training dataset encompasses a wide range of custom metrics relevant to evaluating LLM systems. However, it does not include examples with long context inputs or structured data formats such as JSON, since these are harder to synthetically generate. This limitation may impact Flow Judge's performance when evaluating responses that require processing extensive context or parsing structured input. Extending our model’s capabilities to handle these input types represents an important area for future research.
+Math and coding: The current version has not been trained on specific task domains such as arithmetic problems or code evaluation. As a result, its performance in these specialized areas may be limited. Future iterations of the model should address these gaps.
+Domain-specific knowledge and complex multi-step evaluations: Flow Judge may struggle with highly specialized domain knowledge or proprietary data outside the training scope of its foundation model. Additionally, evaluation tasks requiring multi-step reasoning or complex logical processes may challenge the model's capabilities. We strongly recommend conducting meta-evaluations of the model performance before deploying it in specialized or highly complex evaluation scenarios.