flowaicom
/

Flow-Judge-v0.1

@@ -25,27 +25,15 @@ base_model:
 - microsoft/Phi-3.5-mini-instruct
 ---
-# Flow-Judge-v0.1
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/63368577d184e6b53c50e6d0/NgFJqVmUgrhOnphd47VEm.png)
-<div class="center-content">
-    <div class="links">
-        <a href="https://github.com/flowaicom/flow-judge">flow-judge library</a>
-      |
-        <a href="https://www.flow-ai.com/blog/flow-judge">Technical report</a>
-    </div>
-</div>
 ## Model Summary
 Flow-Judge-v0.1 is a compact yet powerful 3.8B model that offers customizable LLM system evaluations across various fields. The model inherits it's architecture from Phi-3.5-mini instruct model which enables Flow-Judge to deliver high-quality results while maintaining a small footprint. Despite its smaller size, it achieves performance comparable to larger models in both held-out and out-of-domain benchmarks. Flow-Judge-v0.1 supports multiple scoring scales, provides qualitative feedback, and generates structured evaluation outputs. Trained on a smaller synthetic dataset, it represents an efficient approach to AI development. Released under the Apache 2.0 license, Flow Judge is an open and accessible model suitable for developers and companies seeking cost-effective and rapid evaluations using custom rubrics.
-__More information__
-- [Flow Judge website](https://www.flow-ai.com/judge)
-- [Technical report](https://www.flow-ai.com/blog/flow-judge)
-- [Github repo](https://github.com/flowaicom/flow-judge)
 __Quantized weights__
 - [flowaicom/Flow-Judge-v0.1-AWQ](https://huggingface.co/flowaicom/Flow-Judge-v0.1-AWQ)
 - [flowaicom/Flow-Judge-v0.1-GGUF](https://huggingface.co/flowaicom/Flow-Judge-v0.1-GGUF)
@@ -64,7 +52,7 @@ Flow Judge is intended to be used on custom LLM system evaluation tasks.
     - 5-Likert: Provides an even more nuanced assessment, with scores ranging from strongly negative to strongly positive, enabling users to capture subtle differences in quality or sentiment.
 - Easy to interpret results:
-    - Flow Judge produces structured evaluations with <feedback> and <score> tags.
         - Qualitative feedback: Flow Judge detects errors and grades outputs and provides qualitative feedback that explains its reasoning for assigning a particular score from the rubric while highlighting problematic parts of the responses.
         - Score: Based on a grading rubric Flow Judge will return a numerical score on binary, likert-3 or likert-5 scale.
@@ -86,12 +74,12 @@ Flow-Judge-v0.1 has been trained on synthetically generated datasets. The constr
 This process creates a comprehensive and diverse set of training instances that enable accurate, domain-specific evaluations of LLM systems in generative AI products while minimizing human intervention.
-Read more about the dataset construction from [here](https://www.flow-ai.com/blog/flow-judge)
 ### Fine-tuning
-For fine-tuning we used Axolotl's preprocessing to ensure input training data is consistent. We then conducted supervised fine-tuning based on microsoft/Phi-3.5-mini-instruct using RSLoRa. More detailed information about the fine-tuning process is provided in our [technical report](https://www.flow-ai.com/blog/flow-judge).
 ## Usage
@@ -376,7 +364,7 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
   </tbody>
 </table>
-\* _not suitable for 3 likert_
 ### RAGTruth
@@ -496,7 +484,7 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
   </tr>
 </table>
-\* _reported in Galileo luna paper_
 ### HaluEval, Covid-QA, PubMedQA
@@ -677,7 +665,7 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
   </tbody>
 </table>
-\* _reported in lynx paper_
 ### Feedback Bench
 <table border="1" cellpadding="10" cellspacing="0" style="border-collapse: collapse; width: auto;">
@@ -728,4 +716,4 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
   </tr>
 </table>
-\* _reported in prometheus paper using reference answer. Note the rest of the models have been evaluated without reference answer_

 - microsoft/Phi-3.5-mini-instruct
 ---
+<p align="center">
+  <img src="https://cdn-uploads.huggingface.co/production/uploads/63368577d184e6b53c50e6d0/6kSJKgPh2pDh4tA-Ky0xW.png" alt="Centered image">
+</p>
+<p align="center">🚀 <a href="https://www.flow-ai.com/judge">Flow Judge</a> | 📄 <a href="https://www.flow-ai.com/blog/flow-judge">Technical report</a> | 💻 <a href="https://github.com/flowaicom/flow-judge">flow-judge</a></p>
 ## Model Summary
 Flow-Judge-v0.1 is a compact yet powerful 3.8B model that offers customizable LLM system evaluations across various fields. The model inherits it's architecture from Phi-3.5-mini instruct model which enables Flow-Judge to deliver high-quality results while maintaining a small footprint. Despite its smaller size, it achieves performance comparable to larger models in both held-out and out-of-domain benchmarks. Flow-Judge-v0.1 supports multiple scoring scales, provides qualitative feedback, and generates structured evaluation outputs. Trained on a smaller synthetic dataset, it represents an efficient approach to AI development. Released under the Apache 2.0 license, Flow Judge is an open and accessible model suitable for developers and companies seeking cost-effective and rapid evaluations using custom rubrics.
 __Quantized weights__
 - [flowaicom/Flow-Judge-v0.1-AWQ](https://huggingface.co/flowaicom/Flow-Judge-v0.1-AWQ)
 - [flowaicom/Flow-Judge-v0.1-GGUF](https://huggingface.co/flowaicom/Flow-Judge-v0.1-GGUF)
     - 5-Likert: Provides an even more nuanced assessment, with scores ranging from strongly negative to strongly positive, enabling users to capture subtle differences in quality or sentiment.
 - Easy to interpret results:
+    - Flow Judge produces structured evaluations with `<feedback>` and `<score>` tags.
         - Qualitative feedback: Flow Judge detects errors and grades outputs and provides qualitative feedback that explains its reasoning for assigning a particular score from the rubric while highlighting problematic parts of the responses.
         - Score: Based on a grading rubric Flow Judge will return a numerical score on binary, likert-3 or likert-5 scale.
 This process creates a comprehensive and diverse set of training instances that enable accurate, domain-specific evaluations of LLM systems in generative AI products while minimizing human intervention.
+Read more about the dataset construction from [here](https://www.flow-ai.com/blog/flow-judge#dataset-construction)
 ### Fine-tuning
+For fine-tuning we used Axolotl's preprocessing to ensure input training data is consistent. We then conducted supervised fine-tuning based on microsoft/Phi-3.5-mini-instruct using RSLoRa. More detailed information about the fine-tuning process is provided in our [technical report](https://www.flow-ai.com/blog/flow-judge#fine-tuning).
 ## Usage
   </tbody>
 </table>
+\* _Reported in model paper_
 ### RAGTruth
   </tr>
 </table>
+\* _reported in model paper_
 ### HaluEval, Covid-QA, PubMedQA
   </tbody>
 </table>
+\* _reported in model paper_
 ### Feedback Bench
 <table border="1" cellpadding="10" cellspacing="0" style="border-collapse: collapse; width: auto;">
   </tr>
 </table>
+\* _reported in model paper using reference answers_