Text Generation
Transformers
Safetensors
English
phi3
lm-judge
evaluation
nlp
conversational
custom_code
text-generation-inference
bergr7f commited on
Commit
222d23e
·
verified ·
1 Parent(s): 640c93b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -23
README.md CHANGED
@@ -25,27 +25,15 @@ base_model:
25
  - microsoft/Phi-3.5-mini-instruct
26
  ---
27
 
28
- # Flow-Judge-v0.1
29
-
30
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63368577d184e6b53c50e6d0/NgFJqVmUgrhOnphd47VEm.png)
31
-
32
- <div class="center-content">
33
- <div class="links">
34
- <a href="https://github.com/flowaicom/flow-judge">flow-judge library</a>
35
- |
36
- <a href="https://www.flow-ai.com/blog/flow-judge">Technical report</a>
37
- </div>
38
- </div>
39
 
40
  ## Model Summary
41
 
42
  Flow-Judge-v0.1 is a compact yet powerful 3.8B model that offers customizable LLM system evaluations across various fields. The model inherits it's architecture from Phi-3.5-mini instruct model which enables Flow-Judge to deliver high-quality results while maintaining a small footprint. Despite its smaller size, it achieves performance comparable to larger models in both held-out and out-of-domain benchmarks. Flow-Judge-v0.1 supports multiple scoring scales, provides qualitative feedback, and generates structured evaluation outputs. Trained on a smaller synthetic dataset, it represents an efficient approach to AI development. Released under the Apache 2.0 license, Flow Judge is an open and accessible model suitable for developers and companies seeking cost-effective and rapid evaluations using custom rubrics.
43
 
44
- __More information__
45
- - [Flow Judge website](https://www.flow-ai.com/judge)
46
- - [Technical report](https://www.flow-ai.com/blog/flow-judge)
47
- - [Github repo](https://github.com/flowaicom/flow-judge)
48
-
49
  __Quantized weights__
50
  - [flowaicom/Flow-Judge-v0.1-AWQ](https://huggingface.co/flowaicom/Flow-Judge-v0.1-AWQ)
51
  - [flowaicom/Flow-Judge-v0.1-GGUF](https://huggingface.co/flowaicom/Flow-Judge-v0.1-GGUF)
@@ -64,7 +52,7 @@ Flow Judge is intended to be used on custom LLM system evaluation tasks.
64
  - 5-Likert: Provides an even more nuanced assessment, with scores ranging from strongly negative to strongly positive, enabling users to capture subtle differences in quality or sentiment.
65
 
66
  - Easy to interpret results:
67
- - Flow Judge produces structured evaluations with <feedback> and <score> tags.
68
  - Qualitative feedback: Flow Judge detects errors and grades outputs and provides qualitative feedback that explains its reasoning for assigning a particular score from the rubric while highlighting problematic parts of the responses.
69
  - Score: Based on a grading rubric Flow Judge will return a numerical score on binary, likert-3 or likert-5 scale.
70
 
@@ -86,12 +74,12 @@ Flow-Judge-v0.1 has been trained on synthetically generated datasets. The constr
86
 
87
  This process creates a comprehensive and diverse set of training instances that enable accurate, domain-specific evaluations of LLM systems in generative AI products while minimizing human intervention.
88
 
89
- Read more about the dataset construction from [here](https://www.flow-ai.com/blog/flow-judge)
90
 
91
 
92
  ### Fine-tuning
93
 
94
- For fine-tuning we used Axolotl's preprocessing to ensure input training data is consistent. We then conducted supervised fine-tuning based on microsoft/Phi-3.5-mini-instruct using RSLoRa. More detailed information about the fine-tuning process is provided in our [technical report](https://www.flow-ai.com/blog/flow-judge).
95
 
96
  ## Usage
97
 
@@ -376,7 +364,7 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
376
  </tbody>
377
  </table>
378
 
379
- \* _not suitable for 3 likert_
380
 
381
 
382
  ### RAGTruth
@@ -496,7 +484,7 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
496
  </tr>
497
  </table>
498
 
499
- \* _reported in Galileo luna paper_
500
 
501
 
502
  ### HaluEval, Covid-QA, PubMedQA
@@ -677,7 +665,7 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
677
  </tbody>
678
  </table>
679
 
680
- \* _reported in lynx paper_
681
  ### Feedback Bench
682
 
683
  <table border="1" cellpadding="10" cellspacing="0" style="border-collapse: collapse; width: auto;">
@@ -728,4 +716,4 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
728
  </tr>
729
  </table>
730
 
731
- \* _reported in prometheus paper using reference answer. Note the rest of the models have been evaluated without reference answer_
 
25
  - microsoft/Phi-3.5-mini-instruct
26
  ---
27
 
28
+ <p align="center">
29
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/63368577d184e6b53c50e6d0/6kSJKgPh2pDh4tA-Ky0xW.png" alt="Centered image">
30
+ </p>
31
+ <p align="center">🚀 <a href="https://www.flow-ai.com/judge">Flow Judge</a> | 📄 <a href="https://www.flow-ai.com/blog/flow-judge">Technical report</a> | 💻 <a href="https://github.com/flowaicom/flow-judge">flow-judge</a></p>
 
 
 
 
 
 
 
32
 
33
  ## Model Summary
34
 
35
  Flow-Judge-v0.1 is a compact yet powerful 3.8B model that offers customizable LLM system evaluations across various fields. The model inherits it's architecture from Phi-3.5-mini instruct model which enables Flow-Judge to deliver high-quality results while maintaining a small footprint. Despite its smaller size, it achieves performance comparable to larger models in both held-out and out-of-domain benchmarks. Flow-Judge-v0.1 supports multiple scoring scales, provides qualitative feedback, and generates structured evaluation outputs. Trained on a smaller synthetic dataset, it represents an efficient approach to AI development. Released under the Apache 2.0 license, Flow Judge is an open and accessible model suitable for developers and companies seeking cost-effective and rapid evaluations using custom rubrics.
36
 
 
 
 
 
 
37
  __Quantized weights__
38
  - [flowaicom/Flow-Judge-v0.1-AWQ](https://huggingface.co/flowaicom/Flow-Judge-v0.1-AWQ)
39
  - [flowaicom/Flow-Judge-v0.1-GGUF](https://huggingface.co/flowaicom/Flow-Judge-v0.1-GGUF)
 
52
  - 5-Likert: Provides an even more nuanced assessment, with scores ranging from strongly negative to strongly positive, enabling users to capture subtle differences in quality or sentiment.
53
 
54
  - Easy to interpret results:
55
+ - Flow Judge produces structured evaluations with `<feedback>` and `<score>` tags.
56
  - Qualitative feedback: Flow Judge detects errors and grades outputs and provides qualitative feedback that explains its reasoning for assigning a particular score from the rubric while highlighting problematic parts of the responses.
57
  - Score: Based on a grading rubric Flow Judge will return a numerical score on binary, likert-3 or likert-5 scale.
58
 
 
74
 
75
  This process creates a comprehensive and diverse set of training instances that enable accurate, domain-specific evaluations of LLM systems in generative AI products while minimizing human intervention.
76
 
77
+ Read more about the dataset construction from [here](https://www.flow-ai.com/blog/flow-judge#dataset-construction)
78
 
79
 
80
  ### Fine-tuning
81
 
82
+ For fine-tuning we used Axolotl's preprocessing to ensure input training data is consistent. We then conducted supervised fine-tuning based on microsoft/Phi-3.5-mini-instruct using RSLoRa. More detailed information about the fine-tuning process is provided in our [technical report](https://www.flow-ai.com/blog/flow-judge#fine-tuning).
83
 
84
  ## Usage
85
 
 
364
  </tbody>
365
  </table>
366
 
367
+ \* _Reported in model paper_
368
 
369
 
370
  ### RAGTruth
 
484
  </tr>
485
  </table>
486
 
487
+ \* _reported in model paper_
488
 
489
 
490
  ### HaluEval, Covid-QA, PubMedQA
 
665
  </tbody>
666
  </table>
667
 
668
+ \* _reported in model paper_
669
  ### Feedback Bench
670
 
671
  <table border="1" cellpadding="10" cellspacing="0" style="border-collapse: collapse; width: auto;">
 
716
  </tr>
717
  </table>
718
 
719
+ \* _reported in model paper using reference answers_