๐ Flow Judge | ๐ Technical report | ๐ป flow-judge
## Model Summary Flow-Judge-v0.1 is a compact yet powerful 3.8B model that offers customizable LLM system evaluations across various fields. The model inherits it's architecture from Phi-3.5-mini instruct model which enables Flow-Judge to deliver high-quality results while maintaining a small footprint. Despite its smaller size, it achieves performance comparable to larger models in both held-out and out-of-domain benchmarks. Flow-Judge-v0.1 supports multiple scoring scales, provides qualitative feedback, and generates structured evaluation outputs. Trained on a smaller synthetic dataset, it represents an efficient approach to AI development. Released under the Apache 2.0 license, Flow Judge is an open and accessible model suitable for developers and companies seeking cost-effective and rapid evaluations using custom rubrics. __Quantized weights__ - [flowaicom/Flow-Judge-v0.1-AWQ](https://huggingface.co/flowaicom/Flow-Judge-v0.1-AWQ) - [flowaicom/Flow-Judge-v0.1-GGUF](https://huggingface.co/flowaicom/Flow-Judge-v0.1-GGUF) __Quickstart__ - [Quickstart](https://github.com/flowaicom/flow-judge/examples/1_quickstart.ipynb) ## Intended Use Case Flow Judge is intended to be used on custom LLM system evaluation tasks. - Customizable evaluations: Users can define their own evaluation criteria and rubrics, tailoring Flow Judge to their specific needs and requirements. This flexibility allows for the creation of highly targeted assessments that accurately measure performance of their LLM system - Flow Judge supports three different scoring scales: - Pass/fail: Suitable for binary assessments, such as determining whether a piece of text meets a specific standard or contains errors. - 3-Likert: Allows for more granular evaluations, with scores ranging from negative to neutral to positive. Useful for assessing the overall quality or sentiment of a piece of text. - 5-Likert: Provides an even more nuanced assessment, with scores ranging from strongly negative to strongly positive, enabling users to capture subtle differences in quality or sentiment. - Easy to interpret results: - Flow Judge produces structured evaluations with `Evaluator | Pass / Fail Held-out Test set | ||
---|---|---|---|
Precision | Recall | F1 | |
microsoft/Phi-3.5-mini-instruct | 0.685 | 1.000 | 0.813 |
meta-llama/Meta-Llama-3.1-8B-Instruct | 0.870 | 0.982 | 0.923 |
mistralai/Mistral-Nemo-Instruct-2407 | 0.709 | 0.994 | 0.827 |
gpt-4o-mini | 0.834 | 1.000 | 0.910 |
flowaicom/Flow-Judge-v0.1 | 0.940 | 0.972 | 0.955 |
Evaluator | 3-Likert Held-out Test set | 5-Likert Held-out Test set | ||||
---|---|---|---|---|---|---|
pearsonr | spearmanr | kendall-tau | pearsonr | spearmanr | kendall-tau | |
microsoft/Phi-3.5-mini-instruct | 0.756 | 0.749 | 0.695 | 0.808 | 0.819 | 0.739 |
prometheus-eval/prometheus-7b-v2.0* | - | - | - | 0.910 | 0.908 | 0.838 |
meta-llama/Meta-Llama-3.1-8B-Instruct | 0.836 | 0.833 | 0.789 | 0.854 | 0.868 | 0.791 |
mistralai/Mistral-Nemo-Instruct-2407 | 0.813 | 0.807 | 0.758 | 0.870 | 0.867 | 0.789 |
gpt-4o-mini | 0.890 | 0.888 | 0.851 | 0.923 | 0.923 | 0.864 |
flowaicom/Flow-Judge-v0.1 | 0.888 | 0.888 | 0.852 | 0.919 | 0.919 | 0.856 |
Evaluator | RAGTruth QA | RAGTruth Data-to-Text | RAGTruth Summarization | ||||||
---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | Precision | Recall | F1 | |
microsoft/Phi-3.5-mini-instruct | 0.817 | 0.963 | 0.884 | 0.356 | 1.000 | 0.525 | 0.776 | 1.000 | 0.874 |
meta-llama/Meta-Llama-3.1-8B-Instruct | 0.844 | 0.986 | 0.910 | 0.382 | 0.537 | 0.447 | 0.797 | 0.940 | 0.863 |
mistralai/Mistral-Nemo-Instruct-2407 | 0.821 | 0.995 | 0.900 | 0.357 | 1.000 | 0.526 | 0.775 | 1.000 | 0.873 |
gpt-4o-mini | 0.830 | 0.966 | 0.893 | 0.398 | 0.994 | 0.569 | 0.786 | 0.997 | 0.879 |
Luna* | 0.378 | 0.800 | 0.513 | 0.649 | 0.912 | 0.759 | 0.400 | 0.765 | 0.525 |
RAGAS Faithfuless* | 0.312 | 0.419 | 0.357 | 0.792 | 0.508 | 0.619 | 0.642 | 0.299 | 0.408 |
Trulens Groundedness* | 0.228 | 0.925 | 0.366 | 0.669 | 0.965 | 0.790 | 0.402 | 0.500 | 0.445 |
flowaicom/Flow-Judge-v0.1 | 0.835 | 0.961 | 0.894 | 0.541 | 0.249 | 0.341 | 0.834 | 0.836 | 0.835 |
Evaluator | HaluEval | Covid-QA | PubMedQA | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Accuracy | Precision | Recall | F1 | Accuracy | Precision | Recall | F1 | Accuracy | |
microsoft/Phi-3.5-mini-instruct | 0.730 | 0.914 | 0.812 | 0.788 | 0.617 | 0.964 | 0.752 | 0.681 | 0.623 | 0.986 | 0.764 | 0.696 |
meta-llama/Meta-Llama-3.1-8B-Instruct | 0.864 | 0.891 | 0.878 | 0.874 | 0.663 | 0.976 | 0.790 | 0.734 | 0.681 | 0.962 | 0.797 | 0.750 |
mistralai/Mistral-Nemo-Instruct-2407 | 0.655 | 0.993 | 0.789 | 0.735 | 0.651 | 0.982 | 0.783 | 0.728 | 0.602 | 0.994 | 0.750 | 0.669 |
gpt-4o-mini | 0.846 | 0.940 | 0.891 | 0.885 | 0.795 | 0.964 | 0.872 | 0.858 | 0.791 | 0.904 | 0.843 | 0.832 |
flowaicom/Flow-Judge-v0.1 | 0.826 | 0.895 | 0.859 | 0.854 | 0.767 | 0.877 | 0.818 | 0.807 | 0.874 | 0.624 | 0.728 | 0.767 |
gpt-4o* | - | - | - | 0.879 | - | - | - | 0.821 | - | - | - | 0.821 |
Claude 3 Sonnet* | - | - | - | 0.845 | - | - | - | 0.829 | - | - | - | 0.829 |
RAGAS Faithfulness* | - | - | - | 0.706 | - | - | - | 0.750 | - | - | - | 0.669 |
Lynx 8B* | - | - | - | 0.857 | - | - | - | 0.963 | - | - | - | 0.852 |
Lynx 70B* | - | - | - | 0.884 | - | - | - | 0.975 | - | - | - | 0.904 |
Evaluator | Feedback bench | ||
---|---|---|---|
pearsonr | spearmanr | kendall-tau | |
microsoft/Phi-3.5-mini-instruct | 0.710 | 0.721 | 0.622 |
prometheus-eval/prometheus-7b-v2.0* | 0.878 | 0.909 | 0.773 |
meta-llama/Meta-Llama-3.1-8B-Instruct | 0.742 | 0.749 | 0.654 |
mistralai/Mistral-Nemo-Instruct-2407 | 0.720 | 0.724 | 0.632 |
gpt-4o-mini | 0.797 | 0.795 | 0.701 |
flowaicom/Flow-Judge-v0.1 | 0.787 | 0.789 | 0.688 |