Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,75 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: llama3
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: llama3
|
| 3 |
+
---
|
| 4 |
+
This model is a fine-tuned Llama3 model, trained on the training set of PromptEvals (https://huggingface.co/datasets/user104/PromptEvals). It is fine-tuned to generate high quality assertion criteria for prompt templates.
|
| 5 |
+
|
| 6 |
+
Model Card:
|
| 7 |
+
Model Details
|
| 8 |
+
– Person or organization developing model: Meta, and fine-tuned by [Redacted for submission]
|
| 9 |
+
– Model date: Base model was released in April 18 2024, and fine-tuned in July 2024
|
| 10 |
+
– Model version: 3.1
|
| 11 |
+
– Model type: decoder-only Transformer
|
| 12 |
+
– Information about training algorithms, parameters, fairness constraints or other applied approaches, and features: 8 billion parameters, fine-tuned by us using Axolotl (https://github.com/axolotl-ai-cloud/axolotl)
|
| 13 |
+
– Paper or other resource for more information: https://arxiv.org/abs/2310.06825
|
| 14 |
+
– Citation details: Redacted for submission
|
| 15 |
+
– License: Meta Llama 3 Community License
|
| 16 |
+
– Where to send questions or comments about the model: [Redacted for submission]
|
| 17 |
+
Intended Use. Use cases that were envisioned during development. (Primary intended uses, Primary intended users, Out-of-scope use cases)
|
| 18 |
+
Intended to be used by developers to generate high quality assertion criteria for LLM outputs, or to benchmark the ability of LLMs in generating these assertion criteria.
|
| 19 |
+
Factors. Factors could include demographic or phenotypic groups, environmental conditions, technical attributes, or others listed in Section 4.3.
|
| 20 |
+
We don’t collect any demographic, phenotypic, or others listed in Section 4.3, data in our dataset.
|
| 21 |
+
Metrics. Metrics should be chosen to reflect potential realworld impacts of the model. (Model performance measures, Decision thresholds, Variation approaches)
|
| 22 |
+
| | **Base Mistral** | **Mistral (FT)** | **Base Llama** | **Llama (FT)** | **GPT-4o** |
|
| 23 |
+
|----------------|------------------|------------------|----------------|----------------|------------|
|
| 24 |
+
| **p25** | 0.3608 | 0.7919 | 0.3211 | **0.7922** | 0.6296 |
|
| 25 |
+
| **p50** | 0.4100 | 0.8231 | 0.3577 | **0.8233** | 0.6830 |
|
| 26 |
+
| **Mean** | 0.4093 | 0.8199 | 0.3607 | **0.8240** | 0.6808 |
|
| 27 |
+
| **p75** | 0.4561 | 0.8553 | 0.3978 | **0.8554** | 0.7351 |
|
| 28 |
+
|
| 29 |
+
*Semantic F1 scores for generated assertion criteria. Percentiles and mean values are shown for base models, fine-tuned (FT) versions, and GPT-4o. Bold indicates highest scores.*
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
| | **Mistral (FT)** | **Llama (FT)** | **GPT-4o** |
|
| 33 |
+
|----------------|------------------|----------------|-------------|
|
| 34 |
+
| **p25** | **1.8717** | 2.3962 | 6.5596 |
|
| 35 |
+
| **p50** | **2.3106** | 3.0748 | 8.2542 |
|
| 36 |
+
| **Mean** | **2.5915** | 3.6057 | 8.7041 |
|
| 37 |
+
| **p75** | **2.9839** | 4.2716 | 10.1905 |
|
| 38 |
+
|
| 39 |
+
*Latency for criteria generation. We compared the runtimes for all 3 models (in seconds) and included the 25th, 50th, and 75th percentile along with the mean. We found that our fine-tuned Mistral model had the lowest runtime for all metrics.*
|
| 40 |
+
|
| 41 |
+
| | **Average** | **Median** | **75th percentile** | **90th percentile** |
|
| 42 |
+
|--------------------|--------------|------------|---------------------|---------------------|
|
| 43 |
+
| **Base Mistral** | 14.5012 | 14 | 18.5 | 23 |
|
| 44 |
+
| **Mistral (FT)** | **6.28640** | **5** | **8** | **10** |
|
| 45 |
+
| **Base Llama** | 28.2458 | 26 | 33.5 | 46 |
|
| 46 |
+
| **Llama (FT)** | 5.47255 | **5** | **6** | 9 |
|
| 47 |
+
| **GPT-4o** | 7.59189 | 6 | 10 | 14.2 |
|
| 48 |
+
| *Ground Truth* | *5.98568* | *5* | *7* | *10* |
|
| 49 |
+
|
| 50 |
+
*Number of Criteria Generated by Models. Metrics show average, median, and percentile values. Bold indicates closest to ground truth.*
|
| 51 |
+
|
| 52 |
+
Evaluation Data: Evaluated on PromptEvals test set
|
| 53 |
+
Training Data: Fine-tuned on PromptEvals train set
|
| 54 |
+
|
| 55 |
+
Quantitative Analyses (Unitary results, Intersectional results):
|
| 56 |
+
| **Domain** | **Similarity** | **Precision** | **Recall** |
|
| 57 |
+
|----------------------------- |----------------|---------------|------------|
|
| 58 |
+
| General-Purpose Chatbots | 0.8140 | 0.8070 | 0.8221 |
|
| 59 |
+
| Question-Answering | 0.8104 | 0.8018 | 0.8199 |
|
| 60 |
+
| Text Summarization | 0.8601 | 0.8733 | 0.8479 |
|
| 61 |
+
| Database Querying | 0.8362 | 0.8509 | 0.8228 |
|
| 62 |
+
| Education | 0.8388 | 0.8498 | 0.8282 |
|
| 63 |
+
| Content Creation | 0.8417 | 0.8480 | 0.8358 |
|
| 64 |
+
| Workflow Automation | 0.8389 | 0.8477 | 0.8304 |
|
| 65 |
+
| Horse Racing Analytics | 0.8249 | 0.8259 | 0.8245 |
|
| 66 |
+
| Data Analysis | 0.7881 | 0.7940 | 0.7851 |
|
| 67 |
+
| Prompt Engineering | 0.8441 | 0.8387 | 0.8496 |
|
| 68 |
+
|
| 69 |
+
*Fine-Tuned Llama Score Averages per Domain (for the 10 most represented domains in our test set*
|
| 70 |
+
|
| 71 |
+
Ethical Considerations:
|
| 72 |
+
PromptEvals is open-source and is intended to be used as a benchmark to evaluate models' ability to identify and generate assertion criteria for prompts. However, because it is open-source, it may be used in pre-training models, which can impact the effectiveness of the benchmark.
|
| 73 |
+
Additionally, PromptEvals uses prompts contributed by a variety of users, and the prompts may not represent all domains equally.
|
| 74 |
+
However, we believe that despite this, our benchmark still provides value and can be useful in evaluating models on generating assertion criteria.
|
| 75 |
+
Caveats and Recommendations: None
|