reyavir commited on
Commit
66b1a42
·
verified ·
1 Parent(s): 7e83df9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -3
README.md CHANGED
@@ -1,3 +1,75 @@
1
- ---
2
- license: llama3
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3
3
+ ---
4
+ This model is a fine-tuned Llama3 model, trained on the training set of PromptEvals (https://huggingface.co/datasets/user104/PromptEvals). It is fine-tuned to generate high quality assertion criteria for prompt templates.
5
+
6
+ Model Card:
7
+ Model Details
8
+ – Person or organization developing model: Meta, and fine-tuned by [Redacted for submission]
9
+ – Model date: Base model was released in April 18 2024, and fine-tuned in July 2024
10
+ – Model version: 3.1
11
+ – Model type: decoder-only Transformer
12
+ – Information about training algorithms, parameters, fairness constraints or other applied approaches, and features: 8 billion parameters, fine-tuned by us using Axolotl (https://github.com/axolotl-ai-cloud/axolotl)
13
+ – Paper or other resource for more information: https://arxiv.org/abs/2310.06825
14
+ – Citation details: Redacted for submission
15
+ – License: Meta Llama 3 Community License
16
+ – Where to send questions or comments about the model: [Redacted for submission]
17
+ Intended Use. Use cases that were envisioned during development. (Primary intended uses, Primary intended users, Out-of-scope use cases)
18
+ Intended to be used by developers to generate high quality assertion criteria for LLM outputs, or to benchmark the ability of LLMs in generating these assertion criteria.
19
+ Factors. Factors could include demographic or phenotypic groups, environmental conditions, technical attributes, or others listed in Section 4.3.
20
+ We don’t collect any demographic, phenotypic, or others listed in Section 4.3, data in our dataset.
21
+ Metrics. Metrics should be chosen to reflect potential realworld impacts of the model. (Model performance measures, Decision thresholds, Variation approaches)
22
+ | | **Base Mistral** | **Mistral (FT)** | **Base Llama** | **Llama (FT)** | **GPT-4o** |
23
+ |----------------|------------------|------------------|----------------|----------------|------------|
24
+ | **p25** | 0.3608 | 0.7919 | 0.3211 | **0.7922** | 0.6296 |
25
+ | **p50** | 0.4100 | 0.8231 | 0.3577 | **0.8233** | 0.6830 |
26
+ | **Mean** | 0.4093 | 0.8199 | 0.3607 | **0.8240** | 0.6808 |
27
+ | **p75** | 0.4561 | 0.8553 | 0.3978 | **0.8554** | 0.7351 |
28
+
29
+ *Semantic F1 scores for generated assertion criteria. Percentiles and mean values are shown for base models, fine-tuned (FT) versions, and GPT-4o. Bold indicates highest scores.*
30
+
31
+
32
+ | | **Mistral (FT)** | **Llama (FT)** | **GPT-4o** |
33
+ |----------------|------------------|----------------|-------------|
34
+ | **p25** | **1.8717** | 2.3962 | 6.5596 |
35
+ | **p50** | **2.3106** | 3.0748 | 8.2542 |
36
+ | **Mean** | **2.5915** | 3.6057 | 8.7041 |
37
+ | **p75** | **2.9839** | 4.2716 | 10.1905 |
38
+
39
+ *Latency for criteria generation. We compared the runtimes for all 3 models (in seconds) and included the 25th, 50th, and 75th percentile along with the mean. We found that our fine-tuned Mistral model had the lowest runtime for all metrics.*
40
+
41
+ | | **Average** | **Median** | **75th percentile** | **90th percentile** |
42
+ |--------------------|--------------|------------|---------------------|---------------------|
43
+ | **Base Mistral** | 14.5012 | 14 | 18.5 | 23 |
44
+ | **Mistral (FT)** | **6.28640** | **5** | **8** | **10** |
45
+ | **Base Llama** | 28.2458 | 26 | 33.5 | 46 |
46
+ | **Llama (FT)** | 5.47255 | **5** | **6** | 9 |
47
+ | **GPT-4o** | 7.59189 | 6 | 10 | 14.2 |
48
+ | *Ground Truth* | *5.98568* | *5* | *7* | *10* |
49
+
50
+ *Number of Criteria Generated by Models. Metrics show average, median, and percentile values. Bold indicates closest to ground truth.*
51
+
52
+ Evaluation Data: Evaluated on PromptEvals test set
53
+ Training Data: Fine-tuned on PromptEvals train set
54
+
55
+ Quantitative Analyses (Unitary results, Intersectional results):
56
+ | **Domain** | **Similarity** | **Precision** | **Recall** |
57
+ |----------------------------- |----------------|---------------|------------|
58
+ | General-Purpose Chatbots | 0.8140 | 0.8070 | 0.8221 |
59
+ | Question-Answering | 0.8104 | 0.8018 | 0.8199 |
60
+ | Text Summarization | 0.8601 | 0.8733 | 0.8479 |
61
+ | Database Querying | 0.8362 | 0.8509 | 0.8228 |
62
+ | Education | 0.8388 | 0.8498 | 0.8282 |
63
+ | Content Creation | 0.8417 | 0.8480 | 0.8358 |
64
+ | Workflow Automation | 0.8389 | 0.8477 | 0.8304 |
65
+ | Horse Racing Analytics | 0.8249 | 0.8259 | 0.8245 |
66
+ | Data Analysis | 0.7881 | 0.7940 | 0.7851 |
67
+ | Prompt Engineering | 0.8441 | 0.8387 | 0.8496 |
68
+
69
+ *Fine-Tuned Llama Score Averages per Domain (for the 10 most represented domains in our test set*
70
+
71
+ Ethical Considerations:
72
+ PromptEvals is open-source and is intended to be used as a benchmark to evaluate models' ability to identify and generate assertion criteria for prompts. However, because it is open-source, it may be used in pre-training models, which can impact the effectiveness of the benchmark.
73
+ Additionally, PromptEvals uses prompts contributed by a variety of users, and the prompts may not represent all domains equally.
74
+ However, we believe that despite this, our benchmark still provides value and can be useful in evaluating models on generating assertion criteria.
75
+ Caveats and Recommendations: None