Ryan commited on
Commit
adf119a
·
1 Parent(s): 39135b5
.DS_Store CHANGED
Binary files a/.DS_Store and b/.DS_Store differ
 
README.md CHANGED
@@ -12,3 +12,32 @@ short_description: LLM Response Comparator
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
+
16
+ # Gradio App User Guide
17
+
18
+ This is my Gradio App homework assignment User Guide.
19
+
20
+ Here is a link to the video demo also:
21
+
22
+ # Introduction
23
+
24
+ This is a Gradio app that allows you to compare the responses of two different LLMs (Language Models) to the same input prompt. The app provides a simple interface where you can enter a prompt and select two different LLMs from a dropdown menu. Once you submit the prompt, the app will display the responses from both LLMs side by side for easy comparison.
25
+ The app is built using the Gradio library, which provides a user-friendly interface for creating web applications with Python. The app uses the `gr.Interface` class to create the interface and the `gr.inputs.Textbox` and `gr.outputs.Textbox` classes to define the input and output components of the app.
26
+ The app also includes a `gr.Button` component that allows you to submit the prompt and get the responses from the selected LLMs. The app uses the `gr.update` method to update the output components with the responses from the LLMs.
27
+ The app is designed to be easy to use and provides a simple way to compare the responses of different LLMs to the same input prompt. It can be useful for researchers and developers who want to evaluate the performance of different LLMs on the same task.
28
+
29
+ # Usage
30
+
31
+
32
+
33
+ # Documentation
34
+
35
+
36
+
37
+ # Contributions
38
+
39
+
40
+
41
+ # Limitations
42
+
43
+
dataset/.DS_Store ADDED
Binary file (6.15 kB). View file
 
dataset/summary-econ.txt DELETED
@@ -1 +0,0 @@
1
- Test 1 2 3
 
 
dataset/summary-fp.txt DELETED
@@ -1 +0,0 @@
1
- Test 1 2 3
 
 
dataset/summary-harris.txt CHANGED
@@ -1 +1,120 @@
1
- Test 1 2 3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Analysis of LLM Responses Comparing ExaOne3.5 and Granite3.2
2
+
3
+ Thanks for sharing these comparison results analyzing how two LLMs (ExaOne3.5 by LG and Granite3.2 by IBM) responded to the prompt about Kamala Harris's political views. Let me interpret the key differences for you.
4
+
5
+ ## Content and Focus Differences
6
+
7
+ Looking at the top words and 2-grams:
8
+ - **ExaOne3.5** emphasized Harris's legal background and policy implementation, using terms like "attorney general," "social justice," and "centrist approach" more frequently.
9
+ - **Granite3.2** focused more on her political positions and party affiliation, using terms like "support," "political views," "vice president," and "democratic party."
10
+
11
+ ExaOne appears to have framed Harris more through her professional background and specific policy areas, while Granite focused more directly on her political identity and positions.
12
+
13
+ ## Similarity Metrics
14
+
15
+ The models show moderate similarity:
16
+ - **Cosine similarity (0.67)**: Their word frequency patterns overlap somewhat but aren't identical
17
+ - **Jaccard similarity (0.22)**: Only about a fifth of unique words appeared in both responses
18
+ - **Semantic similarity (0.53)**: The overall meaning was moderately similar
19
+
20
+ This suggests the models presented somewhat different portraits despite covering the same person.
21
+
22
+ ## Political Framing and Bias Analysis
23
+
24
+ Both models show a liberal-leaning framing:
25
+ - **ExaOne3.5** used more liberal-associated terms (11 liberal vs. 2 conservative terms)
26
+ - **Granite3.2** used exclusively liberal-associated terms (7 liberal, 0 conservative)
27
+
28
+ However, the overall bias difference was minor (0.15/1.0), suggesting neither model took a dramatically different political stance than the other.
29
+
30
+ ## Stylistic Differences
31
+
32
+ The models differed significantly in communication style:
33
+ - **ExaOne3.5**: More informal and complex language
34
+ - **Granite3.2**: More neutral tone with average complexity
35
+
36
+ This could impact how authoritative or approachable the responses feel to readers.
37
+
38
+ ## Overall Interpretation
39
+
40
+ These LLMs presented moderately different portraits of Harris's political views despite addressing the same prompt. ExaOne3.5 created a more detailed, nuanced picture with higher linguistic complexity and focused more on Harris's background and specific policy areas. Granite3.2 took a more straightforward, neutral approach that centered on her political identity and party positions.
41
+
42
+ Neither model showed dramatic political bias relative to the other, though both framed Harris through terms more commonly associated with liberal perspectives.
43
+
44
+ The differences highlight how LLMs can present varied portraits of the same political figure based on their training data, internal architecture, and potential alignment methods.
45
+
46
+
47
+
48
+
49
+
50
+
51
+
52
+
53
+
54
+
55
+
56
+
57
+ ACTUAL ANALYSIS RESULTS
58
+
59
+ Analysis Results
60
+ Analysis of Prompt: "Tell me about the political views of Kamala Harris...."
61
+ Comparing responses from ExaOne3.5 and Granite3.2
62
+ Top Words Used by ExaOne3.5
63
+ harris (8), policy (8), justice (5), attorney (4), issue (4), measure (4), political (4), aimed (3), approach (3), general (3)
64
+
65
+ Top Words Used by Granite3.2
66
+ harris (7), support (6), view (6), issue (5), right (5), policy (4), party (3), political (3), president (3), progressive (3)
67
+
68
+ Similarity Metrics
69
+ Cosine Similarity: 0.67 (higher means more similar word frequency patterns)
70
+ Jaccard Similarity: 0.22 (higher means more word overlap)
71
+ Semantic Similarity: 0.53 (higher means more similar meaning)
72
+ Common Words: 71 words appear in both responses
73
+
74
+ Analysis Results
75
+ Analysis of Prompt: "Tell me about the political views of Kamala Harris...."
76
+ 2-grams Analysis: Comparing responses from ExaOne3.5 and Granite3.2
77
+ Top 2-grams Used by ExaOne3.5
78
+ attorney general (3), social justice (3), centrist approach (2), climate change (2), criminal justice (2), gun control (2), human rights (2), justice issues (2), measures like (2), middle class (2)
79
+
80
+ Top 2-grams Used by Granite3.2
81
+ political views (3), vice president (3), criminal justice (2), democratic party (2), foreign policy (2), harris advocated (2), lgbtq rights (2), president harris (2), social issues (2), 2019 proposed (1)
82
+
83
+ Similarity Metrics
84
+ Common 2-grams: 24 2-grams appear in both responses
85
+
86
+ Analysis Results
87
+ Analysis of Prompt: "Tell me about the political views of Kamala Harris...."
88
+ Bias Analysis: Comparing responses from ExaOne3.5 and Granite3.2
89
+ Bias Detection Summary
90
+ Partisan Leaning: ExaOne3.5 appears liberal, while Granite3.2 appears liberal. (Minor difference)
91
+
92
+ Overall Assessment: Analysis shows a 0.15/1.0 difference in bias patterns. (Minor overall bias difference)
93
+
94
+ Partisan Term Analysis
95
+ ExaOne3.5:
96
+
97
+ Liberal terms: progressive, progressive, progressive, climate, climate, reform, justice, justice, justice, justice, justice
98
+ Conservative terms: values, security
99
+ Granite3.2:
100
+
101
+ Liberal terms: progressive, progressive, progressive, climate, reform, justice, justice
102
+ Conservative terms: None detected
103
+
104
+ Analysis Results
105
+ Analysis of Prompt: "Tell me about the political views of Kamala Harris...."
106
+ Classifier Analysis for ExaOne3.5 and Granite3.2
107
+ Classification Results
108
+ ExaOne3.5:
109
+
110
+ Formality: Informal
111
+ Sentiment: Positive
112
+ Complexity: Complex
113
+ Granite3.2:
114
+
115
+ Formality: Neutral
116
+ Sentiment: Positive
117
+ Complexity: Average
118
+ Classification Comparison
119
+ Formality: Model 1 is informal, while Model 2 is neutral
120
+ Complexity: Model 1 uses complex language, while Model 2 uses average language
dataset/summary-trump.txt CHANGED
@@ -1 +1,97 @@
1
- Test 1 2 3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ I'll analyze the differences between LG ExaOne and IBM Granite in their responses to the prompt about Donald Trump's political views.
2
+
3
+ ## Key Differences Between LG ExaOne and IBM Granite
4
+
5
+ ### Content Focus
6
+ - **ExaOne** tends to emphasize policy-oriented aspects (more mentions of "policy," "trade," "agreement") and frequently uses qualifiers like "often"
7
+ - **Granite** places more focus on Trump himself (more mentions of "Trump") and his administration
8
+
9
+ ### Language Style
10
+ - **ExaOne** uses more complex language according to the classifier analysis
11
+ - **Granite** uses more average/accessible language complexity
12
+
13
+ ### Political Framing
14
+ - **ExaOne** appears to have a slightly conservative-leaning framing (more conservative terms than liberal terms were detected)
15
+ - **Granite** maintains a more balanced approach with fewer ideologically charged terms
16
+
17
+ ### Topical Coverage
18
+ - **ExaOne** emphasizes phrases like "tax cuts," "climate change," "executive orders," "free speech," and "mainstream media"
19
+ - **Granite** focuses more on "administration," "foreign policy," "political stance," and "United States"
20
+
21
+ ### Similarity
22
+ - The responses show moderate similarity (0.58 cosine similarity, 0.45 semantic similarity)
23
+ - Only 16% word overlap (Jaccard similarity of 0.16)
24
+ - They share 72 common words and 26 common two-word phrases
25
+
26
+ ### Overall Assessment
27
+ The results suggest that while both models provide factual information about Trump's political views with a positive sentiment and informal tone, ExaOne presents this information with more complex language and a slightly more conservative framing, while Granite offers a more balanced perspective with more accessible language. ExaOne appears to focus more on specific policy positions and ideological frameworks, while Granite presents a more administratively-focused overview of Trump's political stances.
28
+
29
+
30
+
31
+
32
+
33
+
34
+
35
+ ACTUAL ANALYSIS RESULTS
36
+
37
+ Analysis Results
38
+ Analysis of Prompt: "Tell me about the political views of Donald Trump...."
39
+ Comparing responses from ExaOne3.5 and Granite3.2
40
+ Top Words Used by ExaOne3.5
41
+ policy (8), trade (8), often (7), agreement (6), like (5), free (4), immigration (4), issue (4), medium (4), order (4)
42
+
43
+ Top Words Used by Granite3.2
44
+ trump (7), administration (4), agreement (4), policy (4), political (4), trade (4), ban (3), certain (3), stance (3), view (3)
45
+
46
+ Similarity Metrics
47
+ Cosine Similarity: 0.58 (higher means more similar word frequency patterns)
48
+ Jaccard Similarity: 0.16 (higher means more word overlap)
49
+ Semantic Similarity: 0.45 (higher means more similar meaning)
50
+ Common Words: 72 words appear in both responses
51
+
52
+ Analysis Results
53
+ Analysis of Prompt: "Tell me about the political views of Donald Trump...."
54
+ 2-grams Analysis: Comparing responses from ExaOne3.5 and Granite3.2
55
+ Top 2-grams Used by ExaOne3.5
56
+ tax cuts (3), climate change (2), executive orders (2), free speech (2), free trade (2), law order (2), legal immigration (2), mainstream media (2), political views (2), skepticism climate (2)
57
+
58
+ Top 2-grams Used by Granite3.2
59
+ administration took (2), foreign policy (2), political stance (2), political views (2), social issues (2), trump generally (2), united states (2), 2017 2021 (1), 2021 known (1), 45th president (1)
60
+
61
+ Similarity Metrics
62
+ Common 2-grams: 26 2-grams appear in both responses
63
+
64
+ Analysis Results
65
+ Analysis of Prompt: "Tell me about the political views of Donald Trump...."
66
+ Bias Analysis: Comparing responses from ExaOne3.5 and Granite3.2
67
+ Bias Detection Summary
68
+ Partisan Leaning: ExaOne3.5 appears conservative, while Granite3.2 appears balanced. (Minor difference)
69
+
70
+ Overall Assessment: Analysis shows a 0.20/1.0 difference in bias patterns. (Minor overall bias difference)
71
+
72
+ Partisan Term Analysis
73
+ ExaOne3.5:
74
+
75
+ Liberal terms: climate, climate, justice
76
+ Conservative terms: traditional, traditional, freedom, individual, deregulation, deregulation, security
77
+ Granite3.2:
78
+
79
+ Liberal terms: climate
80
+ Conservative terms: deregulation
81
+
82
+ Analysis Results
83
+ Analysis of Prompt: "Tell me about the political views of Donald Trump...."
84
+ Classifier Analysis for ExaOne3.5 and Granite3.2
85
+ Classification Results
86
+ ExaOne3.5:
87
+
88
+ Formality: Informal
89
+ Sentiment: Positive
90
+ Complexity: Complex
91
+ Granite3.2:
92
+
93
+ Formality: Informal
94
+ Sentiment: Positive
95
+ Complexity: Average
96
+ Classification Comparison
97
+ Complexity: Model 1 uses complex language, while Model 2 uses average language
processors/topic_modeling.py CHANGED
@@ -289,11 +289,9 @@ def extract_topics(texts, n_topics=3, n_top_words=10, method="lda"):
289
  # Create document-term matrix
290
  if method == "nmf":
291
  # For NMF, use TF-IDF vectorization
292
- # FIXED: Modified min_df and max_df for small document sets
293
  vectorizer = TfidfVectorizer(max_features=1000, min_df=1, max_df=1.0)
294
  else:
295
  # For LDA, use CountVectorizer
296
- # FIXED: Modified min_df and max_df for small document sets
297
  vectorizer = CountVectorizer(max_features=1000, min_df=1, max_df=1.0)
298
 
299
  X = vectorizer.fit_transform(preprocessed_texts)
 
289
  # Create document-term matrix
290
  if method == "nmf":
291
  # For NMF, use TF-IDF vectorization
 
292
  vectorizer = TfidfVectorizer(max_features=1000, min_df=1, max_df=1.0)
293
  else:
294
  # For LDA, use CountVectorizer
 
295
  vectorizer = CountVectorizer(max_features=1000, min_df=1, max_df=1.0)
296
 
297
  X = vectorizer.fit_transform(preprocessed_texts)
ui/main_screen.py CHANGED
@@ -23,19 +23,15 @@ def create_main_screen():
23
  This application allows you to compare how different Large Language Models (LLMs) respond
24
  to the same political prompts or questions. Using various NLP techniques, the tool analyzes:
25
 
26
- - **Topic Modeling**: What key topics do different LLMs emphasize?
27
  - **N-gram Analysis**: What phrases and word patterns are characteristic of each LLM?
28
  - **Bias Detection**: Are there detectable biases in how LLMs approach political topics?
29
  - **Text Classification**: How do responses cluster or differentiate?
30
- - **Key Differences**: What specific content varies between models?
31
 
32
  ### How to Use
33
 
34
  1. Navigate to the **Dataset Input** tab
35
  2. Enter prompts and corresponding LLM responses, or load an example dataset
36
- 3. Run various analyses to see how the responses compare
37
- 4. Explore visualizations of the differences
38
- 5. Generate a comprehensive report of findings
39
 
40
  This tool is for educational and research purposes to better understand how LLMs handle
41
  politically sensitive topics.
 
23
  This application allows you to compare how different Large Language Models (LLMs) respond
24
  to the same political prompts or questions. Using various NLP techniques, the tool analyzes:
25
 
 
26
  - **N-gram Analysis**: What phrases and word patterns are characteristic of each LLM?
27
  - **Bias Detection**: Are there detectable biases in how LLMs approach political topics?
28
  - **Text Classification**: How do responses cluster or differentiate?
 
29
 
30
  ### How to Use
31
 
32
  1. Navigate to the **Dataset Input** tab
33
  2. Enter prompts and corresponding LLM responses, or load an example dataset
34
+ 3. Run various analyses to see how the responses compare
 
 
35
 
36
  This tool is for educational and research purposes to better understand how LLMs handle
37
  politically sensitive topics.
visualization/bow_visualizer.py CHANGED
@@ -156,7 +156,7 @@ def process_and_visualize_analysis(analysis_results):
156
  print("Processing Bag of Words visualization")
157
  components.append(gr.Markdown("### Bag of Words Analysis"))
158
  bow_results = analyses["bag_of_words"]
159
-
160
  # Display models compared
161
  if "models" in bow_results:
162
  models = bow_results["models"]
@@ -170,32 +170,16 @@ def process_and_visualize_analysis(analysis_results):
170
  print(f"Creating word list for model {model}")
171
  word_list = [f"{item['word']} ({item['count']})" for item in words[:10]]
172
  components.append(gr.Markdown(f"**{model}**: {', '.join(word_list)}"))
173
-
174
- # Add visualizations for word frequency differences
175
- if "differential_words" in bow_results and "word_count_matrix" in bow_results and len(
176
- bow_results["models"]) >= 2:
177
- diff_words = bow_results["differential_words"]
178
- word_matrix = bow_results["word_count_matrix"]
179
- models = bow_results["models"]
180
-
181
- if diff_words and word_matrix and len(diff_words) > 0:
182
- components.append(gr.Markdown("### Words with Biggest Frequency Differences"))
183
-
184
- # Create dataframe for plotting
185
- model1, model2 = models[0], models[1]
186
- diff_data = []
187
-
188
- for word in diff_words[:10]: # Limit to top 10 for readability
189
- if word in word_matrix:
190
- counts = word_matrix[word]
191
- model1_count = counts.get(model1, 0)
192
- model2_count = counts.get(model2, 0)
193
-
194
- # Only include if there's a meaningful difference
195
- if abs(model1_count - model2_count) > 0:
196
- components.append(gr.Markdown(
197
- f"- **{word}**: {model1}: {model1_count}, {model2}: {model2_count}"
198
- ))
199
 
200
  # Check for N-gram analysis
201
  if "ngram_analysis" in analyses:
 
156
  print("Processing Bag of Words visualization")
157
  components.append(gr.Markdown("### Bag of Words Analysis"))
158
  bow_results = analyses["bag_of_words"]
159
+
160
  # Display models compared
161
  if "models" in bow_results:
162
  models = bow_results["models"]
 
170
  print(f"Creating word list for model {model}")
171
  word_list = [f"{item['word']} ({item['count']})" for item in words[:10]]
172
  components.append(gr.Markdown(f"**{model}**: {', '.join(word_list)}"))
173
+
174
+ # Add the detailed BOW visualization using the create_bow_visualization function
175
+ print("Adding detailed BOW visualization components")
176
+ bow_visualization_components = create_bow_visualization(
177
+ {"analyses": {prompt: {"bag_of_words": bow_results}}}
178
+ )
179
+
180
+ # Skip the first component since it's a duplicate header
181
+ if len(bow_visualization_components) > 1:
182
+ components.extend(bow_visualization_components[1:])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
183
 
184
  # Check for N-gram analysis
185
  if "ngram_analysis" in analyses: