raymondEDS commited on
Commit
faeb953
·
1 Parent(s): b1b0b70

test week 5

Browse files
Reference files/Copy_Lab_5_hands_on_peer_review.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
app/pages/week_5.py ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import pandas as pd
3
+ import numpy as np
4
+ import matplotlib.pyplot as plt
5
+ import seaborn as sns
6
+ from sklearn.linear_model import LinearRegression
7
+ from sklearn.metrics import r2_score
8
+ import scipy.stats as stats
9
+ from nltk.tokenize import word_tokenize
10
+
11
+ def show():
12
+ st.title("Week 5: Introduction to Machine Learning and Linear Regression")
13
+
14
+ # Introduction Section
15
+ st.header("Course Overview")
16
+ st.write("""
17
+ In this week, we'll explore machine learning through a fascinating real-world challenge: The Academic Publishing Crisis.
18
+
19
+ Imagine you're the program chair for a prestigious AI conference. You've just received 5,000 paper submissions, and you need to:
20
+ - Decide which papers to accept (only 20% can be accepted)
21
+ - Ensure fair and consistent reviews
22
+ - Understand what makes reviewers confident in their assessments
23
+
24
+ The Problem: Human reviewers are inconsistent. Some are harsh, others lenient. Some write detailed reviews, others just a few sentences.
25
+ How can we use data to understand and improve this process?
26
+
27
+ **Your Mission: Build a machine learning system to analyze review patterns and predict paper acceptance!**
28
+ """)
29
+
30
+ # Learning Path
31
+ st.subheader("Key Concepts You'll Master")
32
+ st.write("""
33
+ 1. **Linear Regression (线性回归):**
34
+ - Definition: A statistical method that models the relationship between a dependent variable and one or more independent variables
35
+ - Real-world example: Predicting house prices based on size and location
36
+
37
+ 2. **Correlation Analysis (相关性分析):**
38
+ - Definition: Statistical measure that shows how strongly two variables are related
39
+ - Range: -1 (perfect negative correlation) to +1 (perfect positive correlation)
40
+
41
+ 3. **Reading Linear Regression Output (解读线性回归结果):**
42
+ - R-squared (R²): Proportion of variance explained by the model (0-1)
43
+ - p-value: Probability that the observed relationship occurred by chance
44
+ - Coefficients (系数): How much the dependent variable changes with a one-unit change in the independent variable
45
+ - Standard errors: Uncertainty in coefficient estimates
46
+ - Confidence intervals: Range where true coefficient likely lies
47
+ """)
48
+
49
+ # Module 1: Setting Up Your Data Science Toolkit
50
+ st.header("Module 1: Setting Up Your Data Science Toolkit")
51
+ st.write("""
52
+ Let's start by importing the necessary libraries for our analysis:
53
+ """)
54
+
55
+ st.code("""
56
+ import numpy as np
57
+ import pandas as pd
58
+ import scipy.stats as stats
59
+ import matplotlib.pyplot as plt
60
+ import sklearn
61
+ from nltk.tokenize import word_tokenize
62
+ import seaborn as sns
63
+
64
+ # Set up visualization style
65
+ sns.set_style("whitegrid")
66
+ sns.set_context("poster")
67
+ """)
68
+
69
+ # Module 2: Loading and Understanding Data
70
+ st.header("Module 2: Loading and Understanding Data")
71
+ st.write("""
72
+ Before diving into analysis, we need to understand our data structure. What information do we have about each review? Each submission?
73
+ """)
74
+
75
+ if st.button("Load Sample Data"):
76
+ # Create sample data for demonstration
77
+ sample_reviews = pd.DataFrame({
78
+ 'rating_int': [6, 6, 5, 6, 8],
79
+ 'confidence_int': [4.0, 4.0, 4.0, 3.0, 3.0],
80
+ 'review': [
81
+ 'There is a lot of recent work on link-prediction...',
82
+ 'Pros: The different attention techniques...',
83
+ 'Overview of the paper: This paper studies...',
84
+ 'Summary: The authors propose a near minimax...',
85
+ 'This paper introduces a GPU-friendly variant...'
86
+ ],
87
+ 'forum': ['tGZu6DlbreV', 'uKhGRvM8QNH', 'IrM64DGB21', 'ww-7bdU6GA9', 'r1VGvBcxl']
88
+ })
89
+
90
+ st.write("Sample Reviews Data:")
91
+ st.dataframe(sample_reviews)
92
+
93
+ # Module 3: Feature Engineering
94
+ st.header("Module 3: Feature Engineering")
95
+ st.write("""
96
+ We'll create features from our text data that can help predict paper acceptance:
97
+ - Review length (word count)
98
+ - Review rating
99
+ - Reviewer confidence
100
+ - Number of keywords in the paper
101
+ """)
102
+
103
+ # Interactive Feature Engineering
104
+ st.subheader("Try Feature Engineering")
105
+ st.write("""
106
+ Let's create some features from a review:
107
+ """)
108
+
109
+ review_text = st.text_area(
110
+ "Enter a review to analyze:",
111
+ "This paper introduces a novel approach to machine learning. The methodology is sound and the results are promising.",
112
+ key="review_text"
113
+ )
114
+
115
+ if st.button("Extract Features"):
116
+ # Calculate features
117
+ word_count = len(word_tokenize(review_text))
118
+ sentence_count = len(review_text.split('.'))
119
+
120
+ st.write("Extracted Features:")
121
+ st.write(f"Word Count: {word_count}")
122
+ st.write(f"Sentence Count: {sentence_count}")
123
+
124
+ # Module 4: Linear Regression Analysis
125
+ st.header("Module 4: Linear Regression Analysis")
126
+ st.write("""
127
+ Let's build a simple linear regression model to predict paper ratings based on review features.
128
+ """)
129
+
130
+ # Interactive Regression
131
+ st.subheader("Try Linear Regression")
132
+ st.write("""
133
+ Let's create a simple regression model:
134
+ """)
135
+
136
+ if st.button("Run Sample Regression"):
137
+ # Create sample data
138
+ np.random.seed(42)
139
+ X = np.random.rand(100, 1) * 10 # Review length
140
+ y = 2 * X + np.random.randn(100, 1) * 2 # Rating with some noise
141
+
142
+ # Fit regression model
143
+ model = LinearRegression()
144
+ model.fit(X, y)
145
+
146
+ # Create visualization
147
+ plt.figure(figsize=(10, 6))
148
+ plt.scatter(X, y, color='blue', alpha=0.5)
149
+ plt.plot(X, model.predict(X), color='red', linewidth=2)
150
+ plt.xlabel('Review Length')
151
+ plt.ylabel('Rating')
152
+ plt.title('Linear Regression: Review Length vs Rating')
153
+ st.pyplot(plt)
154
+
155
+ # Show model metrics
156
+ st.write(f"R-squared: {r2_score(y, model.predict(X)):.3f}")
157
+ st.write(f"Coefficient: {model.coef_[0][0]:.3f}")
158
+ st.write(f"Intercept: {model.intercept_[0]:.3f}")
159
+
160
+ # Practice Exercises
161
+ st.header("Practice Exercises")
162
+
163
+ with st.expander("Exercise 1: Feature Engineering"):
164
+ st.write("""
165
+ 1. Load the reviews dataset
166
+ 2. Create features from review text
167
+ 3. Calculate correlation between features
168
+ 4. Visualize relationships
169
+ """)
170
+
171
+ st.code("""
172
+ # Solution
173
+ import pandas as pd
174
+ import numpy as np
175
+ from nltk.tokenize import word_tokenize
176
+
177
+ # Load data
178
+ df_reviews = pd.read_csv('reviews.csv')
179
+
180
+ # Create features
181
+ df_reviews['word_count'] = df_reviews['review'].apply(
182
+ lambda x: len(word_tokenize(x)))
183
+ df_reviews['sentence_count'] = df_reviews['review'].apply(
184
+ lambda x: len(x.split('.')))
185
+
186
+ # Calculate correlation
187
+ correlation = df_reviews[['word_count', 'rating_int',
188
+ 'confidence_int']].corr()
189
+
190
+ # Visualize
191
+ sns.heatmap(correlation, annot=True)
192
+ plt.show()
193
+ """)
194
+
195
+ with st.expander("Exercise 2: Building a Predictive Model"):
196
+ st.write("""
197
+ 1. Prepare features for modeling
198
+ 2. Split data into training and test sets
199
+ 3. Train a linear regression model
200
+ 4. Evaluate model performance
201
+ """)
202
+
203
+ st.code("""
204
+ # Solution
205
+ from sklearn.model_selection import train_test_split
206
+ from sklearn.linear_model import LinearRegression
207
+
208
+ # Prepare features
209
+ X = df_reviews[['word_count', 'confidence_int']]
210
+ y = df_reviews['rating_int']
211
+
212
+ # Split data
213
+ X_train, X_test, y_train, y_test = train_test_split(
214
+ X, y, test_size=0.2, random_state=42)
215
+
216
+ # Train model
217
+ model = LinearRegression()
218
+ model.fit(X_train, y_train)
219
+
220
+ # Evaluate
221
+ train_score = model.score(X_train, y_train)
222
+ test_score = model.score(X_test, y_test)
223
+
224
+ print(f"Training R²: {train_score:.3f}")
225
+ print(f"Testing R²: {test_score:.3f}")
226
+ """)
227
+
228
+ # Weekly Assignment
229
+ username = st.session_state.get("username", "Student")
230
+ st.header(f"{username}'s Weekly Assignment")
231
+
232
+ if username == "manxiii":
233
+ st.markdown("""
234
+ Hello **manxiii**, here is your Assignment 5: Machine Learning Analysis.
235
+ 1. Complete the feature engineering pipeline for the ICLR dataset
236
+ 2. Build a linear regression model to predict paper ratings
237
+ 3. Analyze the relationship between review features and acceptance
238
+ 4. Submit your findings in a Jupyter notebook
239
+
240
+ **Due Date:** End of Week 5
241
+ """)
242
+ elif username == "zhu":
243
+ st.markdown("""
244
+ Hello **zhu**, here is your Assignment 5: Machine Learning Analysis.
245
+ 1. Implement the complete machine learning workflow
246
+ 2. Create insightful visualizations of model results
247
+ 3. Draw conclusions from your analysis
248
+ 4. Submit your work in a Jupyter notebook
249
+
250
+ **Due Date:** End of Week 5
251
+ """)
252
+ elif username == "WK":
253
+ st.markdown("""
254
+ Hello **WK**, here is your Assignment 5: Machine Learning Analysis.
255
+ 1. Complete the feature engineering pipeline
256
+ 2. Build and evaluate a linear regression model
257
+ 3. Analyze patterns in the data
258
+ 4. Submit your findings
259
+
260
+ **Due Date:** End of Week 5
261
+ """)
262
+ else:
263
+ st.markdown(f"""
264
+ Hello **{username}**, here is your Assignment 5: Machine Learning Analysis.
265
+ 1. Complete the feature engineering pipeline
266
+ 2. Build and evaluate a linear regression model
267
+ 3. Analyze patterns in the data
268
+ 4. Submit your findings
269
+
270
+ **Due Date:** End of Week 5
271
+ """)