Upload 4 files
Browse files- ReadMe.md +170 -0
- app.py +378 -0
- gitignore.txt +55 -0
- requirements.txt +8 -3
ReadMe.md
ADDED
@@ -0,0 +1,170 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Transaction Purpose Classification System
|
2 |
+
A machine learning system that classifies financial transactions based on their purpose text using multiple algorithms including Naive Bayes, Logistic Regression, and Support Vector Machines.
|
3 |
+
🌟 Features
|
4 |
+
|
5 |
+
Multiple ML Models: Compare performance across different algorithms
|
6 |
+
Text Preprocessing: Advanced text cleaning with NLTK
|
7 |
+
Interactive Web Interface: Built with Streamlit
|
8 |
+
Real-time Classification: Classify new transactions instantly
|
9 |
+
Model Comparison: Detailed analysis of model performance
|
10 |
+
LLM Integration Guide: Conceptual approach for transformer-based models
|
11 |
+
|
12 |
+
🚀 Live Demo
|
13 |
+
Visit the live demo on Hugging Face Spaces: [Your Space URL]
|
14 |
+
📊 Model Performance
|
15 |
+
The system trains and compares three different models:
|
16 |
+
|
17 |
+
Naive Bayes: Fast and effective for text classification
|
18 |
+
Logistic Regression: Good baseline with interpretable results
|
19 |
+
Support Vector Machine: Often achieves high accuracy
|
20 |
+
|
21 |
+
🛠️ Local Development
|
22 |
+
Prerequisites
|
23 |
+
|
24 |
+
Python 3.8+
|
25 |
+
pip
|
26 |
+
|
27 |
+
Installation
|
28 |
+
|
29 |
+
Clone the repository:
|
30 |
+
|
31 |
+
bashgit clone https://github.com/yourusername/transaction-classification.git
|
32 |
+
cd transaction-classification
|
33 |
+
|
34 |
+
Install dependencies:
|
35 |
+
|
36 |
+
bashpip install -r requirements.txt
|
37 |
+
|
38 |
+
Run the application:
|
39 |
+
|
40 |
+
bashstreamlit run app.py
|
41 |
+
|
42 |
+
Open your browser and go to http://localhost:8501
|
43 |
+
|
44 |
+
📁 Project Structure
|
45 |
+
transaction-classification/
|
46 |
+
├── app.py # Main Streamlit application
|
47 |
+
├── requirements.txt # Python dependencies
|
48 |
+
├── README.md # Project documentation
|
49 |
+
└── .gitignore # Git ignore file
|
50 |
+
🔧 How It Works
|
51 |
+
1. Data Preprocessing
|
52 |
+
The system preprocesses transaction text by:
|
53 |
+
|
54 |
+
Converting to lowercase
|
55 |
+
Removing punctuation and digits
|
56 |
+
Removing stopwords
|
57 |
+
Lemmatizing words
|
58 |
+
Filtering short words
|
59 |
+
|
60 |
+
2. Feature Extraction
|
61 |
+
Uses TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert text into numerical features suitable for machine learning.
|
62 |
+
3. Model Training
|
63 |
+
Three models are trained and compared:
|
64 |
+
|
65 |
+
Naive Bayes: Probabilistic classifier based on Bayes' theorem
|
66 |
+
Logistic Regression: Linear model for classification
|
67 |
+
SVM: Support Vector Machine for high-dimensional data
|
68 |
+
|
69 |
+
4. Classification
|
70 |
+
The best-performing model is used to classify new transactions into categories like:
|
71 |
+
|
72 |
+
Rent
|
73 |
+
Groceries
|
74 |
+
Utilities
|
75 |
+
Subscriptions
|
76 |
+
Transportation
|
77 |
+
Dining
|
78 |
+
Shopping
|
79 |
+
Healthcare
|
80 |
+
Fitness
|
81 |
+
|
82 |
+
🤖 Large Language Model Approach
|
83 |
+
Conceptual Implementation
|
84 |
+
For improved performance, the system could be enhanced using transformer models:
|
85 |
+
pythonfrom transformers import AutoTokenizer, AutoModelForSequenceClassification
|
86 |
+
|
87 |
+
# Load pre-trained BERT model
|
88 |
+
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
|
89 |
+
model = AutoModelForSequenceClassification.from_pretrained(
|
90 |
+
'bert-base-uncased',
|
91 |
+
num_labels=num_classes
|
92 |
+
)
|
93 |
+
|
94 |
+
# Fine-tune on transaction data
|
95 |
+
# (See full implementation in the app)
|
96 |
+
Benefits of LLM Approach
|
97 |
+
|
98 |
+
Better Context Understanding: Captures semantic meaning
|
99 |
+
Higher Accuracy: State-of-the-art performance
|
100 |
+
Transfer Learning: Leverages pre-trained knowledge
|
101 |
+
Robust to Variations: Handles different phrasings better
|
102 |
+
|
103 |
+
Implementation Considerations
|
104 |
+
|
105 |
+
Computational Requirements: Needs GPU for training
|
106 |
+
Training Time: Longer than traditional ML
|
107 |
+
Model Size: Larger deployment footprint
|
108 |
+
Complexity: More complex pipeline
|
109 |
+
|
110 |
+
📈 Model Evaluation
|
111 |
+
Models are evaluated using:
|
112 |
+
|
113 |
+
Accuracy: Overall correctness
|
114 |
+
Precision: Correct positive predictions
|
115 |
+
Recall: Ability to find all positive cases
|
116 |
+
F1-Score: Harmonic mean of precision and recall
|
117 |
+
|
118 |
+
🔄 API Usage
|
119 |
+
While the main interface is web-based, the core functionality can be adapted for API usage:
|
120 |
+
python# Example classification
|
121 |
+
def classify_transaction(purpose_text):
|
122 |
+
cleaned_text = preprocess_text(purpose_text)
|
123 |
+
vectorized = vectorizer.transform([cleaned_text])
|
124 |
+
prediction = model.predict(vectorized)[0]
|
125 |
+
return prediction
|
126 |
+
|
127 |
+
# Usage
|
128 |
+
result = classify_transaction("Monthly apartment rent payment")
|
129 |
+
print(f"Predicted category: {result}")
|
130 |
+
🚀 Deployment to Hugging Face Spaces
|
131 |
+
Step 1: Create Space
|
132 |
+
|
133 |
+
Go to Hugging Face Spaces
|
134 |
+
Click "Create new Space"
|
135 |
+
Choose "Streamlit" as the SDK
|
136 |
+
Set space name and visibility
|
137 |
+
|
138 |
+
Step 2: Upload Files
|
139 |
+
Upload these files to your space:
|
140 |
+
|
141 |
+
app.py
|
142 |
+
requirements.txt
|
143 |
+
README.md
|
144 |
+
|
145 |
+
Step 3: Configuration
|
146 |
+
The space will automatically detect the Streamlit app and deploy it.
|
147 |
+
Step 4: Access
|
148 |
+
Your app will be available at: https://huggingface.co/spaces/yourusername/yourspacename
|
149 |
+
📝 Sample Data
|
150 |
+
The system includes sample transaction data covering common categories:
|
151 |
+
Purpose TextTransaction Type"Monthly apartment rent payment"rent"Grocery shopping at walmart"groceries"Electric bill payment"utilities"Netflix monthly subscription"subscription"Gas station fuel"transportation
|
152 |
+
🤝 Contributing
|
153 |
+
|
154 |
+
Fork the repository
|
155 |
+
Create a feature branch
|
156 |
+
Make your changes
|
157 |
+
Add tests if applicable
|
158 |
+
Submit a pull request
|
159 |
+
|
160 |
+
📄 License
|
161 |
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
162 |
+
🔮 Future Enhancements
|
163 |
+
|
164 |
+
Add more transaction categories
|
165 |
+
Implement ensemble methods
|
166 |
+
Add confidence scoring
|
167 |
+
Include data upload functionality
|
168 |
+
Add model retraining capability
|
169 |
+
Implement A/B testing framework
|
170 |
+
Add logging and monitoring
|
app.py
ADDED
@@ -0,0 +1,378 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import pandas as pd
|
3 |
+
import numpy as np
|
4 |
+
import joblib
|
5 |
+
import re
|
6 |
+
from sklearn.feature_extraction.text import TfidfVectorizer
|
7 |
+
from sklearn.naive_bayes import MultinomialNB
|
8 |
+
from sklearn.linear_model import LogisticRegression
|
9 |
+
from sklearn.svm import LinearSVC
|
10 |
+
from sklearn.metrics import classification_report, accuracy_score
|
11 |
+
from sklearn.model_selection import train_test_split
|
12 |
+
import nltk
|
13 |
+
from nltk.corpus import stopwords
|
14 |
+
from nltk.stem import WordNetLemmatizer
|
15 |
+
import plotly.express as px
|
16 |
+
import plotly.graph_objects as go
|
17 |
+
from plotly.subplots import make_subplots
|
18 |
+
|
19 |
+
# Download required NLTK data
|
20 |
+
@st.cache_resource
|
21 |
+
def download_nltk_data():
|
22 |
+
try:
|
23 |
+
nltk.data.find('tokenizers/punkt')
|
24 |
+
nltk.data.find('corpora/stopwords')
|
25 |
+
nltk.data.find('corpora/wordnet')
|
26 |
+
except LookupError:
|
27 |
+
nltk.download('punkt', quiet=True)
|
28 |
+
nltk.download('stopwords', quiet=True)
|
29 |
+
nltk.download('wordnet', quiet=True)
|
30 |
+
nltk.download('omw-1.4', quiet=True)
|
31 |
+
|
32 |
+
download_nltk_data()
|
33 |
+
|
34 |
+
# Initialize preprocessing tools
|
35 |
+
stop_words = set(stopwords.words('english'))
|
36 |
+
lemmatizer = WordNetLemmatizer()
|
37 |
+
|
38 |
+
def preprocess_text(text):
|
39 |
+
"""Clean and preprocess text for classification"""
|
40 |
+
if pd.isna(text):
|
41 |
+
return ""
|
42 |
+
|
43 |
+
text = str(text).lower()
|
44 |
+
text = re.sub(r'[^\w\s]', '', text) # remove punctuation
|
45 |
+
text = re.sub(r'\d+', '', text) # remove digits
|
46 |
+
|
47 |
+
words = text.split()
|
48 |
+
words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words and len(word) > 2]
|
49 |
+
|
50 |
+
return ' '.join(words)
|
51 |
+
|
52 |
+
# Sample data for demonstration
|
53 |
+
@st.cache_data
|
54 |
+
def create_sample_data():
|
55 |
+
"""Create sample transaction data"""
|
56 |
+
sample_data = [
|
57 |
+
("Monthly apartment rent payment", "rent"),
|
58 |
+
("Grocery shopping at walmart", "groceries"),
|
59 |
+
("Electric bill payment", "utilities"),
|
60 |
+
("Netflix monthly subscription", "subscription"),
|
61 |
+
("Gas station fuel", "transportation"),
|
62 |
+
("Restaurant dinner", "dining"),
|
63 |
+
("Apartment rent for december", "rent"),
|
64 |
+
("Weekly grocery shopping", "groceries"),
|
65 |
+
("Water bill payment", "utilities"),
|
66 |
+
("Spotify premium subscription", "subscription"),
|
67 |
+
("Bus fare to work", "transportation"),
|
68 |
+
("Coffee shop breakfast", "dining"),
|
69 |
+
("Monthly rent payment", "rent"),
|
70 |
+
("Food shopping at target", "groceries"),
|
71 |
+
("Internet bill", "utilities"),
|
72 |
+
("Amazon Prime membership", "subscription"),
|
73 |
+
("Uber ride home", "transportation"),
|
74 |
+
("Pizza delivery", "dining"),
|
75 |
+
("Rent for apartment", "rent"),
|
76 |
+
("Supermarket groceries", "groceries"),
|
77 |
+
("Phone bill payment", "utilities"),
|
78 |
+
("YouTube premium", "subscription"),
|
79 |
+
("Train ticket", "transportation"),
|
80 |
+
("Fast food lunch", "dining"),
|
81 |
+
("Office supplies", "shopping"),
|
82 |
+
("Medical appointment", "healthcare"),
|
83 |
+
("Gym membership", "fitness"),
|
84 |
+
("Book purchase", "shopping"),
|
85 |
+
("Doctor visit", "healthcare"),
|
86 |
+
("Fitness class", "fitness"),
|
87 |
+
("Clothing purchase", "shopping"),
|
88 |
+
("Pharmacy prescription", "healthcare"),
|
89 |
+
("Personal trainer", "fitness"),
|
90 |
+
("Electronics store", "shopping"),
|
91 |
+
("Dentist appointment", "healthcare"),
|
92 |
+
("Yoga class", "fitness"),
|
93 |
+
("Gift for friend", "shopping"),
|
94 |
+
("Eye exam", "healthcare"),
|
95 |
+
("Swimming pool fee", "fitness"),
|
96 |
+
("Home improvement", "shopping")
|
97 |
+
]
|
98 |
+
|
99 |
+
df = pd.DataFrame(sample_data, columns=['purpose_text', 'transaction_type'])
|
100 |
+
return df
|
101 |
+
|
102 |
+
@st.cache_resource
|
103 |
+
def train_models(df):
|
104 |
+
"""Train multiple models and return the best one"""
|
105 |
+
# Preprocess data
|
106 |
+
df['cleaned_purpose'] = df['purpose_text'].apply(preprocess_text)
|
107 |
+
|
108 |
+
X = df["cleaned_purpose"]
|
109 |
+
y = df["transaction_type"]
|
110 |
+
|
111 |
+
# Train-test split
|
112 |
+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
|
113 |
+
|
114 |
+
# TF-IDF Vectorization
|
115 |
+
vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
|
116 |
+
X_train_vec = vectorizer.fit_transform(X_train)
|
117 |
+
X_test_vec = vectorizer.transform(X_test)
|
118 |
+
|
119 |
+
# Train models
|
120 |
+
models = {
|
121 |
+
"Naive Bayes": MultinomialNB(),
|
122 |
+
"Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
|
123 |
+
"SVM (LinearSVC)": LinearSVC(random_state=42)
|
124 |
+
}
|
125 |
+
|
126 |
+
results = {}
|
127 |
+
trained_models = {}
|
128 |
+
|
129 |
+
for name, model in models.items():
|
130 |
+
model.fit(X_train_vec, y_train)
|
131 |
+
y_pred = model.predict(X_test_vec)
|
132 |
+
acc = accuracy_score(y_test, y_pred)
|
133 |
+
results[name] = {
|
134 |
+
'accuracy': acc,
|
135 |
+
'predictions': y_pred,
|
136 |
+
'actual': y_test
|
137 |
+
}
|
138 |
+
trained_models[name] = model
|
139 |
+
|
140 |
+
# Find best model
|
141 |
+
best_model_name = max(results, key=lambda x: results[x]['accuracy'])
|
142 |
+
best_model = trained_models[best_model_name]
|
143 |
+
|
144 |
+
return best_model, vectorizer, results, trained_models
|
145 |
+
|
146 |
+
def main():
|
147 |
+
st.set_page_config(
|
148 |
+
page_title="Transaction Classification System",
|
149 |
+
page_icon="💳",
|
150 |
+
layout="wide"
|
151 |
+
)
|
152 |
+
|
153 |
+
st.title("💳 Transaction Purpose Classification")
|
154 |
+
st.markdown("---")
|
155 |
+
|
156 |
+
# Sidebar
|
157 |
+
st.sidebar.title("Navigation")
|
158 |
+
page = st.sidebar.radio("Choose a page:", ["🏠 Home", "📊 Model Training", "🔍 Classification", "📈 Model Comparison"])
|
159 |
+
|
160 |
+
# Load data
|
161 |
+
df = create_sample_data()
|
162 |
+
|
163 |
+
if page == "🏠 Home":
|
164 |
+
st.header("Welcome to Transaction Classification System")
|
165 |
+
|
166 |
+
col1, col2 = st.columns(2)
|
167 |
+
|
168 |
+
with col1:
|
169 |
+
st.subheader("📖 Project Overview")
|
170 |
+
st.write("""
|
171 |
+
This system classifies financial transactions based on their purpose text using machine learning.
|
172 |
+
|
173 |
+
**Features:**
|
174 |
+
- Multiple ML models (Naive Bayes, Logistic Regression, SVM)
|
175 |
+
- Text preprocessing with NLTK
|
176 |
+
- Interactive model comparison
|
177 |
+
- Real-time transaction classification
|
178 |
+
""")
|
179 |
+
|
180 |
+
with col2:
|
181 |
+
st.subheader("📊 Sample Data")
|
182 |
+
st.dataframe(df.head(10))
|
183 |
+
|
184 |
+
st.subheader("🏷️ Transaction Types")
|
185 |
+
type_counts = df['transaction_type'].value_counts()
|
186 |
+
fig = px.pie(values=type_counts.values, names=type_counts.index, title="Distribution of Transaction Types")
|
187 |
+
st.plotly_chart(fig, use_container_width=True)
|
188 |
+
|
189 |
+
elif page == "📊 Model Training":
|
190 |
+
st.header("Model Training & Evaluation")
|
191 |
+
|
192 |
+
# Train models
|
193 |
+
with st.spinner("Training models..."):
|
194 |
+
best_model, vectorizer, results, trained_models = train_models(df)
|
195 |
+
|
196 |
+
col1, col2 = st.columns(2)
|
197 |
+
|
198 |
+
with col1:
|
199 |
+
st.subheader("📈 Model Performance")
|
200 |
+
|
201 |
+
# Create results dataframe
|
202 |
+
results_df = pd.DataFrame({
|
203 |
+
'Model': list(results.keys()),
|
204 |
+
'Accuracy': [results[model]['accuracy'] for model in results.keys()]
|
205 |
+
})
|
206 |
+
|
207 |
+
fig = px.bar(results_df, x='Model', y='Accuracy', title="Model Accuracy Comparison")
|
208 |
+
fig.update_layout(yaxis_range=[0, 1])
|
209 |
+
st.plotly_chart(fig, use_container_width=True)
|
210 |
+
|
211 |
+
st.dataframe(results_df)
|
212 |
+
|
213 |
+
with col2:
|
214 |
+
st.subheader("🎯 Best Model Details")
|
215 |
+
best_model_name = max(results, key=lambda x: results[x]['accuracy'])
|
216 |
+
st.success(f"**Best Model:** {best_model_name}")
|
217 |
+
st.metric("Accuracy", f"{results[best_model_name]['accuracy']:.3f}")
|
218 |
+
|
219 |
+
# Classification report
|
220 |
+
st.subheader("📋 Classification Report")
|
221 |
+
y_test = results[best_model_name]['actual']
|
222 |
+
y_pred = results[best_model_name]['predictions']
|
223 |
+
|
224 |
+
report = classification_report(y_test, y_pred, output_dict=True)
|
225 |
+
report_df = pd.DataFrame(report).transpose()
|
226 |
+
st.dataframe(report_df.round(3))
|
227 |
+
|
228 |
+
# Store models in session state
|
229 |
+
st.session_state.best_model = best_model
|
230 |
+
st.session_state.vectorizer = vectorizer
|
231 |
+
st.session_state.trained_models = trained_models
|
232 |
+
|
233 |
+
elif page == "🔍 Classification":
|
234 |
+
st.header("Classify New Transaction")
|
235 |
+
|
236 |
+
# Check if models are trained
|
237 |
+
if 'best_model' not in st.session_state:
|
238 |
+
st.warning("Please train the models first by visiting the 'Model Training' page.")
|
239 |
+
return
|
240 |
+
|
241 |
+
# Input form
|
242 |
+
with st.form("classification_form"):
|
243 |
+
purpose_text = st.text_area("Enter transaction purpose:",
|
244 |
+
placeholder="e.g., Monthly apartment rent payment",
|
245 |
+
height=100)
|
246 |
+
|
247 |
+
submitted = st.form_submit_button("Classify Transaction")
|
248 |
+
|
249 |
+
if submitted and purpose_text:
|
250 |
+
# Preprocess input
|
251 |
+
cleaned_text = preprocess_text(purpose_text)
|
252 |
+
|
253 |
+
# Make prediction
|
254 |
+
vectorized_text = st.session_state.vectorizer.transform([cleaned_text])
|
255 |
+
prediction = st.session_state.best_model.predict(vectorized_text)[0]
|
256 |
+
prediction_proba = st.session_state.best_model.predict_proba(vectorized_text)[0]
|
257 |
+
|
258 |
+
# Get class labels
|
259 |
+
classes = st.session_state.best_model.classes_
|
260 |
+
|
261 |
+
# Display results
|
262 |
+
col1, col2 = st.columns(2)
|
263 |
+
|
264 |
+
with col1:
|
265 |
+
st.subheader("🎯 Classification Result")
|
266 |
+
st.success(f"**Predicted Type:** {prediction}")
|
267 |
+
st.info(f"**Original Text:** {purpose_text}")
|
268 |
+
st.info(f"**Processed Text:** {cleaned_text}")
|
269 |
+
|
270 |
+
with col2:
|
271 |
+
st.subheader("📊 Prediction Confidence")
|
272 |
+
proba_df = pd.DataFrame({
|
273 |
+
'Transaction Type': classes,
|
274 |
+
'Probability': prediction_proba
|
275 |
+
}).sort_values('Probability', ascending=False)
|
276 |
+
|
277 |
+
fig = px.bar(proba_df, x='Probability', y='Transaction Type',
|
278 |
+
orientation='h', title="Prediction Probabilities")
|
279 |
+
st.plotly_chart(fig, use_container_width=True)
|
280 |
+
|
281 |
+
elif page == "📈 Model Comparison":
|
282 |
+
st.header("Detailed Model Comparison")
|
283 |
+
|
284 |
+
# Check if models are trained
|
285 |
+
if 'trained_models' not in st.session_state:
|
286 |
+
st.warning("Please train the models first by visiting the 'Model Training' page.")
|
287 |
+
return
|
288 |
+
|
289 |
+
# Model comparison
|
290 |
+
st.subheader("🔍 Model Analysis")
|
291 |
+
|
292 |
+
# Get sample predictions for comparison
|
293 |
+
sample_texts = [
|
294 |
+
"Monthly rent payment",
|
295 |
+
"Grocery shopping",
|
296 |
+
"Netflix subscription",
|
297 |
+
"Gas station",
|
298 |
+
"Restaurant dinner"
|
299 |
+
]
|
300 |
+
|
301 |
+
comparison_data = []
|
302 |
+
for text in sample_texts:
|
303 |
+
cleaned = preprocess_text(text)
|
304 |
+
vectorized = st.session_state.vectorizer.transform([cleaned])
|
305 |
+
|
306 |
+
row = {'Text': text, 'Cleaned': cleaned}
|
307 |
+
for model_name, model in st.session_state.trained_models.items():
|
308 |
+
prediction = model.predict(vectorized)[0]
|
309 |
+
row[model_name] = prediction
|
310 |
+
|
311 |
+
comparison_data.append(row)
|
312 |
+
|
313 |
+
comparison_df = pd.DataFrame(comparison_data)
|
314 |
+
st.dataframe(comparison_df, use_container_width=True)
|
315 |
+
|
316 |
+
# LLM/Transformer approach explanation
|
317 |
+
st.subheader("🤖 Large Language Model Approach")
|
318 |
+
|
319 |
+
with st.expander("Click to see LLM implementation strategy"):
|
320 |
+
st.markdown("""
|
321 |
+
### Using Transformer Models for Transaction Classification
|
322 |
+
|
323 |
+
**Approach:**
|
324 |
+
1. **Pre-trained Model Selection**: Use `bert-base-uncased` or `distilbert-base-uncased`
|
325 |
+
2. **Tokenization**: Use HuggingFace's tokenizer for the selected model
|
326 |
+
3. **Model Architecture**: Add a classification head on top of the transformer
|
327 |
+
4. **Fine-tuning**: Train on labeled transaction data
|
328 |
+
|
329 |
+
**Code Example:**
|
330 |
+
```python
|
331 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
332 |
+
from transformers import Trainer, TrainingArguments
|
333 |
+
|
334 |
+
# Load pre-trained model
|
335 |
+
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
|
336 |
+
model = AutoModelForSequenceClassification.from_pretrained(
|
337 |
+
'bert-base-uncased',
|
338 |
+
num_labels=len(unique_labels)
|
339 |
+
)
|
340 |
+
|
341 |
+
# Tokenize data
|
342 |
+
def tokenize_function(examples):
|
343 |
+
return tokenizer(examples['purpose_text'], truncation=True, padding=True)
|
344 |
+
|
345 |
+
# Fine-tune model
|
346 |
+
training_args = TrainingArguments(
|
347 |
+
output_dir='./results',
|
348 |
+
num_train_epochs=3,
|
349 |
+
per_device_train_batch_size=16,
|
350 |
+
per_device_eval_batch_size=64,
|
351 |
+
warmup_steps=500,
|
352 |
+
weight_decay=0.01,
|
353 |
+
)
|
354 |
+
|
355 |
+
trainer = Trainer(
|
356 |
+
model=model,
|
357 |
+
args=training_args,
|
358 |
+
train_dataset=train_dataset,
|
359 |
+
eval_dataset=eval_dataset,
|
360 |
+
)
|
361 |
+
|
362 |
+
trainer.train()
|
363 |
+
```
|
364 |
+
|
365 |
+
**Benefits:**
|
366 |
+
- Better semantic understanding
|
367 |
+
- Handles context better than TF-IDF
|
368 |
+
- Can capture complex patterns
|
369 |
+
- State-of-the-art performance
|
370 |
+
|
371 |
+
**Drawbacks:**
|
372 |
+
- Requires more computational resources
|
373 |
+
- Longer training time
|
374 |
+
- More complex deployment
|
375 |
+
""")
|
376 |
+
|
377 |
+
if __name__ == "__main__":
|
378 |
+
main()
|
gitignore.txt
ADDED
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Python
|
2 |
+
__pycache__/
|
3 |
+
*.py[cod]
|
4 |
+
*$py.class
|
5 |
+
*.so
|
6 |
+
.Python
|
7 |
+
build/
|
8 |
+
develop-eggs/
|
9 |
+
dist/
|
10 |
+
downloads/
|
11 |
+
eggs/
|
12 |
+
.eggs/
|
13 |
+
lib/
|
14 |
+
lib64/
|
15 |
+
parts/
|
16 |
+
sdist/
|
17 |
+
var/
|
18 |
+
wheels/
|
19 |
+
*.egg-info/
|
20 |
+
.installed.cfg
|
21 |
+
*.egg
|
22 |
+
|
23 |
+
# Virtual environments
|
24 |
+
venv/
|
25 |
+
env/
|
26 |
+
ENV/
|
27 |
+
|
28 |
+
# IDE
|
29 |
+
.vscode/
|
30 |
+
.idea/
|
31 |
+
*.swp
|
32 |
+
*.swo
|
33 |
+
*~
|
34 |
+
|
35 |
+
# OS
|
36 |
+
.DS_Store
|
37 |
+
Thumbs.db
|
38 |
+
|
39 |
+
# Jupyter Notebook
|
40 |
+
.ipynb_checkpoints
|
41 |
+
|
42 |
+
# Model files
|
43 |
+
*.pkl
|
44 |
+
*.joblib
|
45 |
+
|
46 |
+
# Data files
|
47 |
+
*.csv
|
48 |
+
*.json
|
49 |
+
data/
|
50 |
+
|
51 |
+
# Logs
|
52 |
+
*.log
|
53 |
+
|
54 |
+
# Streamlit
|
55 |
+
.streamlit/
|
requirements.txt
CHANGED
@@ -1,3 +1,8 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
streamlit==1.28.1
|
3 |
+
pandas==2.0.3
|
4 |
+
numpy==1.24.3
|
5 |
+
scikit-learn==1.3.0
|
6 |
+
nltk==3.8.1
|
7 |
+
plotly==5.17.0
|
8 |
+
joblib==1.3.2
|