|
Transaction Purpose Classification System
|
|
A machine learning system that classifies financial transactions based on their purpose text using multiple algorithms including Naive Bayes, Logistic Regression, and Support Vector Machines.
|
|
๐ Features
|
|
|
|
Multiple ML Models: Compare performance across different algorithms
|
|
Text Preprocessing: Advanced text cleaning with NLTK
|
|
Interactive Web Interface: Built with Streamlit
|
|
Real-time Classification: Classify new transactions instantly
|
|
Model Comparison: Detailed analysis of model performance
|
|
LLM Integration Guide: Conceptual approach for transformer-based models
|
|
|
|
๐ Live Demo
|
|
Visit the live demo on Hugging Face Spaces: [Your Space URL]
|
|
๐ Model Performance
|
|
The system trains and compares three different models:
|
|
|
|
Naive Bayes: Fast and effective for text classification
|
|
Logistic Regression: Good baseline with interpretable results
|
|
Support Vector Machine: Often achieves high accuracy
|
|
|
|
๐ ๏ธ Local Development
|
|
Prerequisites
|
|
|
|
Python 3.8+
|
|
pip
|
|
|
|
Installation
|
|
|
|
Clone the repository:
|
|
|
|
bashgit clone https://github.com/yourusername/transaction-classification.git
|
|
cd transaction-classification
|
|
|
|
Install dependencies:
|
|
|
|
bashpip install -r requirements.txt
|
|
|
|
Run the application:
|
|
|
|
bashstreamlit run app.py
|
|
|
|
Open your browser and go to http://localhost:8501
|
|
|
|
๐ Project Structure
|
|
transaction-classification/
|
|
โโโ app.py # Main Streamlit application
|
|
โโโ requirements.txt # Python dependencies
|
|
โโโ README.md # Project documentation
|
|
โโโ .gitignore # Git ignore file
|
|
๐ง How It Works
|
|
1. Data Preprocessing
|
|
The system preprocesses transaction text by:
|
|
|
|
Converting to lowercase
|
|
Removing punctuation and digits
|
|
Removing stopwords
|
|
Lemmatizing words
|
|
Filtering short words
|
|
|
|
2. Feature Extraction
|
|
Uses TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert text into numerical features suitable for machine learning.
|
|
3. Model Training
|
|
Three models are trained and compared:
|
|
|
|
Naive Bayes: Probabilistic classifier based on Bayes' theorem
|
|
Logistic Regression: Linear model for classification
|
|
SVM: Support Vector Machine for high-dimensional data
|
|
|
|
4. Classification
|
|
The best-performing model is used to classify new transactions into categories like:
|
|
|
|
Rent
|
|
Groceries
|
|
Utilities
|
|
Subscriptions
|
|
Transportation
|
|
Dining
|
|
Shopping
|
|
Healthcare
|
|
Fitness
|
|
|
|
๐ค Large Language Model Approach
|
|
Conceptual Implementation
|
|
For improved performance, the system could be enhanced using transformer models:
|
|
pythonfrom transformers import AutoTokenizer, AutoModelForSequenceClassification
|
|
|
|
# Load pre-trained BERT model
|
|
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
|
|
model = AutoModelForSequenceClassification.from_pretrained(
|
|
'bert-base-uncased',
|
|
num_labels=num_classes
|
|
)
|
|
|
|
# Fine-tune on transaction data
|
|
# (See full implementation in the app)
|
|
Benefits of LLM Approach
|
|
|
|
Better Context Understanding: Captures semantic meaning
|
|
Higher Accuracy: State-of-the-art performance
|
|
Transfer Learning: Leverages pre-trained knowledge
|
|
Robust to Variations: Handles different phrasings better
|
|
|
|
Implementation Considerations
|
|
|
|
Computational Requirements: Needs GPU for training
|
|
Training Time: Longer than traditional ML
|
|
Model Size: Larger deployment footprint
|
|
Complexity: More complex pipeline
|
|
|
|
๐ Model Evaluation
|
|
Models are evaluated using:
|
|
|
|
Accuracy: Overall correctness
|
|
Precision: Correct positive predictions
|
|
Recall: Ability to find all positive cases
|
|
F1-Score: Harmonic mean of precision and recall
|
|
|
|
๐ API Usage
|
|
While the main interface is web-based, the core functionality can be adapted for API usage:
|
|
python# Example classification
|
|
def classify_transaction(purpose_text):
|
|
cleaned_text = preprocess_text(purpose_text)
|
|
vectorized = vectorizer.transform([cleaned_text])
|
|
prediction = model.predict(vectorized)[0]
|
|
return prediction
|
|
|
|
# Usage
|
|
result = classify_transaction("Monthly apartment rent payment")
|
|
print(f"Predicted category: {result}")
|
|
๐ Deployment to Hugging Face Spaces
|
|
Step 1: Create Space
|
|
|
|
Go to Hugging Face Spaces
|
|
Click "Create new Space"
|
|
Choose "Streamlit" as the SDK
|
|
Set space name and visibility
|
|
|
|
Step 2: Upload Files
|
|
Upload these files to your space:
|
|
|
|
app.py
|
|
requirements.txt
|
|
README.md
|
|
|
|
Step 3: Configuration
|
|
The space will automatically detect the Streamlit app and deploy it.
|
|
Step 4: Access
|
|
Your app will be available at: https://huggingface.co/spaces/yourusername/yourspacename
|
|
๐ Sample Data
|
|
The system includes sample transaction data covering common categories:
|
|
Purpose TextTransaction Type"Monthly apartment rent payment"rent"Grocery shopping at walmart"groceries"Electric bill payment"utilities"Netflix monthly subscription"subscription"Gas station fuel"transportation
|
|
๐ค Contributing
|
|
|
|
Fork the repository
|
|
Create a feature branch
|
|
Make your changes
|
|
Add tests if applicable
|
|
Submit a pull request
|
|
|
|
๐ License
|
|
This project is licensed under the MIT License - see the LICENSE file for details.
|
|
๐ฎ Future Enhancements
|
|
|
|
Add more transaction categories
|
|
Implement ensemble methods
|
|
Add confidence scoring
|
|
Include data upload functionality
|
|
Add model retraining capability
|
|
Implement A/B testing framework
|
|
Add logging and monitoring |