SpencerCPurdy commited on
Commit
eb2f588
·
verified ·
1 Parent(s): 071390e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -47
README.md CHANGED
@@ -12,68 +12,55 @@ short_description: End-to-End Automated MLOps Framework
12
 
13
  # End-to-End Automated MLOps Framework
14
 
15
- This project is a comprehensive, production-ready MLOps platform designed to automate the entire machine learning lifecycle. It provides an enterprise-grade solution for training, versioning, deploying, and monitoring models, complete with automated drift detection, retraining, and A/B testing capabilities.
16
 
17
- The entire system is managed through a sophisticated, multi-tabbed dashboard, offering a transparent and interactive view into every stage of the model lifecycle. This framework is built to demonstrate best practices in MLOps and to serve as a robust foundation for real-world machine learning systems.
18
 
19
- ## Core Features
20
-
21
- This platform integrates a full suite of MLOps tools into a single, cohesive system:
22
-
23
- * **Automated Training & Hyperparameter Tuning**: Employs a custom PyTorch neural network and leverages `Optuna` for sophisticated, automated hyperparameter optimization to find the best-performing model architecture.
24
-
25
- * **Model Registry & Versioning**: A robust model registry, backed by a persistent SQLite database, tracks every model version, its associated metrics, metadata, and artifacts. It supports clear versioning and promotion of models to a production state.
26
-
27
- * **Data and Concept Drift Detection**: Integrates powerful libraries like `Evidently` and `Alibi-Detect` to continuously monitor for data drift. It provides detailed reports on drift scores and identifies which features are most affected.
28
 
29
- * **Automated Retraining on Drift**: The system can be configured to automatically trigger a model retraining pipeline when significant data drift is detected, ensuring that production models remain accurate and relevant.
30
-
31
- * **Live A/B Testing Framework**: A built-in A/B testing manager allows for controlled, live comparison between a champion (production) model and a challenger. It routes traffic, records performance, and uses statistical tests to determine a winner.
32
-
33
- * **Comprehensive Performance Monitoring**: Tracks key performance indicators in real-time using `Prometheus` metrics. It monitors prediction latency, throughput, and model accuracy, providing alerts for performance degradation.
34
-
35
- * **Detailed Cost Tracking**: An integrated cost tracker estimates the financial impact of the ML system, breaking down costs for training (compute), inference (API calls), and model storage.
36
-
37
- * **Automated Model Card Generation**: Generates detailed, shareable model cards that document a model's architecture, performance metrics, training data characteristics, and intended use cases, promoting transparency and responsible AI.
38
 
39
- * **One-Click Hugging Face Deployment**: Seamlessly exports any registered model version, along with its model card, to the Hugging Face Hub, making it easy to share and collaborate.
 
 
 
 
 
 
 
 
 
40
 
41
  ## How It Works
42
 
43
- The MLOps Engine orchestrates a continuous, automated loop for managing the model lifecycle:
44
 
45
- 1. **Initial Training**: The system begins by training an initial model on a baseline dataset. This process includes hyperparameter optimization with Optuna to find the most effective architecture.
46
- 2. **Model Registration**: The trained model, its performance metrics, training duration, and metadata are logged in the Model Registry. The best-performing initial model is automatically promoted to "Production."
47
- 3. **Inference & Monitoring**: The production model serves predictions via the interactive UI. The `PerformanceMonitor` and `CostTracker` log every prediction, tracking latency, confidence, and associated costs.
48
- 4. **Drift Detection**: On a configurable schedule or by manual trigger, the `DriftDetector` compares incoming data to the reference dataset used for training.
49
- 5. **Automated Retraining & A/B Testing**:
50
- * If significant drift is detected, the system automatically triggers a retraining job on the new data.
51
- * The newly trained model becomes a "challenger" and is placed into an A/B test against the current "champion" production model.
52
- * The `ABTestManager` splits live traffic between the two models, and the winner is automatically promoted to production after reaching statistical significance.
53
- 6. **Analysis & Reporting**: At any point, users can generate detailed performance reports, cost breakdowns, and model cards directly from the dashboard.
54
 
55
  ## Technical Stack
56
 
57
- * **Machine Learning & Deep Learning**: PyTorch, Scikit-learn
58
- * **MLOps & Experiment Tracking**: MLflow, Optuna, Evidently, Alibi-Detect, SHAP
59
- * **Data Processing & Storage**: Pandas, NumPy, SQLite, Joblib
60
- * **Monitoring**: Prometheus Client
61
- * **Deployment & UI**: Gradio, Hugging Face Hub
62
- * **Core Language**: Python
63
 
64
  ## How to Use the Demo
65
 
66
- The Gradio interface is organized into logical tabs that cover the entire MLOps lifecycle.
67
 
68
- 1. **Model Training**: Generate synthetic data and train a new model. Choose whether to run hyperparameter optimization. The training results and performance metrics will be displayed.
69
- 2. **Model Registry**: View all registered model versions. Select a model and promote it to the production environment.
70
- 3. **Make Predictions**: Input feature values to get a real-time prediction from the current production model.
71
- 4. **Drift Detection**: Manually trigger a drift check to compare the current data distribution against the model's training data.
72
- 5. **A/B Testing**: Start a new A/B test between the production model and a new challenger, check the status of an active test, or complete a test to promote the winner.
73
- 6. **Performance Monitoring & Cost Tracking**: View dashboards summarizing model performance and operational costs over various time windows.
74
- 7. **Model Card**: Select any model version and generate a complete documentation card with its metrics and metadata.
75
- 8. **Settings**: Configure system-level parameters, such as enabling or disabling the automated retraining loop.
76
 
77
  ## Disclaimer
78
 
79
- This project is a comprehensive demonstration of an MLOps framework and operates on synthetically generated data. The models and workflows are designed for educational and illustrative purposes and should be adapted and validated for use in real-world production environments.
 
12
 
13
  # End-to-End Automated MLOps Framework
14
 
15
+ **Author**: Spencer Purdy
16
 
17
+ This project is a comprehensive, enterprise-grade MLOps platform that demonstrates a complete, automated lifecycle for machine learning models. It handles everything from automated training and hyperparameter optimization to versioning, production deployment, drift detection, A/B testing, and ongoing performance monitoring.
18
 
19
+ The entire system is orchestrated by a central engine and managed through a powerful, multi-tab Gradio interface, providing a single pane of glass for all MLOps activities.
 
 
 
 
 
 
 
 
20
 
21
+ ## Core Features
 
 
 
 
 
 
 
 
22
 
23
+ * **Automated Model Training**: The system features a `ModelTrainer` that automatically trains a custom PyTorch neural network on tabular data. It includes support for handling class imbalance with SMOTE and integrates `Optuna` for sophisticated hyperparameter optimization.
24
+ * **Model Registry and Versioning**: A robust `ModelRegistry` tracks all trained model versions, their performance metrics, and metadata. Models are persisted to disk and logged in a SQLite database, with functionality to promote any version to the "production" stage.
25
+ * **Data and Concept Drift Detection**: The platform integrates both `Evidently` and `Alibi-Detect` (with a statistical fallback) to continuously monitor for data drift between the reference training data and live inference data. Drift scores are tracked over time.
26
+ * **Automated Retraining**: A background process can be enabled to periodically check for significant data drift. If the drift threshold is exceeded, it automatically triggers a new model training cycle and initiates an A/B test against the current production model.
27
+ * **Live A/B Testing**: The `ABTestManager` allows for controlled experiments between the current production model and a challenger. It routes inference traffic, records performance metrics for both models, and determines a statistical winner.
28
+ * **Comprehensive Monitoring & Cost Tracking**:
29
+ * **Performance**: The `PerformanceMonitor` uses Prometheus-compatible metrics to track prediction latency, accuracy, and throughput. It also logs detailed performance data to a database for historical analysis.
30
+ * **Cost**: The `CostTracker` provides reports on estimated operational costs, breaking them down by training, inference, and model storage based on configurable rates.
31
+ * **Model Cards and Explainability**: The system can generate detailed model cards that consolidate metadata, performance metrics, and operational history. It also has `SHAP` integrated as a dependency for future explainability features.
32
+ * **Hugging Face Hub Integration**: Models can be exported directly from the registry to the Hugging Face Hub, with an automatically generated model card (`README.md`).
33
 
34
  ## How It Works
35
 
36
+ The platform operates as a cohesive system of specialized components orchestrated by the main `MLOpsEngine`:
37
 
38
+ 1. **Training**: A user initiates a training job from the UI. The `ModelTrainer` uses `Optuna` to find the best hyperparameters and then trains a `CustomNeuralNetwork` model.
39
+ 2. **Registration**: The newly trained model, along with its performance metrics and metadata, is registered in the `ModelRegistry`. The model artifact is saved, and its details are recorded in the SQLite database.
40
+ 3. **Promotion**: A user can review all registered models and promote a specific version to be the active "production" model via the UI.
41
+ 4. **Prediction**: When a prediction request is made, the engine retrieves the current production model (or routes to an A/B test model if active) to perform inference. Latency and other performance metrics are logged by the `PerformanceMonitor`.
42
+ 5. **Monitoring & Drift Detection**: In the background, the `DriftDetector` continuously compares incoming data against a reference dataset. If drift is detected and auto-retraining is enabled, it triggers the training of a new "challenger" model.
43
+ 6. **A/B Testing**: The new challenger model is automatically placed into an A/B test against the current production model. Live traffic is split between them until a statistically significant winner is found, which can then be automatically promoted.
 
 
 
44
 
45
  ## Technical Stack
46
 
47
+ * [cite_start]**Machine Learning & Training**: scikit-learn, PyTorch, imbalanced-learn [cite: 1]
48
+ * [cite_start]**MLOps & Experiment Tracking**: MLflow, Optuna, Hugging Face Hub, W&B [cite: 1]
49
+ * [cite_start]**Drift & Anomaly Detection**: Evidently, Alibi-Detect, SHAP [cite: 1]
50
+ * [cite_start]**Web Interface & Visualization**: Gradio, Matplotlib, Seaborn, Plotly, Yellowbrick [cite: 1]
51
+ * **Infrastructure & Utilities**: Prometheus Client, Joblib, SQLite
 
52
 
53
  ## How to Use the Demo
54
 
55
+ The Gradio interface is organized into tabs that follow a logical MLOps workflow.
56
 
57
+ 1. **Train a Model**: Navigate to the **Model Training** tab, select the number of training samples, and click **Train New Model**. This will create the first version in the registry.
58
+ 2. **Manage Models**: Go to the **Model Registry** tab. Click **Refresh Model List** to see all trained models. Select a version from the dropdown and click **Promote to Production** to make it active.
59
+ 3. **Make Predictions**: In the **Make Predictions** tab, enter values for the features and click **Predict**. The result from the current production model will be displayed.
60
+ 4. **Detect Drift**: Go to the **Drift Detection** tab and click **Check for Data Drift** to simulate checking a new batch of data against the original training data.
61
+ 5. **Run an A/B Test**: In the **A/B Testing** tab, click **Start New A/B Test**. This will train a new challenger model and run it against the current production model. To generate results, make several predictions in the "Make Predictions" tab with the "Use A/B Test" checkbox ticked.
62
+ 6. **Monitor Performance**: Check the **Performance Monitoring** and **Cost Tracking** tabs to see live operational dashboards for the system.
 
 
63
 
64
  ## Disclaimer
65
 
66
+ This project is an advanced demonstration of MLOps principles and is intended for educational and portfolio purposes. It uses synthetically generated data for its training and drift detection processes. While built to be robust, it is not intended for direct use in a live production environment without extensive testing and validation.