File size: 3,195 Bytes
b67b96f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
title: "๐ Blueberry Yield Regression"
emoji: ๐พ
colorFrom: indigo
colorTo: green
sdk: streamlit
app_file: app.py
pinned: true
license: mit
tags:
- regression
- machine-learning
- streamlit
- kaggle
- agriculture
---
# ๐ Blueberry Yield Prediction with Machine Learning
This project is a complete machine learning pipeline that predicts the **yield of wild blueberries** using various environmental and biological features such as pollinator counts, rainfall, and fruit measurements.
## ๐ Project Type
- Supervised Learning
- Regression Problem
---
## ๐ Problem Description
Predicting agricultural yield is a crucial component in planning, sustainability, and food economics. The dataset used in this project comes from the **Kaggle Playground Series S3E14** competition and contains information on:
- Different species of pollinators (honeybee, bumblebee, osmia...)
- Environmental conditions (rainfall days, temperature ranges...)
- Fruit attributes (fruit mass, fruit set, seed count...)
๐ฏ **Goal**: Predict the `yield` (kg/ha) of blueberries based on input features.
---
## ๐ Dataset Info
- `train.csv`: 15,289 samples with 18 features
- `test.csv`: same structure, no target
- No missing values, clean numerical data
---
## ๐ What We Did (Pipeline Summary)
1. **EDA (Exploratory Data Analysis)**
- Checked for missing values โ
- Analyzed feature distributions & target (`yield`)
- Built correlation heatmaps โ strongest positive correlations:
- `fruitmass`, `fruitset`, `seeds`
2. **Data Preprocessing**
- Removed `id` column
- Standard feature selection based on correlation
- No categorical encoding needed (all numerical)
3. **Model Training**
- Model: `RandomForestRegressor`
- Train-Test Split: 80/20
- **Results**:
- RMSE โ **573.8**
- Rยฒ Score โ **0.81** โ
4. **Test Prediction & Submission**
- Predictions made on `test.csv`
- `submission.csv` generated for Kaggle submission
5. **Streamlit App**
- Users input bee counts, rain days, and fruit measurements
- Predicts blueberry yield in kg/ha
- Uses trained model (`rf_model.pkl`) behind the scenes
---
## ๐ Try it Online
๐ You can try this app live here:
[Hugging Face Space Link](https://huggingface.co/spaces/yazodi/blueberry-yield-regression-app)
---
## ๐ฎ What Could Be Improved?
| Area | Suggestion |
|------|------------|
| Feature Engineering | Create interaction terms, try log/ratio features |
| Model | Try LightGBM, XGBoost, or stacking |
| Tuning | GridSearchCV or Optuna for hyperparameter optimization |
| Visualization | Add interactive charts in Streamlit app |
| Real-World Data | Add satellite weather data, soil types, historical trends |
---
## ๐ Project Structure
๐ฆ blueberry-yield-regression
โโโ app.py
โโโ rf_model.pkl
โโโ model_columns.pkl
โโโ requirements.txt
โโโ submission.csv
โโโ README.md
---
## ๐ License
MIT License โ Free to use, modify and distribute.
--- |