Spaces:

yazodi
/

blueberry-yield-regression-app

Sleeping

File size: 3,195 Bytes

b67b96f

---

title: "🍇 Blueberry Yield Regression"
emoji: 🌾
colorFrom: indigo
colorTo: green
sdk: streamlit
app_file: app.py
pinned: true
license: mit
tags:
  - regression
  - machine-learning
  - streamlit
  - kaggle
  - agriculture
---


# 🍇 Blueberry Yield Prediction with Machine Learning

This project is a complete machine learning pipeline that predicts the **yield of wild blueberries** using various environmental and biological features such as pollinator counts, rainfall, and fruit measurements.

## 📌 Project Type

- Supervised Learning
- Regression Problem

---

## 🔍 Problem Description

Predicting agricultural yield is a crucial component in planning, sustainability, and food economics. The dataset used in this project comes from the **Kaggle Playground Series S3E14** competition and contains information on:

- Different species of pollinators (honeybee, bumblebee, osmia...)
- Environmental conditions (rainfall days, temperature ranges...)
- Fruit attributes (fruit mass, fruit set, seed count...)

🎯 **Goal**: Predict the `yield` (kg/ha) of blueberries based on input features.

---

## 📊 Dataset Info

- `train.csv`: 15,289 samples with 18 features
- `test.csv`: same structure, no target
- No missing values, clean numerical data

---

## 📈 What We Did (Pipeline Summary)

1. **EDA (Exploratory Data Analysis)**  
   - Checked for missing values ✅  
   - Analyzed feature distributions & target (`yield`)  
   - Built correlation heatmaps — strongest positive correlations:  
     - `fruitmass`, `fruitset`, `seeds`

2. **Data Preprocessing**  
   - Removed `id` column  
   - Standard feature selection based on correlation  
   - No categorical encoding needed (all numerical)

3. **Model Training**  
   - Model: `RandomForestRegressor`  
   - Train-Test Split: 80/20  
   - **Results**:  
     - RMSE ≈ **573.8**  
     - R² Score ≈ **0.81** ✅

4. **Test Prediction & Submission**  
   - Predictions made on `test.csv`  
   - `submission.csv` generated for Kaggle submission

5. **Streamlit App**  
   - Users input bee counts, rain days, and fruit measurements  
   - Predicts blueberry yield in kg/ha  
   - Uses trained model (`rf_model.pkl`) behind the scenes

---

## 🚀 Try it Online

🌐 You can try this app live here:  
[Hugging Face Space Link](https://huggingface.co/spaces/yazodi/blueberry-yield-regression-app)

---

## 🔮 What Could Be Improved?

| Area | Suggestion |
|------|------------|
| Feature Engineering | Create interaction terms, try log/ratio features |
| Model | Try LightGBM, XGBoost, or stacking |
| Tuning | GridSearchCV or Optuna for hyperparameter optimization |
| Visualization | Add interactive charts in Streamlit app |
| Real-World Data | Add satellite weather data, soil types, historical trends |

---

## 📁 Project Structure

📦 blueberry-yield-regression
├── app.py
├── rf_model.pkl

├── model_columns.pkl
├── requirements.txt
├── submission.csv
└── README.md


---

## 📜 License

MIT License – Free to use, modify and distribute.

---