File size: 3,195 Bytes
b67b96f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---

title: "๐Ÿ‡ Blueberry Yield Regression"
emoji: ๐ŸŒพ
colorFrom: indigo
colorTo: green
sdk: streamlit
app_file: app.py
pinned: true
license: mit
tags:
  - regression
  - machine-learning
  - streamlit
  - kaggle
  - agriculture
---


# ๐Ÿ‡ Blueberry Yield Prediction with Machine Learning

This project is a complete machine learning pipeline that predicts the **yield of wild blueberries** using various environmental and biological features such as pollinator counts, rainfall, and fruit measurements.

## ๐Ÿ“Œ Project Type

- Supervised Learning
- Regression Problem

---

## ๐Ÿ” Problem Description

Predicting agricultural yield is a crucial component in planning, sustainability, and food economics. The dataset used in this project comes from the **Kaggle Playground Series S3E14** competition and contains information on:

- Different species of pollinators (honeybee, bumblebee, osmia...)
- Environmental conditions (rainfall days, temperature ranges...)
- Fruit attributes (fruit mass, fruit set, seed count...)

๐ŸŽฏ **Goal**: Predict the `yield` (kg/ha) of blueberries based on input features.

---

## ๐Ÿ“Š Dataset Info

- `train.csv`: 15,289 samples with 18 features
- `test.csv`: same structure, no target
- No missing values, clean numerical data

---

## ๐Ÿ“ˆ What We Did (Pipeline Summary)

1. **EDA (Exploratory Data Analysis)**  
   - Checked for missing values โœ…  
   - Analyzed feature distributions & target (`yield`)  
   - Built correlation heatmaps โ€” strongest positive correlations:  
     - `fruitmass`, `fruitset`, `seeds`

2. **Data Preprocessing**  
   - Removed `id` column  
   - Standard feature selection based on correlation  
   - No categorical encoding needed (all numerical)

3. **Model Training**  
   - Model: `RandomForestRegressor`  
   - Train-Test Split: 80/20  
   - **Results**:  
     - RMSE โ‰ˆ **573.8**  
     - Rยฒ Score โ‰ˆ **0.81** โœ…

4. **Test Prediction & Submission**  
   - Predictions made on `test.csv`  
   - `submission.csv` generated for Kaggle submission

5. **Streamlit App**  
   - Users input bee counts, rain days, and fruit measurements  
   - Predicts blueberry yield in kg/ha  
   - Uses trained model (`rf_model.pkl`) behind the scenes

---

## ๐Ÿš€ Try it Online

๐ŸŒ You can try this app live here:  
[Hugging Face Space Link](https://huggingface.co/spaces/yazodi/blueberry-yield-regression-app)

---

## ๐Ÿ”ฎ What Could Be Improved?

| Area | Suggestion |
|------|------------|
| Feature Engineering | Create interaction terms, try log/ratio features |
| Model | Try LightGBM, XGBoost, or stacking |
| Tuning | GridSearchCV or Optuna for hyperparameter optimization |
| Visualization | Add interactive charts in Streamlit app |
| Real-World Data | Add satellite weather data, soil types, historical trends |

---

## ๐Ÿ“ Project Structure

๐Ÿ“ฆ blueberry-yield-regression
โ”œโ”€โ”€ app.py
โ”œโ”€โ”€ rf_model.pkl

โ”œโ”€โ”€ model_columns.pkl
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ submission.csv
โ””โ”€โ”€ README.md


---

## ๐Ÿ“œ License

MIT License โ€“ Free to use, modify and distribute.

---