Upload 5 files
Browse files- README.md +116 -20
- app.py +50 -0
- model_columns.pkl +3 -0
- requirements.txt +5 -3
- rf_model.pkl +3 -0
README.md
CHANGED
@@ -1,20 +1,116 @@
|
|
1 |
-
---
|
2 |
-
title: Blueberry Yield Regression
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
-
colorTo:
|
6 |
-
sdk:
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: "🍇 Blueberry Yield Regression"
|
3 |
+
emoji: 🌾
|
4 |
+
colorFrom: indigo
|
5 |
+
colorTo: green
|
6 |
+
sdk: streamlit
|
7 |
+
app_file: app.py
|
8 |
+
pinned: true
|
9 |
+
license: mit
|
10 |
+
tags:
|
11 |
+
- regression
|
12 |
+
- machine-learning
|
13 |
+
- streamlit
|
14 |
+
- kaggle
|
15 |
+
- agriculture
|
16 |
+
---
|
17 |
+
|
18 |
+
# 🍇 Blueberry Yield Prediction with Machine Learning
|
19 |
+
|
20 |
+
This project is a complete machine learning pipeline that predicts the **yield of wild blueberries** using various environmental and biological features such as pollinator counts, rainfall, and fruit measurements.
|
21 |
+
|
22 |
+
## 📌 Project Type
|
23 |
+
|
24 |
+
- Supervised Learning
|
25 |
+
- Regression Problem
|
26 |
+
|
27 |
+
---
|
28 |
+
|
29 |
+
## 🔍 Problem Description
|
30 |
+
|
31 |
+
Predicting agricultural yield is a crucial component in planning, sustainability, and food economics. The dataset used in this project comes from the **Kaggle Playground Series S3E14** competition and contains information on:
|
32 |
+
|
33 |
+
- Different species of pollinators (honeybee, bumblebee, osmia...)
|
34 |
+
- Environmental conditions (rainfall days, temperature ranges...)
|
35 |
+
- Fruit attributes (fruit mass, fruit set, seed count...)
|
36 |
+
|
37 |
+
🎯 **Goal**: Predict the `yield` (kg/ha) of blueberries based on input features.
|
38 |
+
|
39 |
+
---
|
40 |
+
|
41 |
+
## 📊 Dataset Info
|
42 |
+
|
43 |
+
- `train.csv`: 15,289 samples with 18 features
|
44 |
+
- `test.csv`: same structure, no target
|
45 |
+
- No missing values, clean numerical data
|
46 |
+
|
47 |
+
---
|
48 |
+
|
49 |
+
## 📈 What We Did (Pipeline Summary)
|
50 |
+
|
51 |
+
1. **EDA (Exploratory Data Analysis)**
|
52 |
+
- Checked for missing values ✅
|
53 |
+
- Analyzed feature distributions & target (`yield`)
|
54 |
+
- Built correlation heatmaps — strongest positive correlations:
|
55 |
+
- `fruitmass`, `fruitset`, `seeds`
|
56 |
+
|
57 |
+
2. **Data Preprocessing**
|
58 |
+
- Removed `id` column
|
59 |
+
- Standard feature selection based on correlation
|
60 |
+
- No categorical encoding needed (all numerical)
|
61 |
+
|
62 |
+
3. **Model Training**
|
63 |
+
- Model: `RandomForestRegressor`
|
64 |
+
- Train-Test Split: 80/20
|
65 |
+
- **Results**:
|
66 |
+
- RMSE ≈ **573.8**
|
67 |
+
- R² Score ≈ **0.81** ✅
|
68 |
+
|
69 |
+
4. **Test Prediction & Submission**
|
70 |
+
- Predictions made on `test.csv`
|
71 |
+
- `submission.csv` generated for Kaggle submission
|
72 |
+
|
73 |
+
5. **Streamlit App**
|
74 |
+
- Users input bee counts, rain days, and fruit measurements
|
75 |
+
- Predicts blueberry yield in kg/ha
|
76 |
+
- Uses trained model (`rf_model.pkl`) behind the scenes
|
77 |
+
|
78 |
+
---
|
79 |
+
|
80 |
+
## 🚀 Try it Online
|
81 |
+
|
82 |
+
🌐 You can try this app live here:
|
83 |
+
[Hugging Face Space Link](https://huggingface.co/spaces/yazodi/blueberry-yield-regression-app)
|
84 |
+
|
85 |
+
---
|
86 |
+
|
87 |
+
## 🔮 What Could Be Improved?
|
88 |
+
|
89 |
+
| Area | Suggestion |
|
90 |
+
|------|------------|
|
91 |
+
| Feature Engineering | Create interaction terms, try log/ratio features |
|
92 |
+
| Model | Try LightGBM, XGBoost, or stacking |
|
93 |
+
| Tuning | GridSearchCV or Optuna for hyperparameter optimization |
|
94 |
+
| Visualization | Add interactive charts in Streamlit app |
|
95 |
+
| Real-World Data | Add satellite weather data, soil types, historical trends |
|
96 |
+
|
97 |
+
---
|
98 |
+
|
99 |
+
## 📁 Project Structure
|
100 |
+
|
101 |
+
📦 blueberry-yield-regression
|
102 |
+
├── app.py
|
103 |
+
├── rf_model.pkl
|
104 |
+
├── model_columns.pkl
|
105 |
+
├── requirements.txt
|
106 |
+
├── submission.csv
|
107 |
+
└── README.md
|
108 |
+
|
109 |
+
|
110 |
+
---
|
111 |
+
|
112 |
+
## 📜 License
|
113 |
+
|
114 |
+
MIT License – Free to use, modify and distribute.
|
115 |
+
|
116 |
+
---
|
app.py
ADDED
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import pandas as pd
|
3 |
+
import numpy as np
|
4 |
+
import joblib
|
5 |
+
|
6 |
+
# Başlık
|
7 |
+
st.title("🍇 Blueberry Yield Prediction App")
|
8 |
+
st.write("Bu uygulama, çevresel ve biyolojik faktörlere göre yaban mersini verimini tahmin eder.")
|
9 |
+
|
10 |
+
# Giriş alanları
|
11 |
+
clonesize = st.slider("Klon Boyutu", 0.0, 10.0, 1.0)
|
12 |
+
honeybee = st.slider("Bal Arısı Sayısı", 0.0, 10.0, 1.0)
|
13 |
+
bumbles = st.slider("Bumblebee Sayısı", 0.0, 10.0, 1.0)
|
14 |
+
andrena = st.slider("Andrena Sayısı", 0.0, 10.0, 1.0)
|
15 |
+
osmia = st.slider("Osmia Sayısı", 0.0, 10.0, 1.0)
|
16 |
+
RainingDays = st.slider("Yağmurlu Günler", 0.0, 100.0, 20.0)
|
17 |
+
AverageRainingDays = st.slider("Ortalama Yağmurlu Günler", 0.0, 100.0, 30.0)
|
18 |
+
fruitset = st.slider("Fruit Set", 0.0, 1.0, 0.5)
|
19 |
+
fruitmass = st.slider("Fruit Mass", 0.0, 10.0, 5.0)
|
20 |
+
seeds = st.slider("Tohum Sayısı", 0.0, 100.0, 50.0)
|
21 |
+
|
22 |
+
# DataFrame'e dönüştür
|
23 |
+
user_input = pd.DataFrame([{
|
24 |
+
"clonesize": clonesize,
|
25 |
+
"honeybee": honeybee,
|
26 |
+
"bumbles": bumbles,
|
27 |
+
"andrena": andrena,
|
28 |
+
"osmia": osmia,
|
29 |
+
"RainingDays": RainingDays,
|
30 |
+
"AverageRainingDays": AverageRainingDays,
|
31 |
+
"fruitset": fruitset,
|
32 |
+
"fruitmass": fruitmass,
|
33 |
+
"seeds": seeds
|
34 |
+
}])
|
35 |
+
|
36 |
+
# Model ve sütunlar yükleniyor
|
37 |
+
model = joblib.load("rf_model.pkl")
|
38 |
+
model_columns = joblib.load("model_columns.pkl")
|
39 |
+
|
40 |
+
# Eksik sütunları ekle
|
41 |
+
for col in model_columns:
|
42 |
+
if col not in user_input.columns:
|
43 |
+
user_input[col] = 0
|
44 |
+
|
45 |
+
user_input = user_input[model_columns]
|
46 |
+
|
47 |
+
# Tahmin
|
48 |
+
if st.button("Tahmini Göster"):
|
49 |
+
pred = model.predict(user_input)[0]
|
50 |
+
st.success(f"🌱 Tahmini Yaban Mersini Verimi: {pred:.2f} kg/ha")
|
model_columns.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:2f8f2353d1c8c3d79295e022ad6bd9a36aa8bc6bb2ce3f6b597b67cc2fea59ac
|
3 |
+
size 255
|
requirements.txt
CHANGED
@@ -1,3 +1,5 @@
|
|
1 |
-
|
2 |
-
pandas
|
3 |
-
|
|
|
|
|
|
1 |
+
streamlit
|
2 |
+
pandas
|
3 |
+
numpy
|
4 |
+
scikit-learn
|
5 |
+
joblib
|
rf_model.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:68b74682bb46d81c2aa0e680cea3abae0a97da6f372a366babe5a3bebd77e300
|
3 |
+
size 108065345
|