Crop Yield Predictor
Introduction
This project implements a comprehensive pipeline for predicting cereal crop yields in Nepal using machine learning techniques. The workflow begins with raw data extraction via Optical Character Recognition (OCR) and culminates in model training and performance evaluation. The pipeline is built using Python and leverages powerful tools such as Tesseract OCR, Pandas, NumPy, scikit-learn, Matplotlib, and Seaborn.
Methodology
OCR Data Extraction (Tesseract)
- Extract tabular data from scanned government reports and PDFs.
- Convert raster text into machine-readable formats.
Data Cleaning & Preparation (Pandas, NumPy)
- Cleaning Tasks:
- Removal of null values and duplicate records
- Standardization of column names and date formats
- Normalization and outlier treatment
- Feature Engineering:
- Derived new features such as rainfall deviation, average temperature bands, and yield-per-hectare
- Aggregated data across districts and crop types
Exploratory Data Analysis (Matplotlib, Seaborn)
- Visualizations Created:
- Crop yield trends over time
- Correlation heatmaps between climatic variables and yield
- Boxplots and scatter plots for yield distribution and anomalies
- Purpose:
- Gain domain insights
- Identify potential feature importance for model training
Machine Learning Model Training (scikit-learn)
- Algorithms Used:
- Random Forest Regressor
- Support Vector Regressor (SVR)
- Linear Regression
- Process:
- Split data into training and testing sets
- Applied feature scaling where necessary
- Hyperparameter tuning using GridSearchCV
- Performance Metrics:
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- R² Score
Key Features
- OCR-driven data ingestion using Tesseract
- Flexible and robust data manipulation with Pandas
- Statistical transformations and numerical analysis via NumPy
- Visual analytics powered by Matplotlib and Seaborn
- ML prediction using Random Forest, SVR, and Linear Regression
- Model explainability through feature importance plots
Outcome
- Best Performing Model: Random Forest Regressor
- Top Influencing Factors: Rainfall, Minimum Temperature, Cultivated Area
- Use Case: Insight into climatic impacts on crop productivity and policy formulation
Roadmap
- Integrate real-time weather APIs
- Build a REST API for serving predictions
- Incorporate deep learning models (e.g., LSTM for time-series prediction)
- Extend to other crops and integrate soil profile data
The final trained RF model can be found in this here.