Crop Yield Predictor

Introduction

This project implements a comprehensive pipeline for predicting cereal crop yields in Nepal using machine learning techniques. The workflow begins with raw data extraction via Optical Character Recognition (OCR) and culminates in model training and performance evaluation. The pipeline is built using Python and leverages powerful tools such as Tesseract OCR, Pandas, NumPy, scikit-learn, Matplotlib, and Seaborn.

Methodology

OCR Data Extraction (Tesseract)

Extract tabular data from scanned government reports and PDFs.
Convert raster text into machine-readable formats.

Data Cleaning & Preparation (Pandas, NumPy)

Cleaning Tasks:
- Removal of null values and duplicate records
- Standardization of column names and date formats
- Normalization and outlier treatment
Feature Engineering:
- Derived new features such as rainfall deviation, average temperature bands, and yield-per-hectare
- Aggregated data across districts and crop types

Exploratory Data Analysis (Matplotlib, Seaborn)

Visualizations Created:
- Crop yield trends over time
- Correlation heatmaps between climatic variables and yield
- Boxplots and scatter plots for yield distribution and anomalies
Purpose:
- Gain domain insights
- Identify potential feature importance for model training

Machine Learning Model Training (scikit-learn)

Algorithms Used:
- Random Forest Regressor
- Support Vector Regressor (SVR)
- Linear Regression
Process:
- Split data into training and testing sets
- Applied feature scaling where necessary
- Hyperparameter tuning using GridSearchCV
Performance Metrics:
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- R² Score

Key Features

OCR-driven data ingestion using Tesseract
Flexible and robust data manipulation with Pandas
Statistical transformations and numerical analysis via NumPy
Visual analytics powered by Matplotlib and Seaborn
ML prediction using Random Forest, SVR, and Linear Regression
Model explainability through feature importance plots

Outcome

Best Performing Model: Random Forest Regressor
Top Influencing Factors: Rainfall, Minimum Temperature, Cultivated Area
Use Case: Insight into climatic impacts on crop productivity and policy formulation

Roadmap

Integrate real-time weather APIs
Build a REST API for serving predictions
Incorporate deep learning models (e.g., LSTM for time-series prediction)
Extend to other crops and integrate soil profile data

Crop Yield Predictor The final trained RF model can be found in this here.

GitHub Repository

Thesis of my Research