Pharmaceutical Sales Analysis And Prediction
This project is based on Rossmann Pharmaceuticals data analysis posted on Kaggle and also included in 10Acadamy challenge week3. It is a machine learning project that includes data preprocessing, exploration, regressions, and deep learning concepts. If you are interested continue reading
Objective of the project
Rossmann Pharmaceutical's finance team wants to forecast sales in all their stores across several cities six weeks ahead of time. Managers in individual stores rely on their years of experience as well as their personal judgment to forecast sales. The data team identified factors such as promotions, competition, school and state holidays, seasonality, and locality as necessary for predicting the sales across the various stores. The task is to build and serve an end-to-end product that delivers this prediction to analysts in the finance team.
Given Data
We are given there datasets that include
- Store.csv(1115 rows) - has store information
Columns include Store Type , Competition distance , promotion data
- Train.csv (1017209 rows) and test.csv (41088 rows)
Columns include store , sales number , customer number, promotions , holidays
Used Tools In the project
MLflow - is a framework that plays an essential role in any end-to-end machine learning lifecycle. It helps to track your ML experiments, including tracking your models, model parameters, datasets, and hyperparameters and reproducing them when needed.
- It is easy to set up a model tracking mechanism in MLflow.
- It offers very intuitive APIs for serving.
- It provides data collection, data preparation, model training, and taking the model to production.
Data Version Control, or DVC, is a data and ML experiment management tool that takes advantage of the existing engineering toolset that we are familiar with (Git, CI/CD, etc.).
- Along with data versioning, DVC also allows model and pipeline tracking.
- With DVC, you don't need to rebuild previous models or data modeling techniques to achieve the same past state of results.
- Along with data versioning, DVC also allows model and pipeline tracking.
Continuous Machine Learning (CML) is a set of tools and practices that brings widely used CI/CD processes to a Machine Learning workflow.
Pre-Processing and Explanatory Data Analysis
At this stage I have cleaned the three given data. The tasks done at this stage include fixing datatypes , fixing missing values , removing duplicates , and adding features.
Comparing promotion distribution of train and test sets
Sales and Customer numbers by state holiday
Sales per Assortment Analysis
Correlation of selected 10 variables of train data
Distribution of Sales and Customer by the Day's of the week
Sales Prediction Using Random Forest Regression
Random Forest Regression is a supervised learning algorithm that uses ensemble learning method for regression. Ensemble learning method is a technique that combines predictions from multiple machine learning algorithms to make a more accurate prediction than a single model. So using sklearn's random forest regressor module I have got the following results.
For this Prediction I have chosen sklearn's random forest regressor algorithm. But with a lot of tuning the model I have used the RandomizedSearchCV model to change parameters. The entire training , fitting and predicting of the model is done using ml pipeline which includes column transformation
Actual value vs Predicted value
Feature Importance
The Best Model's Results
Sales Prediction with Deep Learning
there are various deep learning algorithms that help with sales forecasting with like naïve mode, LSTM , NBEATs and so on. But for this project we will use LSTM algorithm. LSTM stands for long short-term memory networks, used in the field of Deep Learning. It is a variety of recurrent neural networks (RNNs) that are capable of learning long-term dependencies, especially in sequence prediction problems.
Steps Taken To Predict using LSTM Include
1, Preprocess data for the algorithm
- cleaning data
- scaling data
- normalize the data
2, Visualize the data to understand more
3 , Separate dataset into a Training and Validation and set format to use feed into LSTM.
4, Define and Estimate the LSTM.
5, Forecast the LSTM on the Validation Set and Assess Accuracy.
Descriptive Statistics and Visualizations of the Data.
So At this stage it is all about visualizing the data set and prepare the data for the algorithm
Time Series Plots with and with out Scaled data
Histogram view of Sales Data with and without Scaled Data
Autocorrelation Plot
Partial Autocorrelations Plots
Defining and Estimating LSTM
Here there are two key variable used to draw the Below Graph
1, Loss (training loss) - is a metric used to assess how a deep learning model fits the training data.
2, Val_loss (Validation loss) - The training loss is a metric used to assess how a deep learning model fits the training data.
LSTM Model Forecast Compared to Test Data
Mean Absolute Error of LSTM
Other Numerical results
Our last MAE value became 0.13843362033367157 with our MSE (mean squared Error) coming at 0.044740352779626846. Which are good results
Conclusion
Through out this project we have performed preprocessing of data , explanatory data analysis , predicting using machine learning and pridiction using one of time series deep learning algorithms called LSTM. We have got a pretty good results by the deep learning algorithm.
Codes Can be found on
- Pharmaceutical Sales prediction
Contribute to tesfayealex/Pharmaceutical-Sales-prediction development by creating an account on GitHub.