Skip to main content

Pharmaceutical Sales Analysis And Prediction

This project is based on Rossmann Pharmaceuticals data analysis posted on Kaggle and also included in 10Acadamy challenge week3. It is a machine learning project that includes data preprocessing, exploration, regressions, and deep learning concepts. If you are interested continue reading

Objective of the project

Rossmann Pharmaceutical's finance team wants to forecast sales in all their stores across several cities six weeks ahead of time. Managers in individual stores rely on their years of experience as well as their personal judgment to forecast sales. The data team identified factors such as promotions, competition, school and state holidays, seasonality, and locality as necessary for predicting the sales across the various stores. The task is to build and serve an end-to-end product that delivers this prediction to analysts in the finance team.

Given Data

We are given there datasets that include

  • Store.csv(1115 rows) - has store information

Columns include Store Type , Competition distance , promotion data

  • Train.csv (1017209 rows) and test.csv (41088 rows)

Columns include store , sales number , customer number, promotions , holidays

Used Tools In the project

MLflow - is a framework that plays an essential role in any end-to-end machine learning lifecycle. It helps to track your ML experiments, including tracking your models, model parameters, datasets, and hyperparameters and reproducing them when needed.

  • It is easy to set up a model tracking mechanism in MLflow.
  • It offers very intuitive APIs for serving.
  • It provides data collection, data preparation, model training, and taking the model to production.

Data Version Control, or DVC, is a data and ML experiment management tool that takes advantage of the existing engineering toolset that we are familiar with (Git, CI/CD, etc.).

  • Along with data versioning, DVC also allows model and pipeline tracking.
  • With DVC, you don't need to rebuild previous models or data modeling techniques to achieve the same past state of results.
  • Along with data versioning, DVC also allows model and pipeline tracking.

Continuous Machine Learning (CML) is a set of tools and practices that brings widely used CI/CD processes to a Machine Learning workflow.

Pre-Processing and Explanatory Data Analysis

At this stage I have cleaned the three given data. The tasks done at this stage include fixing datatypes , fixing missing values , removing duplicates , and adding features.

Comparing promotion distribution of train and test sets


Sales and Customer numbers by state holiday


Sales per Assortment Analysis


Correlation of selected 10 variables of train data


Distribution of Sales and Customer by the Day's of the week


Sales Prediction Using Random Forest Regression

Random Forest Regression is a supervised learning algorithm that uses ensemble learning method for regression. Ensemble learning method is a technique that combines predictions from multiple machine learning algorithms to make a more accurate prediction than a single model. So using sklearn's random forest regressor module I have got the following results.

For this Prediction I have chosen sklearn's random forest regressor algorithm. But with a lot of tuning the model I have used the RandomizedSearchCV model to change parameters. The entire training , fitting and predicting of the model is done using ml pipeline which includes column transformation

Scroll to Continue

Actual value vs Predicted value


Feature Importance


The Best Model's Results


Sales Prediction with Deep Learning

there are various deep learning algorithms that help with sales forecasting with like naïve mode, LSTM , NBEATs and so on. But for this project we will use LSTM algorithm. LSTM stands for long short-term memory networks, used in the field of Deep Learning. It is a variety of recurrent neural networks (RNNs) that are capable of learning long-term dependencies, especially in sequence prediction problems.

Steps Taken To Predict using LSTM Include

1, Preprocess data for the algorithm

- cleaning data

- scaling data

- normalize the data

2, Visualize the data to understand more

3 , Separate dataset into a Training and Validation and set format to use feed into LSTM.

4, Define and Estimate the LSTM.

5, Forecast the LSTM on the Validation Set and Assess Accuracy.

Descriptive Statistics and Visualizations of the Data.

So At this stage it is all about visualizing the data set and prepare the data for the algorithm

Time Series Plots with and with out Scaled data


Histogram view of Sales Data with and without Scaled Data


Autocorrelation Plot


Partial Autocorrelations Plots


Defining and Estimating LSTM

Here there are two key variable used to draw the Below Graph

1, Loss (training loss) - is a metric used to assess how a deep learning model fits the training data.

2, Val_loss (Validation loss) - The training loss is a metric used to assess how a deep learning model fits the training data.


LSTM Model Forecast Compared to Test Data


Mean Absolute Error of LSTM


Other Numerical results

Our last MAE value became 0.13843362033367157 with our MSE (mean squared Error) coming at 0.044740352779626846. Which are good results


Through out this project we have performed preprocessing of data , explanatory data analysis , predicting using machine learning and pridiction using one of time series deep learning algorithms called LSTM. We have got a pretty good results by the deep learning algorithm.

Codes Can be found on

Related Articles