Predicting the Sale Price of Bulldozers using Machine Learning

Introduction:

In this article, I'll discuss a project I worked on. This project aims to predict the sale price of Bulldozers using machine learning models. Since the output variable we are trying to predict is a number. This kind of problem is known as a regression problem

You can check the complete code from Github repository. The approach is inspired from zerotomatery.io

Data:

The data and evaluation metric used is (root mean square log error or RMSLE) as this is a constraint from the Kaggle Bluebook for Bulldozers competition.

Looking at the dataset, we can understand that there's a time attribute to dataset. So it is Time series problem.

There are 3 datasets:

Train.csv - Historical bulldozer sales examples up to 2011 (close to 400,000 examples with 50+ different attributes, including SalePrice which is the target variable).
Valid.csv - Historical bulldozer sales examples from January 1 2012 to April 30 2012 (close to 12,000 examples with the same attributes as Train.csv).
Test.csv - Historical bulldozer sales examples from May 1 2012 to November 2012 (close to 12,000 examples but missing the SalePrice attribute, as this is what we'll be trying to predict).

Data Exploration:

($) Parsing dates : When working with time series data, it's a good idea to make sure any date data is in the format of a datetime object (a Python data type which encodes specific information about dates).

($) Let's compare SalePrice and Saledate using Scatter plot

Screenshot (32).png

We can see that there are very less sales in 2005 and more sales between 2007 and 2008

($) Let's us check the spread of SalePrice

Screenshot (35).png

($) As we're working on a time series problem and trying to predict future examples given past examples, it makes sense to sort our data by date.

df.sort_values(by=["saledate"], inplace=True, ascending=True)

($) Now Let's see if we have any missing values

# Check for missing values
df_tmp.isna().sum()

Yes, there are some values missing from some columns.

As we know ,

All of our data has to be numerical
There can't be any missing values

Convert strings to categories

One way to help turn all of our data into numbers is to convert the columns with the string datatype into a category datatype.
To do this we can use the pandas types API which allows us to interact and manipulate the types of data.
Once data is converted to categories turn them into codes and Filled missing values with Median of that column.

Once our data is in numeric format and there are no missing values, we should be able to build a machine learning model!

Splitting data into train/valid sets

As this is a time series problem, we will split our data into training, validation and test sets using the dates.

In our case:

Training = all samples up until 2011
Valid = all samples form January 1, 2012 - April 30, 2012
Test = all samples from May 1, 2012 - November 2012

Modeling:

I've used RandomForestRegressor model.

Fit the train data to the model.

The evaluation function used here is root mean squared log error (RMSLE).

Hyperparameter tuning with RandomizedSearchCV

We can increase n_iter to try more combinations of hyperparameters but in our case, I've tried 20

Screenshot (37).png

We got best parameters as {'n_estimators': 90, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_samples': 10000, 'max_features': 0.5, 'max_depth': None}

Now Train a model with the best parameters.

The output is as follows:

{'Training MAE': 2538.086916839285, 'Valid MAE': 5921.566785123334,

'Training RMSLE': 0.12938511430324895, 'Valid RMSLE': 0.24299588986662657,

'Training R^2': 0.9672433268997953, 'Valid R^2': 0.8812702169749176}

Make predictions on test data

Our model has been trained on data formatted in the same way as the training data.
This means in order to make predictions on the test data, we need to take the same steps we used to preprocess the training data.
Also check if test and training data has same columns so that there will not be any further hassle.
Once preprocessing is done on test data, fit the test data to the model with best parameters.

Now we've built a model which is able to make predictions.