Machine Learning – Part 3 – Regression

Regression is a type of Supervised Machine Learning method of modelling a target value based on independent predictors. The Regression algorithm builds a model on the features of training data and using the model to predict the value of new data. Regression is mostly used to perform forecasting, trend analysis, time-series prediction, response modelling etc. by finding out cause and effect relationship between the variables. The variables can be of two types:

  • Input Variables (also known as “Features”“Explanatory Variables” or “Independent Variables”)
  • Output Variables (also known as “Target Variables” or “Dependent Variables”)

There are various types of Regression techniques which usually differ based on the number of Features and the type of relationship between the Features and the Target Variables. Some of them are:

  • Simple Linear Regression
  • Multiple Linear Regression
  • Polynomial Regression
  • Support Vector Regression
  • Decision Tree Regression
  • Random Forest Regression
  • Stepwise Regression
  • Ridge Regression
  • Lasso Regression
  • Elasticnet Regression

SIMPLE LINEAR REGRESSION (SLR)

The simplest method among the Linear Regression algorithms is the Simple Linear Regression (SLR). It is a statistical method which helps us to study the relationship between two continuous quantitative variables (input and output variable). The simple linear model assumes a linear relationship between one independent variable (x) and one dependent variable (y) and hence the name. Examples of SLR usage include Salary Forecasting, Real Estate Prediction, Financial Portfolio Prediction etc.

SLR Intuition
Let’s understand the math behind Simple Linear Regression. Suppose, we want to predict the salary of an employee (a continuous- valued output) based on his/her age. We know that there is usually a linear correlation between employee age and employee salary. More the age, more the salary (though exceptions are there). Now consider the following parameters:

m = Number of training data fed to the SLR model
x = Input Variable/ Feature/ Independent Variable/ Explanatory Variable
y = Output Variable/ Target Variable/ Dependent Variable
(x,y) = One training sample data
(xi, yi) = ith  training example
ypred – yi = Error Difference

Then, the hypothesis of the SLR will be:-

y = a*x + b

where, dependent ‘y’ variable is a linear function of one independent ‘x’ variable and thus this is also an “Univariate Linear Regression”.

Cost Function and Gradient Descent
The motive of the Linear Regression is to find the best possible values for ‘a’ and ‘b’ from the above equation. The Cost Function helps us to find out the best possible values for ‘a’ and ‘b’ which would provide the best fit linear regression line for the data points. For this, we convert this problem into a minimization problem where we would like to minimize the error between the predicted value and the actual value. In other words, we need to find the value of ‘a’ and ‘b’ which will minimize the average of sum of square errors. The formula is:

SLR_Formula

This is also known as the “Squared Error Cost Function” or “Mean Squared Error Function” because this provides the average squared error over all the data points. Gradient Descent is a method to update ‘a’ and ‘b’ to minimize the Cost Function (J(a,b)). Here, we start with some initial values of ‘a’ and ‘b’ and gradually reduce the cost function by using discrete number of steps (also known as the “Learning Rate in Gradient Descent”). This will decide how fast the hypothesis converges to the minima.

Now, let’s train a sample dataset with a Simple Linear Regression model in Python (using PyCharm IDE) and then predict output for some data. Here, the dataset is of Number of Defects found (y) against Test Execution duration in months (x). We will split this dataset into “Training Set” and “Test Set”. Then, we will train the SLR model with the training data and then test the model with the test data. The dataset is as follows:

Test Execution Duration(in Months)No. of Defects Found
1.139
1.346
1.537
243
2.239
2.956
360
3.254
3.264
3.757
3.963
455
456
4.157
4.561
4.967
5.166
5.383
5.981
693
6.891
7.198

Below is the python code:

#Simple Linear Regression

#Import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plotter

#Import the collected Defect Dataset from csv file
collectedDataset = pd.read_csv('MonthlyDefects_Data.csv')

#Create Matrix of features for independent variable-X(Test Execution Duration(in Months)) 
#and for dependent variable-Y(No. of Defects Found))
X=collectedDataset.iloc[:,:-1].values
Y=collectedDataset.iloc[:,1].values

#Split the collected dataset into Training set and Test set (with split ratio 1/3)
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=1/3,random_state=0)

#Fit the Simple Linear Regression model to the Training set
from sklearn.linear_model import LinearRegression
slr=LinearRegression()
slr.fit(X_train,Y_train)

#Predict the Test Set Results and a sample value of independent variable
#Create a vector of predictions of the Dependent variable
Y_pred=slr.predict(X_test)
sample=slr.predict(4.2)
print(sample)

#Visualize the Training set result
plotter.scatter(X_train,Y_train,color='green')
plotter.plot(X_train,slr.predict(X_train),color='red')
plotter.title('Defects Found vs Months of Execution (Training Set)')
plotter.xlabel('Months of Execution')
plotter.ylabel('Defects Found')
plotter.show()

#Visualizing the Test set result
plotter.scatter(X_test,Y_test,color='green')
plotter.plot(X_train,slr.predict(X_train),color='red')
plotter.title('Defects Found vs Months of Execution (Test Set)')
plotter.xlabel('Months of Execution')
plotter.ylabel('Defects Found')
plotter.show()
SLR_Plot
SLR plot of Training Set and Test Set

MULTIPLE LINEAR REGRESSION (MLR)

Multiple Linear Regression is a statistical method which uses several continuous (not categorical/discrete/qualitative) Independent Variables to predict the outcome of a continuous (not categorical/discrete/qualitative) Dependent Variable. If your independent variables are “Categorical Variables”, then you have to convert them into “Continuous Variables” to use MLR. It is the most common form of Linear Regression Analysis. The main aim of the MLR technique is to model a linear relationship between two or more independent variables (also referred to as predictor variables/regressors) and one dependent variable (also referred to as outcome variables/regressand). It is an extension of the OLS (Ordinary Least Squared) Regression technique.

MLR Assumptions:
1) Regression Residuals should be normally distributed, homoscedastic and approximately rectangular-shaped.
2) A linear relationship is assumed between the independent variables and the dependent variable.
3) Lack of multicollinearity which means that the independent variables are not highly correlated to each other.
4) Adding too much independent variable will increase the amount of explained variance in the dependent variable (also known as R-squared, R2 or co-efficient of determination) and it will result in an over-fit model.

MLR Uses:
1) MLR can be used to identify the impact of the independent variables on the dependent variable.
2) MLR can be used to forecast the change in the dependent variable with the changes in the independent variables.
3) MLR can be used to predict trends and future values in the market.

MLR Intuition:
The formula for Multiple Linear Regression is:
y = b0 + b1x + b2x2 + b3x3 + ……. + bnxn

where, n+1 = number of Independent Variables
y = One Dependent Variable
x1, x2, x3 …….. xn =  Two or more (here, n+1) Independent Variables
b0  = y – intercept (constant)
b1, b2, b3 …….. bn = Slope coefficients for each independent variable

The MLR is used to determine the linear mathematical relationship among a number of random variables in the form of a straight line that best approximates all the individual data points in a multidimensional space.

Let us consider the “50-startup” dataset (You can get the dataset from Kaggle).

Below is the python code to perform Multiple Linear Regression on the dataset:

# Multiple Linear Regression

# Import the required pandas library and collected companies dataset
import pandas as pd
collectedDataset = pd.read_csv('50-startups.csv')
X = collectedDataset.iloc[:, :-1].values
Y = collectedDataset.iloc[:, 4].values

# Encode the Categorical data to convert to Continuous Data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelEncoder = LabelEncoder()
X[:, 3] = labelEncoder.fit_transform(X[:, 3])
oneHotEncoder = OneHotEncoder(categorical_features = [3])
X = oneHotEncoder.fit_transform(X).toarray()

# Avoid the Dummy Variable Trap
X = X[:, 1:]

# Split the collected dataset into the Training set and Test set (with split ratio 1/3)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)


# Fit the Multiple Linear Regression model to the Training set
from sklearn.linear_model import LinearRegression
mlr = LinearRegression()
mlr.fit(X_train, Y_train)

# Predict the Test set results and compare with Y_test
Y_pred = mlr.predict(X_test)
print (Y_pred)
print (Y_test)

POLYNOMIAL LINEAR REGRESSION (PLR)

Polynomial Linear Regression is a statistical method in which the relationship between the independent variable (x) and the dependent variable (y) is defined as the nth degree polynomial in x. Though the Polynomial Linear Regression fits a non-linear relationship between x and its corresponding value of y, it is referred to as “linear” because the co-efficients of x are in a linear relationship with each other. Hence, Polynomial Linear Regression is considered to be a special case of Multiple Linear Regression. The aim of the model is to find the best possible values of the co-efficients to fit the data points.

PLR Intuition
The hypothesis of a Polynomial Linear Regression is given by the below formula:
y = b0 + b1x+ b2x12 + b3x13 + …….. + bnx1n

where, y = Dependent Variable
x1, x12, x13…… x1n = Independent Variable and its higher degree terms.
b0, b1, b2 …… bn = Co-efficients of the Independent variable and its higher degree terms.

PLR is used when a straight line from SLR or MLR does not fit the data points well and we want more of a parabolic curve to fit the data points. The polynomial term – a quadratic (squared) or cubic (cubed) term, turns a linear regression model into a curved one. When you observe your data points of the independent and dependent variables to be scattered or in a curvilinear relationship, it is best to use PLR since linear model on such kind of data will result in many positive and negative residuals.

Consider the following dataset (JobRole_Salaries) which displays the salaries of different job roles in a test consultation company:

Job RoleJob Position LevelSalary
Graduate Test Analyst135000
Junior Test Associate245000
Senior Test Associate358000
Senior Test Analyst465000
Test Automation Architect590000
Junior SDET6110000
Senior SDET7150000
Lead SDET8200000
Associate Manager9300000
Senior Manager101000000

Below is the Python code to perform Simple Linear Regression Analysis and then Polynomial Regression Analysis on the same dataset:

# Polynomial Linear Regression

# Import the required libraries and collected dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plotter
collectedDataset = pd.read_csv('JobRole_Salaries.csv')
X = collectedDataset.iloc[:, 1:2].values
Y = collectedDataset.iloc[:, 2].values

# Split the collected dataset into Training set and Test set (with split ratio of 1/5)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

# Fit Simple Linear Regression Model to the dataset
from sklearn.linear_model import LinearRegression
linearRegression1 = LinearRegression()
linearRegression1.fit(X, Y)

# Visualize the Simple Linear Regression Model result
plotter.scatter(X, Y, color = 'green')
plotter.plot(X, linearRegression1.predict(X), color = 'red')
plotter.title('Salary Prediction (Simple Linear Regression)')
plotter.xlabel('Job Position level')
plotter.ylabel('Salary')
plotter.show()

# Fit Polynomial Linear Regression Model to the dataset
from sklearn.preprocessing import PolynomialFeatures
polynomialRegression = PolynomialFeatures(degree = 4)
X_polynomial = polynomialRegression.fit_transform(X)
polynomialRegression.fit(X_polynomial, Y)
linearRegression2 = LinearRegression()
linearRegression2.fit(X_polynomial, Y)

# Visualize the Polynomial Linear Regression Model results
plotter.scatter(X, Y, color = 'green')
plotter.plot(X, linearRegression2.predict(polynomialRegression.fit_transform(X)), color = 'red')
plotter.title('Salary Prediction (Polynomial Linear Regression)')
plotter.xlabel('Job Position level')
plotter.ylabel('Salary')
plotter.show()

# Visualize the Polynomial Regression Model results (for higher resolution and smoother curve)
X_grid = np.arange(min(X), max(X), 0.1)
X_grid = X_grid.reshape((len(X_grid), 1))
plotter.scatter(X, Y, color = 'green')
plotter.plot(X_grid, linearRegression2.predict(polynomialRegression.fit_transform(X_grid)), color = 'red')
plotter.title('Salary Prediction (Polynomial Linear Regression)')
plotter.xlabel('Job Position level')
plotter.ylabel('Salary')
plotter.show()

# Predicting and printing a new result with Simple Linear Regression
print(linearRegression1.predict(6.5))

# Predicting and printing a new result with Polynomial Linear Regression
print(linearRegression2.predict(polynomialRegression.fit_transform(6.5)))
PLR_Plot
SLR, PLR and PLR (with higher resolution and smoother curve)

SUPPORT VECTOR REGRESSION (SVR)

Support Vector Regression (SVR) is different from Support Vector Machine (SVM) and can be used to perform regression analysis on continuous-valued data instead of classification which is usually performed with SVM. The difference between other regression techniques and SVR is that in the other regression techniques we try to minimize the error rate whereas in SVR, we try to fit the error within a certain threshold.

Consider the below diagram:

SVR_Intuition
Support Vector Regression (SVR)

Here, the red line is the Boundary Line
the blue line is the Hyperplane

SVR supports both linear and non-linear regression and tries to fit as many data instances as possible on the street formed by the blue and the red lines while limiting margin violations. SVR performs regression in a higher dimensional space and each data point represents its own dimension. The width between the red lines and the blue line is controlled by a hyper parameter (€ – epsilon). The main aim of SVR model is to consider the data points within the red boundary lines and the best fit line is the blue hyperplane that will have the maximum number of data points.

When you evaluate the kernel between a data point in the training set and a data point in the test set, the resulting value gives the co-ordinate of the test data point in that dimension. The vector (k) produced when the test point is evaluated for all the training set data points is the representation of the test point in the higher dimensional spaces. This vector can then be used to perform linear regression. The vectors closest to the test point are referred to as Support Vectors.

To train a SVR model, we need a training set, which covers the domain of interest and is accompanied by solutions on that domain. The work of the SVR is to approximate the function we used to generate the training set. After collecting the training set, we need to choose a kernel, its parameters and any regularization needed. Then, we need to form the Correlation Matrix and train the model to get Contraction Co-efficients. Then using these co-efficients, we can create our own estimators.

Let’s consider the same dataset (JobRole_Salaries) that we used for our Polynomial Linear Regression example. Below is the Python code where we have used SVR model on the dataset to predict:

#Support Vector Regression

# Import the required libraries and the collected dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plotter
collectedDataset=pd.read_csv('JobRole_Salaries.csv')
X=collectedDataset.iloc[:,1:2].values
Y=collectedDataset.iloc[:,2:3].values

# Split the Dataset into Training Set and Test Set
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=0)

# Perform Featue Scaling of the collected dataset
from sklearn.preprocessing import StandardScaler
standardScaler_X=StandardScaler()
standardScaler_Y=StandardScaler()
X=standardScaler_X.fit_transform(X)
Y=standardScaler_Y.fit_transform(Y)

# Fit the Support Vector Regression Model to the dataset
from sklearn.svm import SVR
supportVectorRegressor=SVR(kernel='rbf')
supportVectorRegressor.fit(X,Y)

# Predict a new Result with the built SVR model
Y_predicted=standardScaler_Y.inverse_transform(supportVectorRegressor.predict(standardScaler_X.transform(np.array([[6.5]]))))
print(Y_predicted)

#Visualize the SVR Results
plotter.scatter(X,Y,color='green')
plotter.plot(X,supportVectorRegressor.predict(X),color='red')
plotter.title('Salary Prediction (Support Vector Regression)')
plotter.xlabel('Job Position level')
plotter.ylabel('Salary')
plotter.show()
SVR_Plot

DECISION TREE REGRESSION

A Decision Tree is one of the most important supervised machine learning algorithms which can be used to predict a target value by learning decision rules from the features. It is also known as CART (Classification And Regression Trees) and it provides a foundation for some other important ML algorithms like Bagged Decision Trees, Random Forest and Boosted Decision Trees. The main concept behind Decision Tree is to break down the data by making decisions after asking a series of questions to the data. It can work on both Categorical and Numerical Data.

A decision tree can build both Regression and Classification models in the form of a tree structure. It essentially breaks down the data into smaller subsets and at the same time the associated decision tree is incrementally developed. The final outcome is a tree with Decision Nodes and Leaf Nodes. Here, we will learn the Decision Tree Regression technique which is a non-linear, non-continuous regression model. Decision Tree Classification will be covered in the Classification section of the tutorial.

A decision node can have two or more branches, each representing the tested attribute values. A leaf node represents a decision on the numerical target. The topmost decision node in a tree that represents the best predictor is called the Root Node (First Parent).

A decision tree is constructed by the process of Recursive Partitioning – starting from the Root Node.Each node can be further split into left and right child nodes which can be further split. Then these nodes will become the parent nodes of their resulting child nodes. These splitting procedure is continued until samples at each node all belong to the same class.

Let us consider the below diagram:

DecisionTreeRegression_Sample

In this diagram, the decision tree is based on categorical targets (classification), but the same concept can be applied for continuous-valued numbers too (Regression). Here, the Root Node (First Parent)is the node “Test Execution Pending?” and it splits into the child nodes – “Do Test Execution” and “Test Artifacts in place?”. The “Test Artifacts in place?” is further split into the child nodes – “Organize Artifacts” and “Review Test Automation Framework”.

Let’s consider the same dataset (JobRole_Salaries) that we used for our Polynomial Linear Regression and Support Vector Regression examples and see how we can use Decision Tree Regression on that dataset using Python:

# Decision Tree Regression

# Import the required libraries and the collected dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plotter
collectedDataset=pd.read_csv('JobRole_Salaries.csv')
X=collectedDataset.iloc[:,1:2].values
Y=collectedDataset.iloc[:,2].values

# Split the Dataset into the Training Set and Test Set
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=0)

# Fit the Decision Tree Regression Model to the dataset
from sklearn.tree import DecisionTreeRegressor
decisionTreeRegressor=DecisionTreeRegressor(random_state=0)
decisionTreeRegressor.fit(X,Y)

#Predicting a new value with the trained model
Y_predict=decisionTreeRegressor.predict(6.5)
print(Y_predict)

#Visualize the Decision Tree Regression Results
X_grid=np.arange(min(X),max(X),0.01)
X_grid=X_grid.reshape((len(X_grid),1))
plotter.scatter(X,Y,color='green')
plotter.plot(X_grid,decisionTreeRegressor.predict(X_grid),color='red')
plotter.title('Salary Prediction (Decision Tree Regression)')
plotter.xlabel('Job Position level')
plotter.ylabel('Salary')
plotter.show()
DecisionTreeRegression_Plot

RANDOM FOREST REGRESSION

Before trying to understand what Random Forest Regression is, let’s know what is “Ensemble Learning”, which is a very powerful technique to improve the Machine Learning model performances. Ensemble Learning (also known as “Model Ensembling”) is a method to take some Machine Learning algorithms multiple times, combine them together and produce a powerful optimal predictive model out of it. We will talk about Ensemble Learning in details later in this tutorial.

The Random Forest can be considered as an Ensemble Learning technique which can perform both Regression and Classification tasks by combining decisions from a sequence of multiple base Decision Tree models. In terms of mathematical functions, we can represent it as:
g(x) = f0(x) + f1(x) + f2(x) + …… + fn(x)

where, the final ensemble Random Forest Regression Model g(x) is the sum of simple base Decision Tree Models fn. The individual Decision Tree Models are constructed independently using different subsamples of the training data and this process of training each decision tree with different data sample, where sampling is done with replacement, is known as Bagging/Bootstrap Aggregation.

Random Forest Regression Uses
The Random Forest Regression model is very useful in handling of tabular data with numerical/categorical features compared to its other counterparts. Also, unlike the linear regression models, Random Forest Regression can capture non-linear interaction between the independent variables and the dependent variable.

Random Forest Regression does not work well with sparse input data, which are basically categorical features with large dimension. You need to either perform pre-processing on those features to generate numerical values or use a linear model.

Steps to build a Random Forest Regression Model
Step1: Pick at random ‘n’ data points from the training set.
Step2: Build a base Decision Tree model associated with the ‘n’ data points.
Step3: Choose the number of Decision Trees you want to build (N).
Step4: Repeat Step1 and Step2 for ‘N’ times.
Step5: For a new data point, make the ‘N’ number of formed decision trees predict the output value and then assign the average of all the predicted output values to the predicted output value of the new data point.

Let’s consider our previous dataset (JobRole_Salaries) and perform Random Forest Regression on it:

# Random Forest Regression

# Import the required libraries and the collected dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plotter
collectedDataset=pd.read_csv('JobRole_Salaries.csv')
X=collectedDataset.iloc[:,1:2].values
Y=collectedDataset.iloc[:,2].values

# Split the collected dataset into the Training Set and Test Set
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=0)

# Fit the Random Forest Regression Model to the dataset with n_estimators=10,
# where n_estimators is the number of trees in the forest
from sklearn.ensemble import RandomForestRegressor
randomForestRegressor=RandomForestRegressor(n_estimators=10,random_state=0)
randomForestRegressor.fit(X,Y)

# Fit the Random Forest Regression Model to the dataset
# from sklearn.ensemble import RandomForestRegressor
# randomForestRegressor=RandomForestRegressor(n_estimators=100,random_state=0)
# randomForestRegressor.fit(X,Y)

# Fit the Random Forest Regression Model to the dataset
# from sklearn.ensemble import RandomForestRegressor
# randomForestRegressor=RandomForestRegressor(n_estimators=300,random_state=0)
# randomForestRegressor.fit(X,Y)


# Predict and print the output for a new data based on the formed Random Forest Regression Model
Y_predict=randomForestRegressor.predict(6.5)
print(Y_predict)

# Visualize the Random Forest Regression Model Results
X_grid=np.arange(min(X),max(X),0.01)
X_grid=X_grid.reshape((len(X_grid),1))
plotter.scatter(X,Y,color='green')
plotter.plot(X_grid,randomForestRegressor.predict(X_grid),color='red')
plotter.title('Salary Prediction (Random Forest Regression)')
plotter.xlabel('Job Position level')
plotter.ylabel('Salary')
plotter.show()
RandomForestRegression_Plot