On my previous blog, I have discussed the idea of Linear regression and we have solved a problem using simple linear regression approach. There, we had two find dependent variable value using a single independent variable. In thig blog post, I will be talking about Multiple linear regression in python.Multiple linear regression technique is used for solving problems with multiple independent variables. Let’s look at the below dataset.
Above dataset contains the amount spent by each department in an organisation along with the state they are located as well as the profit they gained. Here profit is our dependent variable (Which we want to predict) all others are the independent variables. Our target model should be able to predict the profit by giving it required independent variables. This is where multiple linear regression comes in.
There is no marginal difference between single and multiple regressions except the no.of independent variables. All other procedures are the same as single regression except processing of data. In simple linear regression, the independent variable didn’t want to go for a pre-processing stage as it was ready for modelling. But when you look at the above table, You can easily find out that variable category called
state needs a revision or modification to proceed with modelling.
Encoding Categorical Data with OneHotEncoder
Categorical data encoding is a data pre-processing technique based on usage of label values for non-readable values. So here in our dataset,
state variable cannot be passed to our model directly. It might turn into issues in the model. Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric. So we have to indicate them with integer values or in a binary form. Integer forms found to be more complicated than binary form. Binary forms type of encoding can be done by One-Hot encoding.
In One-Hot encoding method, a label is represented by 0s and 1s. So for here, we have 3 states. New York, Florida and California. If you look at the table below,
You can easily find out that each
stateis represented by the different varieties of 0 and 1. As an analytical engine perspective, Our model differentiates all each state. This can be done by coding itself. A module named
sklearn.preprocessing is the right guy for you. Let’s look into the code. Import the necessary modules first
import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression
You might have noticed importing of a new module called
pandas Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Most of our datasets might be in the form of an Excel sheet or CSV data file format. We can make use pandas library to import those file dataset into our environment. Some of the key features of Pandas are
- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of data sets.
I have already accounted for the above dataset into a CSV file called firms.csv (You can download the dataset here). So I have to import them into our model using pandas. Here it is,
# Importing the dataset dataset = pd.read_csv('firms.csv') #Independent variables grouping into X (All columns except profit columns) X = dataset.iloc[:, :-1].values # Dependent variable groups into Y (Profit in the 4th column) y = dataset.iloc[:, 4].values
If you run the above code, You can print X and y successfully. But there is a problem, if you call
type(X) (A function which prints the type of a variable) you can see that it is a type of
numpy.ndarray which is an object array. Object arrays are ndarrays with a datatype of np.object whose elements arePython objects, enabling the use of numpy’s vectorized operations and broadcasting rules with arbitrary Python types. Object arrays have certain special rules to resolve ambiguities that arise between python types and numpy types. So we need to convert it to a python readable array format. This can be solved by the method
# Encoding categorical data using OneHotEncoding from sklearn.preprocessing import LabelEncoder, OneHotEncoder #Creating LabelEncoder instance labelencoder = LabelEncoder() #Selecting state column for processing and label encoding X[:, 3] = labelencoder.fit_transform(X[:, 3]) #One hot encoding onehotencoder = OneHotEncoder(categorical_features = ) #Converting to float array X = onehotencoder.fit_transform(X).toarray()
Now when you print X type you can clearly see that, It has been converted to a float64 array. Now our dataset is ready to go. But wait! There is another problem! If you look at the one-hot encoded table, each
state has 3 numbers which include 0 and 1. But in actual representation, there is no need of 3 numbers in the count. Only 2 is required, right? Because 0 and 1 can represent 4 different states. right? So we can easily avoid a column. This is called a dummy variable trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear – a scenario in which two or more variables are highly correlated; in simple terms, one variable can be predicted from the others. Let us remove the first column using below code,
# Avoiding the Dummy Variable Trap X = X[:, 1:]
Fitting the model
Yay! Now our data preprocessing is complete! Our dataset is ready to go for regressor modelling. It is the same as in Simple linear regression.
# Fitting Multiple Linear Regression from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X, y)
Now our model is ready for predictions. You can simply make the predictions or tests by calling `regressor.predict()` function. You have to pass an array contains independent variables.
sklearn library has a module to split our dataset into training and testing set. So that you can pick the testing dataset from your actual dataset itself.
# Splitting the dataset into the Training set and Test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
As I said before, the data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset. So that we can calculate the accuracy of the model by changing the
I will be talking about the model accuracy on upcoming posts. Thank you.