Apply Linear Regression in Machine Learning using Python
This blog will help in applying Linear Regression using python.
(It is a sequel of blog available at
A problem which is solved using Data Science and Machine Learning involves data and requires data cleaning.
Once the data cleaning is done, the data is divided following the 70/30 rule. This means that 70% of the overall data is used for training and 30% of the data is kept aside to test the model.
There is no hard-and-fast rule to take 70/30 distribution. Partitioning ratio is an important aspect and experience along with domain knowledge and amount of records available can help to make a good choice of this ratio.
Why to divide data into train and test data?
This is done to avoid overfitting. Imagine the data set is not divided into train and test data and complete data is used to train the model. Once the model is ready for testing then the testing data will come from the training data itself. The model has already learned this data, and it will always give correct answer. This means that the model is never tested correctly before deploying and there is a high chance that it will perform bad on new data set.
Testing will give the right picture of how good the model is by using a separate set of test data because the model has never seen this data during training.
SKLearn library in python helps to achieve this.
from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.7)
Instantiate an object
from sklearn.linear_model import LinearRegression linReg = LinearRegression()
Fit the model using fit function
Predict the labels using test data
y_pred = linReg.predict(X_test)
Here, y_pred has the predicted values and y_test has the actual results.
Now, let’s compare y_pred and y_test to see the error.
Compute RMSE and R-square
from sklearn.metrics import mean_squared_error, r2_score mse = mean_squared_error(y_test,y_pred) r_squared = r2_score(y_test,y_pred)
Mean square error and r2_score metrics can be used to evaluate the model performance.