Linear regression in simple words
Linear regression helps in predicting the future outcomes with easy machine learning techniques.
What is SLR?
Simple linear regression is a machine learning technique which involves a independent variable x and a dependent variable y, by plotting a graph between the x and y we can determine the value of y for any value of x.
The equation for SLR is given as
Here,
y_hat:- these are the predicted values
b_not:- it is the intercept on the y axis
b_one:- it is the coefficient of x
x_i:- it is the independent variable
Let me explain this through an example,
Given below is the “countries of the world.csv” data set
First of we create a correlation matrix of the data set something like this
what we observe from this is that there is a strong correlation between the columns Service and Phones (per 1000), also between birthrate and infant mortality rate.
Both of these are self explanatory as a country with great service will have more infrastructure and thus more amount of people using phones.
While these correlation are visible we need to find some correlation based on our intuition.
Take for example we can apply SLR on the 2 columns Phones (per 1000) and GDP ($ per capita).
First of all we read the above data using pandas using pd.read_csv().
We take GDP in x and phones/1000 in y and apply linear regression.
We perform train test split using sklearn library providing 20% data for splitting
- >from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2,random_state=0)
We now perform linear regression on the given data between the 2 columns
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train , Y_train)
This is the result we get,
The blue line you observe is the regression line and the red dots are the data points we provided in the columns of x and y.
The more the dots are closer to the line more is the accuracy of the model and better is the result.
Using .coef() and .intercept() methods we find the coefficient of x -b_one and b_not respectively as discussed earlier.
Also we find r_square which depicts the accuracy of the model.
print(‘coefficent\n’ , regressor.coef_)
print(‘intercept’ , regressor.intercept_)
Y_pred = regressor.predict(X_test)
from sklearn.metrics import mean_squared_error , r2_score
print(“mean squared error: {}”.format(mean_squared_error(Y_test,Y_pred)))
print(“r2 score: {}”.format(r2_score(Y_test,Y_pred)))
print(np.sqrt(metrics.mean_absolute_error(Y_test, Y_pred)))
Thse are the results,
coefficent
[[36.27195242]]
intercept [1121.74507482]
r2 score: 0.6296860984628494
This gives us an accuracy of 62.9% and the regression equation will be
y_hat=1121.74+36.27*x
Now for any value of x or phone per 1000 we can get a value for gdp of the country and thus we can predict the future too.
For example if we want to find the gdp of the country with x=978 we will get gdp of the country equal to 1121.74 +36.27*978=36593.8.
I hope this explanation helped you to understand SLR in a new and fun way.