0:00
hi everyone my name is Ja and today I'm
0:03
going to teach you how to use psyed
0:05
learn linear regression to make a simple
0:11
Python so linear regression is using
0:14
ordinary least Square linear regression
0:17
from Psychic learn to predict
0:19
values so the it is found in the SK
0:23
learn linear model module and the linear
0:26
regression class fits a linear model to
0:29
minimize is the residual sum of squares
0:33
between your observation and your
0:36
prediction so as a nutshell a linear
0:40
regression it is trying to fit a line on
0:46
points so that you minimize the sum the
0:50
residual sums of square so we have an
0:53
intercept we fit a slope uh we fit a
0:56
line using the slope and then we
0:59
calculate the residuals here each of
1:02
them and we sum of we make the sum of it
1:05
squared and that's what the linear
1:09
regression tries to minimize the average
1:12
error that you get from RSS so the first
1:15
thing you have to do is you need to pip
1:17
install psychic learn M clot clib for
1:19
this tutorial and what we'll do is an
1:22
example on the diabetes data set so the
1:25
diabetes data set it is a few phological
1:30
that preer that creates a feature so you
1:33
can have the X say sex age body mass
1:36
index blood pressure and we try to
1:38
predict uh the disease of diabetes uh on
1:42
this after a one year so we will load
1:46
the data and by loading the data what
1:48
we'll find here is that we have a few
1:51
features and these are the ones that
1:53
I've told you uh about and here we want
1:57
to do a simple linear regression so we
1:59
will only do it on the body mass index
2:02
so we will only use the body mass index
2:04
in order to try to predict uh
2:07
diabetes um here generally when we do
2:11
machine learning we try to have a test
2:14
set set apart so that you can look at
2:16
your accuracy score and that's what we
2:18
do with train test split I'm not going
2:20
into this train test split uh for this
2:24
tutorial I have a video on it but
2:27
essentially we are getting training
2:29
training data and testing data uh from
2:33
this so we're splitting and keeping one
2:36
part for testing later now what we want
2:39
to do is actually train a linear
2:43
regression so linear regression is found
2:46
in in uh from SK learn so you go into
2:50
the SK learn Library uh and you go from
2:56
model uh Li uh module and and then you
3:03
regression and to train the model we set
3:07
up a variable to do linear and we use
3:13
regression and then we use that
3:17
variable to fit the data so fitting
3:21
means training so we train the data on
3:24
the X Trin and Y trining data so you can
3:28
see I'm not using X and Y I'm just using
3:31
the training data from X and Y and I'm
3:34
trying to predict y here and I use the
3:37
Lin reg trained model and I'm using this
3:42
predict using X test so I will take X
3:46
test and try to predict y based on our
3:54
regression I told you a linear
3:56
regression tries to fit the lines so we
3:58
are going to look at The Intercept which
4:01
was a star on my graph and
4:04
linreg and we will get the
4:10
intercept attribute from our uh train
4:14
model and we will use get the
4:18
coefficient uh of the Lin
4:22
R and we get the coefficient attribute
4:26
and we select the first coefficient and
4:29
and essentially we are going to print
4:32
intercept and at Point z uh the
4:35
intercept and the BMI uh so essentially
4:39
the disease progression of one unit of
4:45
9988 so let's try to plot this so I'm
4:49
going to use Seaborn reg plot I have a
4:52
tutorial on Seaborn reg plot that you
4:53
can look at but essentially I'm going to
4:58
plot to uh set x and x will be defined
5:03
as X test so we're going to use test and
5:06
we will use the BME BMI and we will set
5:14
test so essentially we will do a reg
5:17
plug and then we can add some
5:20
information like the title The X label
5:22
and the Y label to put some a access on
5:25
our on our chart and then we do Cloud
5:30
this so here I have my linear regression
5:35
with intercept 150 at 0 0 and the slope
5:41
988 so here if you can see it stops only
5:44
at 350 because our data but our axis is
5:47
zero from 0 to 0.15 but if we were to
5:51
expand this to 0 to one we would see
5:55
988 here so that's what uh The Intercept
5:59
is now if we want to compute the
6:01
accuracy score for this uh simple linear
6:05
regression we will import metric from
6:09
psyched learn so we use psych learn
6:13
metrics and we will import
6:16
numpy as NP so here I'm not going too
6:21
deep into what these metrics are uh
6:24
there are tutorials on the topic but
6:27
essentially to fit our model we will try
6:30
to figure out the mean absolute error
6:33
the root mean square error and the
6:36
coefficient of determination the R2
6:39
score so in order to get these metrics
6:41
we take metrics uh we use mean
6:52
absolute error and then we pass in the Y
6:56
test and the Y PR so it will compare uh
7:01
test and our prediction of Y and it will
7:05
look at this absolute error we will want
7:09
to look at the range of Y so we look at
7:12
the minimum value and we use MP Max of Y
7:17
uh to Showcase what is the range so that
7:20
we can can compare the mean absolute
7:23
error to AR range and then we will look
7:31
squared error uh initially we were
7:36
passing mean squared error and the
7:39
squared equal true but that's not uh
7:44
true anymore so we have the root mean
7:48
squared it was dicated so now we use the
7:50
root mean squared error and we use y
7:53
test and Y PR and then we calculate the
7:57
R2 score we can use again the metrics R2
8:01
score so there are multiple uh
8:04
convenient way of the uh calculating
8:07
this without us having to go through all
8:09
the mths behind it and then if we print
8:12
all of that we get that the absolute
8:16
error here is 50 units uh so if we
8:20
compare so essentially uh what the mean
8:23
absolute error is it will look at how
8:26
much the prediction deviate from the
8:28
actual value so if you compare your
8:30
prediction and your actual value how
8:32
much on average are we far off in this
8:36
case we are far off by 50 units and of
8:40
BMI and if we look at the range of the
8:43
disease progression uh so not BMI but
8:46
the the disease progression and if we
8:49
look at that range we go from 25 to 346
8:53
so if we are on average Wrong by 50 it's
8:57
quite significant it's not like
8:59
completely out of it but it's quite
9:02
significant if we look at the root mean
9:05
squared error then that's the magnitude
9:08
of error so what we're saying here is we
9:11
have 62 unit uh so it's somewhat uh
9:16
accurate but it's still high error and
9:20
we have a coefficient of determination
9:22
of 28 so it says how much of the
9:26
variation variance in the Target can we
9:28
explain with BMI and it's only
9:31
28% so it's quite uh weak so in this
9:35
case uh that is a value from 0er to one
9:39
and 28% signifies a weak
9:42
relationship so in order to fix this we
9:45
can add more features to get a better
9:49
performance so we can improve score by
9:51
doing some data processing that we
9:53
haven't done in this tutorial we can do
9:55
feature engineering to generate new
9:58
features and try to improve the score we
10:00
can do hyper parameter tuning and cross
10:02
validation to select better our uh our
10:08
um our uh hyperparameters and then we
10:12
can per use different models instead of
10:15
linear regression or even select better
10:18
features so this is it for the simple
10:20
linear regression with psyit learn so
10:23
feel free to follow me and help me by
10:25
subscribing to this channel uh or ask me
10:28
any questions that you want in the
10:30
comments see you next time