0:00
hi everyone my name is Ja in this Python
0:04
tutorial I'm going to show you how to
0:05
use psychic learns caners classifiers
0:08
and how to perform K rest neighbor in py
0:10
it learns so what KNN is uh it's an
0:15
algorithm used in supervised learning
0:18
for both classification and regression
0:20
and essentially is you try to classify
0:24
data in two or multiple categories based
0:27
on the closest neighbors so if if you
0:30
have some prediction here uh you can say
0:32
okay if I'm looking at three uh variable
0:36
uh in that Circle I'm going to have this
0:39
classification and so if K is equal
0:42
three then I will classify it as green
0:46
or Class B and if K is six since it's uh
0:50
bigger I have four items in that class
0:52
six and two in green then I'm
0:54
classifying it as Class A so in a
0:58
nutshell that's how K&N work
1:01
Works uh if you want to learn more uh
1:03
make sure that you uh look at my
1:07
online uh otherwise let's get started so
1:11
I have loaded this data for you and
1:13
essentially what I'm doing is I'm
1:15
loading the breast cancer data set and I
1:18
have all these features that try to
1:21
predict something which is malignant or9
1:26
features and what I'm trying to predict
1:30
cancer is malignant or benign in breast
1:35
set uh so I'm going to since psychic
1:39
learn requires us to use uh numer
1:41
numerical data arrays I'm going to uh
1:44
assign the data set to X and Y and then
1:48
I'm going to get started with data
1:51
prepar preparation and splitting so
1:54
whenever we run machine learning model
1:58
what we want to do is we want to make
1:59
sure that we have a a test set and a
2:03
control set so that we can look at
2:05
accuracy and this is essentially what
2:07
this does you can look at my tutorial on
2:09
train test split if you want to
2:11
understand what this does but
2:13
essentially it creates uh it split my
2:16
data in a 30% test group so let's run
2:21
this and let's dive into the meat of the
2:25
subject which is the actual keris
2:28
neighbor and in in order to run a k KNN
2:32
algorithm with cych learn we need to uh
2:39
neighbors so it's within neighbors
2:42
module you import K nearest neighbor K
2:49
classifier and in order to train the
2:51
model we generally use the uh naming
2:54
convention KNN and we use K neighbors
2:57
classifier and the first parameter that
3:00
we want to use the parameter that we
3:02
want to set is the number of neighbors
3:04
and that's the K value I was talking to
3:06
you about so in this case we'll randomly
3:09
select eight and I'm going to show you
3:12
how to select that value later and then
3:15
we do KN ann. fit at X train and Y train
3:30
predict X test so here the fit method is
3:35
the actual training of the of the data
3:37
set so that we uh that's why we put the
3:39
training data in there and the predict
3:42
method is where we put in the test so
3:45
we're trying to predict y test uh so
3:48
here that's why we have splitted X test
3:52
y train and we're going to try to
3:54
predict y test and in order to uh
3:57
compare the score we're going to predict
3:59
compare y test to the actual y
4:02
prediction uh data and that's how uh we
4:07
compute accuracy so we do KNN doore and
4:11
we add X test and Y test so we will look
4:18
at that information and currently our
4:21
prediction accuracy is
4:24
92.9% so 92.9% of the time we are right
4:30
and that happens because this data set
4:32
uh is a clean data set so we don't have
4:34
too much pre-processing to do but let's
4:41
information so what we want to know is
4:44
how many neighbors should we actually
4:46
put in the data so what I'm going to do
4:49
is I'm going to PL make a plot that will
4:51
tell us this information so I have the
4:54
same split and training the same data
4:57
here and I'm going to create variable
4:59
where to store the information so we
5:02
start with the same starting point of as
5:04
what we have done but what we're going
5:06
to do is we're going to try to Loop
5:08
through uh multiple value values so what
5:12
we can do is we can say
5:17
NP a range 1 to 26 so essentially we
5:23
what this NPA range will do is that it
5:25
will create a range of value from 1 2 3
5:29
4 up to 26 and we are going to Loop for
5:33
each of these value so in this case if I
5:38
neighbor you'll see that what I'm doing
5:40
is I'm actually looping through all the
5:43
value 1 to 25 and what we want to do is
5:48
to train a model using that value every
5:53
time so we use K neighbor classifier and
5:58
neighbors equal W and then we put the
6:01
number of the neighbor so here we will
6:04
train it at one train it two and then
6:06
we'll expand up to 25 in order to try to
6:09
find the best um the best uh value for
6:16
neighbors so we will train the model at
6:24
train and then what we will do is with
6:28
the training value we will try
6:32
to assign the test accuracy and the
6:35
train accuracy the training accuracy in
6:37
the test secur and here for each
6:41
neighbor essentially we will
6:43
assign the can and score so Cann do
6:47
score so that's our accuracy score X
6:50
train y train and then here I'm going to
6:55
do the same but instead of X Trin y
7:02
test so I'm going to plot H so I'm going
7:06
to have this um this value here and what
7:10
I want to do is actually plot this
7:12
information so I'm going to plot My
7:15
Title Here with mad plot liim hence why
7:18
uh I've imported mad plot
7:21
liim uh and then I what I do is I will
7:26
need to plot plot The Neighbors so I'm
7:30
taking all the neighbors for the x value
7:34
and I'm going to put the train
7:38
values and the label here will
7:49
accuracy and I can do the same with the
7:53
uh test accuracies and say test accuracy
7:58
and then I will plot a legend plot X
8:00
label and plot show let me show you what
8:03
this is going to do it just plots this
8:06
information in order to tell us that
8:09
somewhere around here before it plate uh
8:13
we have the best possible and neighbors
8:17
and over here to the right that's me
8:19
that's where we start overfitting the
8:22
data um so this is a plot that shows you
8:26
the accuracy by NE number of neighbor we
8:29
can also actually do grid search CV in
8:33
order to predict the right uh get the
8:36
the actual right um end neighbor so what
8:40
we're going to do is we are going to run
8:42
parent grid and I'm going to not going
8:44
to go too deep into the grid search CV
8:47
because this is a K&N
8:50
uh uh K&N tutorial but let me show you
8:54
here uh I'm going to use en neighor from
8:57
NPA range from 1 to 50 uh and I'm going
9:00
to use grid search CV KNN
9:11
CV and I'm going to use
9:13
KNN param grid and CV equal five cross
9:19
validation equal five let's not dive too
9:22
too deep into what this means but we can
9:25
use now KNN CV to fit on the xra and
9:29
white train and what this will do is
9:32
that it will go through all the end
9:34
neighbor and fit the model and what it
9:37
will show is we can show the print
9:52
print we can print the best score as
9:55
well so essentially by doing this we
9:59
will be able to know what are the best
10:01
parameter and what will be the accuracy
10:03
score for this best series of of
10:06
parameter in this case it tells us that
10:10
uh we have we should use six neighbor
10:12
with an accuracy of almost
10:15
95% and here when we show you the plot
10:18
earlier we were around six parameters so
10:21
we were rightous by looking at the plot
10:23
but uh grid search CV it's a much better
10:26
way if we look at our PR previous we had
10:30
90% uh 92% accuracy and now we are
10:34
showing 95% accuracy which is
10:38
fantastic now we can evaluate the model
10:42
and one of the way that we can do this
10:44
is by using confusion Matrix if you
10:46
don't know what a confusion Matrix is I
10:49
have a tutorial on the topic uh here on
10:51
my website but essentially it shows you
10:54
the true positive true negative false
10:57
negatives and false posit postives um so
11:01
essentially is how often do you how
11:03
often you are right and how often you
11:07
or uh I won't dive too much into this
11:10
but you can uh look at that accuracy by
11:13
using cm equal confusion
11:19
Matrix and this is useful it's again not
11:22
a torial on KNN but uh not on confuser
11:26
Matrix but it's helpful to go when you
11:29
look at your KNN uh accuracy you can
11:33
look at your labels and then you do
11:38
classes so we are going to show this I'm
11:44
color and I'm putting a display and I'm
11:49
going to use confusion Matrix display
11:51
because if I just run confusion Matrix
11:54
what I have is an array like this which
11:57
you can interpret if you want to make it
11:59
more visual then you come in to with
12:01
confusion Matrix display and you can
12:15
labels equal K&N classes so essentially
12:20
you are just making a beautiful plot
12:25
this and then you can plot the title and
12:28
plot the confusion Matrix plot and then
12:31
you can see uh your true labels and if
12:34
you want to interpret this what it means
12:36
is that you have 102 instances where you
12:39
correctly predicted the
12:42
B9 uh and 57 instances when you
12:49
malignant um let's not dive too much but
12:52
you can interpret this and the way you
12:55
are going to go further is by looking at
12:57
the classification report and that's
13:00
where you can have your metri actual
13:01
metric so you go from escalar
13:07
metric import classification
13:14
report and then you can print
13:22
your white test y bre so you're
13:27
comparing your prediction against your
13:30
initial test information and you get
13:34
that report again I have a tutorial here
13:37
that can uh tell you what the
13:39
classification report is uh but
13:41
essentially what you want to learn is
13:44
your uh accuracy here is
13:48
93% um and that says how often you are
13:53
predicting uh and the other information
13:56
that you can look is here the class one
13:59
so essentially if we come back to our
14:01
problem here we try to uh Define how
14:05
often we can predict
14:08
cancer um as malignant and so what we
14:12
can look here we can look at the
14:14
Precision and recall and what the
14:17
Precision says is that when you
14:23
cancer how often was it really cancer
14:27
and the answer is 94% of the time when
14:30
we predicted someone had cancer it was
14:32
really cancer and the recall here is
14:37
what percentage of the real cancers of
14:40
all the cancers that existed so the true
14:42
positive and the the the false negative
14:47
uh how of of all the real cancers how
14:51
what is the percentage that we could
14:53
predict is 95% so these number in a
14:57
cases of cancer they should be really
14:59
really high and that's what we have in
15:02
case now let's dive into feature
15:06
scaling uh here we have a specific case
15:09
where the data set is quite clean but
15:12
with KNN it's very uh KNN is really
15:15
sensitive to the scale of the data so if
15:17
you have uh something that are on a very
15:20
large scale and some other features that
15:22
are in a tiny scale this really can skew
15:25
uh the calculations because the distance
15:28
won't map the same thing here if you
15:31
want to scale uh the data so you one way
15:35
you can do this is you might want to
15:39
convert uh the mean to be zero and the
15:44
uh standard deviation to be
15:48
um and so essentially you have one
15:51
standard deviation everywhere and one
15:54
way you can do this is you can use
15:57
standard scaler and and to do this you
16:06
scaler and then you just run xtrain
16:17
transform and you can run X test scale
16:21
and you run scaler and then you just uh
16:26
transform because you don't want to fit
16:35
xend X test so you don't want to train
16:39
on the test data that's why you're just
16:41
transforming and if we plot the results
16:43
here we have the original uh mean and
16:49
deviation and once we have transferred
16:53
uh transformed it we have this mean and
16:56
uh standard deviation
17:00
in order to do this within the P
17:02
Pipeline with uh uh can neighbors
17:06
classif fires uh we start with the same
17:10
initial step that we've done so far we
17:13
run the train test plate and then we
17:16
create a steps variable and the steps
17:18
variable should be a list and the first
17:22
the each step should be what you're
17:24
trying to do here we have a scaler and
17:26
we can run standard scal
17:32
the list should be two
17:37
KNN and what we have here is K neighbors
17:43
classifiers and neighbors should be six
17:47
because we learned that information
17:49
earlier and then we train the data in a
17:52
separate Way by running a pipeline so we
17:55
instantiate pipeline equals pip pip
18:02
line and within the pipeline we give the
18:06
steps uh two pole then we can train the
18:13
KNN scale equals pipeline instead of
18:17
using KN andn the fit this time we use
18:20
pipeline the FIP and we put in the X
18:23
train y train and then we predict
18:32
scale predict on X test again and then
18:37
we can print the pipeline score and
18:41
that's how you train a model using a
18:44
pipeline so thank you very much that was
18:47
all for this K neighbors classifier with
18:49
pyit learn tutorial uh make sure to
18:52
subscribe to my Channel or visit my
18:54
website or follow me on the social media
18:57
thank you very much see you next next