0:00
Hi everyone. My name is Jacques Christopheena
0:02
And today we're going to learn about ScycotelearnedT-Learned-Train test split. So train evening split is one of the function
0:09
that you can find inside GitLearn. It's within the model selection model
0:14
And it's used to create training and testing data. And why you want to create testing data
0:21
is because you want to be able to measure your machine learning models performance
0:26
So instead of training on all your data, you keep one subset of the data in order to be able to identify what well or wrong
0:37
And how trained test splits works is essentially you create a training set and a test set
0:43
and then you train your model only on a training set, and then you make a prediction
0:48
and then you look at your accuracy afterwards in order you compare it to your test set in order to calculate the accuracy
0:57
So if you want to get started with Train Test Split, you first need to install PsychetLine Library
1:06
I have a small dataset that I'm loading from Psychet Learn, which is the breast cancer data set
1:13
in order to showcase how Train Test Splits works. So in order to use Train Test Split, you follow those steps
1:22
So essentially you split a dataset into a training and tested set
1:26
And then you will provide the test size that you want to have based on the population that you have
1:34
After that, you train your model and then you make a prediction
1:38
Once you have your prediction, you can compute the accuracy using some metrics like accuracy or accuracy score
1:46
in order to know whether your model is doing fine on the unseen data
1:52
So let me show you how the data setler would look like in the
1:56
into a data frame, essentially you have all your features here and right down the end, you have your prediction
2:05
the target. And what we are going to do is we are gonna split that data set
2:13
into a training set and a testing set using the PsychoLearn library
2:18
So the first thing that you wanna know is that PsychoLearn requires you to have data
2:26
type which is an array in this case. So we are showing here that you have an array
2:33
And in order to be able to do this you going to have two arrays one for one for the features and one for the target
2:46
And you want to make sure that the length of your features and the length of your target are of the same array
2:56
And your target is one-dimensional array, whereas the features are two-dimensional. So the first step to use train test split is actually to import the module
3:08
So you import, no, sorry, from SKLearn. Model selection, you import train test split
3:24
You can run this. And once you, what you want to do when you, you run the data, if you want to split
3:33
your data, you will have four unpack value for your train test split method
3:43
So essentially you do X, train, X, tests. So by convention we use X as a titled case and Y as a lowercase
3:56
And then you do train Y tests. And then you do train test split
4:06
And what you need is you need to pass your first array and your target array
4:13
And you need to define your test size. Here I'm going to use 0.3
4:19
And then you can specify random state. Random state will be used to be able to reproduce your results
4:26
So it will give the pseudo random number generator. in order to be able to reproduce this
4:35
So those are the main way to run train to split. And what this does is essentially it splits into your training test
4:47
It splits the X array into a training and testing array. So X test
4:59
and then you have your features into a training and testing array too
5:04
If you looked at the shape, then you will see that your shape for X train and Y train is the same
5:13
and your shape for X test and Y test is the same. And it essentially represents a third of what you have in your training data set because we have defined the testing size to 0
5:28
How do you choose your test size? So the best option in order to choose a test side would be to do some cross-validation
5:36
but most of the time we use as a rule of time
5:41
If you have a large data set that you can spare plenty of training data
5:46
then as a rule of time you would do a 70-30 or 80-20 splits on your test training testing ratio
5:54
If you have limited data and you need to keep a larger amount of your data assigned it to the training set
6:02
then you probably want to look into a 90-10 splits. Otherwise, as I mentioned, you do some cross-validation in order to define it
6:12
One last thing I want to tell you about training test split is to use track
6:16
Tratify sampling. Why you use stratify is in your data set, sometimes what you try to predict does not have the same volume
6:30
So for example, with the breast cancer data set, there are two outcomes that we want to predict
6:36
Either it's malignant or benign. But when you split that data into training and testing set
6:45
sometimes one of the outcomes might appear more often than the other
6:49
For example, it might be more malignant than benign. And in that case, you have something that we call class imbalance
7:00
So you can use classification in order to make sure that the outcome that
7:06
appears more often is being represented more often in your testing set
7:12
So this can help improve the model accuracy. accuracy and performance. So for example, if I look at my original data frame target
7:23
and I can show you that if you do value counts, you will figure out that you have more often the class one than the class two
7:35
and you want that to be represented. So what you can do in order to stratify your data
7:41
you can do X strain again, X test, Y, train, Y test, equal train, test
7:51
test split. And then you do X Y your two arrays You set up your test size as what we just mentioned You can set up your random state
8:10
equal 42 for reproduction and reproducibility and then you can set up the stratify parameter
8:18
2 equal Y. So in that way, that's how you're gonna that's how you're gonna
8:25
stratify and then you have your testing and training set that you can potentially use to train your data
8:35
And so that's it. That's how you use it. There are other parameters here that you can use
8:42
If you want to go further, you can go ahead and start training your data set
8:48
And the way to do, the way to train your data set here, we will, we'll do you
8:55
could use a classification for example and to do so what we will do is take one of the
9:03
library which is psyched learn neighbor so we can do a neighbor classification so you
9:12
use from psyched learn neighbors and then you set up your K and n equal K neighbor
9:20
classifiers with and neighbors I'm not going into how exactly this works we have separate tutorials for how classification algorithms work
9:34
but X train Y train so that's how you pass your training data set so right now
9:40
we're fitting the data the model with the training data set and then we try to
9:45
make the prediction can and predict X test
9:55
So then we try to make the prediction on the testing data and then we can run this
10:02
And in order to know if it works well, you can compute a score by doing k-and-n-n-n-dscore
10:10
and x-tests, y-test, and then you run this and you can compute your accuracy score and see here what you have
10:22
So that's how you use train test splits. in order to fit a model and compute the accuracy of it
10:29
I have other parameters that you can be used. Here are the Deficient Mission
10:34
And if you want to learn more, make sure you subscribe to this channel and look at one of my profiles
10:39
in order to learn more about what I do. Thank you very much. See you next time