Understand Train and Test Split in your Data Set (ML Process… cont..)

Shanthababu Pandian
Analytics Vidhya
Published in
5 min readAug 9, 2020

--

Let’s discuss Train and Test your model with given data set. Certainly you can assume this how the students are get trained before their board exams by the great teachers in school/college with different form of test to prepare them for their final exam. It would be unit test/term exams/revisions/surprise test and etc., here we use to training on various combination of the questions and mixed scenarios. Hope you all are come across these situations in many times in your studies. No exceptional for data set in Data Science :) because we need to build very strong model before we go into production environment and this is really very important process after EDA, Feature Engineering steps.

Similarly, in Data Science domain, Model has been trained by the sample data and predict the values with available data set before deploying into production environment, before model meets the real-time/streaming data.

This process is always helping us to understand the insight of the data and what/which model we could use for our data set/source/ to meet business problem. Here we must take care of the data set and it should match with real-time/streaming data feed (To align with all combinations), while the model performing in production environment. So, the choice of data set (data preparation) is really key before T&T process. Otherwise the model situation become pathetic… as below in picture 😊. There might be huge effort loss, impact on the project cost and end up with unhappy customer service.

Here You should ask me below questions.

Why do you split data into Training and Test Sets?

What is a good train test split?

How do you split data into training and testing?

What is training and testing accuracy?

How do you split data into train and test in Python?

What is X_train and Y_train X_test and Y_test?

Is train test split random?

What is the difference between training set and test set?

Questions!
Questions!

How do you split data into training and testing?

80/20 is certainly a good starting point. Later you can adjust based on your model performance and volume of the data. 75/25 also best split!.

Training data is the data set on which, you train the model.

Train data from which the model has learned the experiences.

Training sets are used to fit and tune your models.

Test data is the data which is used to check if the model has learnt good enough from the experiences it got in the train data set.

Test sets is “unseen” data to evaluate your models.

Train and Test Split
T&T Process
T&T Process

What is training and testing accuracy?

a. Training accuracy is usually the accuracy we get if we apply the model on the training data

b. Testing accuracy is the accuracy for the testing data.

It is useful to compare these to identify how Training and Test set doing during ML process.

Accuracy

What is X_train and Y_train X_test and Y_test?

1). X_train — This includes your all independent variables, (Will share detailed note on independent and dependent variables) these will be used to train the model.

2). X_test — This is remaining portion of the independent variables from the data which will not be used in the training set. Mainly used to make predictions to test the accuracy of the model.

3). y_train — This is your dependent variable which needs to be predicted by model, this includes category labels against your independent variables X.

4). y_test — This is remaining portion of the dependent variable. these labels will be used to test the accuracy between actual and predicted categories.

NOTE: We need to specify our dependent and Independent variables, before training/fitting the model. Identifying those variables are big challenge and it should come out from the business problem statement, what we are going to address.

Is train test split random?

The importance of the random split has been explained in below picture clearly in simple way! You could understand from pictorial representation!

In simple text, the model could understand what all data combination are is exists in the give data set.

The random_state parameter is used for initializing the internal random number generator, which will decide the splitting of data into train and test.

Let say! random_state=40, then you will always get the same output the first time you make the split. This would be very useful if you want reproducible results to finalize the model.

Random Split
Actual Vs Predicted

After! the Train and Test split! We should apply the algorithm/modeling on the data set X and Y

Thanks for your time on reading this article! Hope! You all got an idea of Train and Test Split in ML process. on top of this we are having few more concepts in the same row like validation set,Overfitting and Underfitting, Hyperparameters and so many! Will cover all one by one! See Again!

--

--

Shanthababu Pandian
Analytics Vidhya

Data & Analytics Technical Delivery Manager:Data Scientist;Machine Learning Engg, Azure Data Engg.https://www.linkedin.com/in/shanthababu-pandian-b2a9259/