/sci/ - Science & Math » Thread #13679429

3KiB, 347x145, catboost.png

View Same Google iqdb SauceNAO

Gradient Boosting

Anonymous Fri 24 Sep 22:24:24 2021 No.13679429 View Reply Original Report

Quoted By: >>13679458

Question about CatBoost.

You have a dataset and you split it into 80% training and 20% testing. You want to find optimal hyperparameters of the model, so with the training set, you do cross validation and search parameter space.

CatBoost has something called the eval set which is used to help avoid overfitting, but I'm confused. My question is how to sue it during the cross validation.

Say you do CV5. So now we have iterations of 80% of the training data to predict the other 20%.

Is it fair to in every iteration use the 20% as the eval set to avoid overfitting and then still predict on those features and report the result? Or have we cheated in a way by making the 20% of the training the eval set because we stopped the training early as a function of that set.

This same concept applies to the testing set. After finding optimal hyperparameters, can I use the testing set as the eval set during final model training? Or is this again cheating?

Thank you so much!

Anonymous

Anonymous Fri 24 Sep 2021 22:35:55 No.13679458 Report

Quoted By:

>>13679429
Yes, that's cheating.
A less cheaty way is to split the dataset into three: training (which you use to find parameters), validation (which you use to find hyperparameters) and testing. Ideally testing is only touched once, when you report the result. But in reality people just change their model until result in testing looks good. Many researchers do this, and it leaks the testing dataset into training via the researcher himself.

Anonymous

Anonymous Fri 24 Sep 2021 22:41:54 No.13679477 Report

Quoted By: >>13679501

So would it make sense to do something like this?
100% data

Training 70%
Eval 10%
Testing 20%

Run CV5 on the training to find hyperparameters using the 10% Eval at each iteration.

Train final model with optimized hyperparameters, again subject to the 10% eval and test on testing.

Thank you again!

Anonymous

Anonymous Fri 24 Sep 2021 22:47:36 No.13679501 Report

Quoted By: >>13679524

>>13679477
Cross validation is already a fancy way to split the dataset into training and eval. Usually people don't split further because you might decrease the training quality due to less training data overall.

Anonymous

Anonymous Fri 24 Sep 2021 22:53:53 No.13679524 Report

Quoted By: >>13680656

>>13679501

Wait, but this was the point of my original post. When you do cross val, you are splitting the training in an exhaustive way as to use all the data. So in CV5 when you use .fit on the 80% of the TRAINING dataset and then .predict on the 20% of the TRAINING data during that iteration, my question is, is it ok to also have the 20% that is going to be predicted in that iteration in the eval set to avoid overfitting?

If so awesome, if not, the only way I can remedy this in my brain is to create an external eval set.

Anonymous

Anonymous Sat 25 Sep 2021 07:21:07 No.13680656 Report

Quoted By:

>>13679524
alright so basically further split your data and then do a full train with your good hyperparameters and thus your problem will be solved. dont forget anon, always think 2,3 or even 10 steps ahead if you need to, you dont have to think what to do 1 step only and rely on googling or some shit

Capcode	All Only User Posts Only Moderator Posts Only Admin Posts Only Developer Posts
Show Posts	All Only With Images Only Without Images
Deleted Posts	All Only Deleted Posts Only Non-Deleted Posts
Ghost Posts	All Only Ghost Posts Only Non-Ghost Posts
Post Type	All Only Sticky Threads Only Opening Posts Only Reply Posts
Results	All Grouped By Threads
Order	Latest Posts First Oldest Posts First

Your latest searches