Gradient Boosting

No.13679429 ViewReplyOriginalReport
Question about CatBoost.

You have a dataset and you split it into 80% training and 20% testing. You want to find optimal hyperparameters of the model, so with the training set, you do cross validation and search parameter space.

CatBoost has something called the eval set which is used to help avoid overfitting, but I'm confused. My question is how to sue it during the cross validation.

Say you do CV5. So now we have iterations of 80% of the training data to predict the other 20%.

Is it fair to in every iteration use the 20% as the eval set to avoid overfitting and then still predict on those features and report the result? Or have we cheated in a way by making the 20% of the training the eval set because we stopped the training early as a function of that set.

This same concept applies to the testing set. After finding optimal hyperparameters, can I use the testing set as the eval set during final model training? Or is this again cheating?

Thank you so much!