No.10407741 ViewReplyOriginalReport
Good Turing Smoothing

I'm not sure if this is the right board for this but I've been having some problem trying to understand good Turing estimation and how to implement it in a bigram model.
Basically my main question is: ho big of a role does the testing dataset play in estimating the c* counts?
Like, is the number of occurances based on the tokens we see in the training dataset or the testing dataset?
Here are some useful links if you want to read more about it.

https://en.m.wikipedia.org/wiki/Good–Turing_frequency_estimation

https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf

https://www.grsampson.net/AGtf1.html