>>12614093This is counterintuitive, but very large neural networks actually generalize better when they have massively more parameters than data. This has been demonstrated experimentally time and time, so we know there is something different happening with these types of models.
I don't think there is a definitive answer on this question, but the current hypothesis seems to be that when you have massively more parameters, it creates "flatter" optima on your parameter landscape. When you have large flat optima, you get good generalization, because the network can't be bouncing around chasing after each data, you just don't have enough gradient to follow.