>>10532667>We have weights which approximate the actual result. So we have two dots: the approximation and the actual, the approximation - the actual = the error/the loss. But no we actually have a fuckton of pairs of dots. Because the weights approximate more than one case of the same categories of values, but with different result, and it does this in order to generalize so that it can predict something in the future, not just the current set of data (that's overfitting). The loss function is different depending on how hipster you are, but generally it's the mean squared error. This means the mean from each example squared, square it, add it all together, then divide by amount of examples to get the average loss.suppose your data points are generated by some function f(x,m,b) = mx + b + e, where e is random gaussian noise, and you want to find m and b. the important part here is the noise term. without it, your points would all be on the same line.
you want m and b that maximize p(m,b | data)
p(m,b | data) = p(data | m,b)p(m,b)/p(data).
p(data) is just a constant. if you just assume a uniform prior, a maximizer for the likelihood p(data | m,b) also maximizes p(m,b | data).
the density p( y | x,m,b, var) is N(y | mx+b, var), the normal distribution centered around mx+b.
assuming your y data were drawn independently, p(y1,y2, ...,yn | mx+b, var) = N(y1 | mx+b,var)*N(y2 | mx+b,var) * ... * N(yn | mx+b, var)
instead of maximizing this, you can equivalently minimize the negative logarithm. so the product becomes a sum, the logarithm undoes the exponential in the normal PDF, and the negative cancels so you have each term a(y-(mx+b))^2 where a is some positive constant.
so if you assume a linear relation between x and y, and that the only reason they don't all lie on the same line is because of normally distributed noise, then you can see how your loss function is justified.