model calibration

Your neural network for classification produces numbers for each class. Do these numbers correspond with probabilities?
Consider a set of examples that all have the same values, but different labels. \(a\) percent of label 1 and \(b\) percent of label 2. You can imagine a model and decision rule that shoves everything into label 1 with a score of 1. This would not be well calibrated.
Then, imagine a model and decision rule that shoves everything into label 1 with a score of \(a\) %. This would be better calibrated, even though the classification score would be the same, because the score reflects the proportion of examples with label 1.