The correlation of a variable and its affine transformation is either \(-1\) or \(1\)

Sigmoid and softmax learn an equivalent classifier.

Let \[\sigma(wx + b) = 0.5\] be the decision boundary for a sigmoid classifier.

Then, for a softmax, \[\exp(w_1 x + b_1) / [(\exp(w_1x + b_1) + \exp(w_2x + b_2)] = 0.5\] implies \[\exp(w_1x + b_1) = \exp(w_2x + b_2)\] implies \[w_1x + b_1 = w_2x + b_2\] implies \[(w_1 - w_2)x + (b_1 - b_2) = 0.\]

Source for this toot: stats.stackexchange.com/questi

Why does SGD work? We might finally know.

(1) A sufficient represenation of data is invariant to nuissance if and only if it holds the lowest possible amount of information about the data.
(2) The information of a representation is bounded by the information in the weights.
(3) The information in the weights is lowest at flat minima rather than sharp minima. Because SGD seeks flat minima, it forces generalization.

arxiv.org/abs/1706.01350

Hey guys. I'm a masters in data science. I post about statistical learning theory and neural networks. 🙂

Mathstodon

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!