The Beta distribution is the probability distribution of a probability.
For example, let p be the probability of some event happaning. Assume that we don't know p exactly, but know that p should lie within approximately [0.1, 0.35], and is most likely about 0.2. Then we may use the Beta(20, 80) distribution to represent this knowledge, because its mean value is 20/(20+80) = 0.2 and it lies almost entirely within [0.1, 0.35].
More along these lines: https://stats.stackexchange.com/a/47782
Oh, actually I meant to attach this figure.
Source: Strang (1993) The Fundamental Theorem of Linear Algebra
Let A be an n × m matrix with n > m that has linearly independent columns.
Consider the eq. Ax = b, where b is *not* in the column space. Then Ax = b cannot be solved. Instead we can aim at minimizing the error (b - Ax).
The vector b can be decomposed as b = p + e, where p is in the column space of A and e is in the nullspace of Aᵀ.
Now we can approximate the "solution" to Ax = b by solving Ax = p. In fact, the solution to Ax = p minimizes the squared error ||b - Ax||².
Fig. from Strang (1993)
Grid Search no more!
Here is a very nice illustration from Bergstra & Bengio (2012) why Random Search is often superior to Grid Search for purposes of parameter choice -- Random Search gives by far the better approximations to the important univariate parameter distributions.
Turns out an ancient paper(*) has the answer.
If z = u₁ + iv₁ and w = u₂ + iv₂, where u₁, u₂, v₁, v₂ ~ N(0,1) (and independent), then the probability density of
r := |wz|
is given by
rK₀(r),
where K₀ denotes the modified Bessel function of the second kind with order 0.
(*) Wells, Anderson, Cell (1962) "The Distribution of the Product of Two Central or Non-Central Chi-Square Variates"
Consider two random complex numbers
z = u₁ + iv₁ and
w = u₂ + iv₂,
where u₁, v₁, u₂, v₂ are independent standard normal random variables (N(0,1)).
Then what is the probability distribution of the absolute value of the product |zw|?
Some empirical investigation (simulation) shows that the distribution looks like this:
Consider a matrix \(A\in\mathbb{R}^{m\times n}\). Then \(Ax\) is a linear mapping from \(\mathbb{R}^n\) to \(\mathbb{R}^m\). There is an \(r\leq n\), such that:
1. \(A\) maps an \(r\)-dim. subspace of \(\mathbb{R}^n\) to an \(r\)-dim. subspace of \(R^m\) (the column space or image of \(A\)).
2. The other \((n-r)\)-dimensional subspace of \(\mathbb{R}^n\), called the null space of \(A\), is mapped to 0.
Figure from Strang (1993) "The Fundamental Theorem of Linear Algebra".
Toots of random math/stats/ML trivia that I find interesting.