Multilayer (vanilla) RNN.

hₜˡ = tanh[ Wˡ ⋅ (hₜˡ⁻¹, hₜ₋₁ˡ)ᵀ ].

The Beta distribution is the probability distribution of a probability.

For example, let p be the probability of some event happaning. Assume that we don't know p exactly, but know that p should lie within approximately [0.1, 0.35], and is most likely about 0.2. Then we may use the Beta(20, 80) distribution to represent this knowledge, because its mean value is 20/(20+80) = 0.2 and it lies almost entirely within [0.1, 0.35].

More along these lines:

Oh, actually I meant to attach this figure.

Source: Strang (1993) The Fundamental Theorem of Linear Algebra

Show thread

Let A be an n × m matrix with n > m that has linearly independent columns.
Consider the eq. Ax = b, where b is *not* in the column space. Then Ax = b cannot be solved. Instead we can aim at minimizing the error (b - Ax).
The vector b can be decomposed as b = p + e, where p is in the column space of A and e is in the nullspace of Aᵀ.
Now we can approximate the "solution" to Ax = b by solving Ax = p. In fact, the solution to Ax = p minimizes the squared error ||b - Ax||².

Fig. from Strang (1993)

Grid Search no more!

Here is a very nice illustration from Bergstra & Bengio (2012) why Random Search is often superior to Grid Search for purposes of parameter choice -- Random Search gives by far the better approximations to the important univariate parameter distributions.

Turns out an ancient paper(*) has the answer.
If z = u₁ + iv₁ and w = u₂ + iv₂, where u₁, u₂, v₁, v₂ ~ N(0,1) (and independent), then the probability density of
r := |wz|
is given by
where K₀ denotes the modified Bessel function of the second kind with order 0.

(*) Wells, Anderson, Cell (1962) "The Distribution of the Product of Two Central or Non-Central Chi-Square Variates"

Show thread

Consider two random complex numbers
z = u₁ + iv₁ and
w = u₂ + iv₂,
where u₁, v₁, u₂, v₂ are independent standard normal random variables (N(0,1)).
Then what is the probability distribution of the absolute value of the product |zw|?
Some empirical investigation (simulation) shows that the distribution looks like this:

Consider a matrix \(A\in\mathbb{R}^{m\times n}\). Then \(Ax\) is a linear mapping from \(\mathbb{R}^n\) to \(\mathbb{R}^m\). There is an \(r\leq n\), such that:

1. \(A\) maps an \(r\)-dim. subspace of \(\mathbb{R}^n\) to an \(r\)-dim. subspace of \(R^m\) (the column space or image of \(A\)).

2. The other \((n-r)\)-dimensional subspace of \(\mathbb{R}^n\), called the null space of \(A\), is mapped to 0.

Figure from Strang (1993) "The Fundamental Theorem of Linear Algebra".


The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!