Deep neural networks (DNN) off the top of my head [1/3]

A DNN is basically a fnct $$\mathbb{R}^p \to \mathbb{R}^c : x \mapsto y$$ that is evaluated as follows.

Input: $$x \in \mathbb{R}^p$$
$$z_1 = W_1 x + b_1 \in \mathbb{R}^{h_1}$$
$$z_2 = W_2 f_1(z_1) + b_1 \in \mathbb{R}^{h_2}$$
$$\vdots$$
$$z_r = W_r f_{r-1}(z_{r-1}) + b_r \in \mathbb{R}^{c}$$
Output: $$y = f_r(z_r) \in \mathbb{R}^c$$

where we need to optimize the weights $$W_1, \dots, W_r$$ and the biases $$b_1, \dots, b_r$$.

Deep neural networks (DNN) off the top of my head [2/3]

Suppose we have a "cost" function $$C(y_{\text{true}}, y)$$, which quantifies the prediction accuracy / error btwn true and predicted values. One typically uses stochastic gradient descent to find the "best" weights $$W_1, \dots, W_r$$ and biases $$b_1, \dots, b_r$$. To do gradient descent one needs to differentiate C w.r.t. all weights and biases. The "backpropagation" algorithm (aka the Chain rule) is used to obtain these derivatives.

DNNs off the top of my head [3/3]

"Backpropagation":

$$\delta_r = dC / d z_r \in \mathbb{R}^c$$
$$\delta_{r-1} = dC / d z_{r-1} = \delta_r \cdot d z_r / d z_{r-1} \in \mathbb{R}^{h_{r-1}}$$
$$\vdots$$
$$\delta_1 = dC / d z_1 = \delta_2 \cdot d z_2 / d z_1 \in \mathbb{R}^{h_1}$$

(the δs get reused top to bottom - hence "backpropagation").
Finally,

$$dC / d W_k = f_{k-1}(z_{k-1}) \cdot \delta_k^T \in \mathbb{R}^{h_k \times h_{k-1}}$$
$$dC / d b_k = \delta_k \in \mathbb{R}^{h_k}$$

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!