Deep neural networks (DNN) off the top of my head [1/3]

A DNN is basically a fnct \(\mathbb{R}^p \to \mathbb{R}^c : x \mapsto y\) that is evaluated as follows.

Input: \( x \in \mathbb{R}^p \)
\( z_1 = W_1 x + b_1 \in \mathbb{R}^{h_1} \)
\( z_2 = W_2 f_1(z_1) + b_1 \in \mathbb{R}^{h_2} \)
\( \vdots \)
\( z_r = W_r f_{r-1}(z_{r-1}) + b_r \in \mathbb{R}^{c} \)
Output: \( y = f_r(z_r) \in \mathbb{R}^c \)

where we need to optimize the weights \(W_1, \dots, W_r\) and the biases \(b_1, \dots, b_r\).


Deep neural networks (DNN) off the top of my head [2/3]

Suppose we have a "cost" function \( C(y_{\text{true}}, y) \), which quantifies the prediction accuracy / error btwn true and predicted values. One typically uses stochastic gradient descent to find the "best" weights \(W_1, \dots, W_r\) and biases \(b_1, \dots, b_r\). To do gradient descent one needs to differentiate C w.r.t. all weights and biases. The "backpropagation" algorithm (aka the Chain rule) is used to obtain these derivatives.

DNNs off the top of my head [3/3]


\( \delta_r = dC / d z_r \in \mathbb{R}^c \)
\( \delta_{r-1} = dC / d z_{r-1} = \delta_r \cdot d z_r / d z_{r-1} \in \mathbb{R}^{h_{r-1}} \)
\( \vdots \)
\( \delta_1 = dC / d z_1 = \delta_2 \cdot d z_2 / d z_1 \in \mathbb{R}^{h_1} \)

(the δs get reused top to bottom - hence "backpropagation").

\( dC / d W_k = f_{k-1}(z_{k-1}) \cdot \delta_k^T \in \mathbb{R}^{h_k \times h_{k-1}} \)
\( dC / d b_k = \delta_k \in \mathbb{R}^{h_k} \)

Show thread
Sign in to participate in the conversation

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!