Home Artificial Intelligence Kaiming He Initialization in Neural Networks — Math Proof Math Proof: Kaiming He Initialization III. Weight Distribution Conclusion

Kaiming He Initialization in Neural Networks — Math Proof Math Proof: Kaiming He Initialization III. Weight Distribution Conclusion

1
Kaiming He Initialization in Neural Networks — Math Proof
Math Proof: Kaiming He Initialization
III. Weight Distribution
Conclusion

Initialization techniques are one in every of the prerequisites for successfully training a deep learning architecture. Traditionally, weight initialization methods should be compatible with the selection of an activation function as a mismatch can potentially affect training negatively.

ReLU is one of the vital commonly used activation functions in deep learning. Its properties make it a really convenient selection for scaling to large neural networks. On one hand, it’s inexpensive to calculate the derivative during backpropagation since it is a linear function with a step-function derivative. However, ReLU helps reduce feature correlation because it is a non-negative activation function, i.e. features can only contribute positively to subsequent layers. It’s a prevalent selection in convolutional architectures where the input dimension is large and neural networks are inclined to be very deep.

In “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”¹ ⁾ by He et al. (2015), the authors present a technique to optimally initialize neural network layers using a ReLU activation function. This method allows the neural network to begin in a regime with constant variance between inputs and outputs each by way of forward and backward passes, which empirically showed meaningful improvement in training stability and speed. In the next sections, we offer an in depth and complete derivation behind the He initialization technique.

Notation

  • A layer in a neural network, composed of a weight matrix Wₖ and bias vector bₖ, undergoes two consecutive transformations. The primary transformation is yₖ = xₖ Wₖ + bₖ, and the second is xₖ ₊ ₁ = f(yₖ)
  • xₖ is the actual layer and yₖ is the pre-activation layer
  • A layer has nₖ units, thus xₖ ∈ ℝⁿ⁽ ⁾, Wₖ ∈ ℝⁿ⁽ ˙ⁿ⁽ ᵏ ⁺ ¹⁾, bₖ ∈ℝⁿ⁽ ᵏ ⁺ ¹
  • xₖWₖ + bₖ has dimension ( 1 × nₖ ) × ( nₖ × nₖ ₊ ) + 1 × nₖ ₊ = 1 × nₖ ₊
  • The activation function f is applied element-wise and doesn’t change the form of a vector. In consequence, xₖ ₊ = f(xₖ Wₖ+ bₖ)∈ ℝⁿ⁽ ᵏ ⁺ ¹
  • For a neural network of depth n, the input layer is represented by x₀ and the output layer by xₙ
  • The loss function of the network is represented by L
  • Δx = ∂L/∂x denotes gradients of the loss function with respect to vector x

Assumptions

  • Assumption 1:
    We assume for this initialization setup a non-linear activation function ReLU defined as f(x) = ReLU(x) = max(0, x). As a function defined individually on two intervals, its derivative has a price of 1 on the strictly positive half of and 0 on the strictly negative half. Technically, the derivative of ReLU is just not defined in 0 on account of the boundaries of either side not being equal, that’s f’(0⁻) = 0 ≠ 1 = f’(0⁺). In practice, for backpropagation’s purpose, ReLU’(0) is assumed to be 0.
  • Assumption 2:
    It’s assumed that each one inputs, weights, and layers within the neural network are independent and identically distributed (iid) at initialization, in addition to the gradients.
  • Assumption 3:
    The inputs are assumed to be normalized with zero mean and the weights and biases are initialized from a symmetric distribution centered at zero, i.e. 𝔼[x₀] = 𝔼[Wₖ] = 𝔼[bₖ] = 0. Which means each xₖ and yₖ have an expectation of zero at initialization, and yₖ has a symmetric distribution at initialization on account of f(0) = 0.

Motivation

The aim of this proof is to find out the distribution of the burden matrix by finding Var[W] given two constraints:

  1. k, Var[yₖ] = Var[yₖ ₋ ], i.e. constant variance within the forward signal
  2. k, Var[Δxₖ] = Var[Δxₖ ₊ ], i.e. constant variance within the backward signal

Ensuring that the variance of each layers and gradients is constant throughout the network at initialization helps prevent exploding and vanishing gradients in neural networks. If the gain is above one, it would end in exploding gradients and optimization divergence, while if the gain is below one, it would end in vanishing gradients and halt learning. The above two equations be certain that the signal gain is precisely one.

The motivation in addition to the derivations on this paper are following the Xavier Glorot initialization⁽²⁾ paper published five years prior. While the previous work uses post-activation layers for constant variance within the forward signal, the He initialization proof uses pre-activation layers. Similarly, for the backward signal, He’s derivation uses post-activation layers as an alternative of pre-activation layers in Glorot’s initialization. Provided that these two proofs share some similarities, taking a look at each helps gain insights into why controlling for weights’ variance is so necessary in any neural network. (See “Xavier Glorot Initialization in Neural Networks — Math Proof” for more details)

I. Forward Pass

We’re searching for Wₖ such that the variance of every subsequent pre-activation layer y is equal, i.e. Var[yₖ] = Var[yₖ ₋ ].

We all know that yₖ = xₖ Wₖ+ bₖ.

For simplicity, we take a look at the i-th element of the pre-activation layer yₖ and apply the variance operator on either side of the previous equation.

  • In step one, we remove bₖ entirely, as following Assumption 1 it’s initialized at value zero. Moreover, we leverage the independence of W and x to rework the variance of the sum right into a sum of variances, following Var[X+Y] = Var[X] + Var[Y] with X Y.
  • Within the second step, as W and x are i.i.d., each term within the sum is equal, hence the sum is solely a nₖ times repetition of Var[xW].
  • Within the third step, we follow the formula for X Y which suggests that Var[XY] = E[X²]E[Y²] – E[X]²E[Y]². This permits us to separate W and x contributions to the pre-activation layer’s variance.
  • Within the fourth step, we leverage Assumption 3 of zero expectation for weights and layers at initialization. This leaves us with a single term involving a squared expectation.
  • Within the fifth step, we transform the squared expectation right into a variance since Var[X] = E[( X – E[X])²] = E[X²] if X has a zero mean. Now we are able to express the pre-activation layer’s variance as a separate product of layer and weight variance.

Finally, as a way to link Var[yₖ] to Var[yₖ ₋ ], we express the squared expectation E[xₖ²] by way of Var[yₖ ₋ ] in the next steps using the Law of the Unconscious Statistician.

The concept states that we are able to formulate any expectation of the function of a random variable as an integral of its function and probability density p. As we all know that xₖ = max(0, yₖ ₋ ), we are able to rewrite the squared expectation of xₖ as an integral on of y.

  • Within the sixth step, we simplify the integral using that y is zero on ℝ⁻.
  • Within the seventh step, we leverage the statistical property of y as a symmetric random variable, which hence has a symmetric density function p, and note that your entire integral’s term is a good function. Even functions are symmetric with respect to 0 on ℝ, which implies that integrating from 0 to a is identical as from -a to 0. We use this trick to reformulate back the integral as an integral over ℝ.
  • Within the ninth and tenth steps, we rewrite this integral as an integral of a function of a random variable. By applying the LOTUS — this time from right to left — we are able to change this integral to an expectation of the function over the random variable y. As a squared expectation of a zero mean variable, this is actually a variance.

We finally get to place all of it together using the outcomes of steps five and ten — the variance of a pre-activation layer is directly linked to its previous pre-activation variance in addition to the variance of the layer’s weights. Since we require that Var[yₖ] = Var[yₖ ₋ ], it allows us to substantiate that a layer’s weights variance Var[Wₖ] needs to be 2/nₖ .

In summary, here is again the entire derivation of the forward propagation reviewed on this section:

II. Backward Pass

We’re searching for Wₖ such that Var[Δxₖ] = Var[Δxₖ ₊ ].

Here, xₖ ₊ = f (yₖ) and yₖ = xₖ Wₖ + bₖ.

Before applying the variance operator, allow us to first calculate the partial derivatives of the loss L with respect to x and y : Δxₖ and Δyₖ .

  • First, we use the chain rule and the incontrovertible fact that the derivative of a linear product is its linear coefficient — on this case, Wₖ.
  • Second, we leverage Assumption 2 stating that gradients and weights are independent of one another. Using independence, the variance of the product becomes the product of variances, which is the same as zero because the weights are assumed to be initialized with zero means. Hence, the expectation of the gradient of L w.r.t. x is zero.
  • Third, we use the chain rule to link Δyₖ and Δxₖ ₊ ₁ because the partial derivative of x w.r.t. y is ReLU’s derivative taken in y.
  • Fourth, recalling the derivative of ReLU, we compute the expectation of Δyₖ using the previous equation. As f’(x) is split into two parts with an equal probability of ½, we are able to write it as a sum of two terms: expectation over ℝ⁺ and ℝ⁻, respectively. From previous calculations, we all know that the expectation of Δxₖ is zero, and we are able to thus confirm that each gradients have a mean of 0.
  • Fifth, we use the identical rule as before to put in writing a squared expectation as a variance, here with Δyₖ .
  • Sixth, we leverage Assumption 2 stating gradients are independent at initialization to separate the variance of the 2 gradients Δxₖ ₊ ₁ and f’(yₖ). Further simplification stems from Assumption 3 and we are able to finally compute ReLU’s squared expectation given its even split between positive and negative intervals.

Finally, using the gathered results from the above sections, and reapplying the idea of iid, we conclude the results of the backpropagation pass to be much like the forward pass, i.e. given Var[Δxₖ] = Var[Δxₖ ₊ ], the variance of any layer’s weights Var[Wₖ] is the same as 2/nₖ .

To summarize, here’s a reminder of the necessary step-by-step calculations included inside this backward pass section:

Within the two previous sections, we concluded the next for each from side to side setups:

It’s interesting to notice that this result’s different from the Glorot initialization⁽²⁾, where the authors essentially must average the two distinct results obtained within the forward and backward passes. Moreover, we observe that the variance within the He method is doubled, which, intuitively, is on account of the incontrovertible fact that ReLU’s zero negative section reduces variance by an element of two.

Subsequently, knowing the variance of the distribution, we are able to now initialize the weights with either normal distribution N(0, 𝜎²) or uniform distribution U(-a, a). Empirically, there is no such thing as a evidence that one distribution is superior to the opposite, and plainly the performance improvement comes down solely to the symmetry and scale properties of a selected distribution. Moreover, we do have to take into account Assumption 3, restricting the distribution selection to be symmetric and centered in 0.

If X ~ N(0, 𝜎²), then Var[X] = 𝜎², thus the variance and standard deviation of the burden matrix will be written as:

We will subsequently conclude that Wₖ follows a traditional distribution with coefficients:

As a reminder, nₖ is the variety of inputs of the layer k.

If X ~ U(-a, a), then using the below formula of a variance for a random variable following a uniform distribution, we are able to find the sure a:

Finally, we are able to conclude that Wₖ follows a uniform distribution with coefficients:

This text provides a step-by-step derivation of why He initialization method is perfect for neural networks that use ReLU activation functions, given the constraints on forward and backward passes to have constant variances.

The methodology of this proof also extends to the broader family of linear rectifiers, like PReLU (discussed in (1) by He et al.) or Leaky ReLU (allowing for a minuscule gradient to flow within the negative interval). Similar optimal variance formulas will be derived for these variants of the ReLU activation function.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here