The Channel in Convolutional Neural Nets

THOUGHTS

By LI Haoyang 2020.12.24

Content

The Channel in Convolutional Neural NetsContentDefinitionsChannelConvolutionPoolingActivationConvolutional Neural NetworksThe Bound of Channel NumberLinear Algebra PerspectiveInformation Theory PerspectiveSignal Processing PerspectiveConflicts

Definitions

Channel

Reference: https://en.wikipedia.org/wiki/Channel_(digital_image)

For a digital image, a channel is the grayscale image of the same size as a color image, made of just one of these primary colors.

A general colored image often has three channels for red, green and blue respectively. A grayscale image only has one channel representing the intensity of pixels.

A channel is a projection of image in a color space.

Convolution

Reference: https://pytorch.org/docs/master/generated/torch.nn.Conv2d.html#torch.nn.Conv2d

The "convolution" operation used in convolutional neural networks is actually cross-correlation operation, not a real convolution as used in signal processing.

Taken from PyTorch document for torch.nn.Conv2d, it is defined as

out(N_i,C_{out_j})=bias(C_{out_j})+\sum_{k=0}^{C_{in}-1}weight(C_{out_j},k)* input(N_i,k)

$*$ $(N,C_{in},H,W)$ $(N,C_{out},H_{out},W_{out})$ $(C_{out}, C_{in},H_{weight}, W_{weight})$ .

$C_{in}$ $C_{out}$ $C_{out}$ $C_{in}$ cross-correlations of corresponding channels of weight and input, shifted by a bias.

¹ $Y$ $X$ , i.e.

y_j=\sum_{i}A_{ji}x_i+b_j

Convolution/Cross-correlation operation is linear.

Pooling

Reference: https://pytorch.org/docs/master/generated/torch.nn.AvgPool2d.html?highlight=avgpool2d#torch.nn.AvgPool2d

Taken from PyTorch document for torch.nn.AvgPool2d, it is defined as (one stride pooling):

out(N_j,C_j,h,w)=\frac{1}{kH\cdot kW}\sum_{m=0}^{kH-1}\sum_{n=0}^{kW-1}input(N_i,C_j,h+m,w+n)

$(kH,kW)$ $(N,C,H,W)$ $(N,C,H_{out},W_{out}), H_out=H/k,W_{out}=W/k$ .

Other poolings are defined similarly.

It's obvious that pooling operation will not change the number of channels, but the dimension of each channel.

Average pooling is linear while others may be non-linear.

Activation

Reference: https://pytorch.org/docs/master/generated/torch.nn.ReLU.html?highlight=relu#torch.nn.ReLU

Activation is used to introduce non-linearity into neural networks. An activation function is an element-wise operation that only changes the intensity of each dimension.

Taken from PyTorch document for torch.nn.ReLU, it is defined as:

ReLU(x)=(x)^{+}=max(0,x)

Other popular activation functions include:

Sigmoid torch.nn.Sigmoid
$Sigmoid(x)=\sigma(x)=\frac{1}{1+\exp(-x)}$
Tanh torch.nn.Tanh
$Tanh(x)=tanh(x)=\frac{\exp(x)-\exp(-x)}{\exp(x)+\exp(-x)}$

Activation is designed to be non-linear.

Convolutional Neural Networks

Reference: https://en.wikipedia.org/wiki/Convolutional_neural_network

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery.[1] They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics.[2][3]

A typical convolutional neural network consists of multiple cascaded structure of convolutional layer, activation layer and pooling layer.

The Bound of Channel Number

Linear Algebra Perspective

Put activation layer aside, assuming there is a convolutional layer and an average layer, since these two layers are both linear, the output of these two layers is a linear projection of the input.

$X\in\R^{C\times H\times W}$ $y\in\R^{H^\prime\times W^\prime}$ is

y=AX+b

$A$ $b$ are the coefficients of this linear projection.

$C\times H\times W$ $C\times H\times W$ linearly independent equations.

$H^\prime\times W^\prime$ $C^\prime$ channels such that

C^\prime\times H^\prime\times W^\prime-M\ge C\times H\times W\implies C^\prime\ge\frac{C\times H\times W+M}{H^\prime\times W^\prime}

$M$ is the number of linearly correlated equations reconstructed from the output.

$C\times H\times W$ $C\times H\times W$ one dimensional basis to represent every point in this space, as long as all of them are linearly independent.

$C^\prime$ $C^\prime$ $C^\prime$ different basis.

$C\times H\times W$ $C\times H\times W$ $H^\prime\times W^\prime$ $C^\prime$ we need should suffice

C^\prime< C\times H\times W

Combining both, there should be

\frac{C\times H\times W}{H^\prime\times W^\prime}\le C^\prime< C\times H\times W

But it may be completely different if we consider the non-linear activation layer.

$M$ will be drastically reduced if we take activation layer into consideration.

Information Theory Perspective

$X\in\R^{C\times H\times W}$ suffices that

H(X)\le \log(C\times H\times W)

$Y\in\R^{C^\prime\times H^\prime\times W^\prime}$ , the information contained suffices that

H(Y)\le\log(C^\prime\times H^\prime\times W^\prime)

If we wish the network to be capable of preserving all possible information, it at least should suffice that

\log(C^\prime\times H^\prime\times W^\prime)\ge\log (C\times H\times W)\implies C^\prime\ge \frac{C\times H\times W}{H^\prime\times W^\prime}

From this perspective, for data with less information, the number of channels can be drastically reduced proportionally.

This bound is intuitively super loose, since no data reaches the information upper bound; but in the other hand, no output reaches the information upper bound either, perhaps together they make the bound proper.

Signal Processing Perspective

According to Nyquist-Shannon Sampling Theorem, if we want to be capable of recovering the original signal without alias, the sampling frequency should be at least twice of the highest frequency of the original signal.

Although it was originally proposed for discretization of analog signal, most of the networks are designed such that every four times reduction of feature size is accompanied with a double of channel number.

$C^\prime$ should suffice

C^\prime\ge\frac{1}{2}\cdot\frac{C\times H\times W}{H^\prime\times W^\prime}

This seems to be the most popular perspective.

Conflicts

The first two perspective seem to give the same bound, but the third perspective seems to differ from other two. Which one should we take?

It's a consensus that images are highly redundant, therefore with much fewer channels, even lower than the bound can still achieve competitive results. If we want to decide the channels we should adopt, the ultimate way seems to be measuring how redundant the images are by information.

Personally, I take the information theory perspective.

1 Jiayun Wang Yubei Chen Rudrasis Chakraborty Stella X. Yu. Orthogonal Convolutional Neural Networks. CVPR 2020 ↩