Feature Purification : an Explanation for Adversarial Training

By LI Haoyang 2020.12.10

Content

Feature Purification: How Adversarial Training Performs Robust Deep Learning - 2020

Zeyuan Allen-Zhu, Yuanzhi Li. Feature Purification: How Adversarial Training Performs Robust Deep Learning. arXiv preprint 2020. arXiv:2005.10190

In this paper, we present a principle that we call feature purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network; and more importantly, one of the goals of adversarial training is to remove such mixtures to purify hidden weights.

They study the learning process of adversarial training and propose the idea of feature purification as a principle of robust deep learning, i.e. trying to answer the following questions:

Why do adversarial examples exist when we train the neural networks using the original training data set? How can adversarial training further “robustify” the trained neural networks against these adversarial attacks?

For a simplified standard model, they prove that:

Training over the original data is indeed non-robust to small adversarial perturbations of some radius.
Adversarial training, even with an empirical perturbation algorithm such as FGM, can in fact be provably robust against any perturbations of the same radius.

They also give a complexity lower bound, showing that a model under this complexity bound cannot yield robustness regardless of training algorithms.

A similar idea like feature purification was actually adopted as a defense itself several years ago by PixelDefend, which was later breached and classified as one of gradient masking methods.

This paper seems to indicate that adversarial training also utilizes feature purification, but in a better way.

The Principle of Feature Purification

The Principle of Feature Purification During adversarial training, the neural network will neither learn new, robust features nor remove existing, non-robust features learned over the original data set. Most of the works of adversarial training is done by purifying a small part of each learned feature after clean training.

Using the following notations:

$w_i^{(0)}$ $i$ -th neuron at initialization.
$w_i$ , the weight vector after clean training.
$w_i^\prime$ $w_i$ as initialization.

And the following provisional measure of correlation between "features":

\theta(z,z^\prime):=\frac{|\left<z,z^\prime\right>|}{||z||_2||z^\prime||_2}

i.e. the absolute value of cosine similarity.

They have the following claims:

$\theta(w_i^{(0)},w_i),\theta(w_i^{(0)},w_i^\prime)\le c$ $c$ (such as 0.2).
$\theta(w_i,w_i^\prime)\ge C$ $C$ (such as 0.8).
$\theta(w_i,w_j)\le c$ $c$ (such as 0.2).

A small cosine similarity implies a large difference.

$w_i,w_i^\prime$ $w_i^{(0)}$ .

They propose that clean training already discover a big portion of the robust features while adversarial training further purifies them.

Well, two weights both differ from initialization and they are relatively similar, so it indicates that adversarial training purifies the weights....It's a little shaky.

Why clean training learns non-robust features? Which part of the features are “purified” during adversarial training?

As we shall formally discuss in Section 6.2, training algorithms such as gradient descent will, at every step, add to the current parameters a direction that maximally correlates with the labeling function on average.

They refer this part of weight that correlates with the average of training data as dense mixture. Since under natural assumptions of the data such as the sparse coding model, i.e. inputs come from sparse combinations of hidden dictionary words/vectors, such dense mixtures cannot have any high correlation with any individual, clean example, thus the network can still generalize well on the original dataset.

However, we show that these portions of the features are extremely vulnerable to small, adversarial perturbations along the “dense mixture” directions. As a result, one of the main goals of adversarial training, as we show, is to purify the neurons by removing such dense mixtures.

Moreover, in our setting, these dense mixtures in the hidden weights of the network come from the sparse coding structure of the data and the gradient descent algorithm.

They state that the cleanly trained model tends to learn an average of the training data, namely dense mixture which leads to the adversarial vulnerability.

Computational Complexity

We also prove a lower bound that, for the same sparse coding data model, even when the original data is linearly-separable, any linear classifier, any low-degree polynomial, or even the corresponding neural tangent kernel (NTK) of our studied two-layer neural network, cannot achieve meaningful robust accuracy (although they can easily achieve high clean accuracy).

The main intuition is that low complexity models, including the neural tangent kernel, lacks the power to zero out low magnitude signals to improve model robustness, as illustrated in Figure 3 and Section 3.

Preliminaries

Notations

$||x||$ $||x||_2$ $\ell_2$ $x$
$||x||_p$ $\ell_p$ $x$
$\bold{M}_i$ $i$ $\bold{M}\in\R^{d\times d}$
$||\bold{M}||_{\infty}:=\max_{i\in[d]}\sum_{j\in[d]}\bold{M}_{i,j}$ , the largest sum of columns of the matrix
$||\bold{M}||_1:=\max_{j\in[d]}\sum_{i\in[d]}\bold{M}_{i,j}$ , the largest sum of rows of the matrix
$\text{poly}(d):=\Theta(d^C)$ $C$ is some not-specified constant
$\text{sign}(x)=1$ $x\ge 0$ $\text{sign}(x)=-1$ $x<0$

$\Theta$ ?

Sparse coding model

$x\in\R^d$ generated from

x=\bold{M}z+\xi

$\bold{M}\in\R^{d\times D}$ $z\in\R^D$ $\xi\in\R^d$ is the noise.

$D=d$ $\bold{M}$ is a unitary matrix.

$z$ $k\le d^{0.499}$ :

Assumption 2.1 $z$ $z$ $|z_i|\in\{0\}\cup[\frac{1}{\sqrt{k}},1]$ . Moreover,

\Bbb{E}[z_i^2]=\Theta(\frac{1}{d}),\ \bold{Pr}[|z_i|=1]=\Omega(\frac{1}{d}),\ \bold{Pr}\left[|z_i|=\Theta(\frac{1}{\sqrt{k}})\right]=\Omega(\frac{k}{d})

$\Bbb{E}[||z||_2^2]=\Theta(1)$ .

$z_i$ $z$ is non-zero but has a small value.

???

Fact 2.2 $||z||_0=\Theta(k)$ is a sparse vector.

$z$ , i.e.

y(x)=\text{sign}(\left<w^*,z\right>)

$\forall i\in[D],|w_i^*|=\Theta(1)$ $z$ have relatively equal contributions.

Clean and robust error

$f$ $\text{sign}(f(x))$ $y$ as possible. The classification error on the original data set is defined as:

\varepsilon^c(f):=\bold{Pr}_{x,y=y(x)}[\text{sign}(f(x))\neq y]

$\ell_p$ $\tau>0$ $||\cdot||_p$ $f$ is defined as

\varepsilon^r(f):=\bold{Pr}_{x,y=y(x)}[\exist\delta:||\delta||_p\le\tau:\text{sign}(f(x+\delta))\neq y]

Warmup Intuitions

Linear learners are not robust

This is anti-intuitive......

A direct approach to classify the given dataset is to use a linear classifier as

f(x)=\left<w^*,\bold{M}^\top x\right>

and use the sign to predict label. They point out that there are two issues of this classifier:

$\sigma_x$ $\Theta(1)$ $x$ in good accuracy, i.e.
$f(x)=\left<w^*,\bold{M}^\top x\right>=\left<w^*,z\right>+\left<\bold{M} w^*, \xi\right>$
$|\left<w^*,z\right>|=O(1)$ $\left<\bold{M} w^*,\xi^\prime\right>\sim\mathcal{N}(0,\Theta(\sigma_x^2))$ $\sigma_x\ge \Theta(1)$ , the signal is overwhelmed by the noise.
$\sigma_x=0$ , i.e. the original data is perfectly linearly classifiable, linear classifier is also not robust to small perturbations, i.e.
$\delta=\frac{-Cu\bold{M} w^*}{||w^*||_2^2}$ $C$ $|\left<w^*,z\right>|=O(1)$ .
Well, I think it's DeepFool, orthogonal perturbation to the decision boundary.
$||w^*||_2=\Theta(\sqrt{d})$ $\ell_2$ $\Theta(\frac{1}{\sqrt{d}})$ .
It's obtained by taking the norm into the form of perturbation.

I think the first claim is weird here, the first issue states that when the noise is comparable with signal, the classifier fails, which is not the problem of linearity as far as I can see. For data with many noise, a complicated classifier also fails.

High-complexity models are more robust

Another approach is to use a higher-complexity model:

f(x)=\sum_{i\in[d]}w_i^*\left<\bold{M}_i,x\right>\bold{1}_{|\left<\bold{M}_i,x\right>|\ge\frac{1}{2\sqrt{k}}}

$\left<\bold{M}_i,x\right>=z_i+\left<\bold{M}_i,\xi\right>$ $z_i$ $|\left<\bold{M}_i,x\right>|\ge\frac{1}{2\sqrt{k}}$ with high probability.

$z$ $O(k)$ $O(k)$ $f(x)$ $1-o(1)$ $\ell_2$ $o(\frac{1}{\sqrt{k}})$ .

How ???

Learning robust classifier using neural network

$f(x)$ after adversarial training, such as

f(x)\approx\sum_{i\in[d]}w_i^*[ReLU(\left<\bold{M}_i,x\right>-b)-ReLU(-\left<\bold{M}_i,x\right>-b)]

in which:

$ReLU(y)=\max\{y,0\}$
$b\approx\frac{1}{2\sqrt{k}}$

In this paper, we present a theorem stating that adversarial training of a (wlog. symmetric) two-layer neural network can indeed recover a neural network of this form.

$M_1,\dots,M_d$ $x$ where the coefficients are sparse.

It learns a set of basises ! Are they orthogonal ? If so, then it's connected back to Parseval Network and Lipschitz Regularization.

Learner Network and Adversarial Training

They consider a simple, two layer symmetric neural network with ReLU activation, i.e.

f(x)=\sum_{i=1}^ma_i[ReLU(\left<w_i,x\right>-b_i+\rho_i)-ReLU(-\left<w_i,x\right>-b_i+\rho_i)]

$f(x)=-f(-x)$ , and

$w_i\in\R^d$ $i$ -th neuron
$b_i>0$ , the negative bias
$\rho_i\sim\mathcal{N}(0,\sigma_p^2)$ $\widetilde{ReLU}=\Bbb{E}_{\rho}ReLU(x+\rho)$ .

$\rho$ $(\rho_i)_{i\in[m]}$ .

$a_i=1$ throughout the training for simplicity.

$w_i^{(t)}$ $t$ $f_t(w;x,\rho)$ $t$ , i.e.

f_t(w;x,\rho)=\sum_{i=1}^m(ReLU(\left<w_i^{(t)},x\right>+\rho_i-b_i^{(t)})-ReLU(-\left<w_i^{(t)},x\right>+\rho_i-b_i^{(t)}))

$\mathcal{Z}=\{x_j,y_j\}_{j\in[N]}$ $\rho^{(j)}$ $(x_j,y_j)$ , they define:

$\bold{Loss}_t(w;x,y,\rho):=\log(1+e^{-yf_t(w;x,\rho)})$
$\bold{Loss}_t(w):=\Bbb{E}_{x,y=y(x),\rho}[\bold{Loss}_t(w;x,y,\rho)]$
$\widetilde{\bold{Loss}}_t(w):=\frac{1}{N}\sum_{j\in[N]}[\bold{Loss}_t(w;x_j,y_j,\rho^{(j)})]$
$\bold{Obj}_t(w):=\bold{Loss}_t(w)+\lambda\sum_{i\in[m]}\bold{Reg}(w_i)$
$\widetilde{\bold{Obj}}_t(w):=\widetilde{\bold{Loss}}_t(w)+\lambda\sum_{i\in[m]}\bold{Reg}(w_i)$

The first one is the standard logistic loss, the second one is the population risk and the third one is the empirical risk.

They consider a strong regularier:

\bold{Reg}(w_i):=\left(\frac{||w_i||_2^2}{2}+\frac{||w_i||_2^2}{3}\right)

$\lambda$ :

\lambda=\frac{\log\log\log d}{d}

for simplicity.

Definition 4.1 $t$ is

\varepsilon_t^c:=\bold{Pr}_{x,y=y(x),\rho}[y\neq \text{sign}(f_t(w^{(t)});x,\rho)]

We also introduce a notation (the logistic gradient)

\ell_t^\prime(w^{(t)};x,y,\rho):=\frac{d}{ds}[\log(1+e^s)]|_{s=-yf_t(w^{(t)};x,\rho)}

and observe

\Bbb{E}_{x,y=y(x),\rho}[\ell_t^\prime(w^{(t)};x,y,\rho)]\ge\Omega(\varepsilon_t^c)

$\Omega$ ???

Clean Training

Details seems redundant, so it's omitted.

Adversarial Training

Definition 4.2. $A$ $f$ $\{w_i\}$ $\{a_i\}$ $\{b_i\}$ $\sigma_p$ $x$ $y$ $r$ $\R^d$ satisfying

||A(f,x,y,r)||_p\le \tau

$\ell_p$ $A$ $\ell_2$ $\tau$ $p=2$ $\ell_{\infty}$ $\tau$ $p=\infty$ .

$A$ $f,y,r$ $||A(f,x,y,r)||_p\le\frac{1}{\text{poly}(d)}$ $A(f,x,y,r)$ $\text{poly}(d)$ $x$ .

The FGM attack satisfies the above properties, defined as

A(f,x,y)=\begin{cases} \arg\min_{\delta:||\delta||_p\le\tau}\left<y\nabla_x f(x),\delta\right>&\text{ if }||\nabla_x f(x)||_q\ge\frac{1}{\text{poly}(d)};\\ 0&\text{otherwise} \end{cases}

$||\cdot||_q$ $||\cdot||_p$ .

Definition 4.3. $t$ $\ell_p$ $\tau$ , is

\varepsilon_t^r:=\bold{Pr}_{x,y=y(x),\rho}[\exist \delta\in\R^d,||\delta||_p\le\tau:\text{sign}(f_t(x+\delta))\neq y]

$A$ is

\hat{\varepsilon}_t^r:=\bold{Pr}_{x,y=y(x),\rho,r}[\text{sign}(f_t(x+A(f_t,x,y,r)))\neq y]

Main Results

Clean Training

Theorem 5.1 $C, c>0$ $c_0\in(0,c]$ $d$ $m$ $m=d^{1+c_0}$ $N>\Omega(d^C)$ $\sigma_0=\frac{1}{\text{poly}(d)}$ $\eta\in(0,\frac{1}{\Omega(d^C)}])$ $T_c:=\Theta(\frac{d^{1.01}}{\eta})$ $T_f\in[T_c,d^{\log d}/\eta]$ , the following holds with high probability:

$f_t$ $\{w_i^{(t)}\}_{i\in[m]}$ learned by clean training Algorithm 1 satisfies:

$t\in[T_c,T_f]$ ,
$\sum_{i\in[m]}\left<w_i^{(t)},w_i^{(0)}\right>^2=o(1)\times\sum_{i\in[m]}||w_i^{(t)}||_2^2\cdot||w_i^{(0)}||_2^2$
and
$\sum_{i,j\in[m]}\left<w_i^{(t)},w_j^{(t)}\right>^2=o(1)\times(\sum_{i\in[m]}||w_i^{(t)}||_2^2)^2$
It states that weights are different from initialization and the differences among weights across layers are proportional to the scale of weight (?)
$t\in[T_c,T_f]$ ,
$\varepsilon_t^c=\bold{Pr}_{x,y=y(x),\rho}[y\neq \text{sign}(f_t(w^{(t)};x,\rho))]\le o(1)$
$t\in[T_c,T_f]$ $\tau\ge\frac{1}{k^{0.5+10c}}$ $\delta=-\tau\frac{y\bold{M} w^*}{||\bold{M} w^*||_2}$ $f_t$ ),
$\varepsilon_t^r\ge\bold{Pr}_{x,y.\rho}[f_t(w^{(t)};x+\delta,\rho)\neq y]=1-o(1)$

Theorem 5.1 indicates that in our setting, clean training of the neural network has good clean accuracy but terrible robust accuracy. Such terrible robust accuracy is not due to over-fitting, as it holds even when a super-polynomially many iterations and infinitely many training examples are used to train the neural network

Theorem 5.2 $i\in[m]$ $\mathcal{N}_i$ $|\mathcal{N}_i|=O(1)$ $t\in[T_c,d^{\log d}/\eta)$ ,

w_i^{(t)}=\sum_{j\in\mathcal{N}_i}\alpha_{i,j}w_j^*\bold{M}_j +\sum_{j\notin\mathcal{N}_i}\beta_{i,j}w_j^*\bold{M}_j

where:

$|\beta_{i,j}|<\frac{k}{d^{1-c}}$ $c\in[0,0.001]$
$\Omega(d)$ $i\in[m]$ $|\mathcal{N}_i|=1$ $\alpha_{i,j}>d^{-c}$ . Moreover,
$\frac{1}{md}\sum_{i\in[m]}\sum_{j\in[d]\backslash\mathcal{N}_i}\beta_{i,j}\in[\frac{1}{d^c}\times\frac{k}{d},d^c\times\frac{k}{d}]$

???

$w_i$ $\{\bold{M}_j;j\in\mathcal{N}_j\}$ $\{\bold{M}_j:j\notin\mathcal{N}_j\}$ are all small.

$\{\bold{M}_j\}_{j\in[d]}$ $d^c$ $|\mathcal{N}_i|=1$ $\beta_{i,j}$ 's are of similar (positive) magnitude, then, clean training will learn neurons:

w_i^{(t)}\approx\underbrace{\Theta(1)\bold{M}_j}_{\text{pure,robust feature}}+\underbrace{\sum_{j^\prime\neq j}[\Theta(\frac{k}{d})w_{j^\prime}^*\bold{M}_{j^\prime}]}_{\text{dense mixture}}\tag{5.1}

It seems to say that in clean training the model learns the bias of the dataset. If so, then it explains for poor o.o.d. performance.

Feature Purification

$v$ $x$ from the original distribution, so it has negligible effect for clean accuracy.

$v$ , making the model non-robust.

As shown in Figure 5:

As we point out, such “dense adversarial perturbation” directions do not exist in the original data. Thus, one has to rely on adversarial training to remove dense mixtures to make the model robust.

They also state the source of dense mixture:

$\nabla\bold{Obj}$ $y=\text{sign}(\left<w^*,z\right>)$ $x=\bold{M}z+\xi$ $\bold{M}w^*=\sum_{j}w_j^*\bold{M}_j$ , so it a dense mixture direction and will be accumulated across time.

It seems to indicate that labeling function misguides the learning algorithm to learn the dense mixture.

Adversarial Training

Theorem 5.3 $A$ $\ell_2$ $\tau\le\frac{1}{k^{0.5+c}}$ clean training $T_f\ge T_c$ adversarial training $T_g=\Theta(\frac{k^{2+c}}{\eta})$ iterations.

The following holds with high probability:

$t=T_f+T_g$
$\hat{\varepsilon_t^r}=\bold{Pr}_{x,y=y(x),\rho,r}[\text{sign}(f_t(x+A(f_t,x,y,r)))\neq y]\le o(1)$
$A$ $t=T_f+T_g$
$\varepsilon_t^r:=\bold{Pr}_{x,y=y(x),\rho}[\exist\delta\in\R^d,||\delta||_2\le\tau:\text{sign}(f_t(x+\delta))\neq y]\le o(1)$
$t\in[T_f,T_f+T_g-1]$ ,
$\sum_{i\in[m]}||w_i^{(T_f)}-w_i^{(t)}||_2^2=o(1)\times\sum_{i\in[m]}||w_i^{(T_f)}||_2^2$
???

More surprisingly, Theorem 5.3b says when a good perturbation algorithm such as FGM is used, then not only the robustness generalizes to unseen examples, it also generalizes to any worst-case perturbation algorithm.

As shown in Figure 6:

Our theorem shows one of the main goals of adversarial training is to remove dense mixtures to make the network more robust. $\{\bold{M}_j\}_{j\in[d]}$ ; and after adversarial training, the adversarial perturbations ought to be more sparse and aligned with inputs from the original data set.

It says that adversarial training does not eliminate the adversarial perturbations, instead, it spreads the adversarial perturbations, making the perturbations semantically meaningful, violating the imperceptibility.

L_∞ Robustness and Lower Bound for Low-Complexity Models

Theorem 5.4 $\ell_{\infty}$ $||\bold{M}||_{\infty},||\bold{M}||_1=d^{o(1)}$ $c_1\in(0,c_0)$ $k\in[d^{c_1},d^{0.399}]$ $A$ $\ell_{\infty}$ $\tau=\frac{1}{k^{1.75+2c_0}}$ . Then, the same Theorem 5.3 and Theorem 5.1 still hold and imply:

$\ell_{\infty}$ $\frac{1}{k^{2-c_1}}$ .
$\ell_{\infty}$ $\frac{1}{k^{1.75+2c_0}}$

$c_0,c_1$ can be made arbitrarily small.

???

Definition 5.5 $f$ is

\Phi(x)=\left(x\Bbb{E}_{\rho_i}(\bold{1}_{\left<w_i,x\right>+\rho_i\ge b_i}-\bold{1}_{-\left<w_i,x\right>+\rho_i\ge b_i})\right)_{i=1}^m

$\{v_i\}_{i\in[m]}$ $p(x)$ is given as

p(x)=\sum_{i\in[m]}\left<x,v_i\right>\Bbb{E}_{\rho_i\sim\mathcal{N}(0,\sigma_\rho^2)}(\bold{1}_{\left<w_i,x\right>+\rho_i\ge b_i}-\bold{1}_{-\left<w_i,x\right>+\rho_i\ge b_i})

$w_i\sim\mathcal{N}(0,\bold{I})$ .

???

They prove the following:

Theorem 5.6 $C>1$ $m\le d^C$ $c>0$ $k=\frac{1}{d^c}$ $\ell_{\infty}$ $\tau=\frac{1}{k^{100}}$ $w_i$ $p(x)$ in the above definition, the robust error

\varepsilon^r(p)\ge\frac{1-o(1)}{2}

And the following corollary:

Corollary 5.7 $q(x)$ $\varepsilon^r(q)\ge\frac{1-o(1)}{2}$ .

I cannot get any sense.....

Overview of the Training Process

Lottery ticket winning

$m\ge d^{1.01}$ $\bold{M}_j$ is slightly higher than usual

Lottery tickets winning process $j\in[d]$ $t$ $i\in\mathcal{S}_{j,sure}^{(0)}$ $\left<\bold{M}_j,w_i^{(t)}\right>^2$ $\left<\bold{M}_j^\prime,w_i^{(t)}\right>^2$ $t$ $\left<\bold{M}_j,w_i^{(t)}\right>^2$ $\left<\bold{M}_j^\prime,w_i^{(t)}\right>^2$ .

$i$ $\bold{M}_j$ .

Some neurons will be lucky to resemble the feature mostly at begining and grows into the feature responsible neuron.

Dense mixture

We shall prove that in this phase, gradient descent will also accumulate, in each neuron, a small “dense mixture” that is extremely vulnerable to small but adversarial perturbations.

$\bold{M}_j$ near random initialization, then it will keep this “lottery ticket” throughout the training.

But they show that that even for the “lucky neuron” that wins the lottery ticket, the hidden weight of this neuron will look like:

w_i\approx\alpha_t\left(\bold{M}_j+\Theta(\frac{k}{d})\sum_{j^\prime\neq j}w_{j^\prime}^*\bold{M}_{j^\prime}\right)

$w_i\approx \bold{M}_j+v_i$ $v_i$ is a "dense mixture":

v_i=\Theta(\frac{k}{d})\sum_{j^\prime\neq j}w_{j^\prime}^*\bold{M}_{j^\prime}

$v_i$ is small and dense, i.e. with high probability:

|\left<v_i,x\right>|\le\tilde{O}(\frac{k}{d}||x||_2)

This dense mixture will not be correlated with any particular natural input, and thus the existence of these mixtures will have negligible contribution to the output of ft on clean data.

$\delta\propto \sum_{j^\prime\in[d]}\bold{M}_{j^\prime}$ , i.e.

|\left<v_i,\delta\right>|=\Omega(\frac{k}{\sqrt{d}}||\delta||_2)

Thus, at this phase, even when the network has a good clean accuracy, it is still non-robust to these small yet dense adversarial perturbations.

Moreover, this perturbation direction is “universal”, in the sense that it does not depend on the randomness of the model at initialization, or the randomness we use during the training. This explains transfer attacks in practice: that is, the adversarial perturbation found in one model can also attack other models that are independently trained.

Since Eq. (6.2) suggests most original inputs have negligible correlations with each dense mixture, during clean training, gradient descent will have no incentive to remove those mixtures.

They point out that it's a property of gradient descent:

We emphasize that this is indeed a special property of gradient descent.

(Stochastic) gradient descent, as a local update algorithm, only exams the local correlation between the update direction and the labeling function, and it does not exam whether this direction can be used in the final result.

$w_i=\bold{M}_i$ as initialization as opposed to random initialization, continuing clean training will still accumulate these small but dense mixtures.

Local learning accumulates noise?

More experiments results

Inspirations

They propose a principle named as feature purification to describe what adversarial training is doing to the cleanly trained models.

They point out that gradient descent will accumulate dense mixtures correlated to the labeling function into the features learned, which is less correlated with inputs, therefore hurting no accuracy, but correlated perturbations towards the direction of dense mixtures lead to mistakes, making the model adversarially vulnerable.

Adversarial training instead is able to remove these dense mixtures, thus purifying the features learned by model.

It seems that adversarial training indeed does not eliminate adversarial examples, but amplify the perturbations required and spread the perturbations in a semantically meaningful way.

It also seems to indicate that a better labeling function can improve adversarial robustness.

Feature Purification : an Explanation for Adversarial Training

Content

Feature Purification: How Adversarial Training Performs Robust Deep Learning - 2020

The Principle of Feature Purification

Why clean training learns non-robust features? Which part of the features are “purified” during adversarial training?

Computational Complexity

Preliminaries

Notations

Sparse coding model

Clean and robust error

Warmup Intuitions

Linear learners are not robust

High-complexity models are more robust

Learning robust classifier using neural network

Learner Network and Adversarial Training

Clean Training

Adversarial Training

Main Results

Clean Training

Feature Purification

Adversarial Training

L∞ Robustness and Lower Bound for Low-Complexity Models

Overview of the Training Process

Lottery ticket winning

Dense mixture

More experiments results

Inspirations

L_∞ Robustness and Lower Bound for Low-Complexity Models