First-order Adversarial Vulnerability

By LI Haoyang 2020.11.16

Content

First-order Adversarial VulnerabilityContentFirst-order Adversarial Vulnerability - ICML 2019From Adversarial Examples to Large GradientsA new old regularizer (defenses inspired by Lemma 2)Link to adversarially augmented trainingEvaluate Adversarial VulnerabilityOne Neuron with Many InputsFormal Statements for Deep NetworksEmpirical ResultsFirst-Order Approximation, Gradient Penalty, Adversarial AugmentationVulnerability’s Dependence on Input DimensionFurther ExperimentsInspirations

First-order Adversarial Vulnerability - ICML 2019

Code:https://github.com/facebookresearch/AdversarialAndDimensionality

Carl-Johann Simon-Gabriel, Yann Ollivier, Léon Bottou, Bernhard Schölkopf, David Lopez-Paz. First-order Adversarial Vulnerability of Neural Networks and Input Dimension. ICML 2019. arXiv:1802.01421

We show that adversarial vulnerability increases with the gradients of the training objective when viewed as a function of the inputs.

Surprisingly, vulnerability does not depend on network topology.

We empirically show that this dimension dependence persists after either usual or robust training, but gets attenuated with higher regularization.

Their contributions:

We show an empirical one-to-one relationship between average gradient norms and adversarial vulnerability.
$\sqrt{d}$ $d$ .
We empirically show that this dimension dependence persists after both usual and robust (PGD) training, but gets dampened and eventually vanishes with higher regularization.
We observe that further training after the training loss has reached its minimum can provide improved test accuracy, but severely damages the network’s robustness.
We notice a striking discrepancy between the gradient norms (and therefore the vulnerability) on the training and test sets respectively.

From Adversarial Examples to Large Gradients

Adversarial example is a small perturbation of the inputs that creates a large variation of outputs.

$\varphi$ $x$ $\delta$ $||\delta||\le \epsilon$ $\varphi(x+\delta)\neq \varphi(x)$ .

Definition 1 $P$ adversarial vulnerability $\varphi$ $\epsilon$ $||\cdot||$ $\delta$ $x$ such that

||\delta||\le\epsilon\text{ and }\varphi(x)\neq\varphi(x+\delta)

$\mathcal{L-}$ ) damage $\Bbb{E}_{x\sim P}[\Delta\mathcal{L}]$ $\mathcal{L}$ $\mathcal{L}$ $\mathcal{L}_{0/1}$ , adversarial damage is the accuracy-drop after attack, which lower bounds adversarial vulnerability.

The accuracy-drop rules out natural adversarial examples, thus lower bounds the adversarial vulnerability.

$\mathcal{L}$ $\varphi$ $x$ $\delta$ $x$ $\delta\mathcal{L}$ of the loss. (It can also be the case of obfuscated gradient.)

$||\delta||\le \epsilon$ $\epsilon$ shows that

\begin{array}{rl} \delta\mathcal{L}=&\max_{\delta:||\delta||\le \epsilon}|\mathcal{L}(x+\delta,c)-\mathcal{L}(x,c)|\\ \approx&\max_{\delta:||\delta||\le\epsilon}|\partial_{x}\mathcal{L}\cdot\delta|=\epsilon|||\partial_{x}\mathcal{L}||| \end{array}

$\partial_{x}\mathcal{L}$ $\mathcal{L}$ $x$ $|||\cdot|||$ $||\cdot||$ .

The dual norm is defined as:

|||z|||=\sup\{z^\top x|||x||\le 1\}

which obviously explains the last equality.

$||\cdot||_1$ $||\cdot||_{\infty}$ $||\cdot||_p$ $||\cdot||_q$ $\frac{1}{p}+\frac{1}{q}=1$ .

This leads to the following lemma.

Lemma 2 $\epsilon$ $\epsilon$ $||\cdot||$ $\mathcal{L}$ $x$ $\epsilon|||\partial_{x}\mathcal{L}|||$ $|||\cdot|||$ $||\cdot||$ $\epsilon$ $\ell_p$ $\epsilon||\partial_{x}\mathcal{L}||_q$ $1\le q\le\infty$ $\frac{1}{p}+\frac{1}{q}=1$ .

Moreover, we will see that the first-order predictions closely match the experiments.

This lemma shows that adversarial vulnerability depends on three main factors:

$||\cdot||$ , the chosen norm of threat model
A measure of sensibility to image perturbations.
$\epsilon$ , the size of attack
A sensibility threshold.
$\Bbb{E}_{x}|||\partial_{x}\mathcal{L}|||$ $\partial_x\mathcal{L}$
The classifier's expected marginal sensibility to a unit perturbation.

$\delta$ $\ell_p$ $d^{1/p}$ $\epsilon_p$ $\ell_p$ -attacks can be written as

\epsilon_p=\epsilon_{\infty} d^{1/p}

$\epsilon_{\infty}$ denotes a dimension independent constant.

$||x||_2/||\delta||_2$ .

A new old regularizer (defenses inspired by Lemma 2)

Lemma 2 $\frac{\epsilon}{2}$ $||\cdot||$ -attack is

\mathcal{L}_{\epsilon,|||\cdot|||}(x,c):=\mathcal{L}(x,c)+\frac{\epsilon}{2}|||\partial_{x}\mathcal{L}|||

It is thus natural to take this loss-after-attack as a new training objective.

$||\cdot||=||\cdot||_2$ , this new loss reduces to an old regularization scheme proposed by Drucker & LeCun (1991) called double backpropagation.

A priori, we do not know what will happen for attacks generated with other norms; but our experiments suggest that training with one norm also protects against other attacks.

Link to adversarially augmented training

$\epsilon$ $||\cdot||$ $x+\delta$ , the training objectively is

\tilde{\mathcal{L}}_{\epsilon,||\cdot||}(x,c):=\frac{1}{2}(\mathcal{L}(x,c)+\mathcal{L}(x+\epsilon \delta,c))

referred as adversarially augmented training by the authors. (Generally known as adversarial training.)

They prove that this 'old-plus-post-attack' loss simply reduces to the loss-after-attack loss using the first order Taylor expansion.

Proposition 3 $\epsilon$ $\tilde{\mathcal{L}}_{\epsilon,||\cdot||}=\mathcal{L}_{\epsilon,|||\cdot|||}$ $\epsilon$ $\epsilon$ $||\cdot||$ $|||\cdot|||$ $\partial_{x}\mathcal{L}$ $\epsilon/2$ $\ell_2$ $\ell_1$ $\partial_{x}\mathcal{L}$ .

Tikhonov regularization $\partial_{x}\mathcal{L}$ .

Evaluate Adversarial Vulnerability

One Neuron with Many Inputs

$\partial_{x}\mathcal{L}$ $|\partial_x\mathcal{L}|$ $||\partial_x\mathcal{L}||_q$ $d^{1/q}|\partial_x\mathcal{L}|$ , consequently

\epsilon_p||\partial_x\mathcal{L}||_q\propto\epsilon_p d^{1/q}|\partial_x\mathcal{L}|\propto d|\partial_x\mathcal{L}|

$\epsilon_p\propto d^{1/p}$ $\frac{1}{q}+\frac{1}{p}=1$ .

$\ell_p$ $d$ $\partial_x\mathcal{L}$ $1/d$ . (Is it true?)

The neural weights are usually initialized with a variance that is inversely proportional to the number of inputs per neuron.

$o$ $1/d$ $|\partial_x o|\equiv|\partial_x\mathcal{L}|$ $1/\sqrt{d}$ $\epsilon||\partial_x o||_q\equiv\epsilon||\partial_x\mathcal{L}||_q$ $d/\sqrt{d}=\sqrt{d}$ .

Intuitively, the above sounds reasonable, but a more detailed illustration is needed to be more convincing.

$|\partial_x\mathcal{L}|$ $1/\sqrt{d}$ $1/d$ $\ell_{\infty}$ $\epsilon$ $\partial_x\mathcal{L}$ $\partial_x\mathcal{L}$ $\epsilon\sqrt{d}$ and leaves the network increasingly vulnerable with growing input-dimension.

Formal Statements for Deep Networks

$\mathcal{H}$ of hypotheses:

H1 Non-input neurons are followed by a ReLU killing half of its inputs, independently of the weights.
H2 Neurons are partitioned into layers, meaning groups that each path traverses at most once.
H3 All weights have 0 expectation and variance 2/(in-degree) (‘He-initialization’).
H4 The weights from different layers are independent.
$w,w^\prime$ $\Bbb{E}[w w^\prime]=0$ .

Not covering all cases, but reasonable.

Nevertheless, they do not hold after training. That is why all our statements in this section are to be understood as orders of magnitudes that are very well satisfied at initialization both in theory and practice.

Theorem 4 (Vulnerability of Fully Connected Nets) $x$ $d$ $\mathcal{H}$ $f_k(x)$ $\mathcal{L}$ $\partial_x f_k$ $1/\sqrt{d}$ , and

||\partial_x\mathcal{L}||_q\propto d^{\frac{1}{q}-\frac{1}{2}}\text{ and }\epsilon_p||\partial_x\mathcal{L}||_q\propto\sqrt{d}

$\ell_p$ -attacks with growing input-dimension.

$p$ path-degree $d_p$ $p$ . For a fully connected network, this is the unordered sequence of layer-sizes preceding the last path-node, including the input-layer.

Choose a path of calculation, put each node's in-degrees into a set.

$\{d_p\}_{p\in\mathcal{P}(x,o)}$ $p$ $x$ $o$ . The symmetry assumption is

$\mathcal{S}$ $x$ $\{d_p\}_{p\in\mathcal{P}(x,o)}$ $x$ $o$ .

Intuitively, this means that the statistics of degrees encountered along paths to the output are the same for all input nodes. This symmetry assumption is exactly satisfied by fully connected nets, almost satisfied by CNNs (up to boundary effects, which can be alleviated via periodic or mirror padding) and exactly satisfied by strided layers, if the layersize is a multiple of the stride.

Theorem 5 (Vulnerability of Feedforward Nets). $\mathcal{H}$ $f_k(x)$ $\mathcal{L}$ $||\partial_x f_k||_2$ $d$ $\epsilon_2||\partial_x\mathcal{L}||_2\propto \sqrt{d}$ $\mathcal{S}$ $|\partial_x f_k|\propto 1/\sqrt{d}$ $||\partial_x\mathcal{L}||_q\propto d^{\frac{1}{q}-\frac{1}{2}}$ $\epsilon_p||\partial_x\mathcal{L}||_q\propto\sqrt{d}$ .

The main proof idea is that in the gradient norm computation, the He initialization exactly compensates the combinatorics of the number of paths in the network, so that this norm becomes independent of the network topology.

Corollary 6 (Vulnerability of CNNs) $\mathcal{H}$ $\mathcal{L}$ $1/\sqrt{d}$ results in theorem 4 $\ell_p$ -norm.

Although the principles of our analysis naturally extend to residual nets, they are not yet covered by our theorems.

Current weight initializations (He-, Glorot-, Xavier-) are chosen to preserve the variance from layer to layer, which constrains their scaling to 1/√ in-degree.

Also note that rescaling all weights by a constant does not change the classification decisions, but it affects cross-entropy and therefore adversarial damage.

Empirical Results

First-Order Approximation, Gradient Penalty, Adversarial Augmentation

$\epsilon$ .

Note that our goal here is not to advocate one defense over another, but rather to check the validity of the Taylor expansion, and empirically verify that first order terms (i.e., gradients) suffice to explain much of the observed adversarial vulnerability.

As shown in Figure 1, they have following conclusions:

Confirming first order expansion and large first-order vulnerability
- the efficiency of the first-order defense against iterative (non-first-order) attacks (Fig.1&4a).
- the striking similarity between the PGD curves (adversarial augmentation with iterative attacks) and the other adversarial training training curves (one-step attacks/defenses).
- $\Bbb{E}_x||\partial_x\mathcal{L}||_1$ (Fig.4b), and its independence on the training method
- the excellent correspondence between the gradient regularization and adversarial augmentation curves (see next paragraph).
Gradient regularization matches adversarial augmentation
Confirming correspondence of norm-dependent thresholds
Accuracy-vulnerability trade-off: confirming large first-order component of vulnerability
The regularization-norm does not matter

Vulnerability’s Dependence on Input Dimension

$\ell_1$ $\partial_x\mathcal{L}$ $d$ , and therefore an increased adversarial vulnerability (Lemma 2).

As shown in Figure 2 and Figure 13, they have the following conclusions:

$\sqrt{d}$ . ( Figure 2 (a) and (b))
Accuracies are dimension independent ( Figure 2 (c))
This hints to defend using dimension manipulation.
PGD effectively recovers original input dimension ( Figure 2 (d))
Higher dimensions have a longer plateau to the right, because without regularization, vulnerability increases with input dimension.
The curves overlap when moving to the left, meaning that the accuracy vulnerability trade-offs achieved by PGD are essentially independent ofthe actual input dimension.
PGD training outperforms down-sampling (Figure 13.)

Further Experiments

As shown in Figure 8, they have the following conclusions:

Non-equivalence of loss- and accuracy-damage. (Figure 8 a & c)
The test-error continues to decrease all over training, while the cross-entropy increases on the test set from epoch ≈ 40 and on.
Intriguing.
Early stopping dampens vulnerability
Since cross-entropy overfits, early stopping effectively acts as a defense. (Figure 10)

As shown in Figure 12,

Gradient norms do not generalize well

This discrepancy increases over training (gradient norms decrease on the training data but increase on the test set).

Inspirations

This paper is very informative. It bridges up the regularization with adversarial training and give many fruitful insights.

I think the subsequent defenses can cut in from the following perspectives:

The dimension of inputs
Regularization of gradients (JARN, GradAlign, etc.)
New structure
Improvement of adversarial training (as verified by this paper)