Adversarial Defense with Ensemble

By LI Haoyang 2020.11.12 (branched)

Content

Adversarial Defense with EnsembleContentEnsemble + Adversarial TrainingEnsemble Adversarial Training - ICLR 2018Adversarial TrainingEnsemble Adversarial TrainingR+FSGMExperimentsInspirationsDVERGE - NIPS 2020MethodAlgorithmPerformanceInspirations

Ensemble + Adversarial Training

Ensemble Adversarial Training - ICLR 2018

Paper: https://openreview.net/forum?id=rkZvSe-RZ

Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel. Ensemble Adversarial Training: Attacks and Defenses. ICLR 2018. arXiv:1705.07204

We show that this form of adversarial training converges to a degenerate global minimum, wherein small curvature artifacts near the data points obfuscate a linear approximation of the loss.

We further introduce Ensemble Adversarial Training, a technique that augments training data with perturbations transferred from other models.

In particular, our most robust model won the first round of the NIPS 2017 competition on Defenses against Adversarial Attacks (Kurakin et al., 2017c).

Adversarial Training

The basic adversarial training is a variant of standard Empirical Risk Minimization (ERM), i.e.

h^*=\mathop{\arg\min}_{h\in\mathcal{H}}\Bbb{E}_{(x,y_{true})\sim\mathcal{D}} \left[\max_{||x^{adv}-x||_{\infty}\le\epsilon}\mathcal{L}(h(x^{adv}),y_{true})\right]

Adversarial training has a natural interpretation in this context, where a given attack (see below) is used to approximate solutions to the inner maximization problem, and the outer minimization problem corresponds to training over these examples.

For the inner maximization problem, i.e. generating an adversarial examples, they consider following methods:

Fast Gradient Sign Method (FSGM)
This method linearizes the inner problem:
$x_{FGSM}^{adv}:=x+\varepsilon\cdot\text{sign}(\nabla_xL(h(x),y_{true}))$
Single-Step Least-Likely Class Method (Step-LL)
$y_{LL}=\arg\min\{h(x)\}$ :
$x^{adv}_{LL}:=x-\varepsilon\cdot\text{sign}(\nabla_xL(h(x),y_{LL}))$
Kurakin et al. (2017b) find it to be the most effective for adversarial training on ImageNet.
Iterative Attack (I-FGSM or Iter-LL)
$k$ $\alpha\ge\epsilon/k$ $\ell_{\infty}$ $x$ .

Using the single-step attack to approximate the inner problem, the problem further becomes

h^*=\mathop{\arg\min}_{h\in\mathcal{H}}\Bbb{E}_{(x,y_{true})\sim\mathcal{D}} \left[\mathcal{L}(h(x^{adv}_{FGSM}),y_{true})\right]

$h^*$ :

$x$ $\mathcal{D}$ $x^{adv}$ $x$ $\ell_{\infty}$ norm that induces a high loss, i.e.
$L(h^*(x^{adv}_{FGSM}),y_{true})\approx\max_{||x^{adv}-x||_{\infty}\le\epsilon}L(h(x^{adv}),y_{true})\approx 0$
$h^*$ $\ell_{\infty}$ perturbations.
The approximation method underlying the attack poorly fits the model's loss function, i.e.
$L(h^*(x^{adv}_{FGSM}),y_{true})\ll\max_{||x^{adv}-x||_{\infty}\le\epsilon}L(h(x^{adv}),y_{true})$

Thus, adversarial training does not simply learn to resist the particular attack used during training, but actually to make that attack perform worse overall.

This phenomenon relates to the notion of Reward Hacking (Amodei et al., 2016) wherein an agent maximizes its formal objective function via unintended behavior that fails to captures the designer’s true intent.

Ensemble Adversarial Training

Our method, which we call Ensemble Adversarial Training, augments a model’s training data with adversarial examples crafted on other static pre-trained models.

In Domain Adaptationsource distributions $\mathcal{S_1,\dots,S_k}$ $x$ target distribution $\mathcal{T}$ .

$\mathcal{A}_i$ $(x^{adv},y_{true})\sim\mathcal{A}_i$ .

Ensemble Adversarial Training $\mathcal{D}$ $\mathcal{A_1,\dots,A_k}$ $\mathcal{A}^*$ .

Standard generalization bounds for Domain Adaptation (Mansour et al., 2009; Zhang et al., 2012) yield the following result.

Theorem 1 $h^*\in\mathcal{H}$ $\mathcal{A_1,\dots,A_k}$ $h^*$ $\mathcal{A_1,\dots,A_k}$ $h^*$ $\mathcal{A}^*$ $\mathcal{A}^*$ $\mathcal{A_1,\dots,A_k}$ .

An ensemble adversarial training on adversaries can improve robustness on weaker or equal unseen attacks.

$N$ $Z_{train}$ $\frac{N}{k}$ $\mathcal{A}_i$ $1\le i\le k$ .

$\mathcal{A}_{train}=\{\mathcal{A_1,\dots,A_k}\}$ $\mathcal{A}^*$ .

$h\in\mathcal{H}$ , they define the empirical risk as

\hat{R}(h,\mathcal{A}_{train}):=\frac{1}{N}\sum_{(x^{adv},y_{true})\in Z_{train}}L(h(x^{adv}),y_{true})

and the risk over the target distribution (i.e. future adversary) as

R(h,\mathcal{A}^*):=\Bbb{E}_{(x^{adv},y_{true})\sim\mathcal{A}^*}[L(h(x^{adv}),y_{true})]

discrepancy distance $\mathcal{A}_i$ $\mathcal{A}^*$ $\mathcal{H}$ is defined as

\text{disc}_{\mathcal{H}}(\mathcal{A}_{train},\mathcal{A}^*):=\frac{1}{k}\sum_{i=1}^k\sup_{h_1,h_2\in\mathcal{H}}\left|\Bbb{E}_{\mathcal{A}_i}[\bold{1}_{\{h_1(x^{adv})=h_2(x^{adv})\}}]-\Bbb{E}_{\mathcal{A}^*}[\bold{1}_{\{h_1(x^{adv})=h_2(x^{adv})\}}]\right|

The discrepancy distance between two hypotheses are the difference between their expected prediction difference between two distributions. The average discrepancy distance are averaged over distributions and the upper bound of the difference in a hypothesis space.

This quantity characterizes how “different” the future adversary is from the train-time adversaries.

$R_N(\mathcal{H})$ Rademacher complexity $\mathcal{A_1,\dots,A_k}$ . The following theorem is a corollary of Zhang et al. (2012, Theorem 5.2):

Theorem 5 $\mathcal{H}$ $1-\epsilon$ ,

\sup_{h\in\mathcal{H}}|\hat{R}(h,\mathcal{A}_{train})-R(h,\mathcal{A}^*)|\le\text{disc}_{\mathcal{H}}(\mathcal{A}_{train},\mathcal{A}^*)+2R_N(\mathcal{H})+O(\sqrt{\frac{\ln(1/\epsilon)}{N}})

$h^*$ learned by Ensemble Adversarial Training has guaranteed generalization bounds with respect to future adversaries that are not “too different” from the ones used during training.

R+FSGM

_adv $\epsilon\le 16/256$ . (Kurakin et al. 2017).

_adv $x^{*}=x+\epsilon_1\cdot g+\epsilon_2\cdot g^{\bot}$ $g$ _adv $g^{\bot}$ $g$ .

Figure 1 shows that the loss is highly curved in the vicinity of the data point x, and that the gradient poorly reflects the global loss landscape.

Table 1 shows error rates for single-step attacks transferred between models. We compute perturbations on one model (the source) and transfer them to all others (the targets). When the source and target are the same, the attack is white-box.

Adversarial training greatly increases robustness to white-box single-step attacks, but incurs a higher error rate in a black-box setting. Thus, the robustness gain observed when evaluating defended models in isolation is misleading

They suggest to prepend signle-step attacks by a small random step to "escape" the non-smooth vicinity of the data point before linearizing the model's loss, named as R+FGSM (resp. R+Step-LL), i.e.

x^{adv}=x^\prime+(\varepsilon-\alpha)\cdot\text{sign}(\nabla_{x^\prime}J(x^\prime,y_{true})),x^\prime=x+\alpha\cdot\text{sign}(\mathcal{N}(\bold{0}^d,\bold{I}^d))

This method is re-proposed as FGSM with random start point.....

As shown in Table 2,

The extra random step yields a stronger attack for all models, even those without adversarial training.

Surprisingly, we find that for the adversarially trained Inception v3 model, the R+Step-LL attack is stronger than the two-step Iter-LL attack.

Experiments

We use the Step-LL, R+Step-LL, FGSM, I-FGSM and the PGD attack from Madry et al. (2017) using the hinge-loss function from Carlini & Wagner (2017a).

Convergence of Ensemble Adversarial Training is slower than for standard adversarial training, a result of training on “hard” adversarial examples and lowering the batch size.

For both architectures, the models trained with Ensemble Adversarial Training are slightly less accurate on clean data, compared to standard adversarial training.

Ensemble Adversarial Training is not robust to white-box Iter-LL and R+Step-LL samples: the error rates are similar to those for the v3_adv model, and omitted for brevity (see Kurakin et al. (2017b) for Iter-LL attacks and Table 2 for R+Step-LL attacks).

Ensemble Adversarial Training significantly boosts robustness to the attacks we transfer from the holdout models.

Inspirations

This paper is out of date, hence providing not much inspirations now.

DVERGE - NIPS 2020

Huanrui Yang, Jingyang Zhang, Hongliang Dong, Nathan Inkawhich, Andrew Gardner, Andrew Touchet, Wesley Wilkes, Heath Berry, Hai Li. DVERGE: Diversifying Vulnerabilities for Enhanced Robust Generation of Ensembles. NIPS 2020. arXiv:2009.14720

It's an ensemble defense based on adversarial training.

We propose DVERGE, which isolates the adversarial vulnerability in each sub-model by distilling non-robust features, and diversifies the adversarial vulnerability to induce diverse outputs against a transfer attack.

DVERGE is short for Diversifying Vulnerabilities for Enhanced Robust Generation of Ensembles.

The ensembled decison region appears to be more robust.

Method

They start by isolating the vulnerability of CNN models based on the distilled non-robust features, defined (by Madry.et.al) as:

x^{'}_{f_x^l}(x,x_s)=\arg\min_z||f_i^l(z)-f_i^l(x)||_2^2,s.t.||z-x_s||_{\infin}\le\epsilon

$(x,y)\in D$ $(x_s,y_s)\in D$ $f_i^l(\cdot)$ $l$ $x_s$ $y$ $x$ $f_i$ $x$ .

$f_i$ $f_j$ as:

d(f_i,f_j):=\frac{1}{2}\Bbb{E}_{(x,y),(x_s,y_s),l}\left[\mathcal{L}_{f_i}(x_{f_j^l}^{'}(x,x_s),y)+\mathcal{L}_{f_j}(x_{f_i^l}^{'}(x,x_s),y)\right]

$\mathcal{L}_{f}(x,y)$ $f$ $(x,y)$ $(x,y)\in D$ $(x_s,y_s)\in D$ $l$ $f_i$ $f_j$ .

$x_{f_j^l}^{'}(x,x_s)$ $y$ $x_s$ $l$ $f_j$ $f_i$ $y$ , the transferability of it can be evaluated and vice versa.

$f_i$ ,the diversity objective is added in training process:

\min_{f_i}\Bbb{E}_{(x,y)}[\mathcal{L}_{f_i}(x,y)]-\alpha\sum_{j\neq i}d(f_i,f_j)

For better convergence, the equation above is re-formulated as:

\min_{f_i}\Bbb{E}_{(x,y)}\left[\mathcal{L}_{f_i}(x,y)+\alpha\sum_{j\neq i}\Bbb{E}_{(x_s,y_s),l}\left[\mathcal{L}_{f_i}\left(x_{f_j^l}^{'}(x,x_s),y_s\right)\right]\right]

$x_s$ .

And since the literature discovers that it's not necessary to include the clean data, the problem is further simlified as:

\min_{f_i}\Bbb{E}_{(x,y),(x_s,y_s),l}\sum_{j\neq i}\mathcal{L}_{f_i}(x_{f_j^l}^{'}(x,x_s),y_s)

$f_i$ with adversarial examples generated from other sub-models.

It can be combined with adversarial training:

\min_{f_i}\Bbb{E}_{(x,y),(x_s,y_s),l}\left[\underbrace{\lambda\cdot\sum_{j\neq i}\mathcal{L}_{f_i}(x_{f_j^l}^{'}(x,x_s),y_s)}_{\text{DVERGE loss}}+\underbrace{\max_{\delta\in\mathcal{S}}\mathcal{L}_{f_i}(x_s+\delta,y_s)}_{\text{AdvT loss}}\right]

Algorithm

$l$ of other sub-models to train the current sub-models in a rotational way.

They average the output probabilities after the soft-max layer of each sub-model to yield the final predictions of ensembles.

Performance

They use ResNet-20 for architecture and CIFAR-10 as dataset for evaluation. As expected, DVERGE reduces the transferability of adversarial exampls among the sub-models.

Although DVERGE achieves the highest robustness among ensemble methods, its robustness against white-box attacks and transfer attacks with a large perturbation strength is still quite low. This result is expected because the objective of DVERGE is to diversify the adversarial vulnerability rather than completely eliminate it.

By combining DVERGE and adversarial training, the robustness is further increased.

Inspirations

It's amazing how such a simple training routine (using other sub-models' adversarial examples) is induced from the initiative to magnify the diversity between sub-models.

However, as the Figure 5 shows, DVERGE alone seems to be not that helpful in overall robustness compared to adversarial training.