Adversarial Defense by Regularization

By LI Haoyang 2020.11.6

Content

Lipschitz Regularization

Parseval Networks - ICML 2017

Moustapha Cisse , Piotr Bojanowski, Edouard Grave, Yann Dauphin, Nicolas Usunier. Parseval Networks: Improving Robustness to Adversarial Examples. ICML 2017. arXiv:1704.08847

We introduce Parseval networks, a layerwise regularization method for reducing the network’s sensitivity to small perturbations by carefully controlling its global Lipschitz constant.

Our main idea is to control this norm by parameterizing the network with parseval tight frames (Kovaˇcevi´c & Chebira, 2008), a generalization of orthogonal matrices.

Robustness in neural networks

They consider a multiclass prediction setting.

$\hat{g}$ $x\in\Bbb{R}^D$ $\tilde{y}\in\cal{Y}$ $W\in\cal{W}$ :

\hat{g}:(x\in\Bbb{R}^D,W\in\mathcal{W})\mapsto\arg\max_{\tilde{y}\in\mathcal{Y}}g_{\tilde{y}}(x,W)

$g_{\tilde{y}}(x,W)$ $(x,\tilde{y})$ $g:\Bbb{R}^D\times \cal{W}\to\Bbb{R}^Y$ $G=(\cal{N},\cal{\varepsilon})$ .

$n\in\cal{N}$ $\Bbb{R}^{d_{out}^{(n)}}$ $W^{(n)}$ :

n:x\mapsto\phi^{(n)}(W^{(n)},(n^{'}(x))_{n^{'}:(n,n^{'})\in\varepsilon})

$g$ $G$ $((x_i,y_i))_{i=1}^m\in(\cal{X}\times\cal{Y})^m$ $\cal{D}$ $\cal{X}\sub \Bbb{R}^D$ is compact.

$l:\Bbb{R}^Y\times \cal{Y}\to\Bbb{R}$ $g$ $(x,y)$ $l$ is the log-loss:

l(g(x,W),y)=-g_y(x,W)+log(\sum_{\tilde{y}\in\mathcal{Y}}e^{g_{\tilde{y}}(x,W)})

$p$ $||\cdot||_p$ $\lambda_p$ such that

\forall z,z^{'}\in\Bbb{R}^Y,\forall \tilde{y}\in\mathcal{Y},|l(z,\tilde{y})-l(z^{'},\tilde{y})|\le\lambda_p||z-z^{'}||_p

$\lambda_2\le\sqrt{2}$ $\lambda_{\infin}\le 2$ .

Generalization with adversarial examples

$g(\cdot,W)$ $p$ $\tilde{x}=x+\delta_x$ is formally defined as

\tilde{x}=\mathop{\arg\max}_{\tilde{x}:||\tilde{x}-x||_p\le\epsilon}l(g(\tilde{x},W),y)

$\epsilon$ is the strength of adversary. It's proposed to compute it by solving

\tilde{x}=\underset{\tilde{x}:\|\tilde{x}-x\|_{p} \leq \epsilon}{\operatorname{argmax}}\left(\nabla_{x} \ell(g(x, W), y)\right)^{T}(\tilde{x}-x)

$p=\infin$ , it reduces to fast gradient sign method:

\tilde{x}=x+\epsilon \mathop{sign}(\nabla_xl(g(x,W),y))

$p=2$ , it becomes

\tilde{x}=x+\epsilon\nabla_xl(g(x,W),y)

Based on which, a more involved method is iterative fast gradient sign method.

With and without adversarial example, there are two kinds of generalization errors:

\begin{array}{rl} L(W)&=\mathop{\Bbb{E}}_{(x,y)\sim\mathcal{D}}[l(g(x,W),y)],\\ L_{adv}(W,p,\epsilon)&=\mathop{\Bbb{E}}_{(x,y)\sim\mathcal{D}}[\max_{\tilde{x}:||\tilde{x}-x||_p\le\epsilon}l(g(\tilde{x},W),y)] \end{array}

$\forall p,\epsilon>0:L(w)\le L_{adv}(W,p,\epsilon)$ .

$\lambda_p$ $\Lambda_p$ $l$ $g$ respectively, there is

\begin{array}{rl} L_{adv}(W,p,\epsilon)&\le L(W) +\mathop{\Bbb{E}}_{(x,y)\sim\mathcal{D}}[\max_{\tilde{x}:||\tilde{x}-x||_p\le\epsilon}|l(g(\tilde{x},W),y)-l(g(x,W),y)|]\\ &\le L(W)+\lambda_p\Lambda_p\epsilon \end{array}

This suggests that the Lipschitz of the network is controlled both by that of the loss function and that of the network itself.

This suggests that the sensitivity to adversarial examples can be controlled by the Lipschitz constant of the network.

In the robustness framework of (Xu & Mannor, 2012), the Lipschitz constant also controls the difference between the average loss on the training set and the generalization performance.

Lipschitz constant of neural networks

$n\in\cal{N}$ $\Lambda_p^{(n,n^{'})}$ is defined as:

||n(x)-n(\tilde{x})||_p\le\sum_{n^{'}:(n,n^{'})\in\epsilon}\Lambda_p^{(n,n^{'})}||n^{'}(x)-n^{'}(\tilde{x})||_p

$n$ $\Lambda_p^{(n)}$ satisfies

\Lambda_p^{(n)}\le\sum_{n^{'}:(n,n^{'})\in\epsilon}\Lambda_p^{(n,n^{'})}\Lambda_p^{(n^{'})}

$g$ can grow exponentially with its depth.

Linear layers
$n(x)=W^{(n)}n^{'}(x)$ $p$ $W^{(n)}$ :
$||W^{(n)}||_p=\sup_{z:||z||_p=1}||W^{(n)}z||_p$
$p=2$ $\Lambda_2^{(n)}=||W^{(n)}||_2\Lambda_2^{(n^{'})}$ $W^{(n)}$ $||W^{(n)}||_2$ $W^{(n)}$ .
$p=\infin$ $\Lambda_{\infty}^{(n)}=||W^{(n)}||_{\infty}\Lambda_{\infty}^{(n^{'})}$ $||W^{(n)}||_{\infty}=max_{i}\sum_{j}|W_{ij}^{(n)}|$ .
Convolutional layers
- $U$
  $z$ $T\times(2k+1)d_{in}$ $T$ $d_{in}$ $j$ -th column is
  $U_j(z)=[z_{j-k};\dots;z_{j+k}]$
  $z_i$ $d_{in}$ $z_i=0$ $i$ is out of bounds.
- Convolution layer
  $d_{out}$ output channels is defined as
  $n(x)=W^{(n)}*n^{'}(x)=W^{(n)}U(n^{'}(x))$
  $W^{(n)}$ $d_{out}\times (2k+1)d_{in}$ matrix.
  $p=2$ ,
  $\Lambda_2^{(n)}\le||W||_2||U(n^{'}(x))||_2,||U(n^{'}(x))-U(n^{'}(\tilde{x}))||_2^2\le(2k+1)||n^{'}(x)-n^{'}(\tilde{x})||_2^2\\ \implies\Lambda_2^{(n)}\le\sqrt{2k+1}||W||_2\Lambda_2^{(n^{'})}$
$p=\infin$ $\Lambda_{\infin}^{(n)}\le||W^{(n)}||_{\infin}\Lambda_{\infin}^{(n^{'})}$
Aggregation layers/transfer functions
$n$ that sums its inputs, there is
$\Lambda_{p}^{(n,n^{'})}=1\implies\Lambda_p^{(n)}\le\sum_{n^{'}:(n,n^{'})\in\epsilon}\Lambda_p^{(n^{'})}$
$1$ , there is
$\Lambda_{p}^{(n)}\le\Lambda_{p}^{(n^{'})}$

Parseval networks

Parseval regularization, which we introduce in this section, is a regularization scheme to make deep neural networks robust, by constraining the Lipschitz constant (5) of each hidden layer to be smaller than one, assuming the Lipschitz constant of children nodes is smaller than one.

To enforce these constraints in practice, Parseval networks use two ideas: maintaining orthonormal rows in linear/convolutional layers, and performing convex combinations in aggregation layers.

$W\in\Bbb{R}^{d_{out}\times d_{in}}$ $d_{out}\le d_{in}$ , Parseval regularization maintains

W^TW\approx I_{d_{out}\times d_{out}}

$W$ is then approximately a Parseval tight frame.

$W\in \Bbb{R}^{d_{out}\times (2k+1)d_{in}}$ $(2k+1)^{-1/2}$ $\Lambda_2^{(n)}\le\Lambda_2^{(n^{'})}$ .

Remark 1 (Orthogonality is required). Without orthogonality, constraints on the 2-norm of the rows of weight matrices are not sufficient to control the spectral norm. Parseval networks are thus fundamentally different from weight normalization (Salimans & Kingma, 2016).

For aggregation layers in parseval networks, it takes a convex combination of inputs

n(x)=\sum_{n^{'}:(n,n^{'})\in\epsilon}\alpha^{(n,n^{'})}n^{'}(x), \sum_{n^{'}:(n,n^{'})\in\epsilon}\alpha^{(n,n^{'})}=1,\alpha^{(n,n^{'})}\ge 0

$\alpha^{(n,n^{'})}$ $\Lambda_p^{(n)}\le 1$ $p$ -norm.

Parseval training

In practice, they use the parseval tightness of weights to regularize them, i.e.

R_{\beta}(W_k)=\frac{\beta}{2}||W_k^TW_k-I||_2^2

$R_{\beta}(W_k)$ , i.e.

W_k\gets (1+\beta)W_k-\beta W_kW_k^TW_k

$\cal{S}$ $\cal{O}(|\cal{S}|^2d)$ .

$\bold{\alpha}$ onto postive simplex after a gradient update by

\bold{\alpha}^*=\mathop{\arg\min}_{\gamma\in\Delta^{K-1}}||\bold{\alpha}-\gamma||_2^2,\Delta^{K-1}=\{\gamma\in\R^K|\bold{1}^T\gamma=1,\gamma\ge0\}

$\alpha_i^*=\max(0,\alpha_i-\tau(\alpha)),\tau:\R^K\to\R,\forall x\in\R^K:\sum_i(x_i-\tau(\alpha))=1$ .

$\alpha_1\ge\alpha_2\ge\dots\alpha_K$ $k(\alpha)=\max\{k\in(1,\dots,K)|1+k\alpha_k>\sum_{j\le k}\alpha_j\}$ , the optimal thresholding is given by

\tau(\alpha)=\frac{(\sum_{j\le k(\alpha)}\alpha_j)-1}{k(\alpha)}

$\mathcal{O}(Klog(K))$ .

$\alpha$ $k$ $1+k\alpha_k$ $k$ $k$ $1$ $k$ $\tau(\alpha)$ $\alpha$ $\tau(\alpha)$ .

Experiments

They test Parseval networks on MNIST, CIFAR-10, CIFAR-100 and SVHN.

To do so, we analyze the spectrum of the weight matrices of the different models by plotting the histograms of their singular values, and compare these histograms for Parseval networks to networks trained using standard SGD with and without weight decay (SGD-wd and SGD).

$\delta_x$ $x$

SNR(x,\delta_x)=20log_{10}\frac{||x||_2}{||\delta_x||_2}

In the table, we denote Parseval(OC) the Parseval network with orthogonality constraint and without using a convex combination in aggregation layers.

Table 2 $k$ of the fully connected network, i.e.

\frac{1}{n}\sum_{i=1}^n\phi_k(x)\phi_k(x)^\top

$\sigma_1\ge\cdots\ge\sigma_d$ $p$ $\sum_{i=1}^p\sigma_i\ge 0.99\sum_{i=1}^d\sigma_i$ .

These results suggest that Parseval contracts the data of each class in a lower dimensional manifold (compared to the intrinsic dimensionality of the whole data) hence making classification easier

They also observe that Parseval networks coverge signifcantly faster than vanilla conterpart (as shown in Figure 4)

Thanks to the orthogonalization step following each gradient update, the weight matrices are well conditioned at each step during the optimization.

Inspirations

It's a very reasonable idea to constrain the weights to be orthogonal, i.e. make the features learned more diverse, but it seems that the robustness gained by this regularization is limited.

Linear Regularization

Adversarial robustness through local linearization - NIPS 2019

Chongli Qin, James Martens, Sven Gowal, Dilip Krishnan, Krishnamurthy Dvijotham, Alhussein Fawzi, Soham De, Robert Stanforth, and Pushmeet Kohli. Adversarial robustness through local linearization. In NeurIPS, 2019. arXiv:1907.02610

Adversarial training with more attacker steps is stronger but very computationally expensive and that with less attacker steps leads to a phenomenon named as gradient obfuscation, i.e. the gradients around the training example is highly convoluted and complex such that weak attack fails, but the strong attack still succeeds.

However, this can produce models which are robust against weak attacks, but break down under strong attacks – often due to gradient obfuscation

If the loss surface was linear in the vicinity of the training examples, which is to say well-predicted by local gradient information, gradient obfuscation cannot occur.

Adversarial training

$\theta$ $f(x;\theta):x\to\Bbb{R}^C$ $x$ $C$ $i$ $p_i(y|x;\theta)=exp(f_i(x;\theta))/\sum_jexp(f_j(x;\theta))$ $f$ is defined as follows:

$\epsilon$ $x$ if only if

\mathop{\arg\max}\limits_{i\in C}f_i(x;\theta)=\mathop{\arg\max}\limits_{i\in C}f_i(x+\delta;\theta)\ \ \ \forall\delta\in B_p(\epsilon)=\{\delta:||\delta||_p\le \epsilon\}

$p=\infin$ $B(\epsilon)=B_{\infin}(\epsilon)$ $L_{\infin}$ -norm attack.

They intend to make improvements based on the adversarial training using PGD, i.e.:

\min_{\theta}\Bbb{E}_{(x,y)\sim\mathcal{D}}\left[\max_{\delta\in B(\epsilon)}l(x+\delta;y,\theta)\right]

And the inner optimization problem is solved by PGD with each gradient step as:

\delta\gets \text{Proj}(\delta-\eta\nabla_{\delta}l(x+\delta;y,\theta)),\\ \text{Proj}(x)=\mathop{\arg\min}_{\xi\in B(\epsilon)}||x-\xi||

A naive approach is to reduce the number of gradient steps performed by the optimization procedure. Generally, the attack is weaker when we do fewer steps. If the attack is too weak, the trained networks often display gradient obfuscation as highlighted in Fig 1.

Local linearity measure

$x$ $\delta\in B(\epsilon)$ $l(x+\delta)$ should be well approximated by its first-order Taylor expansion. Their difference can be an indicator of how linear the loss is:

g(\delta;x)=|l(x+\delta)-l(x)-\delta^T\nabla_xl(x)|

And for the entire vicinity, we can define a quantity named as local linearity measure:

\gamma(\epsilon,x)=\max_{\delta\in B(\epsilon)}|l(x+\delta)-l(x)-\delta^T\nabla_xl(x)|

By experiment, they observe that adversarial training with more steps for attacker results a smaller local linearity measure, suggesting that it has a more linear loss surface.

Local linear regularizer (LLR)

Propostition 4.1 $l(x)$ $B(\epsilon)$ $\delta\in B(\epsilon)$ :

|l(x+\delta)-l(x)|\le|\delta^T\nabla_xl(x)|+\gamma(\epsilon,x)

The adversarial loss is upper bounded by the local linearity measure, plus the change in loss as predicted by the gradient.

$l(x)$ $l(x+\delta)\to l(x)$ $|\delta^T\nabla_x l(x)|\to 0$ $\gamma(\epsilon;x)\to 0$ $\delta\in B(\epsilon)$ . (?)

Based on the analysis above, they propose the Local Linearity Regularization (LLR):

L(\mathcal{D})=\Bbb{E}_{\mathcal{D}}[l(x)+\underbrace{\lambda\gamma(\epsilon,x)+\mu|\delta_{LLR}^T\nabla_x l(x)|}_{LLR}]

$\delta_{LLR}=\arg\max_{\delta\in B(\epsilon)}g(\delta;x)$ $B(\epsilon)$ $l(x)+\delta^T\nabla_xl(x)$ $\gamma(\epsilon,x)=|l(x+\delta_{LLR})-l(x)-\delta_{LLR}^T\nabla_xl(x)|$ $|\delta_{LLR}^T\nabla_xl(x)|$ .

$\gamma(\epsilon;x)$ itself is a sufficient regularizer, but adding a magnitude term makes it better as proved by experiments.

Experiment

They evaluate LLR on CIFAR-10 and ImageNet:

CIFAR-10
The perturbation radius we examine is = 8/255 and the model architectures we use are Wide-ResNet-28-8, Wide-ResNet-40-8 [29].
ImageNet
The perturbation radii considered are = 4/255 and = 16/255. The architecture used for this is from [10] which is ResNet-152.

Under three types of attacks:

It's also observed that LLR also results a better distribution of robustness over attacks of different strengths:

Logit Regularization

Improved Adversarial Robustness via Logit Regularization Methods - 2019

ICLR 2019 withdrawal

Cecilia Summers, Michael J. Dinneen. Improved Adversarial Robustness via Logit Regularization Methods. arXiv preprint 2019. arXiv:1906.03749

We also demonstrate that much of the effectiveness of one recent adversarial defense mechanism can in fact be attributed to logit regularization, and show how to improve its defense against both white-box and black-box attacks, in the process creating a stronger black-box attack against PGD-based models.

They empirically evaluate a series of logit regularization techniques for their potential to be used as a defense against adversarial attack.

This paper is withdrew by authors, therefore let's read it critically.

In this work, we show that adversarial logit pairing derives a large fraction of its benefits from regularizing the model’s logits toward zero, which we demonstrate through simple and easy to understand theoretical arguments in addition to empirical demonstration.

Adversarial Logit Pairing and Logit Regularization

Adversarial logit pairing refers to pairing the logits activated by adversarial examples and clean examples, i.e. regularizing with

L=\frac{1}{2}(\ell_c^{(i)}-\tilde{\ell}_c^{(i)})^2

$\ell_c^{(i)}$ $c$ $i$ $\tilde{\ell_c}^{(i)}$ is the respectively logit of the corresponding adversarial example.

$c=y^{(i)}$ , this term encourages the original logits to be smaller and adversarial logits to be larger, otherwise reversely, which is essentially regularizing the logits.

$\gamma$ , i.e.

L=\frac{1}{2}(\gamma\ell_c^{(i)}-\gamma\tilde{\ell_c}^{(i)})

implying that

\frac{\part L}{\part\gamma}=\gamma(\ell_c^{(i)}-\tilde{\ell_c}^{(i)})^2

$\gamma=0$ as a global minimizer for the loss. Therefore, adversarial logit pairing also has an effect of logit squeezing.

And it is also demonstrated with experimental results:

As shown in Figure 1 left, adversarial logit pairing does reduce the magnitude of logits, i.e. showing an effect of logit squeezing and in Figure 1 right, using logit regularization is able to recover slightly more than half of the total improvement from logit pairing.

It's convincing that adversarial logit pairing also squeezes logits, but it's unfortunately that adversarial logit pairing has been demonstrated to be breached.

Label Smoothing

Label Smoothing refers to replacing hard labels with soft labels:

p_c^{(i)}=\begin{cases} 1-s&c=y^{(i)}\\ \frac{s}{C-1}&c\neq y^{(i)} \end{cases}

$C$ $s\in[0,1-\frac{1}{C}]$ is the smoothing strength.

Interestingly, Kurakin et al. [15] found that incorporating a small amount of label smoothing present in a model trained on ImageNet actually decreased adversarial robustness roughly by 1%.

As shown in Figure 2 left, they discover that aggressive label smoothing can improve adversarial robustness; as shown in Figure 2 right, label smoothing also squeezes the logits, i.e. reducing the dynamic range of logits.

This is also observed in other works.

Other logit regularizations include enforcing linearity both between examples and labels (e.g. mixup).

Decoupling Adversarial Logit Pairing

The regularization term of adversarial logit pairing is

\begin{array}{rl} L(f(x^{(i)};\theta),f(g(x^{(i)});\theta))&=||f(x^{(i)})-f(g(x^{(i)}))||^2\\ &=||f(x^{(i)})||^2-2f(x^{(i)})^\top f(g(x^{(i)}))+||f(g(x^{(i)}))||^2 \end{array}

where the first and third terms are explicit logit regularization, and the logit pairing effects is only determined by the middle inner product.

Therefore, the term can be expressed in a more general form:

L(f(x^{(i)};\theta),f(g(x^{(i)});\theta))=h(f(x^{(i)}),f(g(x^{(i)})))+\beta(||f(x^{(i)})||^2+||f(g(x^{(i)}))||^2)

$h$ .

$h$ , such as the Jensen-Shannon divergence, a cosine similarity, or any similarity metric that does not have a significant regularization effect.

Their proposal still requires adversarial examples as adversarial training, which is not more computationally efficient.

Experiments

Inspirations

It's nice to see that they decoupled adversarial logit pairing and point out the logit squeezing effect of adversarial logit pairing.

Unfortunately, Madry lab has dismissed adversarial logit pairing in 2018 and adversarial logit pairing also shows no promising advantages compared with adversarial training.

In fact, adversarial logit pairing can be seen as a version of adversarial training, the failure of it is quite weird.

Jacobian Regularization

JARN (Jacobian Regularization) - ICLR 2020

Code: https://github.com/alvinchangw/JARN_ICLR2020

Alvin Chan, Yi Tay, Yew Soon Ong, Jie Fu. Jacobian Adversarially Regularized Networks for Robustness. ICLR 2020. arXiv:1912.10185

We propose Jacobian Adversarially Regularized Networks (JARN) as a method to optimize the saliency of a classifier’s Jacobian by adversarially regularizing the model’s Jacobian to resemble natural training images.

Jacobian Adversarially Regularized Networks

The standard cross-entropy loss is

\cal{L}_{cls}=\Bbb{E}_{(x,y)}[-\bold{y}^\top \text{log}f_{cls}(\bold{x})]

$f_{cls}:\R^{h\times w\times c}\to\R^k$ $k$ $\bold{y}\in\R^k$ .

$f_{cls}$ $\cal{L}_{cls}$ $J\in\R^{h\times w\times c}$ is

J(\bold{x}):=\nabla_{\bold{x}}\cal{L}_{cls}=\left[\frac{\part\cal{L}_{cls}}{\part\bold{x}_1}\cdots\frac{\part\cal{L}_{cls}}{\part\bold{x}_d}\right]

$d=hwc$ .

The Jacobians of robust models are empirically observed to be similar to images, and the next part of JARN entails adversarial regularization of Jacobian matrices to induce resemblance with input images.

$f_{apt}$ $\psi$ )is introduced to map the Jacobian into the domain of input images, i.e.

J^{'}=f_{apt}(J)\in\R^{h\times w\times c}

$G(\bold{x},\bold{y})$

G_{\theta,\psi}(\bold{x},\bold{y})=f_{apt}(\nabla_\bold{x}\cal{L}_{cls}(\bold{x},\bold{y}))

$p_{J^{'}}$ $p_{\bold{x}}$ .

$f_{disc}(\bold{x})$ $\phi$ $p_{J^{'}}$ $p_{\bold{x}}$ $G(\bold{x},\bold{y})$ . The training of this GAN involves the following adversarial loss

\mathcal{L}_{adv}=\Bbb{E}_{\bold{x}}[logf_{disc}(\bold{x})]+\Bbb{E}_{J^{'}}[log(1-f_{disc}(f_{apt}(\nabla_\bold{x}\cal{L}_{cls}(\bold{x},\bold{y}))))]

The final loss is a combination of these two losses, and the classifier is learned by

\theta^{*}=\arg\min_{\theta}(\cal{L}_{cls}+\lambda_{adv}\cal{L}_{adv})

$f_{apt}$ $f_{disc}$ $\psi^*$ $\phi^*$ respectively, are learned by

\psi^*=\arg\min_{\psi}\cal{L}_{adv}\\ \phi^*=\arg\max_{\phi}\cal{L}_{adv}

Algorithm 1 $\bold{x}$ to the input.

Theoretical analysis of JARN

Theorem 3.1 $\cal{L}_{adv}$ $G(\bold{x})$ $\bold{x}$ $G(\bold{X})=\bold{x}$ .

It's a theorem similar to that in GAN.

In Etmann et al. (2019), it is shown that the linearized robustness of a model is loosely upperbounded by the alignment between the Jacobian and the input image.

Theorem 3.2 $i^*=\arg\max_{i}\Psi^i$ $j^*=\arg\max_{j\neq i^*}\Psi^{j}$ $g:=\nabla_{\bold{x}}(\Psi^{i^*}-\Psi^{j^*})(\bold{x})$ $\alpha(\bold{x})=\frac{|\left<\bold{x},g\right>|}{||g||}$ , then

\rho(\bold{x})\le\alpha(\bold{x})+\frac{C}{||g||}

$C$ is a positive constant.

$\alpha(\bold{x})$ $\Psi^{i^*}-\Psi^{j^*}$ $\bold{x}$ $||g||$ induces a large upper bound of robustness.

Theorem 3.1 $J$ $g$ $\alpha(\bold{x})$ $\cal{L}_{adv}$ reaches its global minimum.

Experiments

They evaluated their methods on MNIST, SVHN and CIFAR-10, three datasets.

For MNIST:

We train a CNN, sequentially composed of 3 convolutional layers and 1 final softmax layer. All 3 convolutional layers have a stride of 5 while each layer has an increasing number of output channels (64-128-256).

$L_{\infty}$ $\epsilon=0.3$ , the uniform noise denotes data augmentation with uniform noise.

For SVHN and CIFAR-10:

We train the Wide-Resnet model following hyperparameters from (Madry et al., 2017)’s setup for their CIFAR-10 experiments.

$L_{\infty}$ $\epsilon=8/255$ .

To study how JARN-AT1 robustness generalizes, we conduct PGD attacks of varying ε and strength (5, 10 and 20 iterations).

The JARN surely produces Jacobians more similar to the original images.

JARN is trained much more fast and it also appears to be more robust to transfer attacks (black-box attacks), which indicates that this is not a method relying on gradient masking.

Inspirations

This paper is a demonstration of the word "research". The empirical works point out the phenomenon, the theoretical works give possible explanation and direction for engineers and the engineering works propose methods to solve the problem.

JARN seems to offer comparable robustness with adversarial training, while using less time. The use of GAN on Jacobians seems a little redundant, perhaps there is some room for improvements.