About Few-shot Learning

By LI Haoyang 2020.10.13

unfinished

Content

Datasets

MiniImageNet

Dataset: https://drive.google.com/file/d/1HkgrkAwukzEZA0TpO7010PkAOREb2Nuk/view

CIFAR 100

CUB

Approaches

MatchNet - NIPS 2016

Paper: http://papers.nips.cc/paper/6385-matching-networks-for-one-shot-learning

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra. Matching Networks for One Shot Learning. NIPS 2016.

Humans learn new concepts with very little supervision – e.g. a child can generalize the concept of “giraffe” from a single picture in a book – yet our best deep learning systems need hundreds or thousands of examples. This motivates the setting we are interested in: “one-shot” learning, which consists of learning a class from a single labeled example.

In contrast, many non-parametric models allow novel examples to be rapidly assimilated, whilst not suffering from catastrophic forgetting

We aim to incorporate the best characteristics from both parametric and non-parametric models – namely, rapid acquisition of new examples while providing excellent generalization from common examples.

Model

$S$ $c_{S}$ $S$ $S\to c_{S}(\cdot)$ .

$k$ $S=\{(x_i,y_i)\}_{i=1}^k$ $c_S(\hat{x})$ $\hat{x}$ $\hat{y}$ :

P(\hat{y}|\hat{x},S):S\to c_S(\hat{x})

$S^{'}$ $P$ $\hat{y}$ $\hat{x}$ :

P(\hat{y}|\hat{x},S)=\sum_{i=1}^k a(\hat{x},x_i)y_i

$y_i$ here is one-hot coded ?

$a$ is an attention mechanism, subsuming both KDE and kNN methods:

$a$ $X\times X$ $\sim$ kernel density estimator
$a$ $b$ $x_i$ $\hat{x}$ $\sim$ 'k-b' nearest neighbor classifier.

$a$ $\hat{x}$ $x_i$ .

$a$ can be a cosine similarity of embedded inputs:

a(\hat{x},x_i)=\frac{e^{c(f(\hat{x}),g(x_i))}} {\sum_{j=1}^ke^{c(f(\hat{x}),g(x_j))}}

$f$ $g$ are embedding functions are neural networks.

$g$ $S$ , the exact instantiation of this model is:

\begin{array}{rl} \hat{h}_k,c_k &=\mathop{LSTM}(f^{'}(\hat{x}),[h_{k-1},r_{k-1}],c_{k-1}),\\ h_k&=\hat{h}_k+f^{'}(\hat{x})\\ r_{k-1}&=\sum_{i=1}^{|S|}a(h_{k-1},g(x_i))g(x_i)\\ a(h_{k-1},g(x_i))&=e^{h_{k-1}^Tg(x_i)}/{\sum_{j=1}^{|S|}}e^{h_{k-1}^T}g(x_j) \end{array}

Training strategy

$\theta$ such that:

\theta=\arg\max_{\theta}E_{L\sim T}\left[E_{S\sim L, B\sim L}\left[\sum_{(x,y)\in B}log P_{\theta}(y|x,S)\right]\right]

$T$ $L$ $L$ $T$ $B$ $S$ $L$ .

$\theta$ $S^{'}\sim T^{'}$ from a different distribution of novel labels.

In human words, the inner part aims to obtain a model that performs well in classifying samples from batch based on samples from support set, the outer part aims to make the inner part performs well in various subsets of the entire task set.

N-way K-shot Task

All of our experiments revolve around the same basic task: an N-way k-shot learning task. Each method is providing with a set of k labeled examples from each of N classes that have not previously been trained upon. The task is then to classify a disjoint batch of unlabeled examples into one of these N classes. Thus random performance on this task stands at 1/N.

Performance

Omniglot [14] consists of 1623 characters from 50 different alphabets. Each of these was hand drawn by 20 different people. The large number of classes (characters) with relatively few data per class (20), makes this an ideal data set for testing small-scale one-shot classification.

ImageNet is a notoriously large data set which can be quite a feat of engineering and infrastructure to run experiments upon it, requiring many resources. Thus, as well as using the full ImageNet data set, we devised a new data set – miniImageNet – consisting of 60, 000 colour images of size 84 × 84 with 100 classes, each having 600 examples

Inspiration

This paper defines a new challenging and interesting task, i.e. N-way K-shot tasks. They proposed a model to meta-learn the generic metrics between samples and use a kNN-like inference to classify test images.

Diversity Transfer Network - AAAI 2020

Code: https://github.com/Yuxin-CV/DTN

Mengting Chen, Yuxin Fang, Xinggang Wang, Heng Luo, Yifeng Geng, Xinyu Zhang, Chang Huang, Wenyu Liu, Bo Wang. Diversity Transfer Network for Few-Shot Learning. AAAI 2020. arXiv:1912.13182

The idea is to use the diversity of the larger training dataset to augment the few-shot dataset by generating new samples with more diversity.

N-way K-shot task

$\cal{D}_{train}$ $\cal{D}_{test}$ $N$ $K$ $\{(X_{s}^{n,k},Y_{s}^{n,k})\}$ from the training/testing set:

\mathcal{D}_{support}=\{(X_{s}^{n,k},Y_{s}^{n,k})\}_{n\in[1,2,\dots,N],k\in[1,2,\dots,K]}\sub \mathcal{D}_{train}\or\mathcal{D}_{test}

$(X_q,Y_q)$ $N$ classes:

(X_q,Y_q)\in \mathcal{D}_{train}\or\mathcal{D}_{test}\setminus \mathcal{D}_{support}

$N$ $\cal{D}_{train}$ .

The problem seems to be defined differently from that in MatchNet.

Method

The structure consists of a meta-learning branch and an auxiliary branch.

$H$ $\bold{z}_{r}^{h,j}\in\Bbb{R}^C,j\in[1,2],h\in[1,2,\dots,H]$ $\bold{z}_s^{n}$ :

\bold{z}_g^{n,h}=G(\bold{z}_s^n,\bold{z}_{r}^{h,1},\bold{z}_{r}^{h,2})=\phi_2(\phi_1(\bold{z}_s^{n})+\phi_1(\bold{z}_{r}^{h,1})-\phi_1(\bold{z}_r^{h,2}))

$K$ $K\times (H+1)$ generated samples.

$N$ $\bold{W}\in\Bbb{R}^{N\times C}$ $\bold{w}_n\in\Bbb{R}^C$ $n$ -th category:

\hat{\bold{w}}_n=\frac{1}{H+1}\left(\bold{z}_s^n+\sum_{h=1}^H\bold{z}_g^{n,h}\right),\\ \bold{w}_n=\frac{\hat{\bold{w}}_n}{||\hat{\bold{w}}_n||}

$n$ -th category.

$\bold{z}_q$ is:

\bold{p}=\bold{z}_q\bold{W}^T

$\alpha$ is calculated as:

\mathcal{L}=-\mathop{log}\frac {\mathop{exp}(\alpha\bold{z}_q\bold{w}_{Y_q}^T)} {\sum_{n=1}^N\mathop{exp}(\alpha\bold{z}_q\bold{w}_n^T)}

By minimizing this loss, it increases the cosine similarity between the query feature and corresponding query proxy.

$\alpha$ in the loss function? It seems to be ditched in inference.

$\{(X_i,Y_i)\},i\in[1,2,\dots,B]$ $\cal{D}_{train}$ , optimized according to the same form of loss function:

\mathcal{L^{'}}=-\mathop{log}\frac{\mathop{exp}(\alpha^{'}\bold{z}_i\bold{w^{'}}_{Y_i}^T)} {\sum_{n=1}^{N^{'}}\mathop{exp}(\alpha^{'}\bold{z}_i\bold{w^{'}}_n^T)}

$z_i=F(X_i)$ $\bold{w}_n^{'}\in\Bbb{R}^C$ $n$ $\bold{W}^{'}$ .

The auxiliary branch is designed to accelerate the optimization of feature extractor.

A Organized Auxiliary task co-Training (OAT) strategy is proposed to organize auxiliary branch and meta-learning branch.

$T$ $\cal{U}$ $i$ $\cal{U}_i$ $\gamma_i$ $\cal{M}$ $(T-\gamma_i)$ $\cal{A}$ $L$ $\cal{S}$ is:

\mathcal{S}=\sum_{i=1}^L\left((T-\gamma_i)\mathcal{A}+\gamma_i\mathcal{M}\right)

$L$ $(T-\gamma_i)$ $\gamma_i$ meta-learning epochs. The whole process is intertwined.

Experiment settings

Three datasets, miniImageNet, CIFAR100 and CUB, are used for evaluation.

$3\times 3$ $2\times 2$ max-pooling layer. The output is a feature vector of 1024 dimensions.

$\phi_i$ is a fully-connected layer with 2048 units followed by a leaky ReLU activation (max(x,0.2x)) layer and a dropout layer with 30% dropout rate.

$\phi_2$ $\phi_1$ except that the number of units of the FC layer is 1024.

Performance

Visualization

As shown above, the generated samples are roughly clustered in the cloud of real samples of a larger dataset, which indicates the effectiveness of the proposed augmentation.

Ablation study

The results gradually become better as the number of generated features increases. No improvement was observed when the number of generated features exceeds 64. We attribute this to the fact that 64 generated features have been well fitted to the real sample distribution.