Summarized Survey of Few-shot Learning

By LI Haoyang 2020.10.12 ~ 2020.10.13

This note is summarized from a review, therefore, no quote is specified, most of the following sentences are directly taken from the original article.

Content

Few-shot learning: a survey

YAQING WANG, QUANMING YAO. Few-shot Learning: A Survey. 2019. arXiv:1904.05046

Few-short learning aims to improve the model's ability to generalize over training on a small dataset. FSL can learn new task of limited supervised information by incorporating prior knowledge.

Problem definition

Start from the definition of machine learning, which is:

$E$ $T$ $P$ $E$ $T$ $P$ . (Machine learning by Mitchell)

FSL is a special case of machine learning, which exactly targets at getting good learning performance with limited supervised information:

$E$ $T$ $P$ $E$ $T$ .

The scenarios of FSL can be summarized as:

Test bed for human-like learning.
Few-shot to reduce data gathering effort and computation cost.
Few-shot due to rare cases. (some tasks are naturally few-shot)

Therefore, FSL methods combine prior knowledge with available supervised information in E to make the learning of the target T feasible

Relevant problems

Semi-supervised learning
$o^*$ $x$ $y$ $E$ consisting of both labeled and unlabeled examples.
- Positive-unlabeled learning - only positive and unlabeled samples are given
- Active learning $y$
Imbalanced learning
$E$ $y$ .
Transfer learning
It transfers knowledge learned from source domain and source task where sufficient training data is available, to target domain and target task where training data is limited.
- Domain adaptation - where the tasks are the same but the domain is different.
- Zero-shot learning - recognize new class with no supervised training examples by linking them to existing classes that one has already learned. (Cluster?)
Meta-learning
$P$ $T$ $T$ and the meta knowledge extracted across tasks by a meta-learner.
$p(T)$ $T$ $T_s\sim p(T)$ $T_s$ $D_{T_s}$ $N$ $D_{T_s}=\{D_{T_s}^{train},D_{T_s}^{test}\}$ $D_{T_s}^{train}=\{(x_{T_s}^{(i)},y_{T_s}^{(i)})\}$ $D_{T_s}^{test}=\{(x_{T_s}^{test},y_{T_s}^{test})\}$ $D_{T_s}^{train}$ $D_{T_s}^{test}$ $\theta$ of meta-learner learns to minimize the error across all learners by:
$\theta=\arg\min_{\theta}\Bbb{E}_{T_s\sim p(T)}l_{\theta}(D_{T_s})$
$T_t\sim p(T)$ $T_t$ $D_{T_t}$ $N^{'}$ $D_{T_t}=\{D_{T_t}^{train},D_{T_t}^{test}\}$ $D_{T_t}^{train}=\{(x_{T_t}^{(i)},y_{T_t}^{(i)})\}$ $D_{T_t}^{test}=\{x_{T_t}^{test}\}$ $D_{T_t}^{train}$ $D_{T_t}^{test}$ to obtain the meta-learning testing error.
It appears to me that meta-learning is an ensemble of classifiers trained on partial dataset.

Core issue of FSL

$E$ $T$ $P$ , formulated as:

\min_{\theta}\sum_{(x^{(i)},y^{(i)})\in D^{train}}l(h(x^{(i)};\theta),y^{(i)})

$\cal{H}$ $\theta$ $h\in\cal{H}$ $D^{train}$ .

Decomposition of error

$\hat{y}$ $o$ $x$ expected risk $R$ $p(x,y)$ ), formulated as:

R(o)=\int l(o(x),y)dp(x,y)=\Bbb{E}[l(o(x),y)]

$p(x,y)$ empirical risk $R_I(h)$ $R_I(h)$ , defined as the average of the sample loss over the training data set:

R_I(h)=\frac{1}{n}\sum_{(x^{(i)},y^{(i)})\in D^{train}}l(h(x^{(i)}),y^{(i)})

The learning is done by empirical risk minimization.

In short, empirical risk minimization assumes that every sample in the training set is of the same probability.

Let:

$o^*=\arg\min_f R(o)$ , the real but unreachable solution
$h^*=\arg\min_{h\in \cal{H}}R(h)$ $\cal{H}$
$h_I^*=\arg\min_{h\in\cal{H}}R_I(h)$ $\cal{H}$

The total error of learning taken with respect to the random choice of training set can be decomposed into:

\Bbb{E}(R(\tilde{h}_I))=\underbrace{\Bbb{E}[R(h^*)-o^*]}_{\cal{E}_{app}(\cal{H})}+\underbrace{\Bbb{E}[R(h_I)-R(h^*)]}_{\mathcal{E}_{est}(\mathcal{H},I)}

Approximation error $\cal{E}_{app}(H)$ $\cal{H}$ $o^*$ .
Estimation error/Generalization error $\mathcal{E}_{est}(\mathcal{H},I)$ $R_I(h)$ $R(h)$ .

$D^{train}$ $\cal{H}$ $\cal{H}$ $\theta$ $h$ .

Sample complexity

$0<\epsilon, \delta<0.5$ $S$ $I\ge S$ , we have:

Pr(R(h_I)-R(h^*)\ge\epsilon)< \delta \Rightarrow Pr(\cal{E}_{est}\ge\epsilon)<\delta

In short, sample complexity is the number of samples we need for the estimation error to be small enough.

$\cal{H}$ $\cal{H}$ $S$ is tightly bounded as:

S=\Theta(\frac{VC(\cal{H})}{\epsilon^2}+\frac{log(1/\delta)}{\epsilon^2})

$\cal{H}$ $(1-\delta)$ $h_I$ is approximately correct and higher demand of optimization accuracy of algorithm.

Unreliable empirical risk minimizer

For estimation error, we have:

\mathcal{E}_{est}(\mathcal{H},\infin)=\lim_{I\to\infin}\Bbb{E}[R(h_I)-R(h^*)]=0

which means more examples can help reduce estimation error. Besides, there is (why?):

\lim_{I\to\infin}Var[R(h_I)]=0

$h_I$ $R(h_I)$ $R(h^*)$ $h$ $\cal{H}$ .

$I$ $S$ . $h_I$ is no longer reliable, which is the core issue in FSL.

Taxonomy of current works

The existing works deal with the problem of unreliable empirical risk minimizer via the following perspectives:

Data
$D^{train}$ .
Model
$\cal{H}$ $E$ $\cal{H}$ $S$ .
Algorithm
Use prior knowledge for better initial point or better search steps. (Meta-learning etc.)

Data focused approaches

$D^{train}$ with more data.

Duplicate training data with transformations

$D^{train}$ $(x^{(i)},y^{(i)})$ into several samples with some transformation to bring in variation. These transformations are as follows:

Handcrafted Rule
Translating, flipping, shearing, scaling, reflecting, cropping and rotating,etc.
Learned Transformation
$D^{train}$ .

$D^{train}$ can make the estimation of distribution easily go stray. Hence it can only mediate rather than solve FSL problem and is only used as a pre-processing step for image data.

$D^{train}$ $T$ . However, this prior knowledge needed to be extracted from similar tasks, which may not always be available and can be costly to collect.

Borrow from other data sets

$D^{train}$ .

Unlabeled Data Set
$D^{unlabeled}$ $y^{(i)}$ .
$D^{train}$ .
Similar Data Set
$D^{train}$ by aggregating samples pairs from other similar (in classes) data sets with many-shot.
$o^*$ $x$ $y$ of classes. Therefore, new samples can be generated as a weighted average of sample pairs of classes of the similar data set, where the weight is usually some similarity measure.

$D^{unlabeled}$ is cheap but of lower quality.

Similar data set is more informative, but the collecting of it may be laborious and determining the key property of similarity can be objective.

$p(x,y)$ ), the generation of new samples is not precise. The gap between the estimated one and the ground truth largely interferes the data quality, even leading to concept drift.

Model focused approaches

$\cal{H}$ $h(\cdot;\theta)$ .

Multitask learning

This strategy learns multiple learning tasks spontaneously, exploit the generic information shared across tasks and specific information of each task. These tasks are usually related.

When the tasks are from different domains, it's also called domain adaptation.

$\mathcal{H}_{T_t}$ $\mathcal{H}_{T_t}$ by other jointly learned tasks.

Hard Parameter Sharing
$\mathcal{H}_{T_t}$ and additionally learns a task-specific parameter for each task.
It's actually the prevailing methodology in object detection, where the regression head shares the same parameters in deep layers with the classification head.
Soft Parameter Sharing
$\mathcal{H}_{T_t}$ $\mathcal{H}_{T_t}$ $\theta_{T_t}$ $\theta_{T_t}$ .
$T_t$ $\theta_{T}$ $T_t$ .

Embedding learning

I think it's also named as metric learning.

$\mathcal{H}$ $x^{(i)}\in\mathcal{X}\sube\Bbb{R}^d$ $z^{(i)}\in\mathcal{Z}\sube \Bbb{R}^m$ , where the similar and dissimilar pairs can be easily identified.

$D^{train}$ $x^{(i)}\in D^{train}$ $x^{test}\in D^{test}$ $g$ $f$ $x^{test}$ $x^{(i)}\in D^{train}\sube \mathcal{Z}$ .

Task-specific
$D$ $D^{train}$ $S$ $D^{train}$ $y$ .
Task-invariant
$D_c, D\not\sub D_c$ $\mathcal{Z}$ $\mathcal{H}$ $D$ without retraining.
Combine Task-invariant and Task-specific
$D^{train}$ .

This is also the prevailing use of pretrained CNN as feature extractor in literature.

Learning with external memory

This strategy memorizes the needed knowledge in an external memory to be retrieved or updated, hence relieves the burden of learning and allows fast generalization.

$M\in\Bbb{R}^{b\times m}$ $b$ $M(i)\in\Bbb{R}^m$ $x^{(i)}$ $f$ $q\in\Bbb{R}^m$ $M(i)\in\Bbb{R}^k,i=1,\dots,b$ $s$ , e.g. cosine similarity, based on which the prediction is then given.

When a new sample comes, relevant contents are extracted from the memory and combined into the local approximation for this sample. The prediction is then made based on the updated approximation.

Update the least recently used memory slot (e.g. MANN)
Update by location-based addressing of neural Turing machine (e.g. neural Turing machine)
Update according to age of memory slots (e.g. Life-long memory)
Only update the memory when the loss is high
Use the memory as storage only and wipe it out across tasks (e.g. MetaNet)
Aggregate the new information into the most similar one (e.g. Memory Matching Networks)

It relies on human knowledge to design a desired rule. The existing works do not have a clear winner. How to design or choose update rules according to different setting automatically are important issues.

It sounds rather traditional to use memory solving the FSL problem.

Generative models

$p(x,y)$ $D^{train}$ to obtain the estimated distribution.

$h\in\mathcal{H}$ $C$ $D_c$ $D^{train}$ $D^{train}$ for prediction.

By Bayes' rule, we have:

p(y|x^{test},D^{train})=p(x^{test}|y,D^{train})p(y)=\int_{\theta}p(x^{test}|\theta,y)p(\theta|y,D^{train})d\theta

$\theta$ $h\in\mathcal{H}$ $D^{train}$ $p(\theta|y,D^{train})$ $\theta$ using:

$\theta^{ML}=\arg\max_{\theta}p(\theta|y,D^{train})$
$\theta^{MAP}=\arg\max_{\theta}p(\theta|y,D^{train})p(\theta)$

$\theta$ $y$ $\theta$ $D_c$ .

Part and Relation (e.g. Bayesian One-Shot, BPL)
$\theta$ $D_c$ as prior knowledge.
$(x^{(i)},y^{(i)})\in D^{train}$ , the model needs to infer the correct combination of related parts and relations, then decides which target class this combination belongs to.
Super Class (e.g. HB, HDP-DBM)
$\theta$ $h\in\mathcal{H}$ for these super classes as prior knowledge.
$h$ $h$ .
Latent Variable
This strategy models latent variables with no implicit meaning shared across classes.

$\cal{H}$ $E$ $\cal{H}$ $S$ .

$\cal{H}$ $D^{train}$ regularized by a set of tasks jointly learned. When these tasks are highly related, they can guide each other in complement and prevent overfitting problem.
$\cal{H}$ to a smaller embedding space, where the similar and dissimilar pairs can be easily identified.
$D^{train}$ $\cal{H}$ .
$\cal{H}$ . It has good interpretability, causality and compositionality.

Algorithm focused approaches

$h$ $\cal{H}$ . The prevailing algorithms for this search root from the gradient descent method:

\theta^t=\theta^{t-1}-\alpha_t\nabla_{\theta^{t-1}}f_t(\theta^{t-1})

$f_t(\theta)=l(h(x^{(t)};\theta),y^{(t)})$ $\theta$ $t$ -th iteration.

$\theta$ by taking advantage of prior knowledge.

$\theta^0$

$\theta^0$ $\theta$ $D^{train}$ .

$\theta^0$ $D$ .

$\theta^0$ with regularization
$D^{train}$ with regularization to prevent overfitting.
$\theta^0$
$\theta^0$ $D^{train}$ .
$\theta^0$ with new parameters
$\theta^0$ $D^{train}$ $\delta$ attached.

Most of these designs are still heuristic, lack of certain interpretability.

$\theta$

$h^*$ .

$\theta$ $\psi_T$ $T$ 's learner and the learner returns error signals to meta-learner to improve it.

Refining by gradient descent (e.g. MAML)
$\theta$ by gradient descent.
Refining in consideration of uncertainty
$\theta$ $\theta^0$ $\theta$ .

For this approach to work, the tasks over which the meta-learner learns should be related intuitively.

Learning search steps

Its idea is to automatically produce the proper gradient update steps, a good step size, or even a good initialization to start with.

It appears to me as a cooperated training of a meta-learner and a target learner.

$\theta$ $h\in \mathcal{H}$ .

The former two both aim at a better initialization, the first tries to combine the pre-trained parameters with customized parts and the second utilizes the meta-learner for initialization. The latter one tries to alter the search steps directly based on prior knowledge.

Future works

FSL learning methods attempts to obtain a reliable empirical risk minimizer, from the perspectives of data (Section 3), model (Section 4), and algorithm (Section 5). Each component of these methods can be replaced by more recent and advanced ones.

Inspirations

The key problem of few-shot learning is how to either reduce or reach the required sample complexity due to estimation error introduced by empirical risk minimization under a limited number of samples.

From the perspective of reaching the sample complexity, the direct thought is to augment the training data, so the few-shot then becomes many-shots somehow. These augmentation either constructs new samples from original ones or borrows samples from larger datasets.

From the perspective of reducing the sample complexity, the approach is to constrain parts of the model with pre-trained parameters for a smaller hypothesis space.

From the perspective of avoiding empirical risk minimization, the approach is to adapt a better initialization or alter the search process.

To solve the FSL problem purely based on few-shot datasets is impossible, all of the methods utilize prior knowledge explicitly or implicitly somehow.

🔝