Summarized Survey of Few-shot Learning

By LI Haoyang 2020.10.12 ~ 2020.10.13

This note is summarized from a review, therefore, no quote is specified, most of the following sentences are directly taken from the original article.

Content

Summarized Survey of Few-shot LearningContentFew-shot learning: a surveyProblem definitionRelevant problemsCore issue of FSLTaxonomy of current worksData focused approachesDuplicate training data with transformationsBorrow from other data setsModel focused approachesMultitask learningEmbedding learningLearning with external memoryGenerative modelsAlgorithm focused approachesRefining existing parameters Refining meta-learned Learning search stepsFuture worksInspirations

Few-shot learning: a survey

YAQING WANG, QUANMING YAO. Few-shot Learning: A Survey. 2019. arXiv:1904.05046

Few-short learning aims to improve the model's ability to generalize over training on a small dataset. FSL can learn new task of limited supervised information by incorporating prior knowledge.

Problem definition

Start from the definition of machine learning, which is:

A computer program is said to learn from experience with respect to some classes of task and performance measure if its performance can improve with on measured by . (Machine learning by Mitchell)

FSL is a special case of machine learning, which exactly targets at getting good learning performance with limited supervised information:

Few-Shot Learning (FSL) is a type of machine learning problems (specified by , and ) where contains little supervised information for the target .

The scenarios of FSL can be summarized as:

Therefore, FSL methods combine prior knowledge with available supervised information in E to make the learning of the target T feasible

Relevant problems

Core issue of FSL

Start from machine learning problem, improving with on measured by , formulated as:

In this form, learning is about algorithm searching in for the which parameterizes the hypothesis chosen by model that best fit data .

Decomposition of error

Let be the prediction of some function for . To solve the problem defined above, the essence is to minimize the expected risk (i.e. losses measured with respect to ), formulated as:

Since is unknown in practice, the empirical risk is used to estimate the expected risk , defined as the average of the sample loss over the training data set:

The learning is done by empirical risk minimization.

In short, empirical risk minimization assumes that every sample in the training set is of the same probability.

Let:

The total error of learning taken with respect to the random choice of training set can be decomposed into:

Given the analysis above, learning to reduce the total error can be attempted from the perspective of data which offers , model which offers and algorithm which searches through for the parameter of the best .

Sample complexity

For , sample complexity is an integer such that for , we have:

In short, sample complexity is the number of samples we need for the estimation error to be small enough.

For infinite space , its complexity can be measured in terms of VapnikâĂŞChervonenkis(VC) dimension, defined as the size of the largest set of inputs that can be shattered (split in all possible ways) by . The sample complexity is tightly bounded as:

The sample complexity increases with more complicated , higher probability that the learned is approximately correct and higher demand of optimization accuracy of algorithm.

Unreliable empirical risk minimizer

For estimation error, we have:

which means more examples can help reduce estimation error. Besides, there is (why?):

Thus in common setting of supervised learning task, empirical risk minimizer can provide a good and stable approximation to the best possible for in .

However, in the setting of few-shot learning where the number of available examples is smaller than the required sample complexity . The empirical minimizer is no longer reliable, which is the core issue in FSL.

Taxonomy of current works

The existing works deal with the problem of unreliable empirical risk minimizer via the following perspectives:

Data focused approaches

Since few-shot suffers from the limited amount of data, the intuitive way to handle it is to augment training set with more data.

Duplicate training data with transformations

This stratgey augments by duplicating each into several samples with some transformation to bring in variation. These transformations are as follows:

Direct use of handcrafted rule in FSL without considering the task or desired data property available in can make the estimation of distribution easily go stray. Hence it can only mediate rather than solve FSL problem and is only used as a pre-processing step for image data.

Using learned transformations can augment more suitable samples as it is data-driven and exploit prior knowledge akin to or task . However, this prior knowledge needed to be extracted from similar tasks, which may not always be available and can be costly to collect.

Borrow from other data sets

This strategy borrows samples from other data sets and adapts them to be like sample of the target output, so as to be augmented to the supervised information .


The use of is cheap but of lower quality.

Similar data set is more informative, but the collecting of it may be laborious and determining the key property of similarity can be objective.

Due to the imprecise prior knowledge guiding the augmentation (lack of ), the generation of new samples is not precise. The gap between the estimated one and the ground truth largely interferes the data quality, even leading to concept drift.

Model focused approaches

Another approach is to constrain the hypothesis space determined by the model as a parameterized hypothesis .

Multitask learning

This strategy learns multiple learning tasks spontaneously, exploit the generic information shared across tasks and specific information of each task. These tasks are usually related.

When the tasks are from different domains, it's also called domain adaptation.

As these tasks are related, they are assumed to ve similar or overlapping hypothesis space , this is done explicitly by sharing parameters among these tasks, which can be viewed as a way to constrain each by other jointly learned tasks.

Embedding learning

I think it's also named as metric learning.

This strategy learns a hypothesis space which embeds to a smaller embedding space , where the similar and dissimilar pairs can be easily identified.

The embedding function is mainly learned by prior knowledge and can additionally uses to bring in task-specific information. The training sample and the testing sample are embedded differently by and . The prediction is then done by assigning to the class of the most similar .

This is also the prevailing use of pretrained CNN as feature extractor in literature.

Learning with external memory

This strategy memorizes the needed knowledge in an external memory to be retrieved or updated, hence relieves the burden of learning and allows fast generalization.

Denote memory as , which has memory slots . Given a sample , it first embedded by as query , then it attends to each through some similarity measure , e.g. cosine similarity, based on which the prediction is then given.

When a new sample comes, relevant contents are extracted from the memory and combined into the local approximation for this sample. The prediction is then made based on the updated approximation.

It relies on human knowledge to design a desired rule. The existing works do not have a clear winner. How to design or choose update rules according to different setting automatically are important issues.

It sounds rather traditional to use memory solving the FSL problem.

Generative models

Here, the generative modeling methods refer to methods involve learning . It uses both prior knowledge and to obtain the estimated distribution.

Prior probability distributions of each are learned from a set of data sets , which is usually large and disjoint with , it then updates the probability distribution over for prediction.

By Bayes' rule, we have:

In which, is the parameter of . If is large enough, we can use it to learn a well-peaked , and obtain using:

For FSL, generative models assume is transferable across different (e.g. classes). Hence can be instead learned from a large set of data sets .


All methods approaching from model design based on prior knowledge in experience to constrain the complexity of and reduce its sample complexity .

Algorithm focused approaches

The so-called algorithm here denotes the strategy to search for the parameter of the best in hypothesis space . The prevailing algorithms for this search root from the gradient descent method:

In which, , measuring the loss incurred by in -th iteration.

The unreliable empirical risk minimizer makes this approach unreliable. The algorithm-centered methods aim to improve the search process for the best by taking advantage of prior knowledge.

Refining existing parameters

This strategy takes from a pre-trained model as a good illustration, and adapts it to by .

It assumes that since captures general structures by learning from large-scale data, it can be adapted using a few iterations to work well on .

Most of these designs are still heuristic, lack of certain interpretability.

Refining meta-learned

This strategy directly targets at .

Iteratively, the meta-learner parameterized by provides information about parameter updates to parameter of task 's learner and the learner returns error signals to meta-learner to improve it.

For this approach to work, the tasks over which the meta-learner learns should be related intuitively.

Learning search steps

Its idea is to automatically produce the proper gradient update steps, a good step size, or even a good initialization to start with.

It appears to me as a cooperated training of a meta-learner and a target learner.


All methods focused on algorithm take advantage of prior knowledge to search for the which parameterizes the best .

The former two both aim at a better initialization, the first tries to combine the pre-trained parameters with customized parts and the second utilizes the meta-learner for initialization. The latter one tries to alter the search steps directly based on prior knowledge.

Future works

FSL learning methods attempts to obtain a reliable empirical risk minimizer, from the perspectives of data (Section 3), model (Section 4), and algorithm (Section 5). Each component of these methods can be replaced by more recent and advanced ones.

Inspirations

The key problem of few-shot learning is how to either reduce or reach the required sample complexity due to estimation error introduced by empirical risk minimization under a limited number of samples.

From the perspective of reaching the sample complexity, the direct thought is to augment the training data, so the few-shot then becomes many-shots somehow. These augmentation either constructs new samples from original ones or borrows samples from larger datasets.

From the perspective of reducing the sample complexity, the approach is to constrain parts of the model with pre-trained parameters for a smaller hypothesis space.

From the perspective of avoiding empirical risk minimization, the approach is to adapt a better initialization or alter the search process.

To solve the FSL problem purely based on few-shot datasets is impossible, all of the methods utilize prior knowledge explicitly or implicitly somehow.


🔝