论文阅读 An Empirical Survey of Data Augmentation for Limited Data Learning in NLP

Original Paper Reference：An Empirical Survey of Data Augmentation for Limited Data Learning in NLP

这篇综述跟之前 NLP DA 综述的区别，abstract 部分说的很清楚：

we provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting, summarizing the landscape of methods (including token-level augmentations, sentence-level augmentations, adversarial augmentations and hidden-space augmentations) and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks.

　　主要这篇文章是从多种现有 DA 方法上基于 empirical study 来完成的。文章的行文思路是第二部分介绍在 NLP 领域现有的 DA 方法梳理，第三部分是 DA 一致性训练，第四部分为实验及分析，第五部分介绍了其他有限数据学习方法，第六部分是讨论和未来研究方向。

Introduction

　　依然是讲存在高质量手工标注数据缺少及语言更新迅速等问题。This highlights a need for learning algorithms that can be trained with a limited amount of labeled data.

　　文章的贡献点：

(1) summarize and categorize recent methods in textual data augmentation;
(2) compare different data augmentation methods through experiments with limited labeled data in supervised and semi-supervised settings on 11 NLP tasks,
(3) discuss current challenges and future directions of data augmentation, as well as learning with limited data in NLP more broadly.

　论文实验的结论是：

no single augmentation works best for every task,
but (i) token-level augmentations work well for supervised learning,
(ii) sentence-level augmentation usually works the best for semisupervised learning,
(iii) augmentation methods can sometimes hurt performance, even in the semi-supervised setting.

Data Augmentation For NLP

这部分主要从 token-level augmentations, sentence-level augmentations, adversarial augmentations and hidden-space augmentations 来分别叙述现有的增强方法。

Token-Level Augmentation

主要分三种，一种是设计的替换策略，一种是随机的增删替换，还有则是组合式的增强策略，具体如下图：

Sentence-Level Augmentation

对抗性DA

Hidden-Space Augmentation

Consistency Training with DA

　　这部分主要是介绍关于半监督中 DA 的应用——consistency regularization。其主要是：

regularizes a model by enforcing that its output doesn’t change significantly when the input is perturbed. In practice, the input is perturbed by applying data augmentation, and consistency is enforced through a loss term that measures the difference between the model’s predictions on a clean input and a corresponding perturbed version of the same input.

　　具体的形式化的描述为 $f_\theta$ be a model with parameters θ，以及$ f_{\hat{\theta}}$ be a fixed copy of the model where no gradients are allowed to flow, $x_l$ be a labeled datapoint with label $y$, $x_u$ be an unlabeled datapoint, and $\alpha(x)$ be a data augmentation method. Then, a typical loss function for consistency training is:

$CE( f_\theta(x_l, y)) + \lambda_u CE(f_{\hat{\theta}}(x_u), f_\theta(\alpha(x_u)) )$

various other measures have been used to minimize the difference between $f_{\hat{\theta}}(x_u)$ and $ f_\theta(\alpha(x_u))$

Empirical Experiments

　　Experiment with 10 labeled data points per class2in a supervised setup, and an additional 5000 unlabeled data points per class in the semi-supervised setup.