SuooL's Blog

蛰伏于盛夏 藏华于当春

论文阅读 An Empirical Survey of Data Augmentation for Limited Data Learning in NLP

Original Paper Reference:An Empirical Survey of Data Augmentation for Limited Data Learning in NLP

这篇综述跟之前 NLP DA 综述的区别,abstract 部分说的很清楚:

we provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting, summarizing the landscape of methods (including token-level augmentations, sentence-level augmentations, adversarial augmentations and hidden-space augmentations) and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks.

  主要这篇文章是从多种现有 DA 方法上基于 empirical study 来完成的。文章的行文思路是第二部分介绍在 NLP 领域现有的 DA 方法梳理,第三部分是 DA 一致性训练,第四部分为实验及分析,第五部分介绍了其他有限数据学习方法,第六部分是讨论和未来研究方向。

Introduction

  依然是讲存在高质量手工标注数据缺少及语言更新迅速等问题。This highlights a need for learning algorithms that can be trained with a limited amount of labeled data.

  文章的贡献点:

  • (1) summarize and categorize recent methods in textual data augmentation;

  • (2) compare different data augmentation methods through experiments with limited labeled data in supervised and semi-supervised settings on 11 NLP tasks,

  • (3) discuss current challenges and future directions of data augmentation, as well as learning with limited data in NLP more broadly.

 论文实验的结论是:

  • no single augmentation works best for every task,

  • but (i) token-level augmentations work well for supervised learning,

  • (ii) sentence-level augmentation usually works the best for semisupervised learning,

  • (iii) augmentation methods can sometimes hurt performance, even in the semi-supervised setting.

Data Augmentation For NLP

这部分主要从 token-level augmentations, sentence-level augmentations, adversarial augmentations and hidden-space augmentations 来分别叙述现有的增强方法。

Token-Level Augmentation

主要分三种,一种是设计的替换策略,一种是随机的增删替换,还有则是组合式的增强策略,具体如下图:

Sentence-Level Augmentation

对抗性DA

Hidden-Space Augmentation

Consistency Training with DA

  这部分主要是介绍关于半监督中 DA 的应用——consistency regularization。其主要是:

regularizes a model by enforcing that its output doesn’t change significantly when the input is perturbed. In practice, the input is perturbed by applying data augmentation, and consistency is enforced through a loss term that measures the difference between the model’s predictions on a clean input and a corresponding perturbed version of the same input.

  具体的形式化的描述为 $f_\theta$ be a model with parameters θ,以及$ f_{\hat{\theta}}$ be a fixed copy of the model where no gradients are allowed to flow, $x_l$ be a labeled datapoint with label $y$, $x_u$ be an unlabeled datapoint, and $\alpha(x)$ be a data augmentation method. Then, a typical loss function for consistency training is:

various other measures have been used to minimize the difference between $f_{\hat{\theta}}(x_u)$ and $ f_\theta(\alpha(x_u))$

Empirical Experiments

  Experiment with 10 labeled data points per class2in a supervised setup, and an additional 5000 unlabeled data points per class in the semi-supervised setup.

Other Limited Data Learning Methods

  Low-Resourced Languages 时采用迁移学习方法、Few-shot Learning

讨论及未来方向

总结

这篇文章对比之前的综述文章,从实验对比角度分析现在的 NLP DA 方法。下一步准备相关 hidden-space Augmentation 的文章仔细读读。

泡面一杯