SuooL's Blog

Original Paper Reference：Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification (Ren et al., EMNLP 2021)

这篇文章显然借鉴了图像领域的 Randaugment 论文 的思想，针对数据集和特定任务的数据增强策略寻找。

Introduction

现有的 NLP 数据增强方法主要分为两类：基于生成的和基于编辑的方法。

基于生成的方法使用条件生成模型从原始文本中合成新的近似文本，which have advantages in instance fluency and label preservation but suffer from the heavy cost of model pre-training and decoding；

基于编辑的方法则 instead apply label-invariant sentence editing operations (swap, delete, etc.) on the raw instance, which are simpler and more efficient in practice，即是通过在原始文本中使用一些编辑操作来生成新的样本。但缺点是　sensitive to the preset hyper-parameters including the type of the applied operations and the proportion of words to be edited.

作者在 IMDB 数据集的预实验说明了不同编辑操作和参数对于分类任务的影响是不同的，具体如下图所示。

因此作者为了整合不同编辑操作，提出了 TAA 这个框架，目标是实现一个可学习的组合式的 DA 方法，其能生成一个更高质量的数据集从而提高文本分类任务模型的效果。具体的策略效果如图所示：

论文的总体贡献为两点：

• We present a learnable and compositional framework for data augmentation. Our proposed algorithm automatically searches for the optimal compositional policy, which improves the diversity and quality of augmented samples

• In low-resource and class-imbalanced regimes of six benchmark datasets, TAA significantly improves the generalization ability of deep neural networks like BERT and effectively boosts text classification performance

组合式策略的描述如下：

Experiments

数据集

• 无增强模型

• Back Translation (BT) 模型

• Contextual Word Substitute (CWS) 模型

• Easy Data Augmentation (EDA) 模型

• Learning Data Manipulation (LDM) 模型

增强强度倍数的影响

增强数据的多样性

We evaluate the diversity of the augmented data by computing the Dist-2 (Li et al., 2016), which measures the number of distinct bi-grams in generated sentences. The metric is scaled using the total number of bi-grams in the generated sentence, which ranges from 0 to 1 where a larger value indicates higher diversity.