# SuooL's Blog

Source Code： GT-SALT/HiddenCut (github.com)

## Introduction

Fine-tuning large-scale pre-trained language models (PLMs) has become a dominant paradigm in the natural language processing community, achieving state-of-the-art performances in a wide range of natural language processing tasks.

After task-specific fine-tuning, models are very likely to overfit and make predictions based on spurious patterns, making them less generalizable to out-of-domain distributions.

In order to improve the generalization abilities of over-parameterized models with limited amount of task-specific data, various regularization approaches have been proposed, such as adversarial training that injects label-preserving perturbations in the input space, generating augmented data via carefully-designed rules, and annotating counterfactual examples. Despite substantial improvements, these methods often require significant computational and memory overhead or human annotations.

The underlying reasons are two-fold: (1) the linguistic relations among words in a sentence is ignored while dropping the hidden units randomly. In reality, these masked features could be easily inferred from surrounding unmasked hidden units with the self-attention networks. Therefore, redundant information still exists and gets passed to the upper layers. (2) The standard dropout assumes that every hidden unit is equally important with the random sampling procedure, failing to characterize the different roles these features play in distinct tasks.

Specifically, the approach is based on the linguistic intuition that hidden representations of adjacent words are more likely to contain similar and redundant information.

HiddenCut drops hidden units more structurally by masking the whole hidden information of contiguous spans of tokens after every encoding layer. This would encourage models to fully utilize all the task-related information, instead of learning spurious patterns during training.

To make the dropping process more efficient, we dynamically and strategically select the informative spans to drop by introducing an attention based mechanism.

By performing HiddenCut in the hidden space, the impact of dropped information is only mitigated rather than completely removed, avoiding injecting too much noise to the input.

We further apply a Jensen-Shannon Divergence consistency regularization between the original and these augmented examples to model the consistent relations between them

we conduct experiments to compare our HiddenCut with previous state-of-the-art data augmentation method on 8 natural language understanding tasks from the GLUE benchmark for in-distribution evaluations, and 5 challenging datasets that cover single-sentence tasks, similarity and paraphrase tasks and inference tasks for out-of-distribution evaluations.

We further perform ablation studies to investigate the impact of different selecting strategies on HiddenCut’s effectiveness.

Results show that our method consistently outperforms baselines, especially on out-of-distribution and challenging counterexamples.

## 方法

Intuitively, we can drop the spans of hidden representations that are assigned high attentions by the transformer layers. As a result, the information redundancy is alleviated and models would be encourage to attend to other important information.

## 实验

hidden units 采样策略实验：

## 结论

In this work, we introduced a simple yet effective data augmentation technique, HiddenCut, to improve model robustness on a wide range of natural language understanding tasks by dropping contiguous spans of hidden representations in the hidden space directed by strategic attention based sampling strategies. Through HiddenCut, transformer models are encouraged to make use of all the task-related information during training rather than only relying on certain spurious clues.
Through extensive experiments on in-distribution datasets (GLUE benchmarks) and out-of-distribution datasets (challenging counterexamples), HiddenCut consistently and significantly outperformed state-of-the-art baselines, and demonstrated superior generalization performances.