[논문 리뷰] XLNet : Generalized Autoregressive Pretrainingfor Language Understanding

이번 논문은 구글 AI팀에서 발표한 논문으로 2019년 6월 발표 당시 20개 자연어 처리 부분 최고 성능을 기록한 논문이다.

19년 1월 나온 TRANSFORMER-XL: ATTENTIVE LANGUAGE MODELS BEYOND A FIXED-LENGTH CONTEXT 후속버젼의 논문으로 저자들 또한 완벽히 일치한다.

본 논문은 당시 최고 성능을 자랑하던 BERT보다 20개의 nlp task에서 더 좋은 성능을 보였으며 그중 18개 task에서 SOTA를 달성하였으며 기존 language model들이 갖고 있던 Autoregressive 한 특성을 유지, 다양한 context의 방향에서 학습한 모델을 구현하였다.

1. Introduction

Pretraining method를 크게 2가지로 본다면 Autoregressive(AR) / AutoEncoding(AE) 로 볼 수 있다.

AR 모델

데이터를 순차적으로 처리하는 기법 (예 : ELMO, GPT) - 단방향(forward, backward) 학습
이전 단어 입력 후 다음 단어 출력
text generation에서 좋은 성능

AR 모델에서 파란색 이후의 값을 찾는 과정

AE 모델

입력값을 복원 처리하는 기법 (예 : BERT) - 양방향 학습
mask 단어를 예측하기 위해 앞뒤 문맥을 모두 학습 후 예측
- (단점) - masking된 token들을 서로 independent라 가정해야 한다
- (예) - New York is a city라는 문장에서 New와 York를 서로 독립시킨다면 New다음 York라는 단어가 나올 확률과 New가 나오지 않았을 경우 York가 나올 확률이 달라짐
language understanding에서 좋은 성능

AR모델에서 masking된 token을 찾는 과정

2. Proposed Method

AR모델과 AE모델의 단점을 극복하고 장점을 살리기 위하여 새로운 방법 제시

Permutation Language Modeling

$ input sequence : x = (x_1 , x_2 ,... , x_T) $

$ likelihood : E_{Z \sim Z_{T}} [\Pi_{t=1}^T p(x_{z_t} | x_{z<t})] $

$ training \ objective : \underset{\theta}{max} $ $ E_{Z \sim Z_{T}} $ $[ \sum_{t=1} ^T $ $ log\ p_{\theta} (x_{z_t} | x_{z<t} ) ]$

input sequence index(순서)의 모든 permutation을 고려한 AR 방식을 이용

ex)

input sequence [$x_1, x_2, x_3, x_4$]에 대해 index의 permutation의 집합 총 4! = 24
$ Z_T $ = [$x_1, x_2, x_3, x_4$], [$x_1, x_2, x_4, x_3$] ...
$ Z_T $ 에 대해 AR Language Model objective 적용

$ Z_T $ 에 대해 AR Language Model objective 적용시킨 그림

위의 예처럼 토큰이 4개인 문장일 경우 왼쪽 위 그림처럼 3->2->4->1처럼 shuffle 된 sequence의 첫 번째 단어(3)를 맞춰야 하는 상황일 경우 3의 토큰 정보를 주지 않는다 (문제가 너무 쉬워지기 때문)

Permutation Language Modeling에서 기존 AR모델처럼 파란색 줄 이후의 token을 찾는 과정

실제 구현에서는 토큰들을 섞지 않고 attention mask로 구현한다.(XLNet의 구조는 transformer network 이기 때문)

Architecture: Two-Stream Self-Attention for Target-Aware Representations

예시로 들은 New York is a city라는 문장을 다시 한번 예시 설명하면

MLM(Masked Language Method)을 사용하면 'New'와 'York'을 모두 mask 하게 되었을 경우 서로 independent 하다는 가정이 깔리기 때문에 New가 있을 때 York이 나올 확률과 New가 없을 경우 York이 나올 확률이 서로 달라지게 된다는 문제점이 생긴다. 본 논문에서는 이의 문제점을 해결하기 위해 Permutation Language Model을 사용하였다.

하지만 Permutation Language Model을 사용하였을 경우의 문제점도 생기게 된다.

아래 그림처럼 다음에 예측할 token이 몇 번째 token인가에 대한 정보가 없다는 문제가 생긴다.

따라서 본 논문에서는 Two-Stream Self Attention이라는 기법을 사용하였다.

Two-Stream Self Attention

query stream attention과 content stream attention 두 가지를 혼합한 self-attention 기법

content stream attention(기존 self-attention과 유사하다)

예측하고자 하는 토큰의 실제 값 정보를 같이 사용하여 예측
$h_{zt} ^{(m)} \leftarrow Attention(Q = h_{zt} ^{(m-1)} ,KV = h_{z \leq t} ^{(m-1)} ; \theta $
- z : 원래 문장 순서를 random shuffle 한 index list
- $z_t$ : z의 t번째 요소

content stream attention(표준 self-attention과 유사)

Figure 3: A detailed illustration of the content stream of the proposed objective with both the joint view and split views based on a length-4 sequence under the factorization order [3, 2, 4, 1]. Note that if we ignore the query representation, the computation in this figure is simply the standard self-attention, though with a particular attention mask.

query stream attention

토큰, position 정보를 활용한 self-attention 기법
$g_{zt} ^{(m)} \leftarrow Attention(Q=g_{zt} ^{(m-1)} , KV = h_{z <t} ^{(m-1)} ; \theta $
content stream attention과는 다르게 예측하고자 하는 target 토큰 이전 정보들의 값(position embedding, random initialization 된 값)을 가지고 예측

Overview of the permutation language modeling training with two-stream attention.

A detailed illustration of the query stream of the proposed objective with both the joint view and split views based on a length-4 sequence under the factorization order [3, 2, 4, 1]. The dash arrows indicate that the query stream cannot access the token (content) at the same position, but only the location information.

Incorporating Ideas from Transformer-XL

기존 Transformer 는 fixed size의 sequence(context) 길이를 넘을 경우 최대 sequence 이후의 token들을 학습에서 제외하는 단점을 갖고 있었으며 Transformer-XL에서는 Extra-Long size의 context 정보를 활용 가능하도록 segment recurrence 기법을 활용하였다.

context를 작은 segment 단위로 자른 후 첫번째 segment를 기존 Transformer 처럼 학습, cache에 저장, 두번째 segment 학습(첫번째 segment 정보 활용 - memory라고 부름) 하는 방식을 반복하여 진행한다.

(현재 segment 계산시 메모리 학습X -> loss를 줄이기 위해 gradient에 memory 반영 X)

Illustration of the Transformer-XL model with a segment length 4

이 그림에 mem이 위 memory(직전 세그먼트)에 해당한다

3. Experiments

중국의 데이터셋 중,고등학교 수준의 문제 풀기

SQuAD에서의 평가

Error-rate 비교 XLNET의 오류율은 3.79 -> 96.21의 정확도를 자랑

GLUE 데이터셋에서의 평가

4. Conclusions

기존 Transformer-XL이 Language modeling에는 좋지만 downstream task에도 좋은가에 대한 의문점을 해결한 논문
기존 좋은 성능을 보여주던 masking 방식 없이 permutation 방식을 활용
AR(auto-regressive language)모델의 형식을 갖고왔으며 auto-encoder(AE)의 bidirectional context 정보를 활용함으로써 AR과,AE의 모든 장점을 가져온 모델 소개

본 게시물은

https://www.youtube.com/watch?v=koj9BKiu1rU , https://ratsgo.github.io/natural%20language%20processing/2019/09/11/xlnet/

을 참고하여 게시하였습니다.

저작자표시 비영리

'ML 관련 > 자연어 처리 관련' 카테고리의 다른 글

[논문 리뷰] SNS에서 단어 간 유사도 기반 단어의 쾌-불쾌 지수 측정 (0)	2020.06.03
[논문 리뷰] OpinionFinder: A system for subjectivity analysis (0)	2020.05.23
[논문 리뷰] Beating Atari with Natural Language Guided Reinforcement Learning (2)	2020.04.13
[논문 리뷰] Attention is all you need (0)	2019.12.05
[논문 리뷰] Bidrectional Transformers for Language Understanding(BERT) (4)	2019.12.04

하나씩 화이팅!!

[논문 리뷰] XLNet : Generalized Autoregressive Pretrainingfor Language Understanding

1. Introduction