[논문 대충 리뷰]Big Bird, mobileBERT, ELECTRA

papers with code-greatest papaers with code

Big Bird: Transformers for Longer Sequences

NeurIPS 2020

BERT(transformer-based model)

문제점 : quadratic dependency(mainly in terms of memory) on the sequence length
블로그 참고
- 이것은 기본적으로 큰 문자열을 입력으로 적용하기 전에 작은 세그먼트로 분할해야 함을 의미합니다. 이러한 콘텐츠 조각화는 컨텍스트의 상당한 손실을 초래하여 응용 프로그램이 제한됩니다.
BERT는 완전한 자기주의 메커니즘에서 작동 합니다. 이로 인해 모든 새로운 입력 토큰에 대한 계산 및 메모리 요구 사항이 2 차적으로 증가합니다. 최대 입력 크기는 약 512 토큰이므로이 모델은 더 큰 입력 및 대용량 문서 요약과 같은 작업에 사용할 수 없습니다.
full attention mechanism

Big-Bird

sparse attention mechanism(quadratic → linear)
- 장점 : CLS의 이점을 보여줌
  - CLS란? 모든 sentence의 첫번째 token은 언제나 [CLS](special classification token) 입니다. 이 [CLS] token은 transformer 전체층을 다 거치고 나면 token sequence의 결합된 의미를 가지게 되는데, 여기에 간단한 classifier를 붙이면 단일 문장, 또는 연속된 문장의 classification을 쉽게 할 수 있게 됩니다. 만약 classification task가 아니라면 이 token은 무시하면 됩니다.
turing complete(2차 full attention model의 성질을 보존)
- 튜링 컴플리트 : 튜링머신과 동일한 계산 능력을 가진다는 의미
BERT보다 8배 더 긴 길이 가능(BERT와 동일한 하드웨어)
QA,summarization에 performance 향상과 유전체학 데이터에 사용 제안
BIGBIRD gives state-of-the-art performance on a number of NLP tasks such as question answering and long document classification.
DNA에 대한 contextual language model을 제안할거임
region prediction and predicting effects of non-coding variants 에 대한 down stream을 위해 fine-tune할거임~~

ACL 2020

MobileBERT: 리소스 제한 장치를 위한 소형 작업 제한 BERT
- resource-limited device:gpu가 없는 스마트폰
BERT의 큰 모델 크기와 긴 대기시간의 문제로 resource-limited mobile device에 사용하기 어려움을 해결하기 위해 제안
- ux(user experience)를 고려한 모델임
- 번역이나 문장생성의 경우 gpu가 없는 스마트폰에서 사용하려면 시간이 오래걸릴텐데, 그 문제를 해결하기위해 연구
a thin version of BERT_large(bert_large만큼 deep하다는의미)
task-agnostic: fine-tuning으로 downstream가능
학습 방법 : teacher model은 an inverted- bottleneck incorporated BERTLARGE model임. transfer from this teacher to MobileBERT.
- bottleneck
- transformer가 두개의 linear layer에 둘러싸여있음. 이 두 linear layer가 차원을 축소/확장 함에따라 bottleenck/ inverted-bottleneck
- 저자는 feature map을 최대한 줄여야 compact한 모델이라고 생각함. transformer의 input/output ㅊ원을 다르게 하면서 IB-BERT를 train하는 실험을 함
결과 : BERT_base보다 4.3배 작고 5.5배 빠름
GLUE score of 77.7 (0.6 lower than BERTBASE), and 62 ms latency on a Pixel 4 phone.
On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERTBASE).

ICLR 2020

Language Modeling
NLU
masked LM,pre-training methods인 BERT는 downstream으로 많은 태스크에서 좋은 결과를 낼 수 있지만 그만큼 많은 양의 계산을 효율적으로 해야 한다는 문제점
효율적인 pre-training작업 제안 replaced token detection
masking input값 대신에 small network generator로 만든 token으로 대체한값을 입력
- GAN의 아이디어를 차용하긴 했으나 generator에서 maximum likelihood를 통해서 token 생성을 학습함
discrimitive model : 해당 토큰이 generator로 만든 sample로 replaced인지 아닌지
MLM은 masked된 부분만 정의하는 것이 아니라 제안하는 pretraining은 모든 입력토큰을 정의하기 때문에 더 효율적임
작은 모델에 강력한 성능 : gpu로 4일간 학습한 모델이 gpt(30배 많은양을 학습)보다 GLUE벤치마크에 더 나은 성능을 보임
RoBERTa and XLNet 보다 1/4양만 학습해도 뛰어난 성능을 보임

ICWR 2020

QA
community QA (ex. Stackoverflow, Quora
QA사이트에서의 문제점 : the slow handling of violations, the loss of normal and experienced users' time, the low quality of some reports, and discouraging feedback to new users
문제점을 해결하기 위한 솔루션 : 조정작업을 자동화하기 위해 품질 및 주관적 측면을 20가지로 예측함
data : 구글 crouwdsource
model : a fine-tuned pre-trained BERT
evaluation : MSE
achieved : 0.046 after 2 epochs of training