Notice

Recent Posts

Recent Comments

Link

250x250

« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

Tags more

Archives

Today

Total

관리 메뉴

쉬엄쉬엄블로그

(NLP) Other Self-supervised Pre-training Models 본문

부스트캠프 AI Tech 4기

(NLP) Other Self-supervised Pre-training Models

쉬엄쉬엄블로그 2023. 7. 12. 17:51

728x90

이 색깔은 주석이라 무시하셔도 됩니다.

Advanced Self-supervised Pre-training Models

GPT-2

GPT-2 : Language Models are Unsupervised Multi-task Learners

모델 구조는 GPT-1과 다르지 않음
트랜스포머 모델에 레이어를 더 쌓아서 크기를 키움
다음 단어를 예측하는 task로 학습(pretrain)을 진행
증가된 사이즈의 데이터 사용
데이터셋을 대규모로 사용하는 과정에서 퀄리티가 높은 데이터로부터 효과적으로 다양한 지식을 배울 수 있도록 유도함
여러 donw-stream task가 생성 task에서 zero-shot setting으로써 모두 다뤄질 수 있다는 잠재적인 능력을 보여줌

GPT-2 : Motivation (decaNLP)

Multitask Learning as Question Answering 에서 착안을 함
- 이 논문에서는 모든 종류의 자연어처리 task들이 QA task로 바뀔 수 있다고 주장
- 통합된 자연어 생성의 형태로 다양한 task들을 통합하여 학습한 연구 사례

GPT-2 : Datasets

Reddit이라는 커뮤니티 웹사이트에서 답글 중 3개 이상의 좋아요를 받은 답글들을 수집함
- 링크가 존재하는 글이라면 해당 링크에 있는 텍스트 문서도 수집함
Preprocess
- Byte pair encoding (BPE)
  - WordPiece와 비슷하게 subword level에서의 단어 임베딩 사용

GPT-2 : Model

https://openai.com/blog/better-language-models/

Modification
- Layer normalization
- 각 레이어들을 random initialization할 때 레이어가 위로 갈 수록 레이어의 index에 비례 또는 반비례해서 initialization되는 값을 더 작은 값으로 만듬
- 레이어가 위쪽으로 가면 갈수록 쓰이는 여러 선형변환에 해당하는 값들이 0에 가까워지게 됨
- 위쪽에 있는 레이어가 하는 역할이 점점 더 줄어들 수 있도록 모델을 구성함

GPT-2 : Question Answering

자연어처리의 모든 task는 QA task로 바뀔 수 있다는 사실에 입각하여 주어지는 대화형 QA 데이터가 있을 때 대화를 주고 바로 다음에 나올 답을 예측하도록 했을 때(즉, 학습 데이터를 하나도 쓰지 않고 task에 대한 예측을 수행했을 때, zero-shot setting) fine-tuning한 후 달성한 성능에는 못 미치지만 어느 정도는 충분히 가능성을 보임

GPT-2 : Summarization

https://cdn.openai.com/better-languagemodels/ language_models_are_unsupervised_multitask_learners.pdf

GPT-2 : Translation

주어진 문장이 있으면 번역을 하고 싶은 언어로 마지막 부분에 in French처럼 붙여주면 앞서 나온 문장을 번역해줌

GPT-3 : Language Models are Few-Shot Learners

Language Models are Few-shot Learners
- Scaling up language models greatly improves task-agnostic, few-shot performance
- An autoregressive language model with 175 billion parameters in the few-shot setting
- 96 Attention layers, Batch size of 3.2M
- 모델 구조 측면에서 특별한 점은 없음
- GPT-2에 비해서 훨씬 더 많은 파라미터 수를 가지도록 트랜스포머의 self-attention block을 더 많이 쌓음
- 많은 데이터와 큰 배치사이즈를 통해 학습을 진행해서 성능을 향상시킴
  
  Language Models are Few-show Learners, NeurIPS’20
  
  Language Models are Few-show Learners, NeurIPS’20
- Prompt : the prefix given to the model
- Zero-shot : Predict the answer given only a natural language description of the task
- One-shot : See a single example of the task in addition to the task description
  - GPT-3에서는 모델 자체를 전혀 변형하지 않은 채 추론 과정 중에 예시를 입력의 일부로서 제시했을 때 단어를 주면 다음에 나타나는 결과가 zero-shot setting보다 성능이 좋아짐
- Few-shot : See a few samples of the task
  - one-shot setting처럼 하나가 아닌 여러 개의 예시를 줌
- GPT-3는 zero-shot setting에서의 수준을 많이 끌어올림
- Zero-shot performance improves steadily with model size
- Few-shot performance increases more rapidly
  
  Language Models are Few-show Learners, NeurIPS’20
- 모델 사이즈를 키우면 키울수록 zero, one, few-shot에서의 성능이 더 빠르게 올라감

ALBERT : A Lite BERT for Self-supervised Learning of Language Representations

Is having better NLP models as easy as having larger models?
- Obstacles
  - Memory Limitation
  - Training Speed
- Solutions
  - Factorized Embedding Parameterization
  - Cross-layer Parameter Sharing
  - (For Performance) Sentence Order Prediction
- pretraining 모델들은 대규모 메모리와 많은 파라미터를 가진 형태로 발전해왔기 때문에 점점 더 많은 GPU 메모리, 데이터, 학습 시간이 필요했음
- 기존 BERT 모델의 성능 하락없이 크기와 학습 시간을 줄일 수 있는 모델, 그리고 변형된 형태의 문장 레벨의 self-supervised learning의 pretrain task를 제안함
Factorized Embedding Parameterization

http://jalammar.github.io/illustrated transformer/
- BERT에서는 임베딩 레이어와 self-attention block에서 사용되는 차원과 동일해야 함
- V = Vocabulary size
- H = Hidden-state dimension
- E = Word embedding dimension
  
  http://jalammar.github.io/illustrated transformer/
- ALBERT는 임베딩 레이어의 차원을 줄이는 기법을 제시
  - 워드 임베딩을 주기 전에 이 벡터의 차원을 줄여서 필요한 파라미터와 계산량을 줄일 수 있는 기법 제시
    - 100차원 벡터로 주어지는 워드 임베딩 벡터를 추가적인 레이어 하나를 둬서 100차원 벡터로 차원을 다시 늘려줌
    - 그럼 100차원 벡터를 동일하게 100차원으로 계산한 결과와 유사하도록 만들 수 있음
    - 500x100 → 500x15 + 15x100 파라미터로 수가 줄어듬
Cross-layer Parameter Sharing
- Shared-FFN : Only sharing feed-forward network parameters across layers
- Shared-attention : Only sharing attention parameters across layers
- All-shared : Both of them
  
  ALBERT: A Lite BERT for Self supervised Learning of Language Representations, ICLR’20
- 기존 BERT에서는 각 self-attention block마다 Q, K, V에 대한 파라미터가 별도로 존재했지만 ALBERT에서는 서로 다른 self-attention block에 존재하는 선형 변환 matrix들을 shared된 파라미터로 동일하게 적용하여 구성함
1. 출력 레이어에 해당하는 feed-forward 네트워크 파라미터를 공유
2. attention을 수행하기 위해 쓰이는 Q, K, V의 파라미터를 공유
3. 1, 2를 모두 공유
- 1, 2, 3에 대한 실험 진행했을 때 3번이 파라미터 수가 가장 적고 BERT 모델에 비해 성능 하락 폭이 그렇게 크지 않음
Sentence Order Prediction
- Next Sentence Prediction pretraining task in BERT is too easy
- Predict the ordering of two consecutive segments of text
  - Negative samples the same two consecutive segments but with their order swapped
- 연속적으로 등장하는 두 문장을 가져와서 그 문장의 순서가 맞는지 바뀌었는지 예측하는 binary classification task임
- pretraining 과정에서 NSP(Next Sentence Prediction) task 대신 SOP(Sentence Order Prediction)을 진행
  - NSP는 실효성이 별로 없다는 연구 결과가 많이 나옴
  ALBERT: A Lite BERT for Self supervised Learning of Language Representations, ICLR’20
- pertraining에 SOP를 적용한 결과가 더 좋음
GLUE Results

ALBERT: A Lite BERT for Self supervised Learning of Language Representations, ICLR’20

ELECTRA : Efficiently Learning an Encoder that Classifies Token Replacements Accurately

Efficiently Learning an Encoder that Clssifies Token Replacements Accurately
- Learn to distinguish real input tokens from plausible but synthetically genearted replacements
- Pre-training text encoders as discriminators rather than generators
- Discriminator is the main networks for pre-training
  
  ELECTRA: Pre training Text Encoders as Discriminators Rather Than Generators, ICLR’20
- mask된 단어를 복원해주는 모델(generator)를 두고 generator가 복원한 단어에 대해서 예측된 단어인지 원래부터 있었던 단어였는지 예측하는 모델(discriminator)을 두는 형태가 ELECTRA 모델의 핵심적인 특징
- Generator는 BERT 모델로 생각할 수 있고 mask된 단어를 복원한 문장을 입력으로 받은 모델 구조는 기존의 BERT, GPT 모델들과 비슷하게 트랜스포머에서 제안된 self-attention block을 쌓은 구조가 됨
- Discriminator는 단어별로 원래 있었던 단어인지 Generator의 예측에 의해서 대체된 단어인지 이진 분류 예측을 하게 됨
- Generator와 Discriminator가 서로 적대적 관계(adversarial)로 학습됨
  - GAN 모델 아이디어에서 착안
- 최종적으로 학습을 진행한 후에는 Discriminator 모델을 다양한 down-stream task들에 fine-tuning 하여 사용하는 pretrained 모델로 사용
Replaced token detection pre-training vs masked language model pre-training
- Outperforms MLM-based methods such as BERT given the same model size, data, and compute
  
  ELECTRA: Pre training Text Encoders as Discriminators Rather Than Generators, ICLR’20
- ELECTRA 모델이 BERT 모델에 비해서 학습에 필요한 계산량이 같은 경우에 더 좋은 성능을 보여줌

Light-weight Models

pretrained 모델을 고도화하는 연구들이 활발하게 진행 중
그 중 한 방향으로 모델의 경량화가 있음
- 기존 모델보다 적은 레이어 수나 파라미터 수를 가지는 형태로 발전시키거나 확장시키는 연구
- 성능을 최대한 유지하면서 모델의 크기를 줄이고 계산 속도를 빠르게 하는 것이 목적
- 소형 디바이스에서 적용하기 위한 연구
DistillBERT (NeurIPS 2019 Workshop)
- A triple loss, which is a distillation loss over the soft target probabilities of the teacher model leveraging the full teacher distribution
- teacher 모델과 student 모델이 있음
- teacher 모델은 student 모델을 가르치는 역할을 함
- student 모델은 teacher 모델에 비해 레이어 수, 파라미터 수 측면에서 더 작게 경량화된 모델임
- student 모델은 teacher 모델이 내는 결과를 잘 모사할 수 있도록 학습을 진행함
- student 모델이 학습할 때 결과로 나오는 확률 분포에서 적용할 때 ground truth 분포로 teacher 모델이 가지는 확률 분포를 줌
- 그래서 student 모델이 teacher 모델이 예측한 결과를 최대한 잘 모사할 수 있도록 만들어줌
TinyBERT (Findings of EMNLP 2020)
- Two-stage learning framework, which performs Trasnformer distillation at both the pre-training and task-specific learning stages
- 여기서도 teacher 모델과 student 모델이 있음
- DistillBERT와 다르게 target 분포를 모사하여 ground truth로써 softmax loss를 적용하여 teacher 모델을 닮도록 student 모델을 학습하는 방식 뿐만 아니라 임베딩 레이어와 각 self-attention block이 가지는 Q, K, V attention matrix, 결과로 나오는 hidden state vector까지 유사해지도록 학습을 진행함
- 서로 차원이 다른 hidden state vector가 최대한 유사해지도록 loss를 적용하기 위해 더 적은 수의 벡터로 차원이 변환되는 FC 레이어를 하나 더 두고 그 부분도 학습 가능한 파라미터로 두어서 차원간의 mismatch를 해결함
- 예측값만 같아지도록 학습하는 것이 아니라 중간 결과물들도 teacher 네트워크와 최대한 유사해지도록 학습을 했다는 점이 가장 큰 특징

Fusing Knowledge Graph into Language Model

Konwledge Graph
- 이 세상에 존재하는 다양한 개념이나 개체들을 잘 정의하고 그들간의 관계를 잘 정형화하여 만들어둔 것
BERT는 주어진 문장에서 여러 정보들을 추출하는 것은 잘 할 수 있지만 외부 지식이 필요한 경우에는 취약하기 때문에 그러한 부분을 knowledge graph로써 잘 정의하고 BERT 등과 같은 기존의 pretraining language 모델들에 잘 결합하여 외부 지식이 필요한 task를 잘 해결할 수 있을까에 대한 연구가 진행 중임
ERNIE : Enhanced Language Representation with Informative Entities (ACL 2019)
- Informative entities in a knowledge graph enhance language representation
- Information fusion layer takes the concatenation of the token embedding and entity embedding
KagNET : Knowledge-Aware Graph Networks for Commonsense Reasoning (EMNLP 2019)
- A knowledge-aware reasoning framework for learning to answer commonsense questions
- For each pair of question and answer candidate, it retrieves a sub-graph from an external knowledge graph to capture relevant knowledge

출처: 부스트캠프 AI Tech 4기(NAVER Connect Foundation)

'부스트캠프 AI Tech 4기' 카테고리의 다른 글

(NLP 기초대회) PyTorch Lightning 이론 (2)	2023.07.15
(Data Viz) 인터랙티브 시각화 (0)	2023.07.13
(NLP) Self-supervised Pre-training Models (0)	2023.07.11
Git (1)	2023.07.06
(NLP) Transformer - 2 (0)	2023.07.05

'부스트캠프 AI Tech 4기' Related Articles

Comments