Notice

Recent Posts

Recent Comments

Tags more

Archives

관리 메뉴

쉬엄쉬엄블로그

(NLP) Basics of Recurrent Neural Network 본문

부스트캠프 AI Tech 4기

쉬엄쉬엄블로그 2023. 6. 20. 11:59

728x90

이 색깔은 주석이라 무시하셔도 됩니다.

Basic structure
- unrolled version diagram
  
  https://colah.github.io/posts/2015 08 Understanding LSTMs/

Inputs and output of RNNs (rolled version)
- We usually want to predict a vector at some time steps
  - 특정 time step에서의 벡터를 예측하길 원함
  http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf
How to calculate the hidden state of RNNs

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf
- We can process a sequence of vectors by applying a recurrence formula at every time step
  - 매 time step마다 반복 공식을 적용하여 시퀀스 벡터들을 계산함
  - $h_{t-1}$ : old hidden-state vector
  - $x_t$ : input vector at some time step
  - $h_t$ : new hidden-state vector
  - $f_w$ : RNN function with parameters W
  - $y_t$ : output vector at time step t
- Notice : The same function and the same set of parameters are used at every time step
  - 매 time step마다 RNN 모듈을 정의하는 파라미터 W는 모든 time step에서 동일한 값을 공유함
- The state consists of a single “hidden” vector h
  
  http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf
  - hidden state vector의 차원 수는 하이퍼파라미터

One-to-one
- Standard Neural Networks
  
  http://karpathy.github.io/2015/05/21/rnn-effectiveness/
One-to-many
- Image Captioning
  
  http://karpathy.github.io/2015/05/21/rnn-effectiveness/
  - 입력이 첫 번째 time step에서만 들어감
  - 이럴 땐 두 번째 time step에 값이 모두 0으로 채워진 벡터나 텐서를 입력으로 넣어줌
Many-to-one
- Sentiment Classification
  
  http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Many-to-many (Sequence-to-sequence)
- Machine Translation
  
  http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- Video classification on frame level
  
  http://karpathy.github.io/2015/05/21/rnn-effectiveness/
  - 또 다른 예 : POS 태깅 문제

Example of training sequence “hello”
- Vocabulary : [h, e, l, o]
- Example training sequence : “hello”
  
  http://karpathy.github.io/2015/05/21/rnn-effectiveness/
  
  http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- $h_t=tanh(W_{hh}h_{t-1}+W_{xh}x_t+b)$
  
  http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- $Logit=W_{hy}h_t+b$
  
  http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- At test-time, sample characters one at a time, feed back to model
  - 테스트에서는 한 번에 하나씩 문자를 샘플링하고 모델에 피드백을 제공함
    
    http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Training a RNN on Shakespeare’s plays

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf
- 공백과 줄바꿈 같은 특수문자도 인코딩으로 변환해주면 1-dimensional character sequence로 볼 수 있고 이를 통해 language model을 학습할 수 있음
Training process of RNN

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Results of trained RNN

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- 주어진 학습 데이터의 패턴(”:” 다음에 줄바꿈 후 대사가 나옴)을 효과적으로 학습함
A paper written by RNN

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
C code generated by RNN

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- 괄호, indent 등 패턴을 학습함

Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient
- 전체 시퀀스에서 forward를 통해 loss를 계산하고 backward를 통해 전체 시퀀스의 gradient를 계산
  
  http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf
Run forward and backward through chunks of the sequence instead of whole sequence
- 전체 시퀀스 대신 시퀀스 청크를 통해 forward, backward 진행
  
  http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf
Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps
- hidden state를 끝까지 forward(순전파) 시키되, backpropagte(역전파)는 적은 수의 time step에서만 적용
  
  http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf

RNN is excellent, but…
- Multiplying the same matrix at each time step during backpropagation causes gradient vanishing or exploding
- $W_{hh}$가 반복적으로 반영되기 때문에 등비수열처럼 같은 수를 계속 곱해져서 gradient가 기하급수적으로 커지거나 작아지는 문제가 발생
Toy Example
The reason why the vanishing gradient problem is important

https://i.imgur.com/vaNahKE.mp4
- gradient vanishing 문제로 인해 먼 time-step까지 유의미한 gradient signal을 보내줄 수 없는 상황이 됨

시계열 데이터를 잘 처리할 수 있도록 RNN 구조가 제안되었고 언어 데이터를 일종의 시계열 데이터로 보고 RNN 구조를 번역 task에 적용했지만 이후에 제안된 transformer 구조에서는 입력 문장의 전체를 동시에 고려하여 처리하며 더 뛰어난 성능을 보인다.

출처: 부스트캠프 AI Tech 4기(NAVER Connect Foundation)