Notice

Recent Posts

Recent Comments

Link

250x250

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

쉬엄쉬엄블로그

(NLP) LSTM과 GRU 본문

부스트캠프 AI Tech 4기

(NLP) LSTM과 GRU

쉬엄쉬엄블로그 2023. 6. 21. 15:39

728x90

이 색깔은 주석이라 무시하셔도 됩니다.

LSTM, GRU

Long Short-Term Memory (LSTM)

Core Idea : pass cell state information straightly without any transformation
- Solving long-term dependency problem
- cell state의 정보를 변환 없이 바로 전달하여 long-term dependency 문제를 해결하기 위해 제안됨
What is LSTM (Long Short-Term Memory)?

http://colah.github.io/posts/2015 08 Understanding LSTMs/
- cell state vector
  - 필요한 정보를 담고 있는 벡터
- hidden state vector
  - cell state vector를 한번 더 가공해서 그 time step에서 노출할 필요가 있는 정보만을 남긴 필터링 된 정보를 담는 벡터
Long short-term memory

Long short term memory, Neural computation’97
- i : Input gate, Whether to write to cell
- f : Forget gate, Whether to erase cell
- o : Output gate, How much to reveal cell
- g : Gate gate, How much to write to cell
A gate exists for controlling how much information could flow from cell state

http://colah.github.io/posts/2015 08 Understanding LSTMs/
- cell state로부터 흐를 수 있는 정보의 양을 조절하는 게이트가 존재함
Forget gate
- $f_t=\sigma(W_f\cdot[h_{t-1},x_t]+b_f)$
  
  http://colah.github.io/posts/2015 08 Understanding LSTMs/
  - 이전 time step에서 넘어온 정보 중 $f_t$만큼만 기억하고 나머지는 forget함
Generate information to be added and cut it by input gate
- input gate로 추가할 정보를 생성하고 잘라냄
- $i_t=\sigma(W_i\cdot [h_{t-1},x_t]+b_i)$
- $\tilde C_t=tanh(W_c\cdot[h_{t-1},x_t]+b_c)$
  - 1 ~ 1사이의 값으로 만듬
Generate new cell state by adding current information to previous cell state
- 이전 cell state에 현재 정보를 추가하여 새로운 cell state를 생성
- $C_t=f_t\cdot C_{t-1}+i_t\cdot \tilde C_t$
  
  http://colah.github.io/posts/2015 08 Understanding LSTMs/
  - $\tilde C_t$를 input gate와 곱한 후 더해주는 것은 특정 비율만큼의 정보를 덜어낸 후 더해주는 것
Generate hidden state by passing cell state to tanh and output gate
- cell state를 tanh, output gate로 전달하여 hidden state를 생성
Pass this hidden state to next time step, and output or next layer if needed
- hidden state를 다음 time step으로 전달하고 필요하면 output 또는 next layer에 전달
- $o_t=\sigma(W_o[h_{t-1},x_t]+b_o)$
- $h_t=o_t\cdot tanh(C_t)$
  
  http://colah.github.io/posts/2015 08 Understanding LSTMs/
- $C_t$는 기억해야 할 모든 정보를 담고 있는 벡터
- $h_t$는 현재 time step에서 예측값을 내는 output layer의 입력으로 사용되는 벡터라는 점에서 해당 time step의 예측값에 직접적으로 필요한 정보만을 담은 벡터 ($C_t$가 가지는 많은 정보에서 지금 당장 필요한 정보만을 필터링한 형태의 벡터)

Gated Recurrent Unit (GRU)

What is GRU?

http://colah.github.io/posts/2015 08 Understanding LSTMs/
- $z_t=\sigma(W_z\cdot[h_{t-1},x_t])$
- $r_t=\sigma(W_r\cdot[h_{t-1},x_t])$
- $\tilde h_t=tanh(W\cdot[r_t\cdot h_{t-1},x_t])$
- $h_t=(1-z_t)\cdot h_{t-1}+z_t\cdot\tilde h_t$
- c.f) $C_t=f_t\cdot C_{t-1}+i_t\cdot\tilde C_t$ (in LSTM)
- GRU에서는 두 개의 gate 벡터 중 input gate만을 사용
- forget gate 자리에는 1 - input gate에 해당하는 값을 사용
  - 즉, input gate의 값이 커지면 커질수록 forget gate에 해당하는 값은 점차 작은 값이 됨
- GRU는 독립적인 게이트 값을 더하는 것이 아니라 두 정보간의 가중 평균을 내는 형태로 계산됨
  - ex) $\tilde h_t$에서 60% 정보를 보존한다면 $h_{t-1}$에서는 40% 정보만을 보존해서 더하게 됨
- GRU는 LSTM의 $C_t$와 $h_t$ 두 종류의 정보를 $h_t$ 하나로 일원화함
- GRU는 내부적으로 두 개의 독립된 게이트를 통해 하던 연산을 하나의 게이트만으로 계산하도록 만들어서 계산량과 메모리 요구량을 LSTM에 비해 줄인 경량화된 모델
  - LSTM에 비해 경량화된 모델이지만 성능이 뒤지지 않음

Backpropagation in LSTM? GRU?

Uninterrupted gradient flow!
http://colah.github.io/posts/2015 08 Understanding LSTMs/
- 기존 original RNN처럼 $W_{hh}$를 계속적으로 곱해주는 연산이 아니라 전 time step의 cell state vector에서 그때 그때 서로 다른 값으로 이루어진 forget gate를 곱하고 필요로 하는 정보를 곱셈이 아닌 덧셈을 통해 원하는 정보를 만들어준다는 사실로 인해 gradient vanishing, exploding 문제를 해결
- 덧셈 연산은 backpropagation을 수행할 때 gradient를 복사해주는 연산임
  - 따라서 항상 동일한 $W_{hh}$가 곱해지는 형태의 RNN에 비해 멀리 있는 time step까지 gradient를 큰 변형없이 전달해줄 수 있고 긴 time step 간에 존재하는 long term dependency 까지 해결할 수 있게 됨

Summary on RNN/LSTM/GRU

RNNs allow a lot of flexibility in architecture design
- RNN은 flexibility(유연성?)을 가진 구조이다.
Vanilla RNNs are simple but don’t work very well
- Vanilla RNN은 단순하지만 잘 작동하지 않는다.
Backward flow of gradients in RNN can explode or vanish
- RNN에서 backward 방향으로 gradients의 흐름은 explode 또는 vanish 될 수 있다
- gradient vanishing 문제
Common to use LSTM or GRU : their additive interactions improve gradient flow
- LSTM 또는 GRU는 additive interactions(곱 연산이 아닌 합 연산)을 활용하여 gradient 흐름을 개선했기 때문에 Vanilla RNN보다 일반적으로 사용된다.

Reference

Quiz

LSTM 모델에 대한 설명이 다음과 같을 때, 모델의 총 파라미터 수는?

은닉층은 1개이다.
입력($x_t$)의 차원은 25이다.
은닉 상태($h_{t-1}$)의 차원은 100이다.
LSTM의 각 게이트는 bias를 가진다.

풀이

( (hidden state dimension + input dimension) * hidden state dimension + hidden state dimension(bias) ) * 4
- input dimension : 25
- hidden state dimension : 100
- 4 : LSTM의 게이트 개수
- ((100 + 25) * 100 + 100) * 4 = 50400
$f_t=\sigma(W_f\cdot[h_{t-1},x_t]+b_f)$
$i_t=\sigma(W_i\cdot [h_{t-1},x_t]+b_i)$
$\tilde C_t=tanh(W_c\cdot[h_{t-1},x_t]+b_c)$
$o_t=\sigma(W_o[h_{t-1},x_t]+b_o)$

출처: 부스트캠프 AI Tech 4기(NAVER Connect Foundation)

'부스트캠프 AI Tech 4기' 카테고리의 다른 글

(NLP) Beam Search와 BLEU Score (0)	2023.06.23
(NLP) Seq2Seq (0)	2023.06.22
(NLP) Basics of Recurrent Neural Network (0)	2023.06.20
(NLP) Word Embedding (0)	2023.06.19
(NLP) Intro to NLP (0)	2023.06.17