Notice

Recent Posts

Recent Comments

Link

250x250

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

쉬엄쉬엄블로그

(NLP) Beam Search와 BLEU Score 본문

부스트캠프 AI Tech 4기

(NLP) Beam Search와 BLEU Score

쉬엄쉬엄블로그 2023. 6. 23. 13:54

728x90

이 색깔은 주석이라 무시하셔도 됩니다.

Beam search

Greedy decoding

Greedy decoding has no way to undo decisions!
- 현재 time step에서 가장 좋아보이는 단어를 선택하는 접근 방법을 Greedy 접근법이라 함
- Greedy decoding에서는 hit 다음에 me가 나와야 하는데 뒤늦게 단어를 a로 잘못 생성했다고 알게되어도 되돌아갈 수 없음
How can we fix this?
- 이러한 문제를 해결하기 위해 Beam Search가 제안됨

Exhaustive search

Ideally, we want to find a (length $T$) translation $y$ that maximization
- $P(y|x)=P(y_1|x)P(y_2|y_1,x)P(y_3|y_2,y_1,x)...P(y_T|y_1,...,y_{T-1},x)$
  $=\prod^T_1P(y_t|y_1,...,y_{t-1},x)$
- 단어를 하나씩 생성하면서 그 때의 확률값을 순차적으로 곱함
We could try computing all possible sequences $y$
- This means that on each step $t$ of the decoder, we are tracking $V^t$ possible partial translations, where $V$ is the vocabulary size
  - $t$까지의 가능한 모든 경우를 따진다면 매 time step마다 고를 수 있는 단어 수가 vocabulary 사이즈가 되고 그것을 $V$라고 할 때 $V^t$ 만큼 가능한 경우를 모두 고려해야 함
- This $O(V^t)$ complexity is far too expensive!
  - 기하급수적으로 증가하는 경우의 수를 모두 계산하기에는 너무 많은 시간과 계산량이 필요함

Beam search

Core idea
- on each time step of the decoder, we keep track of the $k$ most probable partial translations (which we call hypothesis)
  - 모든 경우를 고려하는 것이 아니라 $k$개 경우만 고려함
  - $k$ is the beam size (in practice around 5 to 10)
A hypothesis $y_1,...,y_t$ has a score of its log probability
- $score(y_1, ..., y_t)=logP_{LM}(y_1,...,y_t|x)=\sum^t_{i=1}P_{LM}(y_i|y_1, ..., y_{i-1},x)$
- 확률 곱에 log를 붙여주면 확률 값들을 다 더하는 연산으로 변경됨
- 여기서 사용된 log 함수는 단조 증가하는 함수이기 때문에 확률 값이 가장 큰 때에는 log를 취한 값도 다른 경우와 비교할 때에 비해서 가장 큰 값으로 유지됨
- Scores are all negative, and a higher score is better
- We search for high-scoring hypotheses, tracking the top $k$ ones on each step
Beam search is not guaranteed to find a globally optimal solution.
But it is much more efficient than exhaustive search!
Beam search가 최적 값을 보장해주지는 않지만 exhaustive search에 비해서 효율적인 결과를 보여준다

Example

Beam size : $k=2$

https://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture08-nmt.pdf

Stopping criterion

In greedy decoding, usually we decode until the model produces a token
- For example
  - <START> he hit me with a pie <END>
- Greedy decoding에서는 모델이 해당 time step에서 토큰을 예측했을 때 종료가 됨
In beam search decoding, different hypotheses may produce tokens on different timesteps
- When a hypothesis produces , that hypothesis is complete
- Place it aside and continue exploring other hypotheses via beam search
- Greedy decoding과 다르게 토큰을 예측한 hypothesis만 종료하여 임시 메모리 공간에 저장하고 토큰이 예측되지 않은 hypotheses들은 토큰이 예측될 때까지 진행됨
Usually we continue beam search until
- We reach timestep $T$ (where $T$ is some pre-defined cutoff), or
  - 사용자가 정한 $T$라는 timestep까지만 decoding하여 beam search 과정을 종료하거나
- We have at least $n$ completed hypotheses (where $n$ is the pre-defined cutoff)
  - 토큰이 명시적으로 발생해서 완료된 hypotheses가 정의한 $n$개 만큼 저장되면 종료시킬 수 있음

Finishing up

We have our list of completed hypotheses
- 완성된 hypotheses의 리스트를 최종적으로 얻게됨
How to select the top one with the highest score?
- 이 중에서 가장 높은 점수를 가지는 하나의 값을 뽑아야 함
Each hypothesis $y_1,...,y_t$ on our list has a score
- $score(y_1, ..., y_t)=logP_{LM}(y_1,...,y_t|x)=\sum^t_{i=1}P_{LM}(y_i|y_1, ..., y_{i-1},x)$
- log joint probability 값이 가장 큰 hypothesis를 최종 결과물(예측값)로 선택해야 함
Problem with this : longer hypotheses have lower scores
- 그런데 예측된 결과 단어의 길이가 서로 다를 때에는 상대적으로 길이가 짧은 단어가 joint probability 값이 높을 것이고 길이가 긴 단어는 joint probability 값이 낮을 것
  - 단어가 순차적으로 생성되면서 동시 사건 확률을 고려하게 되고 매 단어를 생성할 때마다 기존의 log probability 값에 항상 minus 값을 더해주는 형태가 되기 때문에 단어가 많이 생성될 수록 join probability 값이 작아지게 됨
Fix : Normalize by length
- $score(y_1,..,y_t)=\frac{1}{t}\sum^t_{i=1}logP_{LM}(y_i|y_1,...,y_{i-1},x)$
- 위와 같은 의도치않은 부작용을 방지하기 위해 각 hypothesis별로 가지는 단어의 개수로 log joint probability를 나누어줌
- 즉, word당 평균 확률 값으로 각각의 hypothesis에 대한 score를 부여함 ($\frac{1}{t}$를 곱함)
완성된 여러 문장 중 target과 유사한 문장을 골라야 하는데 기존의 측정 방법으로는 순서가 뒤죽박죽인 경우라도 등장한 단어와 길이가 비슷하다면 높은 점수를 받게 되고 적절한 문장을 고르지 못하는 문제가 있음
이러한 문제를 해결하기 위해 translation task에서 BLEU Score라는 측정 방법이 제안됨

BLEU score

Precision and Recall

$precision = \frac{the\ number\ of\ correct\ words}{length\ of\ prediction}=\frac{7}{9}=78%$
$recall = \frac{the\ number\ of\ correct\ words}{length\ of\ reference}=\frac{7}{10}=70%$
$F-measure=\frac{precision\times recall}{\frac{1}{2}(precision + recall)}=\frac{0.78\times 0.7}{0.5\times (0.78 + 0.7)}=73.78%$
precision(정밀도) : 예측된 결과가 노출되었을 때 실질적으로 느끼는 정확도
recall(재현율) : 원하는 의도에 부합하는 정보의 전체 개수 중에서 몇 개를 노출시켜주었는지 보여줌
F-measure : 조화 평균을 채택
- precision과 recall을 평균내서 하나의 값으로 summarize할 때 작은 값에 치중된 값으로 계산한다는 의도

Model 2의 F-measure가 높지만 결과를 보면 문법적으로 말이 되지 않는 문장임

BLEU score

BiLingual Evaluation Understudy (BLEU)
- N-gram overlap between machine translation output and reference sentence
- Compute precision for n-grams of size one to four
- Add brevity penalty (for too short translations)
  - $BLEU = min(1, \frac{length\ of\ prediction}{length\ of\ reference})(\prod^4_{i=1} precision_i)^{\frac{1}{4}}$
    - $(\prod^4_{i=1} precision_i)^{\frac{1}{4}}$ <= 기하평균
  - Typically computed over the entire corpus, not on single sentences
  - brevity penalty를 통해 recall까지 고려함

출처: 부스트캠프 AI Tech 4기(NAVER Connect Foundation)

'부스트캠프 AI Tech 4기' 카테고리의 다른 글

(Data Viz) Seaborn 기초 실습 - 1 (0)	2023.06.24
(Data Viz) Seaborn 소개 (0)	2023.06.24
(NLP) Seq2Seq (0)	2023.06.22
(NLP) LSTM과 GRU (0)	2023.06.21
(NLP) Basics of Recurrent Neural Network (0)	2023.06.20

'부스트캠프 AI Tech 4기' Related Articles

Comments

쉬엄쉬엄블로그

(NLP) Beam Search와 BLEU Score 본문

(NLP) Beam Search와 BLEU Score

Beam search

Greedy decoding

Exhaustive search

Beam search

Example

Stopping criterion

Finishing up

BLEU score

Precision and Recall

BLEU score

'부스트캠프 AI Tech 4기' 카테고리의 다른 글

티스토리툴바