부스트캠프 AI Tech 4기

쉬엄쉬엄블로그 2023. 5. 29. 13:04

728x90

이 색깔은 주석이라 무시하셔도 됩니다.

Important Concepts in Optimization

Generalization

How well the learned model will behave on unseen data.
- Generalization은 본 적 없는 데이터에 대해서 모델이 얼마나 잘 작동하는가를 의미
- 학습에 사용하지 않았던 데이터에 대해서는 성능이 떨어질 수 있음

Cross-validation is a model validation technique for assessing how the model will generalize to an independent (test) data set.
- Cross-validation은 모델이 테스트 데이터에 대해 얼마나 일반화가 가능한지 평가하는 기법

cost를 최소화하는 것은 bias, variance, noise를 최소화하는 것과 같음
- bias를 줄이면 variance가 높아지고 variance를 줄이면 bias가 높아짐
- bias와 variance의 균형으로 일반화가 잘 되는 robust한 모델을 만드는 것이 목표

Bagging (Bootstrapping aggregating)
- Multiple models are being trained with bootstrapping.
- ex) Base classifiers are fitted on random subset where individual predictions are aggregated (voting or averaging).
- 분류 문제를 예로 들면 Bootstrapping으로 훈련된 여러 모델들의 결과를 투표 또는 평균을 내는 방법
Boosting
- It focuses on those specific training samples that are hard to classify.
- A strong model is built by combining weak leaners in sequence where each learner learns from the mistakes of the previous weak learner.
- 연속적으로 훈련이 진행되면서 현재 모델이 이전 모델의 오답에 더 집중하여 오답들 정답으로 맞추도록 영향을 주는 방법

Stochastic gradient descent
- Update with the gradient computed from a single sample.
- 하나의 예제로부터 gradient를 계산하고 가중치 갱신
Mini-batch gradient descent
- Update with the gradient computed from a subset of data.
- mini batch만큼의 예제들로부터 gradient를 계산하고 가중치 갱신
Batch gradient descent
- Update with the gradient computed from the whole data.
- 전체 예제들로부터 gradient를 계산하고 가중치 갱신

“It has been observed in practice that when using a larger batch there is a degradation in the qulity of the model, as measured by its ability to generalize.”
“We … present numerical evidence that supports the view that large batch methods tend to converge to sharp minimizers of the training and testing functions. In contrast, small-batch methods consistently converge to flat minimizers… this is due to the inherent noise in the gradient estimation.”
- 큰 배치 사이즈를 사용하면 sharp minimum에 도달함
- 작은 배치 사이즈를 사용하면 flat minimum에 도달함
- flat minimum에 도달하면 test에서도 loss가 적지만 sharp minimum에 도달하면 test에서 loss가 비교적 큼
- 배치 사이즈를 작게 쓰는게 좋음
  출처 : On Large-batch Training for Deep Learning: Generalization Gap and Sharp Minima, 2017

(Stochastic) Gradient descent
Momentum
- 한 번 흘러가기 시작한 gradient direction을 어느정도 유지시켜주기 때문에 gradient가 많이 왔다갔다 해도 잘 학습하는 효과가 있음
Nesterov Accelerated Gradient
- momentum보다 _local minima에 더 빠르게 수렴할 수 있음
Adagrad
- Adagrad adapts the learning rate, performing larger updates for infrequent and smaller updates for frequent parameters.
- learning rate를 조절하여 자주 사용하지 않는 파라미터에는 큰 업데이트를 수행하고 자주 사용하는 파라미터에는 작은 업데이트를 수행함
- 뒤로 갈수록 학습이 멈춰지는 현상이 발생
Adadelta
- Adadelta extends Adagrad to reduce its monotonically decreasing the learning rate by restricting the accumulation window.
- There is no learning rate in Adadelta.
- hyperparameter가 없어서 직접 수정할 수 없기 때문에 많이 사용되지는 않음
RMSprop
- RMSprop is an unpublished, adapdtive learning rate method proposed by Geoff Hinton in his lecture.
- Adagrad처럼 해주는데 stepsize를 추가함
Adam
- Adaptive Moment Estmation (Adam) leverages both past gradients and squared gradients.
- Adam effectively combines momentum with adaptive learning rate approach.
- gradient 또는 gradient squared 크기가 변함에 따라서 adaptive하게 learning rate를 바꾸는 것과 이전의 gradient 정보에 해당하는 momentum을 잘 합친 방법

More data are always welcomed.
- 데이터는 많을수록 좋음
However, in most cases, training data are given in advance.
In such cases, we need data augmentation.
- 대부분의 경우, 학습 데이터가 사전에 주어지기 때문에 그럴 때는 데이터 증강이 필요함
- 이미지를 예로 들면, 아래 그림과 같이 이미지의 각도를 기울이거나 뒤집어서 데이터를 증강함

Mix-up constructs augmented trainig examples by mixing both input and output of two randomly selected training data.
- 무작위로 선택된 두 개의 교육 데이터의 입력과 출력을 모두 혼합하여 훈련 데이터를 증강

CutMix constructs augmented training examples by mixing inputs with cut and paste and ouputs with soft labels of two randomly selected training data.
- 입력과 cut and paste를 혼합하고 무작위로 선택된 두 가지 훈련 데이터의 soft label과 출력을 혼합하여 훈련 데이터를 증강

In each forward pass, randomly set some neurons to zero.
- 각 forward pass 과정에서 무작위로 몇개의 뉴런들을 zero로 만듬 (연산에 포함되지 않도록)

Batch normalization compute the empirical mean and variance independently for each dimension (layers) and normalize.
- 각 차원(계층)에 대해 독립적으로 경험적 평균과 분산을 계산하고 정규화함

출처: 부스트캠프 AI Tech 4기(NAVER Connect Foundation)