Notice

Recent Posts

Recent Comments

Link

250x250

« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

Tags more

Archives

Today

Total

관리 메뉴

쉬엄쉬엄블로그

(NLP) Intro to NLP 본문

부스트캠프 AI Tech 4기

(NLP) Intro to NLP

쉬엄쉬엄블로그 2023. 6. 17. 12:05

728x90

이 색깔은 주석이라 무시하셔도 됩니다.

Intro to NLP

Goal of This Course

Natural language processing (NLP), which aims at properly understanding and generating human languages, emerges as a crucial application of artificial intelligence, with the advancements of deep neural networks.
- 인간의 언어를 올바르게 이해하고 생성하는 것을 목표로 하는 자연어 처리(NLP)는 심층 신경망의 발전과 함께 인공지능의 중요한 응용 프로그램으로 부상하고 있다.
This course will cover various deep learning approaches as well as their applications such as language modeling, machine translation, question answering, document classification, and dialog systems.
- 이 과정에서 언어 모델링, 기계 번역, 질문 답변, 문서 분류 및 대화 시스템과 같은 다양한 딥러닝 접근법을 다룰 예정이다.

Academic Disciplines related to NLP

Natural language processing (major conferences : ACL, EMNLP, NAACL)

includes state-of-the-art deep learning-based models and tasks
Low-level parsing
- Tokenization : 토큰화
- stemming : 형태소 분석, 어간 추출
Word and phrase level
- Named entity recognition(NER)
- part-of-speech(POS) tagging
  - 단어의 품사를 알아내는 작업
- noun-phrase chunking
  - 구조화되지 않은 텍스트에서 구문을 추출하는 작업
- dependency parsing
  - 문장에서 단어 사이의 의미적 관계를 식별하는 작업
- coreference resolution
  - 임의의 개체(entity)를 표현하는 다양한 명사구(mention)들을 찾아 연결해주는 작업
Sentence level
- Sentiment analysis
  - 감성 분석
- machine translation
  - 기계 번역
Multi-sentence and paragraph level
- Entailment prediction
  - 두 문장 간의 논리적인 모순 관계 예측
- question answering
  - QA
- dialog systems
  - 대화 시스템
- summarization
  - 요약

Text mining(major conferences : KDD, The WebConf (formelry, WWW), WSDM, CIKM, ICWSM)

Extract useful information and insights from text and document data
- 텍스트 및 문서 데이터에서 유용한 정보와 인사이트를 추출
- e.g. analyzing the trends of AI-related keywords from massive news data
  - e.g. 방대한 뉴스 데이터에서 AI 관련 키워드의 동향을 분석
Documents clustering (e.g., topic modeling)
- 문서 군집화. (e.g. topic modeling)
- e.g. clustering news data and grouping into different subjects
  - e.g. 뉴스 데이터를 군집화하고 다른 주제들로 그룹화함
Highly related to computational social science
- 컴퓨터 사회 과학과 관련이 깊음
- e.g., analyzing the evolution of people’s political tendency based on social media data
  - e.g. 소셜 미디어 데이터를 기반으로 사람들의 정치적 경향의 진화를 분석

Information retrieval (major conferences : SIGIR, WSDM, CIKM, RecSys)

Highly related to computational social science
- This area is not actively studied now
  - 이 분야는 현재 활발하게 연구되진 않음
- It has evolved into a recommendation system, which is still an active area of research
  - 이 분야는 추천 시스템으로 진화하고 있고 추천 시스템이 지속적으로 연구되고 있음

Trends of NLP

Text data can basically be viewed as a sequence of words, and each word can be represented as a vector through a technique such as Word2Vec or GloVe.
- 텍스트 데이터는 기본적으로 단어의 시퀀스로 볼 수 있으며, 각 단어는 Word2Vec 또는 GloVe와 같은 기술을 통해 벡터로 나타낼 수 있음
RNN-family models (LSTMs and GRUs), which take the sequence of these vectors of words as input, are the main architecture of NLP tasks.
- 이러한 단어의 벡터 시퀀스를 입력으로 취하는 RNN 모델군(LSTMs and GRUs)은 NLP task의 주요 구조임
Overall performance of NLP tasks has been improved since attention modules and Transformer models, which replaced RNNs with self-attention, have been introduced a few years ago.
- 몇 년 전 self-attention으로 RNN을 대체한 attention 모듈과 트랜스포머 모델이 도입된 후 NLP task의 전반적인 성능이 향상됨
As is the case for Transformer models, most of the advanced NLP models have been originally developed for improving machine translation tasks.
- 트랜스포머 모델의 경우와 마찬가지로 대부분의 advanced NLP 모델은 기계 번역 성능을 개선시키기 위해 개발됨
In the early days, customized models for different NLP tasks had developed separately.
- 초기에는 서로 다른 NLP 작업에 대한 맞춤형 모델이 별도로 개발됨
Since Transformer was introduced, huge models were released by stacking its basic module, self-attention, and these models are trained with large-sized datasets through language modeling tasks, one of the self-supervised training setting that does not require additional labels for a particular task.
- 트랜스포머가 도입된 이후 기본 모듈인 self-attention을 쌓아 거대한 모델이 출시되었으며, 이러한 모델은 특정 작업에 추가 label이 필요하지 않은 self-suprevised 학습 환경 중 하나인 언어 모델링 task를 통해 대규모 데이터로 학습됨
- e.g., BERT, GPT-3 …
Afterwards, above models were applied to other tasks through transfer learning, and they outperformed all other customized models in each task.
- 이후 transfer learning을 통해 위의 모델들을 다른 task에 적용했고, 각 task에서 다른 모든 맞춤형 모델들을 능가함
Currently, these models has now become essential part in numerous NLP tasks, so NLP research become difficult with limited GPU resources, since they are too large to train.
- 현재 이러한 모델들은 수많은 NLP task에서 필수적인 부분이 되었고 학습하기에는 너무 크기 때문에 제한된 GPU 자원으로 NLP 연구는 어려워짐

Bag-of-Words

Bag-of-Words Representation

Step 1. Constructing the vocabulary containing unique words
- Step 1. 고유 단어가 포함된 어휘 사전 구성
- Example sentences : “John really really loves this movie”, “Jane really likes this song”
- Vocabulary : {”John”, “really”, “loves”, “this”, “movie”, “Jane”, “likes”, “song”}
Step 2. Encoding unique words to one-hot vectors
- Step 2. 고유 단어를 one-hot 벡터로 인코딩
- Vocabulary : {”John”, “really”, “loves”, “this”, “movie”, “Jane”, “likes”, “song”}
  - John : [1 0 0 0 0 0 0 0]
  - really : [0 1 0 0 0 0 0 0]
  - loves : [0 0 1 0 0 0 0 0]
  - this : [0 0 0 1 0 0 0 0]
  - movie : [0 0 0 0 1 0 0 0 ]
  - Jane : [0 0 0 0 0 1 0 0]
  - likes : [0 0 0 0 0 0 1 0]
  - song : [0 0 0 0 0 0 0 1]
- For any pair of words, the distance is $\sqrt2$
- For any pair of words, cosine similarity is 0
  - 단어의 의미에 상관없이 모두가 동일한 관계를 가지는 형태로 단어의 벡터 표현형을 설정
A sentence/document can be represented as the sum of one-hot vectors
- 문장/문서는 one-hot 벡터의 합으로 표현될 수 있다.
- Sentence 1 : “John really really loves this movie”
  - John + really + really + loves + this + movie : [1 2 1 1 1 0 0 0]
- Sentence 2 : “Jane really likes this song”
  - Jane + really + likes + this + song : [0 1 0 1 0 1 1 1]

NaiveBayes Classifier for Document Classification

Bag-of-Words for Document Classification
Bayes’ Rule Applied to Documents and Classes
- For a document d and a class c
- For a document d, which consists of a sequence of words w, and a class c
- The probability of a document can be represented by multiplying the probability of each word appearing
  - 문서의 확률은 각 단어가 나타날 확률을 곱하여 나타낼 수 있다.
- $P(d|c)P(c)=P(w_1,w_2,...,w_n|c)P(c)\rightarrow P(c)\prod_{w_i\in W}P(w_i|c)$
  (by conditional independence assumption)
Example
- For a document d, which consists of sequence of words w, and a class c
- For each word $w_i$, we can calculate conditional probability for class c
  - $P(w_k|c_i)=\frac{n_k}{n'}$, where $n_k$ is occurrences of $w_k$ in documents of topic $c_i$
- For a test document $d_5$ = “Classification task uses transformer”
  - We calculate the conditional probability of the document for each class
    - 각 클래스에 대한 문서의 조건부 확률을 계산한다.
  - We can choose a class that has the highest probability for the document
    - 문서에 대한 확률이 가장 높은 클래스를 선택할 수 있다.

출처: 부스트캠프 AI Tech 4기(NAVER Connect Foundation)

'부스트캠프 AI Tech 4기' 카테고리의 다른 글

(NLP) Basics of Recurrent Neural Network (0)	2023.06.20
(NLP) Word Embedding (0)	2023.06.19
(Data Viz) More Tips (+ 실습) (2)	2023.06.16
(Data Viz) Facet (+ 실습) (0)	2023.06.15
(Data Viz) Color (+ 실습) (0)	2023.06.14