(밑바닥부터 시작하는 딥러닝2) 3장. word2vec

이번 장에서는 2장에서 알아본 ‘통계 기반 기법’보다 강력한 ‘추론 기반 기법’을 알아본다. 추론 기반 기법에서는 신경망을 이용하는데, 여기서 word2vec이 등장한다.

추론 기반 기법과 신경망

통계 기반 기법의 문제점

통계 기반 기법에서는 주변 단어의 빈도를 기초로 단어를 표현했다. 이 방식은 대규모 말뭉치를 다룰 때 문제가 발생한다. 예를 들어 영어의 어휘 수는 100만을 넘는다. 통계 기반으로는 100만 * 100만 크기의 행렬을 만들어야하고, 여기에 SVD를 적용해야한다. SVD를 n * n 행렬에 적용하는 비용은 O(n^3)이다. n이 100만이라면 SVD는 비현실적인 처리 시간이 걸린다.

통계 기반 기법은 말뭉치 전체의 통계를 이용해 단 1회의 처리만에 단어의 분산 표현을 얻는다. 하지만 추론 기반 기법에서는 미니배치로 학습을 반복해서 학습하며 가중치를 갱신한다. 따라서 추론 기반 기법을 이용하면 말뭉치가 크더라도, 조금씩 학습시킬 수 있다.

추론 기반 기법 개요

추론이란 아래와 같이 주변 단어(맥락)이 주어졌을 때 “?”에 어떤 단어가 들어가는지 추측하는 것이다. 즉 신경망 모델에 맥락을 입력하면 각 단어가 정답일 확률을 리턴한다.

신경망에서의 단어 처리

학습을 위해서는 단어를 “고정 길이 벡터”로 변환해야한다. 여기서 사용하는 방법이 단어를 원핫(one-hot) 벡터로 변경하는 것이다. 예를 들어 “you say goodbye and I say hello”라는 문장에는 7개의 단어가 등장한다. 이 중 두 단어를 원핫 벡터로 변경하면 아래와 같다.

이렇게 단어를 원핫 벡터로 변경을 하면 이들을 입력, 출력으로 하여 신경망을 학습시킬 수 있다. 아래와 같은 신경망을 구현해보자.

import numpy as np

c = np.array([[1, 0, 0, 0, 0, 0, 0]]) # 입력
W = np.random.randn(7, 3) # 가중치
h = np.matmul(c, W) # 은닉층
print(h) # [[0.80907959 0.07517276 0.19219561]]

1장에서 구현했었던 Matmul 계층으로도 구현할 수 있다.

from common.layers import MatMul

c = np.array([[1, 0, 0, 0, 0, 0, 0]])
W = np.random.randn(7, 3)
layer = MatMul(W)
h = layer.forward(c)
print(h) # [[-1.44375783  0.80451242  0.69349139]]

단순한 word2vec

여기서는 모델을 구현해본다. 여기서 사용하는 신경망은 CBOW(continuos bag-of-words)모델이다.

CBOW 모델의 추론 처리

CBOW 모델은 맥락으로부터 타깃을 추측하는 용도의 신경망이다. 아래와 같은 구조를 갖는다. 보면 입력은 2개인데, 은닉층은 1개이다. 이 경우에는 각 입력에서 나온 은닉층을 평균을 취한다. 출력층 뉴런은 각 단어가 정답일 점수를 반환한다. 이 점수에 소프트맥스 함수를 적용하면 확률을 얻을 수 있다. CBOW 모델의 추론 처리를 파이썬으로 구현해보자.

import numpy as np
from common.layers import MatMul

# 샘플 맥락 데이터
c0 = np.array([[1, 0, 0, 0, 0, 0, 0]])
c1 = np.array([[0, 0, 1, 0, 0, 0, 0]])

# 가중치 초기화
W_in = np.random.randn(7, 3)
W_out = np.random.randn(3, 7)

# 계층 생성
in_layer0 = MatMul(W_in)
in_layer1 = MatMul(W_in)
out_layer = MatMul(W_out)

# 순전파
h0 = in_layer0.forward(c0)
h1 = in_layer1.forward(c1)
h = 0.5 * (h0 + h1)
s = out_layer.forward(h) 

print(s) # [[-0.15226882 -1.91760307  0.8080126   1.77896494  0.08403568 -1.37749376  1.42677427]]

CBOW 모델의 학습

출력층에서는 각 단어의 점수를 출력했다. 이 점수에 소프트맥스 함수를 적용하면 ‘확률’을 얻을 수 있다. CBOW 모델의 학습에서는 가중치를 조정하는 일을 한다. 이 신경망을 학습하려면 소프트맥스와 교차 엔트로피 오차를 이용한다. 여기서는 소프트맥스 함수를 이용해 점수를 확률로 변환하고, 그 확률과 정답 레이블로부터 교차 엔트로피 오차를 구한 후, 그 값을 손실로 사용해 학습을 진행한다.

학습 데이터 준비

맥락과 타깃

word2vec에서 입력은 맥락이고, 출력은 맥락사이의 단어이다. 학습을 위해서는 말뭉치에서 타깃과 그 타깃의 맥락을 만들어야한다.

먼저 맥락과 타깃을 만들기 전, 말뭉치 텍스트를 단어 ID로 변환해야한다. 이 작업에는 2장에서 구현한 preprocess() 함수를 이용한다.

from common.util import preprocess

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)
print(corpus) # [0 1 2 3 4 1 5 6]

print(id_to_word) # {0: 'you', 1: 'say', 2: 'goodbye', 3: 'and', 4: 'i', 5: 'hello', 6: '.'}

그 다음 단어 ID의 배열인 corpus로부터 맥락과 타깃을 만든다. 구체적으로는 아래처럼 corpus를 주면 맥락과 타깃을 반환하는 함수를 작성한다. window_size는 좌우 몇 단어만큼 맥락으로 사용할지이다.

def create_contexts_target(corpus, window_size=1):
    target = corpus[window_size:-window_size]
    contexts = []
    
    for idx in range(window_size, len(corpus)-window_size):
        cs = []
        for t in range(-window_size, window_size+1):
            if t == 0:
                continue
            cs.append(corpus[idx+t])
        contexts.append(cs)
        
    return np.array(contexts), np.array(target)

contexts, target = create_contexts_target(corpus, window_size=1)

print(contexts)
#---------------------출력---------------------#
# [[0 2]
#  [1 3]
#  [2 4]
#  [3 1]
#  [4 5]
#  [1 6]]

print(target) # [1 2 3 4 1 5]

원핫 표현으로 변환

맥락과 타깃을 원핫표현으로 바꿔보자. 수행하는 과정은 아래와 같다. convert_one_hot() 함수는 맥락과 타깃을 입력으로 주면 이를 원핫벡터로 변환한 값을 리턴해준다.

def convert_one_hot(corpus, vocab_size):
    N = corpus.shape[0]
    if corpus.ndim == 1: # corpus가 target인 경우
        one_hot = np.zeros((N, vocab_size), dtype=np.int32)
        for idx, word_id in enumerate(corpus):
            one_hot[idx, word_id] = 1
            
    elif corpus.ndim == 2: # corpus가 맥락인경우
        C = corpus.shape[1]
        one_hot = np.zeros((N, C, vocab_size), dtype=np.int32)
        for idx_0, word_ids in enumerate(corpus):
            for idx_1, word_id in enumerate(word_ids):
                one_hot[idx_0, idx_1, word_id] = 1
                
    return one_hot

지금까지의 데이터 준비과정을 정리해보자.

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)

contexts, target = create_contexts_target(corpus, window_size=1)

vocab_size = len(word_to_id)
target = convert_one_hot(target, vocab_size)
contexts = convert_one_hot(contexts, vocab_size)

CBOW 모델 구현

CBOW 모델을 구현해보자. 구현할 신경망은 아래와 같다. 여기서는 1장에서 구현한 MatMul과 SoftmaxWithLoss 계층을 사용한다.

from common.layers import MatMul, SoftmaxWithLoss

class SimpleCBOW:
    def __init__(self, vocab_size, hidden_size):
        V, H = vocab_size, hidden_size
        
        # 가중치 초기화
        W_in = 0.01 * np.random.randn(V, H).astype('f')
        W_out = 0.01 * np.random.randn(H, V).astype('f')
        
        # 계층 생성
        self.in_layer0 = MatMul(W_in)
        self.in_layer1 = MatMul(W_in)
        self.out_layer = MatMul(W_out)
        self.loss_layer = SoftmaxWithLoss()
        
        # 모든 가중치와 기울기를 리스트에 모은다.
        layers = [self.in_layer0, self.in_layer1, self.out_layer]
        self.params, self.grads = [], []
        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads
            
        # 인스턴스 변수에 단어의 분산 표현을 저장한다.
        self.word_vecs = W_in
        
    def forward(self, contexts, target):
        h0 = self.in_layer0.forward(contexts[:, 0])
        h1 = self.in_layer1.forward(contexts[:, 1])
        h = (h0 + h1) * 0.5
        score = self.out_layer.forward(h)
        loss = self.loss_layer.forward(score, target)
        return loss
    
    def backward(self, dout=1):
        ds = self.loss_layer.backward(dout)
        da = self.out_layer.backward(ds)
        da *= 0.5
        self.in_layer1.backward(da)
        self.in_layer0.backward(da)
        return None

학습 코드 구현

CBOW 모델의 학습은 일반적인 신경망의 학습과 같다. 학습 데이터를 신경망에 입력한 다음, 기울기를 구하고, 가중치 매개변수를 순서대로 갱신해간다. 1장에서 구현한 Trainer 클래스를 이용해 학습을 진행한다. 매개변수 갱신 방식은 Adam을 사용했다.

from common.trainer import Trainer
from common.optimizer import Adam

window_size = 1 # 맥락의 크기
hidden_size = 5 # 은닉층의 수
batch_size = 3
max_epoch = 1000

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)

vocab_size = len(word_to_id)
contexts, target = create_contexts_target(corpus, window_size)
target = convert_one_hot(target, vocab_size)
contexts = convert_one_hot(contexts, vocab_size)

model = SimpleCBOW(vocab_size, hidden_size)
optimizer = Adam()
trainer = Trainer(model, optimizer)

trainer.fit(contexts, target, max_epoch, batch_size)
trainer.plot()

#---------------------출력---------------------#
# | 에폭 1 |  반복 1 / 2 | 시간 0[s] | 손실 1.95
# | 에폭 2 |  반복 1 / 2 | 시간 0[s] | 손실 1.95
# | 에폭 3 |  반복 1 / 2 | 시간 0[s] | 손실 1.95
# ...
# | 에폭 998 |  반복 1 / 2 | 시간 1[s] | 손실 0.26
# | 에폭 999 |  반복 1 / 2 | 시간 1[s] | 손실 0.38
# | 에폭 1000 |  반복 1 / 2 | 시간 1[s] | 손실 0.49

학습을 거듭할수록 손실이 줄어드는 것을 확인할 수 있다.