24. 자연어 처리하기 3¶

지금까지 Tokenizer를 이용해서 텍스트 토큰화를 진행했습니다.

이제 임베딩 (Embedding)에 대해 소개합니다. 임베딩은 단어에 대한 토큰을 고차원 공간의 벡터로 매핑 (mapping)하는 방법입니다. 레이블된 예제를 이용해서 비슷한 의미를 갖는 단어들이 벡터 공간에서 비슷한 방향을 가리키도록 벡터를 조절할 수 있습니다.

임베딩은 텍스트에서 감정 (sentiment)을 훈련하는 과정의 출발점이 됩니다.

‘positive’와 ‘negative’로 레이블된 영화 리뷰 텍스트를 가지고 Neural Network를 훈련하면서, 문장 안에서 어떤 단어들이 긍정적인, 부정적인 의미를 가지는지 학습할 수 있습니다.

이 페이지에서는 Google Colab (Colaboratory)을 사용해서 웹브라우저 상에서 특별한 환경 구성없이 머신러닝 코드를 작성합니다.

Google Colab에 대해서는 Google Colab 소개 페이지를 참고하세요.

순서는 아래와 같습니다.

데이터셋 준비하기
데이터셋 살펴보기
리뷰 문장 토큰화하기
모델 구성하기
모델 컴파일하기
모델 훈련하기

데이터셋 준비하기¶

IMDB 리뷰 데이터셋은 텍스트로부터 감정을 분석하고, 학습 및 분류하기 위한 데이터셋입니다.

긍정적 (positive) 또는 부정적 (negative)로 분류된 50,000개의 영화 리뷰 텍스트를 포함하며, 25,000개는 훈련에 사용되고, 25,000개는 테스트에 사용합니다.

우선 TensorFlow Datasets (TFDS)을 설치합니다.

!pip install -q tensorflow-datasets

Google Colab을 사용하는 경우 TensorFlow Datasets이 이미 설치되어 있습니다. 자신의 환경에서 TensorFlow Datasets이 설치되어 있지 않다면 명령 프롬프트에서 아래의 명령어를 통해 TensorFlow Datasets를 설치합니다.

import tensorflow_datasets as tfds
imdb, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)

TensorFlow Datasets의 설치가 완료되었다면, 아래와 같이 tensorflow_datasets 모듈을 불러옵니다.

Google Colab 환경에서 처음으로 모듈을 불러와서 코드셀을 실행했을 때, 아래와 같이 출력된다면 데이터셋이 준비된 것입니다.

Natural_Language_Processing_in_Tensorflow

데이터셋 살펴보기¶

이제 준비한 영화 리뷰 데이터셋을 훈련과 테스트에 적절한 형태로 처리하고, 첫번째 데이터를 출력해 보겠습니다.

import numpy as np

train_data, test_data = imdb['train'], imdb['test']
train_sentences = []
train_labels = []
test_sentences = []
test_labels = []

for s, l in train_data:
  train_sentences.append(str(s.numpy()))
  train_labels.append(l.numpy())

for s, l in test_data:
  test_sentences.append(str(s.numpy()))
  test_labels.append(l.numpy())

print(train_sentences[0])
print(train_labels[0])
print(test_sentences[0])
print(test_labels[0])

b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
0
b"There are films that make careers. For George Romero, it was NIGHT OF THE LIVING DEAD; for Kevin Smith, CLERKS; for Robert Rodriguez, EL MARIACHI. Add to that list Onur Tukel's absolutely amazing DING-A-LING-LESS. Flawless film-making, and as assured and as professional as any of the aforementioned movies. I haven't laughed this hard since I saw THE FULL MONTY. (And, even then, I don't think I laughed quite this hard... So to speak.) Tukel's talent is considerable: DING-A-LING-LESS is so chock full of double entendres that one would have to sit down with a copy of this script and do a line-by-line examination of it to fully appreciate the, uh, breadth and width of it. Every shot is beautifully composed (a clear sign of a sure-handed director), and the performances all around are solid (there's none of the over-the-top scenery chewing one might've expected from a film like this). DING-A-LING-LESS is a film whose time has come."
1

리스트 train_sentences, train_labels, test_sentences, test_labels는 각각 훈련과 테스트에 사용할 리뷰 텍스트 문장과 레이블입니다.

리뷰가 긍정적이라면 1, 부정적이라면 0으로 레이블되어 있습니다.

train_labels = np.array(train_labels)
test_labels = np.array(test_labels)

np.array()를 사용해서 레이블을 NumPy 어레이로 변환해줍니다.

리뷰 문장 토큰화하기¶

이제 자연어 처리 첫 페이지에서 설명했던대로 문장을 토큰화하고 시퀀스로 변환합니다.

vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type = 'post'
oov_tok = "<OOV>"

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(train_sentences)
padded = pad_sequences(sequences, maxlen=max_length, truncating=trunc_type)

test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, maxlen=max_length)

print(sequences[0])
print(padded[0])

print(test_sequences[0])
print(test_padded[0])

[59, 12, 14, 35, 439, 400, 18, 174, 29, 1, 9, 33, 1378, 3401, 42, 496, 1, 197, 25, 88, 156, 19, 12, 211, 340, 29, 70, 248, 213, 9, 486, 62, 70, 88, 116, 99, 24, 5740, 12, 3317, 657, 777, 12, 18, 7, 35, 406, 8228, 178, 2477, 426, 2, 92, 1253, 140, 72, 149, 55, 2, 1, 7525, 72, 229, 70, 2962, 16, 1, 2880, 1, 1, 1506, 4998, 3, 40, 3947, 119, 1608, 17, 3401, 14, 163, 19, 4, 1253, 927, 7986, 9, 4, 18, 13, 14, 4200, 5, 102, 148, 1237, 11, 240, 692, 13, 44, 25, 101, 39, 12, 7232, 1, 39, 1378, 1, 52, 409, 11, 99, 1214, 874, 145, 10]
[   0    0   59   12   14   35  439  400   18  174   29    1    9   33
3401   42  496    1  197   25   88  156   19   12  211  340   29
248  213    9  486   62   70   88  116   99   24 5740   12 3317
777   12   18    7   35  406 8228  178 2477  426    2   92 1253
 72  149   55    2    1 7525   72  229   70 2962   16    1 2880
  1 1506 4998    3   40 3947  119 1608   17 3401   14  163   19
1253  927 7986    9    4   18   13   14 4200    5  102  148 1237
240  692   13   44   25  101   39   12 7232    1   39 1378    1
409   11   99 1214  874  145   10]
[59, 44, 25, 109, 13, 97, 4115, 16, 742, 4370, 10, 14, 316, 5, 2, 593, 354, 16, 1864, 1212, 1, 16, 680, 7499, 5595, 1, 773, 6, 13, 1037, 1, 1, 439, 491, 1, 4, 1, 334, 3610, 20, 229, 3, 15, 5796, 3, 15, 1646, 15, 102, 5, 2, 3597, 101, 11, 1450, 1528, 12, 251, 235, 11, 216, 2, 377, 6429, 3, 62, 95, 11, 174, 105, 11, 1528, 180, 12, 251, 37, 6, 1144, 1, 682, 7, 4452, 1, 4, 1, 334, 7, 37, 8367, 377, 5, 1420, 1, 13, 30, 64, 28, 6, 874, 181, 17, 4, 1050, 5, 12, 224, 3, 83, 4, 353, 33, 353, 5229, 5, 10, 6, 1340, 1160, 2, 5738, 1, 3, 1, 5, 10, 175, 328, 7, 1319, 3989, 4, 798, 1946, 5, 4, 250, 2710, 158, 3, 2, 361, 31, 187, 25, 1170, 499, 610, 5, 2, 122, 2, 356, 1398, 7725, 30, 1, 881, 38, 4, 20, 39, 12, 1, 4, 1, 334, 7, 4, 20, 634, 60, 48, 214]
[  11 1450 1528   12  251  235   11  216    2  377 6429    3   62   95
174  105   11 1528  180   12  251   37    6 1144    1  682    7
  1    4    1  334    7   37 8367  377    5 1420    1   13   30
 28    6  874  181   17    4 1050    5   12  224    3   83    4
 33  353 5229    5   10    6 1340 1160    2 5738    1    3    1
 10  175  328    7 1319 3989    4  798 1946    5    4  250 2710
  3    2  361   31  187   25 1170  499  610    5    2  122    2
1398 7725   30    1  881   38    4   20   39   12    1    4    1
  7    4   20  634   60   48  214]

우선 토큰화할 단어의 수, 최대 길이와 같은 하이퍼 파라미터들을 먼저 지정하고,

Tokenizer와 pad_sequences() 함수를 불러옵니다.

fit_on_texts를 이용해서 단어를 토큰화하고, texts_to_sequences를 이용해서 숫자의 시퀀스로 다시 변환합니다.

pad_sequences는 Neural Network의 훈련에 적합하도록 이 시퀀스의 길이를 일정하게 만들어줍니다.

모델 구성하기¶

import tensorflow as tf

model = tf.keras.Sequential([
  tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(6, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding (Embedding)      (None, 120, 16)           160000
_________________________________________________________________
flatten (Flatten)          (None, 1920)              0
_________________________________________________________________
dense (Dense)              (None, 6)                 11526
_________________________________________________________________
dense (Dense)              (None, 1)                 7
=================================================================
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
_________________________________________________________________

Embedding 레어이는 텍스트의 감정 분석에 있어서 핵심적인 부분입니다.

임베딩의 결과는 (vocab_size, embedding_dim)의 형태를 갖는 2차원 어레이가 되고,

이미지 분류 문제에서와 마찬가지로 Flatten 레이어를 사용해서 이 2차원 어레이를 1차원으로 변환합니다.

모델 컴파일하기¶

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

compile() 메서드를 이용해서 손실 함수와 옵티마이저를 지정해줍니다.

모델 훈련하기¶

num_epochs = 10
model.fit(padded, train_labels, epochs=num_epochs,
        validation_data=(test_padded, test_labels))

Epoch 1/10
782/782 [==============================] - 5s 7ms/step - loss: 0.4902 - accuracy: 0.7495 - val_loss: 0.3442 - val_accuracy: 0.8478
Epoch 2/10
782/782 [==============================] - 5s 7ms/step - loss: 0.2430 - accuracy: 0.9068 - val_loss: 0.3682 - val_accuracy: 0.8388
Epoch 3/10
782/782 [==============================] - 5s 6ms/step - loss: 0.1016 - accuracy: 0.9715 - val_loss: 0.4751 - val_accuracy: 0.8201
Epoch 4/10
782/782 [==============================] - 5s 6ms/step - loss: 0.0269 - accuracy: 0.9962 - val_loss: 0.5274 - val_accuracy: 0.8249
Epoch 5/10
782/782 [==============================] - 5s 6ms/step - loss: 0.0058 - accuracy: 0.9997 - val_loss: 0.5924 - val_accuracy: 0.8276
Epoch 6/10
782/782 [==============================] - 5s 6ms/step - loss: 0.0019 - accuracy: 1.0000 - val_loss: 0.6433 - val_accuracy: 0.8300
Epoch 7/10
782/782 [==============================] - 5s 7ms/step - loss: 9.1601e-04 - accuracy: 1.0000 - val_loss: 0.6944 - val_accuracy: 0.8282
Epoch 8/10
782/782 [==============================] - 5s 6ms/step - loss: 5.0897e-04 - accuracy: 1.0000 - val_loss: 0.7287 - val_accuracy: 0.8299
Epoch 9/10
782/782 [==============================] - 5s 6ms/step - loss: 2.8224e-04 - accuracy: 1.0000 - val_loss: 0.7691 - val_accuracy: 0.8293
Epoch 10/10
782/782 [==============================] - 5s 6ms/step - loss: 1.6944e-04 - accuracy: 1.0000 - val_loss: 0.8078 - val_accuracy: 0.8293

10회의 에포크 (epoch) 동안 훈련을 거치고 나면,

훈련 데이터에 대해 1.0의 정확도, 테스트 데이터에 대해 0.8293의 정확도를 보이는 것을 알 수 있습니다.

import tensorflow as tf

model = tf.keras.Sequential([
  tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
  # tf.keras.layers.Flatten(),
  tf.keras.layers.GlobalAveragePooling1D(),
  tf.keras.layers.Dense(6, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

Epoch 1/10
782/782 [==============================] - 6s 7ms/step - loss: 0.5611 - accuracy: 0.7525 - val_loss: 0.4015 - val_accuracy: 0.8394
Epoch 2/10
782/782 [==============================] - 6s 8ms/step - loss: 0.3441 - accuracy: 0.8585 - val_loss: 0.3357 - val_accuracy: 0.8572
Epoch 3/10
782/782 [==============================] - 5s 6ms/step - loss: 0.2784 - accuracy: 0.8872 - val_loss: 0.3306 - val_accuracy: 0.8571
Epoch 4/10
782/782 [==============================] - 5s 6ms/step - loss: 0.2410 - accuracy: 0.9070 - val_loss: 0.3413 - val_accuracy: 0.8545
Epoch 5/10
782/782 [==============================] - 5s 6ms/step - loss: 0.2142 - accuracy: 0.9189 - val_loss: 0.3589 - val_accuracy: 0.8500
Epoch 6/10
782/782 [==============================] - 5s 6ms/step - loss: 0.1928 - accuracy: 0.9301 - val_loss: 0.3845 - val_accuracy: 0.8443
Epoch 7/10
782/782 [==============================] - 5s 6ms/step - loss: 0.1760 - accuracy: 0.9370 - val_loss: 0.4063 - val_accuracy: 0.8393
Epoch 8/10
782/782 [==============================] - 5s 6ms/step - loss: 0.1605 - accuracy: 0.9434 - val_loss: 0.4356 - val_accuracy: 0.8340
Epoch 9/10
782/782 [==============================] - 5s 6ms/step - loss: 0.1470 - accuracy: 0.9494 - val_loss: 0.4679 - val_accuracy: 0.8316
Epoch 10/10
782/782 [==============================] - 5s 6ms/step - loss: 0.1354 - accuracy: 0.9545 - val_loss: 0.5020 - val_accuracy: 0.8264

Flatten 레이어 대신, GlobalAveragePooling1D 레이어를 사용할 수 있습니다.

훈련 과정이 조금 더 빨라지는 대신, 정확도가 조금 감소합니다.

이전글 : 23. 자연어 처리하기 2

다음글 : 25. Reference