[ML] 자연어 처리, National Language Processing (2) – NLTK를 사용한 데이터 전처리

이번 글에서는 파이썬에서 가장 대중적인 자연어 처리 패키지 NLTK를 이용한 간단한 실습을 진행하겠습니다.

실행 환경

운영체제 : macOS Catalina
Python Version : python 3.7.3
pip Version : 20.0.2

실습 순서

데이터 준비 -> 토큰화(Tokenization) -> 텍스트 정규화 (불용어 제거 (Stop word elimination) / 어간 추출 (Stemming))

NLTK 설치

pip3 install nltk

데이터 준비

NLTK 패키지의 corpus 서브패키지에서 저작권이 말소된 문학작품이 들어있는 gutenberg 말뭉치(book) 내 emma 문서를 다운로드
emma 문서 중 앞 부분 일부만 이용하여 실습 진행

import nltk
nltk.download('book', quiet=True)
from nltk.book import *

emma_raw = nltk.corpus.gutenberg.raw('austen-emma.txt')
# 처음부터 701번째 인덱스까지 출력
print(emma_raw[:702])

import nltk

nltk.download('book', quiet=True)

from nltk.book import *

emma_raw = nltk.corpus.gutenberg.raw('austen-emma.txt')

# 처음부터 701번째 인덱스까지 출력

print(emma_raw[:702])

[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I

Emma Woodhouse, handsome, clever, and rich, with a comfortable home

and happy disposition, seemed to unite some of the best blessings

of existence; and had lived nearly twenty-one years in the world

with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,

indulgent father; and had, in consequence of her sister's marriage,

been mistress of his house from a very early period. Her mother

had died too long ago for her to have more than an indistinct

remembrance of her caresses; and her place had been supplied

by an excellent woman as governess, who had fallen little short

of a mother in affection.

토큰화(Tokenization)

문장 토큰화
- 전체 문서를 문장 단위로 분리
- NLTK에서 제공하는 sent_tokenize 이용
단어 토큰화
- 분리된 문장을 단어로 분리
- NLTK에서 제공하는 word_tokenize 이용

<문장 토큰화>

from nltk import sent_tokenize
sentence = sent_tokenize(emma_raw[:702])

print('분리된 문장 개수 : ', len(sentences))
print()
for i in range(len(sentences)):
    print('{}번 째 문장 '.format(i+1))
    print(sentences[i], '\n')

from nltk import sent_tokenize

sentence = sent_tokenize(emma_raw[:702])

print('분리된 문장 개수 : ', len(sentences))

print()

for i in range(len(sentences)):

print('{}번 째 문장 '.format(i+1))

print(sentences[i], '\n')

분리된 문장 개수 :  3

1번 째 문장 
[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her. 

2번 째 문장 
She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period. 

3번 째 문장 
Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

분리된 문장 개수 : 3

1번 째 문장

[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I

Emma Woodhouse, handsome, clever, and rich, with a comfortable home

and happy disposition, seemed to unite some of the best blessings

of existence; and had lived nearly twenty-one years in the world

with very little to distress or vex her.

2번 째 문장

She was the youngest of the two daughters of a most affectionate,

indulgent father; and had, in consequence of her sister's marriage,

been mistress of his house from a very early period.

3번 째 문장

Her mother

had died too long ago for her to have more than an indistinct

remembrance of her caresses; and her place had been supplied

by an excellent woman as governess, who had fallen little short

of a mother in affection.

<단어 토큰화>

일반적으로 문장 토큰화를 진행 후 단어 토큰화를 수행합니다.

만약 문서에서 단어의 순서가 중요하지 않은 경우 문장 토큰화를 사용하지 않고 단어 토큰화만 사용해도 충분합니다.

from nltk.tokenize import word_tokenize
# 분리된 문장 별 단어 토큰화
word_tokens = [word_tokenize(sentence) for sentence in setences]

for i in word_tokens:
    print(단어 개수 :', len(i))
    print(i, '\n')

from nltk.tokenize import word_tokenize

# 분리된 문장 별 단어 토큰화

word_tokens = [word_tokenize(sentence) for sentence in setences]

for i in word_tokens:

print(단어 개수 :', len(i))

print(i, '\n')

단어 개수 : 58
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER', 'I', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty-one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.'] 

단어 개수 : 38
['She', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate', ',', 'indulgent', 'father', ';', 'and', 'had', ',', 'in', 'consequence', 'of', 'her', 'sister', "'s", 'marriage', ',', 'been', 'mistress', 'of', 'his', 'house', 'from', 'a', 'very', 'early', 'period', '.'] 

단어 개수 : 44
['Her', 'mother', 'had', 'died', 'too', 'long', 'ago', 'for', 'her', 'to', 'have', 'more', 'than', 'an', 'indistinct', 'remembrance', 'of', 'her', 'caresses', ';', 'and', 'her', 'place', 'had', 'been', 'supplied', 'by', 'an', 'excellent', 'woman', 'as', 'governess', ',', 'who', 'had', 'fallen', 'little', 'short', 'of', 'a', 'mother', 'in', 'affection', '.']

단어 개수 : 58

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER', 'I', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty-one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.']

단어 개수 : 38

['She', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate', ',', 'indulgent', 'father', ';', 'and', 'had', ',', 'in', 'consequence', 'of', 'her', 'sister', "'s", 'marriage', ',', 'been', 'mistress', 'of', 'his', 'house', 'from', 'a', 'very', 'early', 'period', '.']

단어 개수 : 44

['Her', 'mother', 'had', 'died', 'too', 'long', 'ago', 'for', 'her', 'to', 'have', 'more', 'than', 'an', 'indistinct', 'remembrance', 'of', 'her', 'caresses', ';', 'and', 'her', 'place', 'had', 'been', 'supplied', 'by', 'an', 'excellent', 'woman', 'as', 'governess', ',', 'who', 'had', 'fallen', 'little', 'short', 'of', 'a', 'mother', 'in', 'affection', '.']

불용어 제거 (Stop word elimination)

Stop word : 분석에 큰 의미가 없는 단어 (영어의 is, the, a 등 큰 의미는 없지만 문장에 자주 등장하는 단어)

NLTK에서는 Stop word 목록을 제공하고 nltk.download(‘stopwords’) 명령어를 이용해 다운받아 사용가능합니다.

import nltk
nltk.download('stopwords')

print(영어 stop word 개수 : ', len(nltk.corpus.stopwords.words('english')))
print(nltk.corpus.stopwords.words('english')[:20])

import nltk

nltk.download('stopwords')

print(영어 stop word 개수 : ', len(nltk.corpus.stopwords.words('english')))

print(nltk.corpus.stopwords.words('english')[:20])

영어 stop word 개수 :  179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']

1 2	영어 stop word 개수 : 179 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']

단어 토큰화까지 진행한 emma 문서를 NLTK에서 제공하는 Stop word를 이용하여 Stop word 제거를 진행하겠습니다.

import nltk
stopwords = nltk.corpus.stopwords.words('english')
all_tokens = []

for sentence in word_tokens:
    filtered_words = []
    for word in sentence:
        # 소문자 변환
        word = word.lower()
        # 개별 단어가 스톱워드에 포함되지 않으면 word_token에 추가
        if word not in stopwords:
            filtered_words.append(word)
    all_tokens.append(filtered_words)

for i in all_tokens:
    print('단어 개수 :', len(i))
    print(i, '\n')

import nltk

stopwords = nltk.corpus.stopwords.words('english')

all_tokens = []

for sentence in word_tokens:

filtered_words = []

for word in sentence:

# 소문자 변환

word = word.lower()

# 개별 단어가 스톱워드에 포함되지 않으면 word_token에 추가

if word not in stopwords:

filtered_words.append(word)

all_tokens.append(filtered_words)

for i in all_tokens:

print('단어 개수 :', len(i))

print(i, '\n')

단어 개수 : 37
['[', 'emma', 'jane', 'austen', '1816', ']', 'volume', 'chapter', 'emma', 'woodhouse', ',', 'handsome', ',', 'clever', ',', 'rich', ',', 'comfortable', 'home', 'happy', 'disposition', ',', 'seemed', 'unite', 'best', 'blessings', 'existence', ';', 'lived', 'nearly', 'twenty-one', 'years', 'world', 'little', 'distress', 'vex', '.'] 

단어 개수 : 19
['youngest', 'two', 'daughters', 'affectionate', ',', 'indulgent', 'father', ';', ',', 'consequence', 'sister', "'s", 'marriage', ',', 'mistress', 'house', 'early', 'period', '.'] 

단어 개수 : 20
['mother', 'died', 'long', 'ago', 'indistinct', 'remembrance', 'caresses', ';', 'place', 'supplied', 'excellent', 'woman', 'governess', ',', 'fallen', 'little', 'short', 'mother', 'affection', '.']

단어 개수 : 37

['[', 'emma', 'jane', 'austen', '1816', ']', 'volume', 'chapter', 'emma', 'woodhouse', ',', 'handsome', ',', 'clever', ',', 'rich', ',', 'comfortable', 'home', 'happy', 'disposition', ',', 'seemed', 'unite', 'best', 'blessings', 'existence', ';', 'lived', 'nearly', 'twenty-one', 'years', 'world', 'little', 'distress', 'vex', '.']

단어 개수 : 19

['youngest', 'two', 'daughters', 'affectionate', ',', 'indulgent', 'father', ';', ',', 'consequence', 'sister', "'s", 'marriage', ',', 'mistress', 'house', 'early', 'period', '.']

단어 개수 : 20

['mother', 'died', 'long', 'ago', 'indistinct', 'remembrance', 'caresses', ';', 'place', 'supplied', 'excellent', 'woman', 'governess', ',', 'fallen', 'little', 'short', 'mother', 'affection', '.']

i, was, the 등과 같은 단어가 제거됨을 확인 가능하고 단어의 개수도 58 -> 37, 38 -> 19, 44 -> 20개로 줄어든 것을 확인할 수 있습니다.

어간 추출 (Stemming / Lemmatization)

언어에서 단어는 문법적 요소에 따라 단어가 다양하게 변화합니다. 영어의 경우 과거/현재, 3인칭 단수 등의 조건에 따라 원래 단어가 변화합니다.

이로인해 본래 의미는 같은 여러 개의 단어가 생기게되고, 복잡성이 증가합니다.

Stemming 과 Lemmatization은 변화된 단어의 원형을 찾아주는 역할을 하며 단어의 원형만을 사용하여 복잡성을 줄여주는 텍스트 정규화를 진행할 수 있습니다.

두 기능 모두 원형 단어를 찾는다는 목적은 유사하지만

Stemming : 일반적인 방법이나 단순화된 방법을 적용하여 원래 단어에서 일부 철자가 훼손된 어근을 추출하는 경향이 존재
Lemmatization : 문법적인 요소와 의미적인 부분을 감안해 정확한 철자로 된 어근을 찾아주지만, Stemming에 비해 긴 실행 시간이 필요

from nltk.stem import PosterStemmer, LancasterStemmer

st1 = PosterStemmer()
st2 = LancasterStemmer()

words = ['fly', 'flies', 'flying', 'flew', 'flown']

print("Poster Stemmer :', [st1.stem(w) for w in words])
print("Lancaster Stemmer :", [st2.stem(w) for w in words])

from nltk.stem import PosterStemmer, LancasterStemmer

st1 = PosterStemmer()

st2 = LancasterStemmer()

words = ['fly', 'flies', 'flying', 'flew', 'flown']

print("Poster Stemmer :', [st1.stem(w) for w in words])

print("Lancaster Stemmer :", [st2.stem(w) for w in words])

Porter Stemmer : ['fli', 'fli', 'fli', 'flew', 'flown']
Lancaster Stemmer : ['fly', 'fli', 'fly', 'flew', 'flown']

1 2	Porter Stemmer : ['fli', 'fli', 'fli', 'flew', 'flown'] Lancaster Stemmer : ['fly', 'fli', 'fly', 'flew', 'flown']

Stemmer의 경우 원형인 fly를 제대로 찾아내지 못한 경우가 다수 존재하는 것을 확인할 수 있습니다.

다음은 WordNetLemmatizer를 이용한 Lemmatizer를 수행해 보겠습니다.

Lemmatizer의 경우 보다 정확한 원형 단어 추출을 위해 lemmatize의 파라미터에 단어의 ‘품사’를 입력해줘야 합니다. 동사의 경우 ‘v’, 형용사의 경우 ‘a’를 입력합니다.

from nltk.stem import WordNetLemmatizer

lm = WordNetLemmatizer()
print([lm.lemmatize(w, pos='v') for w in words])

from nltk.stem import WordNetLemmatizer

lm = WordNetLemmatizer()

print([lm.lemmatize(w, pos='v') for w in words])

['fly', 'fly', 'fly', 'fly', 'fly']

1	['fly', 'fly', 'fly', 'fly', 'fly']

앞의 Stemmer보다 정확하게 원형 단어를 추출해줌을 알 수 있습니다.

주어진 문서에 대해 토큰화, 정규화까지 모두 진행을 완료 하였지만, 이대로 머신러닝 알고리즘에 사용할 수는 없습니다.

머신러닝 알고리즘은 숫자형 데이터만 입력받을 수 있으므로, 텍스트를 머신러닝에 적용하기 위해서는 단어를 피처로 추출하고, 빈도수 등의 방식을 이용한 값을 설정하는 작업이 필요합니다.

다음 글에서 텍스트 데이터의 피처 추출 및 피처 벡터화 방식에 대해 알아보겠습니다.

실행 환경

실습 순서

불용어 제거 (Stop word elimination)

어간 추출 (Stemming / Lemmatization)

댓글 남기기 댓글 취소