[Python] 텍스트로부터 CSR 행렬을 이용하여 Term-Document 행렬 만들기

희소행렬(Sparse matrix)은 대부분의 값이 원소 '0'인 행렬, '0'이 아닌 원소가 희소(sparse)하게, 듬성듬성 있는 행렬을 말합니다. 반대로 밀집행렬(Dense matrix)은 대부분의 원소 값이 '0'이 아닌 행렬을 말합니다.

자연어처리 분석을 할 때 문서 내 텍스트를 컴퓨터가 이해할 수 있는 형태의 자료구조로 만들 때 텍스트 파싱을 거쳐 단어-문서 행렬(Term-Document matrix) (or 문서-단어 행렬, Document-Term matrix) 를 만드는 것부터 시작합니다.

문서별로 많은 수의 단어가 포함되어 있고, 또 단어별로 발생 빈도가 보통은 드물기 때문에, 문서에서 단어를 파싱하여 Term-Document 행렬을 만들면 대부분은 희소행렬(Sparse matrix)을 얻게 됩니다.

이번 포스팅에서는

(1) 문서별 단어로 부터 CSR 행렬(Compressed Sparse Row matrix) 을 만들고,

(2) CSR 행렬을 이용해 NumPy array의 Term-Document 행렬 만들기

를 해보겠습니다.

단, 이번 포스팅의 주 목적은 문서로부터 문서-단어 CSR 행렬을 만들고 --> 이를 NumPy array의 Term-Document 행렬을 만드는 과정에 집중해서 소개하는 것으로서, 텍스트 파싱하는데 필요한 세부 절차(가령 문장 분리, 대문자의 소문자로 변환, Stop words 생략 등)는 생략합니다.

(텍스트를 단어 단위로 파싱해서 one-hot encoding 하는 방법은 https://rfriend.tistory.com/444 포스팅 참조하세요.)

(1) 문서별 단어로 부터 CSR 행렬(Compressed Sparse Row matrix) 을 만들기

먼저, NumPy와 SciPy 모듈을 importing 하겠습니다.

import numpy as np

from scipy.sparse import csr_matrix

아래와 같이 리스트 [] 하나를 문서(Document) 하나로 간주했을 때, 총 3개의 문서를 가진 "docs" 로 부터 단어(Term)를 파싱해서 단어집(Vocabulary) 사전형(dictionary)을 만들고, 압축 희소 행기준 행렬(Compressed Sparse Row matrix) 을 만들기 위해 필요한 indptr, indices, data 객체를 for loop 문을 써서 만들어보겠습니다.

참고로, CSR 행렬 소개, SciPy.sparse.csr_matrix() 메소드 소개, NumPy 희소행렬을 SciPy 압축 희소 행기준 행렬 (Compressed Sparse Row matrix) 로 변환하는 방법은 https://rfriend.tistory.com/551 포스팅을 참고하세요.

# To construct a CSR matrix incrementally

docs = [["python", "is", "a", "programming", "language"],

["programming", "is", "fun"],

["python", "is", "easy"]]

indptr = [0]

indices = []

data = []

vocabulary = {}

for d in docs:

for term in d:

index = vocabulary.setdefault(term, len(vocabulary))

indices.append(index)

data.append(1)

indptr.append(len(indices))

* reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html

위의 실행결과로 얻은 단어집(Vocabulary)을 Key : Value 쌍으로 출력을 해서 살펴보겠습니다. 3개의 문서에 총 7개의 단어가 있군요. (문서별로 중복되는 단어(term)가 존재함)

for k, v in vocabulary.items():

print(k, ':', v)

[Out]

python : 0
is : 1
a : 2
programming : 3
language : 4
fun : 5
easy : 6

위에서 얻은 indptr, indices, data 를 가지고 SciPy.sparse.csr_matrix() 메소드를 이용하여 압축 희소 행기준 행렬(CSR matrix)을 만들어보겠습니다.

term_document_csr_mat = csr_matrix((data, indices, indptr), dtype=int)

term_document_csr_mat

[Out] <3x7 sparse matrix of type '<class 'numpy.int64'>'
	with 11 stored elements in Compressed Sparse Row format>

print(term_document_csr_mat)

[Out]

  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (1, 3)	1
  (1, 1)	1
  (1, 5)	1
  (2, 0)	1
  (2, 1)	1
  (2, 6)	1

print('-- SciPy Compressed Sparse Row matrix --')

print('indptr:', term_document_csr_mat.indptr)

print('indices:', term_document_csr_mat.indices)

print('data:', term_document_csr_mat.data)

-- SciPy Compressed Sparse Row matrix --
indptr: [ 0  5  8 11]
indices: [0 1 2 3 4 3 1 5 0 1 6]
data: [1 1 1 1 1 1 1 1 1 1 1]

(2) CSR 행렬을 이용해 NumPy array의 Term-Document 행렬 만들기

위의 (1)번에서 만든 SciPy CSR(Compressed Sparse Row) matrix를 csr_matrix.toarray() 또는 csr_matrix.todense() 메소드를 사용해서 NumPy array 행렬로 변환해보겠습니다. 이로부터 Term-Document Matrix를 만들었습니다.

# converting SciPy CSR matrix to NumPy array

term_document_arr = term_document_mat.toarray() # or todense()

term_document_arr

[Out]

array([[1, 1, 1, 1, 1, 0, 0],
       [0, 1, 0, 1, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 1]])

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요.

728x90

저작자표시 비영리 변경금지

'Deep Learning (TF, Keras, PyTorch) > Natural Language Processing' 카테고리의 다른 글

[NLP] TF-IDF (Term Frequency - Inverse Document Frequency) (2)	2022.04.10
[NLP] 언어 구조의 구성 요소 (Building Blocks of Language Structure) (0)	2022.02.20
[NLP] 자연어 처리(NLP, Natural Language Processing)란 무엇이고, NLP 응용분야는 무엇이 있나? (0)	2022.02.20
[Python] NLTK(Natural Language Toolkit)와 WordNet으로 자연어 처리하기 맛보기 (0)	2020.08.02
[Python] 텍스트를 단어 단위로 파싱해서 One-hot encoding 하기 (parsing text and one-hot encoding at word-level) (0)	2019.05.22

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

R, Python 분석과 프로그래밍의 친구 (by R Friend)

[Python] 텍스트로부터 CSR 행렬을 이용하여 Term-Document 행렬 만들기

'Deep Learning (TF, Keras, PyTorch) > Natural Language Processing' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역