'Python' 태그의 글 목록 (13 Page)

[Python] Python으로 Postgresql, GPDB, DB2, Presto DB connect 하는 방법

Python 분석과 프로그래밍/Python 설치 및 기본 사용법 2019. 7. 2. 22:40

이번 포스팅에서는 Windows10 OS 에서

(1) Python으로 Postgresql, Greenplum DB connect 하고 Query 결과 가져오는 방법

(2) Python으로 MySQL DB connect 하고 Query 결과 가져오는 방법

(3) Python으로 IBM DB2 DB connect 하고 Query 결과 가져오는 방법

(4) Python으로 Presto, Hive DB connect 하고 Query 결과 가져오는 방법

을 소개하겠습니다.

(1) Python으로 Postgresql, Greenplum DB connect 하고 Query 결과 가져오는 방법

먼저, 명령 프롬프트 창에서 psycopg2 라이브러리를 설치해줍니다.

$ pip install psycopg2

Spyder 등의 Python IDE에서 PostgreSQL, Greenplum DB에 접속하고 query를 실행하여 결과를 pandas DataFrame으로 받아와서 저장하는 사용자 정의함수를 정의합니다.

( * Reference : PostgreSQL Python: Connect to PostgreSQL Database Server )

[ UDF of connecting to Postgresql, GPDB & Getting query result as a DataFrame ]

def postgresql_query(query):

    import psycopg2 as pg
    import pandas as pd

    # Postgresql, Greenplum DB Connect
    connection_string = "postgresql://{user}:{password}@{host}/{db}".\
        format(user='gpadmin', # put your info
                 password='changeme',
                 host='localhost',
                 db='gpadmin')

    conn = pg.connect(connection_string)

cursor = conn.cursor()

#conn.autocommit = True

    # execute a query and get it as a pandas' DataFrame
    cursor.execute(query)
    col_names = [desc[0] for desc in cursor.description]
    result = pd.DataFrame(cursor.fetchall(), columns=col_names)

cursor.close()

conn.close()

return result

아래는 Query를 실행해서 결과를 가져오는 간단한 예시입니다.

query = """
SELECT * FROM mytable WHERE grp == 'A' LIMIT 100;

"""

postgresql_query(query)grp_A = postgresql_query(query)

(2) Python으로 MySQL DB connect 하고 Query 결과 가져오는 방법

먼저, 명령 프롬프트 창에서 mysql 라이브러리를 설치해줍니다.

$ pip install mysql

다음으로 MySQL DB에 접속하고 query를 실행시켜서 결과를 DataFrame으로 가져오는 사용자 정의함수를 정의합니다.

( * Reference : Connecting to MySQL Using Connector/Python )

def mysql_query(query):

import mysql.connector

import pandas as pd

cnx = mysql.connector.connect(user='userid',

password='changeme',

host='12.34.567.890',

database='mydb')

cursor = cnx.cursor()

     # execute a query and get it as a pandas' DataFrame
     cursor.execute(query)
     col_names = [desc[0] for desc in cursor.description]
     result = pd.DataFrame(cursor.fetchall(), columns=col_names)

cursor.close()

cnx.close()

return result

위에서 정의한 사용자 정의함수를 사용하여 MySQL DB에 접속하고, Query로 조회한 결과를 result 라는 이름의 DataFrame으로 저장하는 예시입니다.

query = """

SELECT * FROM mydb WHERE age >= 20 ORDER BY age;

"""

result = mysql_query(query)

(3) Python으로 IBM DB2 DB connect 하고 Query 결과 가져오는 방법

먼저, 명령 프롬프트 창에서 ibm_db_dbi 라이브러리를 설치해줍니다.

$ pip install ibm_db_dbi

다음으로 DB2에 접속해서 Query를 실행하고, 결과를 pandas DataFrame으로 가져오는 사용자 정의함수를 정의합니다.

( * Reference : Connecting to an IBM database server in Python)

def db2_query(query):

import ibm_db_dbi as db

    import pandas as pd

    conn = db.connect('DATABASE=mydb;'
                             'HOSTNAME=12.34.567.890;'
                             'PORT=50000;'
                             'PROTOCOL=TCPIP;'
                             'UID = secret;'
                             'PWD= changeme;', '', ' ')

     cursor = conn.cursor()
     cursor.execute(query)
     col_names = [desc[0] for desc in cursor.description]

     result = pd.DataFrame(cursor.fetchall(), columns=col_names)

     cursor.close()
     conn.close()

     return result

Python에서 Query를 실행시켜서 결과를 pandas DataFrame을 가져오는 예시는 아래와 같습니다.

query = """

SELECT school_nm, count(*) as student_cnt

FROM school

WHERE school_nm LIKE 'seoul%';

"""

school = db2_query(query)

(4) Python으로 Presto, Hive DB connect 하고 Query 결과 가져오는 방법

먼저 명령 프롬프트 창에서 pyhive 라이브러리를 설치해줍니다.

$ pip install pyhive

Presto 혹은 Hive에 접속하고 Query를 실행해서 결과를 pandas DataFrame으로 가져오는 사용자 정의함수를 정의합니다.

( * Reference : PyHive is a collection of Python DB-API and SQLAlchemy interfaces for Presto and Hive )

def presto_query(query):

     from pyhive import presto
     import pandas as pd

     cursor = presto.connect('12.34.567.890').cursor()

# execute a query and get a result as a DataFrame

     cursor.execute(query)
     col_names = [ desc[0] for desc in cursor.description ]
     result = pd.DataFrame(cursor.fetchall(), columns=col_names)

     cursor.close()

     return result

Python에서 위의 사용자 정의 함수를 사용하여 query를 실행시키고 결과를 DataFrame으로 가져오는 예제입니다.

query = """

WITH

t1 AS (SELECT a, MAX(b) AS b FROM x GROUP BY a),

t2 AS (SELECT a, AVG(d) AS d FROM y GROUP BY a)

SELECT t1.*, t2.* FROM t1 JOIN t2 ON t1.a = t2.a;

"""

result = presto_query(query)

혹시 pip install 하는 단계에서 'error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/' 와 같은 에러가 나면 안내에 나와있는 사이트에 가서 Microsoft Visual C++ 을 다운받아 설치하시기 바랍니다.

많은 도움이 되었기를 바랍니다.

728x90

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[R] Jupyter Notebook에서 R 사용하기 (6)	2019.10.06
[Python] Windows10에서 Anaconda Prompt를 이용해 가상환경 만들기 (Create a new virtual environment for python with anaconda prompt) (0)	2019.07.19
맥북(Mac OS)에서 graphviz 실행 시 "ValueError: Program dot not found in path" 에러 대처방안 (0)	2018.08.31
맥북에 Graphviz, pygraphviz 설치하고 Decision Tree 시각화해보기 (0)	2018.08.25
[Jupyter Notebook, ipython] 경고 메시지 숨기기 (ignore warning message) (0)	2018.01.30

Posted by Rfriend

,

[Python pandas] DataFrame의 문자열 칼럼을 분할하여 일부분으로 새로운 칼럼 만들기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 7. 1. 23:53

이번 포스팅에서는 Python pandas 의 DataFrame에서 문자열(string)을 데이터 형태로 가지는 칼럼을 특정 기준(separator, delimiter) 분할(split a string)하여, 그 중의 일부분을 가져다가 DataFrame에 새로운 칼럼으로 만들어서 붙이는 2가지 방법을 소개하겠습니다.

(1) Vectorization을 이용한 pandas DataFrame 문자열 칼럼 분할하기

(2) For Loop operation을 통한 pandas DataFrame 문자열 칼럼 분할하기

Python pandas DataFrame: Split string column and make a new column using part of it.

(1) Vectorization을 이용한 pandas DataFrame 문자열 칼럼 분할하기 (빠름 ^^)

예제로 사용할 문자열 'id' 와 숫자형 'val' 의 두 개 칼럼으로 이루어진 DataFrame을 만들어보겠습니다. 그리고 문자열 'id' 칼럼을 구분자(separator) '_' 를 기준으로 str.split('_') 메소드를 사용하여 분할(split) 한 후에, 앞부분([0])을 가져다가 'grp'라는 칼럼을 추가하여 만들어보겠습니다.

import numpy as np
import pandas as pd

df = pd.DataFrame({'id': ['A_001', 'A_002', 'A_003', 'B_001', 'C_001', 'C_002'],
'val': np.arange(6)})

print(df)

id val

0 A_001 0

1 A_002 1

2 A_003 2

3 B_001 3

4 C_001 4

5 C_002 5

# 1. vectorization
df['grp'] = df.id.str.split('_').str[0]

print(df)

id val grp

0 A_001 0 A

1 A_002 1 A

2 A_003 2 A

3 B_001 3 B

4 C_001 4 C

5 C_002 5 C

만약 리스트(list)로 만들고 싶으면 분할한 객체에 대해 tolist() 메소드를 사용하면 됩니다.

# tolist()
grp_list = df.id.str.split('_').str[0].tolist()
print(grp_list)

['A', 'A', 'A', 'B', 'C', 'C']

(2) For Loop operation을 통한 pandas DataFrame 문자열 칼럼 분할하기 (느림 -_-;;;)

두번째는 For Loop 연산을 사용하여 한 행, 한 행씩(row by row) 분할하고, 앞 부분 가져다가 'grp' 칼럼에 채워넣고... 를 반복하는 방법입니다. 위의 (1)번의 한꺼번에 처리하는 vectorization 대비 (2)번의 for loop은 시간이 상대적으로 많이 걸립니다. 데이터셋이 작으면 티가 잘 안나는데요, 수백~수천만건이 되는 자료에서 하면 느린 티가 많이 납니다.

# 2. for loop
df = pd.DataFrame({'id': ['A_001', 'A_002', 'A_003', 'B_001', 'C_001', 'C_002'],
'val': np.arange(6)})

for i in range(df.shape[0]):
df.loc[i, 'grp'] = str(df.loc[i, 'id']).split('_')[0]

print(df)

id val grp

0 A_001 0 A

1 A_002 1 A

2 A_003 2 A

3 B_001 3 B

4 C_001 4 C

5 C_002 5 C

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] pivot_table() 할 때 DataError: No numeric types to aggregate 에러 대처방법 aggfunc='first' (0)	2019.07.11
[Python pandas] DataFrame, Series의 행, 열 개수 세기 (1)	2019.07.03
[Python] 텍스트 파일 읽어와서 숫자형 데이터 표준화하기 (reading csv or text file, standardizing or normalizing of numeric data) (0)	2019.05.21
[Python] 경로 및 폴더 생성/제거(directory and path management using os), 파일 복사 (file copy using shutil) (0)	2019.03.03
[Python Numpy] 배열에 차원 추가하기 (Adding Dimensions to a Numpy Array) (2)	2019.02.24

Posted by Rfriend

,

[Python] 텍스트를 단어 단위로 파싱해서 One-hot encoding 하기 (parsing text and one-hot encoding at word-level)

Deep Learning (TF, Keras, PyTorch)/Natural Language Processing 2019. 5. 22. 00:32

텍스트 분석을 할 때 제일 처음 하는 일이 문서, 텍스트를 분석에 적합한 형태로 전처리 하는 일입니다.

이번 포스팅에서는 (1) 텍스트 데이터를 Python의 string methods 를 이용하여 단어 단위로 파싱(parsing text at word-level) 한 후에, 단어별 index를 만들고, (2) 텍스트를 단어 단위로 one-hot encoding 을 해보겠습니다.

one-hot encoding of text at a word-level

1. 텍스트 데이터를 Python string methods를 사용하여 단어 단위로 파싱하고, 단어별 token index 만들기

예제로 사용할 텍스트는 Wikipedia 에서 검색한 Python 영문 소개자료 입니다.

python_wikipedia.txt

0.00MB

# import modules
import numpy as np
import os

# set directory
base_dir = '/Users/ihongdon/Documents/Python/dataset'
file_name = 'python_wikipedia.txt'
path = os.path.join(base_dir, file_name)

# open file and print it as an example
file_opened = open(path)
for line in file_opened.readlines():
    print(line)

Python programming language, from wikipedia

Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects.[26]

Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including procedural, object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.[27]

Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released 2000, introduced features like list comprehensions and a garbage collection system capable of collecting reference cycles. Python 3.0, released 2008, was a major revision of the language that is not completely backward-compatible, and much Python 2 code does not run unmodified on Python 3. Due to concern about the amount of code written for Python 2, support for Python 2.7 (the last release in the 2.x series) was extended to 2020. Language developer Guido van Rossum shouldered sole responsibility for the project until July 2018 but now shares his leadership as a member of a five-person steering council.[28][29][30]

Python interpreters are available for many operating systems. A global community of programmers develops and maintains CPython, an open source[31] reference implementation. A non-profit organization, the Python Software Foundation, manages and directs resources for Python and CPython development.

아래는 Python string method를 사용해서 텍스트에서 단어를 파싱하고 전처리할 수 있는 사용자 정의 함수 예시입니다. 가령, 대문자를 소문자로 바꾸기, stop words 제거하기, 기호 제거하기, 숫자 제거하기 등을 차례대로 적용할 수 있는 기본적인 예시입니다. (이 역시 텍스트 분석용 Python module 에 잘 정의된 함수들 사용하면 되긴 합니다. ^^;)

# UDF of word preprocessing
def word_preprocess(word):
    # lower case
    word = word.lower()
        
    # remove stop-words
    stop_words = ['a', 'an', 'the', 'in', 'with', 'to', 'for', 'from', 'of', 'at', 'on',
                  'until', 'by', 'and', 'but', 'is', 'are', 'was', 'were', 'it', 'that', 'this', 
                  'my', 'his', 'her', 'our', 'as', 'not'] # make your own list
    for stop_word in stop_words:
        if word != stop_word:
            word = word
        else:
            word = ''
    
    # remove symbols such as comma, period, etc.
    symbols = [',', '.', ':', '-', '+', '/', '*', '&', '%', '[', ']', '(', ')'] # make your own list
    for symbol in symbols:
        word = word.replace(symbol, '')
    
    # remove numbers
    if word.isnumeric():
        word = ''
    
    return word

다음으로, python_wikipedia.txt 파일을 열어서(open) 각 줄 단위로 읽고(readlines), 좌우 공백을 제거(strip)한 후에, 단어 단위로 분할(split) 하여, 위에서 정의한 word_preprocess() 사용자 정의 함수를 적용하여 전처리를 한 후, token_idx 사전에 단어를 Key로, Index를 Value로 저장합니다.

# blank dictionary to store
token_idx = {}

# opening the file
file_opened = open(path)

# catching words and storing the index at token_idx dictionary
for line in file_opened.readlines():
    # strip leading and trailing edge spaces
    line = line.strip()
        
    # split the line into word with a space delimiter
    for word in line.split():
        
        word = word_preprocess(word) # UDF defined above
        
        # put word into token_index
        if word not in token_idx:
            if word != '':
                token_idx[word] = len(token_idx) + 1

단어를 Key, Index를 Value로 해서 생성된 token_idx Dictionary는 아래와 같습니다.

token_idx

{'"batteries': 48,
'1980s': 56,
'2x': 87,
'abc': 58,
'about': 80,
'aims': 28,
'amount': 81,
'approach': 27,
'available': 104,
'backwardcompatible': 74,
'capable': 67,
'clear': 32,
'code': 18,

.... 중간 생략 ....

'successor': 57,
'support': 83,
'supports': 40,
'system': 66,
'systems': 107,
'the': 84,
'typed': 38,
'unmodified': 78,
'use': 22,
'van': 10,
'whitespace': 24,
'wikipedia': 4,
'write': 31,
'written': 82}

token_idx.values()

dict_values([104, 96, 102, 112, 68, 111, 21, 18, 8, 15, 20, 47, 37, 16, 74, 89, 57, 117, 19, 93, 83, 76, 91, 43, 30, 32, 54, 33, 35, 98, 64, 80, 17, 34, 10, 61, 50, 46, 49, 23, 72, 67, 119, 95, 14, 3, 116, 81, 85, 1, 99, 51, 77, 38, 90, 118, 120, 100, 101, 9, 39, 12, 123, 84, 122, 69, 26, 115, 88, 13, 36, 60, 5, 6, 75, 103, 66, 94, 78, 97, 121, 55, 108, 109, 58, 4, 82, 41, 79, 87, 29, 106, 114, 113, 105, 73, 45, 71, 24, 2, 53, 31, 86, 11, 22, 42, 59, 7, 110, 40, 56, 70, 92, 28, 27, 48, 62, 44, 107, 65, 25, 52, 63])

총 123개의 단어가 있으며, 이 중에서 'python'이라는 단어는 token_idx에 '1' 번으로 등록이 되어있습니다.

max(token_idx.values())

123

token_idx.get('python')

1

2. 텍스트를 단어 단위로 One-hot encoding 하기

하나의 텍스트 문장에서 고려할 단어의 최대 개수로 max_len = 40 을 설정하였습니다. (한 문장에서 41번째 부터 나오는 단어는 무시함). 그리고 One-hot encoding 한 결과를 저장할 빈 one_hot_encoded 다차원 배열을 np.zeros() 로 만들어두었습니다.

# consider only the first max_length words in texts            
max_len = 40

# array to store the one_hot_encoded results
file_opened = open(path)

one_hot_encoded = np.zeros(shape=(len(file_opened.readlines()), 
                                  max_len, 
                                  max(token_idx.values())+1))

one_hot_encoded 는 (5, 40, 124) 의 다차원 배열입니다. 5개의 텍스트 문장으로 되어 있고, 40개의 최대 단어 길이(max_len) 만을 고려하며, 총 124개의 token index 에 대해서 해당 단어가 있으면 '1', 없으면 '0'으로 one-hot encoding을 하게 된다는 뜻입니다.

one_hot_encoded.shape

(5, 40, 124)

아래는 파일을 열고 텍스트를 줄 별로 읽어 들인 후에, for loop 을 돌면서 각 줄에서 단어를 분할하고 전처리하여, token_idx.get(word) 를 사용해서 해당 단어(word)의 token index를 가져온 후, 해당 텍스트(i), 단어(j), token index(idx)에 '1'을 입력하여 one_hot_encoded 다차원 배열을 업데이트 합니다.

file_opened = open(path)
for i, line in enumerate(file_opened.readlines()):
    # strip leading and trailing edge spaces
    line = line.strip()
    
    for j, word in list(enumerate(line.split()))[:max_len]:
        
        # preprocess the word
        word = word_preprocess(word)
        
        # put word into token_index
        if word != '':
            idx = token_idx.get(word)
            one_hot_encoded[i, j, idx] = 1.

이렇게 생성한 one_hot_encoded 다차원배열의 결과는 아래와 같습니다.

one_hot_encoded

array([[[0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 1.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]])

type(one_hot_encoded)

numpy.ndarray

이해를 돕기 위하여 python_wikipedia.txt 파일의 첫번째 줄의, 앞에서 부터 40개 단어까지의 단어 중에서, token_idx 의 1번~10번 까지만 one-hot encoding이 어떻게 되었나를 단어와 token_idx 까지 설명을 추가하여 프린트해보았습니다. (말로 설명하려니 어렵네요. ㅜ_ㅜ)

# sort token_idx dictionary by value
import operator
sorted_token_idx = sorted(token_idx.items(), key=operator.itemgetter(1))

# print out 10 words & token_idx of 1st text's 40 words as an example
for i in range(10):
    print('word & token_idx:', sorted_token_idx[i])
    print(one_hot_encoded[0, :, i+1])

word & token_idx: ('python', 1)
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('programming', 2)
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('language', 3)
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('wikipedia', 4)
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('interpreted', 5)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('highlevel', 6)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('generalpurpose', 7)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('created', 8)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('guido', 9)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('van', 10)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Deep Learning (TF, Keras, PyTorch) > Natural Language Processing' 카테고리의 다른 글

[NLP] TF-IDF (Term Frequency - Inverse Document Frequency) (2)	2022.04.10
[NLP] 언어 구조의 구성 요소 (Building Blocks of Language Structure) (0)	2022.02.20
[NLP] 자연어 처리(NLP, Natural Language Processing)란 무엇이고, NLP 응용분야는 무엇이 있나? (0)	2022.02.20
[Python] 텍스트로부터 CSR 행렬을 이용하여 Term-Document 행렬 만들기 (0)	2020.09.13
[Python] NLTK(Natural Language Toolkit)와 WordNet으로 자연어 처리하기 맛보기 (0)	2020.08.02

Posted by Rfriend

,

[Python] 텍스트 파일 읽어와서 숫자형 데이터 표준화하기 (reading csv or text file, standardizing or normalizing of numeric data)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 5. 21. 22:29

이번 포스팅에서는 (1) text 또는 csv 포맷으로 저장된 텍스트 파일을 Python의 string methods 를 사용하여 파일을 열어서 파싱하여 matrix 로 저장하고, (2) 숫자형 데이터를 표준화(standardization) 혹은 정규화(normalization) 하는 사용자 정의함수를 만들어보겠습니다.

예제로 사용할 text 파일은 전복의 성별과 length, diameter, height, whole_weight, shucked_weight, viscera_weight, shell_weight, rings 를 측정한 abalone.txt 파일 입니다.

abalone.txt

0.18MB

1. text 파일을 읽어서 숫자형 값으로 만든 matrix, 라벨을 저장한 vector를 만들기

물론, Pandas 모듈의 read_csv() 함수를 이용하여 편리하게 text, csv 포맷의 파일을 읽어올 수 있습니다.

# importing modules
import numpy as np
import pandas as pd
import os

# setting directory
base_dir = '/Users/ihongdon/Documents/Python'
work_dir = 'dataset'
path = os.path.join(base_dir, work_dir)

# reading text file using pandas read_csv() function
df = pd.read_csv(os.path.join(path, 'abalone.txt'), 
                 sep=',', 
                 names=['sex', 'length', 'diameter', 'height', 'whole_weight', 
                        'shucked_weight', 'viscera_weight', 'shell_weight', 'rings'], 
                 header=None)
                 
# check first 5 lines
df.head()
sex	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	rings
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
2	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
4	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7

위의 Pandas 의 함수 말고, 아래에는 Python의 string methods 를 사용해서 파일을 열고, 파싱하는 간단한 사용자 정의함수를 직접 만들어보았습니다.

위의 abalone.txt 파일의 데이터 형태를 참고해서 파일 이름, 숫자형 변수의 개수, 숫자형 변수의 시작 위치, 숫자형 변수의 끝나는 위치, 라벨 변수의 우치를 인자로 받는 사용자 정의함수를 정의하였습니다. 분석을 하려는 각 데이터셋에 맞게 인자와 함수 code block 을 수정하면 좀더 유연하고 데이터 특성에 적합하게 파일을 불어올 수 있는 사용자 정의함수를 만들 수 있습니다.

def file2matrix(filename, val_col_num, val_col_st_idx, val_col_end_idx, label_idx):
    """
    - filename: directory and file name
    - val_col_num: the number of columns which contains numeric values
    - val_col_st_idx: the index of starting column which contains numeric values
    - val_col_end_idx: the index of ending column which contains numeric values
    - label_idx: the index of label column
    """
    # open file
    file_opened = open(filename)
    lines_num = len(file_opened.readlines())
    
    # blank matrix and vector to store
    matrix_value = np.zeros((lines_num, val_col_num))
    vector_label = []
    
    # splits and appends value and label using for loop statement
    file_opened = open(filename)
    idx = 0
    for line in file_opened.readlines():
        # removes all whitespace in string
        line = line.strip()
        
        # splits string according to delimiter str
        list_from_line = line.split(sep=',')
        
        # appends value to matrix and label to vector
        matrix_value[idx, :] = list_from_line[val_col_st_idx : (val_col_end_idx+1)]
        vector_label.append(list_from_line[label_idx])
        idx += 1
        
    return matrix_value, vector_label

Python의 문자열 메소드 (string methods)는 https://rfriend.tistory.com/327 를 참고하세요.

위의 file2matrix() 사용자 정의 함수를 사용하여 abalone.txt 파일을 읽어와서 (a) matrix_value, (b) vector_label 을 반환하여 보겠습니다.

# run file2matrix() UDF
matrix_value, vector_label = file2matrix(os.path.join(path, 'abalone.txt'), 8, 1, 8, 0)

#--- matrix_value
# type
type(matrix_value)
numpy.ndarray

# shape
matrix_value.shape
(4177, 8)

# samples
matrix_value[:3]
array([[ 0.455 ,  0.365 ,  0.095 ,  0.514 ,  0.2245,  0.101 ,  0.15  , 15.    ],
       [ 0.35  ,  0.265 ,  0.09  ,  0.2255,  0.0995,  0.0485,  0.07  ,  7.    ],
       [ 0.53  ,  0.42  ,  0.135 ,  0.677 ,  0.2565,  0.1415,  0.21  ,  9.    ]])
       
#--- vector_label
# type
type(vector_label)
list

# number of labels
len(vector_label)
4177

# samples
vector_label[:3]
['M', 'M', 'F']

2-1. 숫자형 데이터를 표준화(Standardization) 하기

위의 숫자형 데이터로 이루어진 matrix_value 를 numpy를 이용해서 표준화, 정규화하는 사용자 정의함수를 작성해보겠습니다. (물론 scipy.stats 의 zscore() 나 sklearn.preprocessing 의 StandardScaler() 함수를 사용해도 됩니다.)

아래의 사용자 정의 함수는 숫자형 데이터로 이루어진 데이터셋을 인자로 받으면, 평균(mean)과 표준편차(standard deviation)를 구하고, standardized_value = (x - mean) / standard_deviation 으로 표준화를 합니다. 그리고 표준화한 matrix, 각 칼럼별 평균과 표준편차를 반환합니다.

def standardize(numeric_dataset):

    # standardized_value = (x - mean)/ standard_deviation
    
    # calculate mean and standard deviation per numeric columns
    mean_val = numeric_dataset.mean(axis=0)
    std_dev_val = numeric_dataset.std(axis=0)
    
    # standardization
    matrix_standardized = (numeric_dataset - mean_val)/ std_dev_val
    
    return matrix_standardized, mean_val, std_dev_val

위의 standardize() 함수를 사용하여 matrix_value 다차원배열을 표준화해보겠습니다.

# rund standardize() UDF
matrix_standardized, mean_val, std_dev_val = standardize(matrix_value)

# matrix after standardization
matrix_standardized
array([[-0.57455813, -0.43214879, -1.06442415, ..., -0.72621157,
        -0.63821689,  1.57154357],
       [-1.44898585, -1.439929  , -1.18397831, ..., -1.20522124,
        -1.21298732, -0.91001299],
       [ 0.05003309,  0.12213032, -0.10799087, ..., -0.35668983,
        -0.20713907, -0.28962385],
       ...,
       [ 0.6329849 ,  0.67640943,  1.56576738, ...,  0.97541324,
         0.49695471, -0.28962385],
       [ 0.84118198,  0.77718745,  0.25067161, ...,  0.73362741,
         0.41073914,  0.02057072],
       [ 1.54905203,  1.48263359,  1.32665906, ...,  1.78744868,
         1.84048058,  0.64095986]])
 
 # mean per columns
 mean_val
 array([0.5239921 , 0.40788125, 0.1395164 , 0.82874216, 0.35936749,
       0.18059361, 0.23883086, 9.93368446])
       
 # standard deviation per columns
 std_dev_val
 array([0.12007854, 0.09922799, 0.04182205, 0.49033031, 0.22193638,
       0.10960113, 0.13918601, 3.22378307])

2-2. 숫자형 데이터를 정규화(Normalization) 하기

다음으로 척도, 범위가 다른 숫자형 데이터를 [0, 1] 사이의 값으로 변환하는 정규화(Normalization)를 해보겠습니다. normalized_value = (x - minimum_value) / (maximum_value - minimum_value) 로 계산합니다.

def normalize(numeric_dataset):
    
    # normalized_value = (x - minimum_value) / (maximum_value - minimum_value)
    
    # calculate mean and standard deviation per numeric columns
    min_val = numeric_dataset.min(axis=0)
    max_val = numeric_dataset.max(axis=0)
    ranges = max_val - min_val
    
    # normalization, min_max_scaling
    matrix_normalized = (numeric_dataset - min_val)/ ranges
    
    return matrix_normalized, ranges, min_val

위의 normalize() 사용자 정의 함수에 matrix_value 다차원배열을 적용해서 정규화 변환을 해보겠습니다. 정규화된 다차원배열과 범위(range = max_val - min_val), 최소값을 동시에 반환합니다.

# run normalize() UDF
matrix_normalized, ranges, min_val = normalize(matrix_value)

# normalized matrix
matrix_normalized
array([[0.51351351, 0.5210084 , 0.0840708 , ..., 0.1323239 , 0.14798206,
        0.5       ],
       [0.37162162, 0.35294118, 0.07964602, ..., 0.06319947, 0.06826109,
        0.21428571],
       [0.61486486, 0.61344538, 0.11946903, ..., 0.18564845, 0.2077728 ,
        0.28571429],
       ...,
       [0.70945946, 0.70588235, 0.18141593, ..., 0.37788018, 0.30543099,
        0.28571429],
       [0.74324324, 0.72268908, 0.13274336, ..., 0.34298881, 0.29347285,
        0.32142857],
       [0.85810811, 0.84033613, 0.17256637, ..., 0.49506254, 0.49177877,
        0.39285714]])
        
# ranges
ranges
array([ 0.74  ,  0.595 ,  1.13  ,  2.8235,  1.487 ,  0.7595,  1.0035,  28.    ])

# minimum value
min_val
array([7.5e-02, 5.5e-02, 0.0e+00, 2.0e-03, 1.0e-03, 5.0e-04, 1.5e-03, 1.0e+00])

다음번 포스팅에서는 텍스트 파일을 파싱해서 One-Hot Encoding 하는 방법을 소개하겠습니다.

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame, Series의 행, 열 개수 세기 (1)	2019.07.03
[Python pandas] DataFrame의 문자열 칼럼을 분할하여 일부분으로 새로운 칼럼 만들기 (2)	2019.07.01
[Python] 경로 및 폴더 생성/제거(directory and path management using os), 파일 복사 (file copy using shutil) (0)	2019.03.03
[Python Numpy] 배열에 차원 추가하기 (Adding Dimensions to a Numpy Array) (2)	2019.02.24
[Python Numpy] 배열에서 0보다 작은 수를 0으로 변환하는 방법 (0)	2019.02.21

Posted by Rfriend

,

[Keras] 이미지 파일 업로드하고 전처리하여 시각화하는 방법 (how to upload, preprocess and visualize images)

Deep Learning (TF, Keras, PyTorch) 2019. 3. 5. 23:51

CNN(Convolutional Neural Network)으로 이미지 분류 모델링할 때 보통 tensorflow나 keras 라이브러리에 이미 포함되어 있는 MNIST, CIFAR-10 같은 이미지를 간단하게 load 하는 함수를 이용해서 toy project로 연습을 해보셨을 겁니다.

그런데, 실제 이미지, 그림 파일을 분석해야 될 경우 '어? 이미지를 어떻게 업로드 하고, 어떻게 전처리하며, 어떻게 시각화해야 하는거지?'라는 의문을 한번쯤은 가져보셨을 듯 합니다.

이번 포스팅에서는 바로 이 의문에 대한 답변 소개입니다.

필요한 Python 라이브러리를 불러오겠습니다.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import keras

1. 개와 고양이 사진 다운로드 (download dogs and cats images from Kaggle)

개와 고양이 사진을 아래의 Kaggle 사이트에서 다운로드 해주세요. Kaggle 회원가입을 먼저 해야지 다운로드 할 수 있습니다. 개는 1, 고양이는 0으로 라벨링이 되어 있는 25,000 개의 이미지를 다운받을 수 있습니다.

https://www.kaggle.com/c/dogs-vs-cats/data

2. 개와 고양이 이미지 30개만 선택해서 별도 경로(폴더)에 복사하기

downloads 폴더에 들어있는 압축된 다운로드 파일을 압축 해제(unzip)해 주세요.

윈도우 탐색기로 미리보기를 해보면 고양이 반, 개 반 입니다.

directory, path 관리하는데 필요한 os 라이브러리, 파일을 source에서 destination 경로로 복사하는데 필요한 shutil 라이브러리를 불러오겠습니다.

import os # miscellaneous operating system interfaces

import shutil # high-level file operations

이미지를 가져올 경로를 설정해보겠습니다. ('Downdoads/dogs-vs-cats/train' 경로에 train 폴더를 압축해제해 놓았습니다. 폴더 경로 확인 요함.)

# The path to the directory where the original dataset was uncompressed

base_dir = '/Users/admin/Downloads'

img_dir = '/Users/admin/Downloads/dogs-vs-cats/train'

train 폴더에 들어있는 개와 고양이 이미지가 총 25,000개 임을 확인했으며, img_dir 경로에 포함되어 있는 이미지 중에서 10개만 indexing 해서 파일 제목을 확인해보았습니다.

len(os.listdir(img_dir))

25000

os.listdir(img_dir)[:10]

['dog.8011.jpg',
 'cat.5077.jpg',
 'dog.7322.jpg',
 'cat.2718.jpg',
 'cat.10151.jpg',
 'cat.3406.jpg',
 'dog.1753.jpg',
 'cat.4369.jpg',
 'cat.7660.jpg',
 'dog.5535.jpg']

30개의 이미지만 샘플로 선별해서 다른 폴더로 복사해보겠습니다. 먼저, 30개 고양이 이미지를 담아둘 경로/ 폴더(cats30_dir) 를 만들어보겠습니다.

# Directory with 30 cat pictures

cats30_dir = os.path.join(base_dir, 'cats30')

# Make a path directory

os.mkdir(cats30_dir)

이제 source 경로에서 destination 경로로 shutil.copyfile(src, dst) 함수를 사용하여 고양이 이미지 30개만 이미지를 복사하겠습니다.

# Copy first 30 cat images to cats30_dir

fnames = ['cat.{}.jpg'.format(i) for i in range(30)]

for fname in fnames:

src = os.path.join(img_dir, fname)

dst = os.path.join(cats30_dir, fname)

shutil.copyfile(src, dst)

cats30_dir 경로로 복사한 30개의 고양이 이미지 파일 목록을 확인해 보았습니다.

# check if pictures were copied well in cats30 directory

os.listdir(cats30_dir)

['cat.6.jpg',
 'cat.24.jpg',
 'cat.18.jpg',
 'cat.19.jpg',
 'cat.25.jpg',
 'cat.7.jpg',
 'cat.5.jpg',
 'cat.27.jpg',
 'cat.26.jpg',
 'cat.4.jpg',
 'cat.0.jpg',
 'cat.22.jpg',
 'cat.23.jpg',
 'cat.1.jpg',
 'cat.3.jpg',
 'cat.21.jpg',
 'cat.20.jpg',
 'cat.2.jpg',
 'cat.11.jpg',
 'cat.10.jpg',
 'cat.12.jpg',
 'cat.13.jpg',
 'cat.9.jpg',
 'cat.17.jpg',
 'cat.16.jpg',
 'cat.8.jpg',
 'cat.28.jpg',
 'cat.14.jpg',
 'cat.15.jpg',
 'cat.29.jpg']

3. 이미지 파일을 로딩, float array 로 변환 후 전처리하기
(load image file and convert image data to float array format)

Keras preprocessing 에 있는 image 클래스를 불러온 후, load_img() 함수를 사용해서 이미지 파일을 로딩하고, img_to_array() 함수를 사용해서 array 로 변환해보겠습니다. (Python OpenCV 라이브러리로도 가능함)

# a picture of one cat as an example

img_name = 'cat.10.jpg'

img_path = os.path.join(cats30_dir, img_name)

# Preprocess the image into a 4D tensor using keras.preprocessing

from keras.preprocessing import image

img = image.load_img(img_path, target_size=(250, 250))

img_tensor = image.img_to_array(img)

3차원 array에 이미지 샘플을 구분할 수 있도록 np.expand_dims() 함수를 사용하여 1개 차원을 추가하겠습니다. 그리고 [0, 1] 값 범위 내에 값이 존재하도록 array 값을 255.로 나누어서 표준화해주었습니다.

# expand a dimension (3D -> 4D)

img_tensor = np.expand_dims(img_tensor, axis=0)

img_tensor.shape

 (1, 250, 250, 3)

# scaling into [0, 1]

img_tensor /= 255.

첫번째 고양이 이미지의 array 데이터를 출력해보면 아래처럼 생겼습니다. 꼭 영화 메트릭스의 숫자들이 주루룩 내려오는 장면 같이 생겼습니다.

img_tensor[0]

array([[[0.10196079, 0.11764706, 0.15294118],
        [0.07450981, 0.09019608, 0.1254902 ],
        [0.03137255, 0.04705882, 0.09019608],
        ...,
        [0.5058824 , 0.6313726 , 0.61960787],
        [0.49411765, 0.61960787, 0.60784316],
        [0.49019608, 0.6156863 , 0.6039216 ]],

       [[0.11764706, 0.13333334, 0.16862746],
        [0.13725491, 0.15294118, 0.1882353 ],
        [0.08627451, 0.10196079, 0.13725491],
        ...,
        [0.50980395, 0.63529414, 0.62352943],
        [0.49803922, 0.62352943, 0.6117647 ],
        [0.4862745 , 0.6117647 , 0.6       ]],

       [[0.11372549, 0.14117648, 0.16470589],
        [0.16470589, 0.19215687, 0.22352941],
        [0.15294118, 0.18039216, 0.21176471],
        ...,
        [0.50980395, 0.63529414, 0.62352943],
        [0.5019608 , 0.627451  , 0.6156863 ],
        [0.49019608, 0.6156863 , 0.6039216 ]],

       ...,

       [[0.69411767, 0.6431373 , 0.46666667],
        [0.6862745 , 0.63529414, 0.45882353],
        [0.6627451 , 0.6117647 , 0.4392157 ],
        ...,
        [0.7254902 , 0.70980394, 0.04313726],
        [0.6745098 , 0.6509804 , 0.03921569],
        [0.64705884, 0.6156863 , 0.05490196]],

       [[0.64705884, 0.5921569 , 0.45490196],
        [0.6117647 , 0.5568628 , 0.4117647 ],
        [0.5686275 , 0.5176471 , 0.3529412 ],
        ...,
        [0.7254902 , 0.7137255 , 0.01960784],
        [0.6862745 , 0.67058825, 0.00784314],
        [0.6509804 , 0.6313726 , 0.        ]],

       [[0.6039216 , 0.54901963, 0.4117647 ],
        [0.5882353 , 0.53333336, 0.3882353 ],
        [0.5803922 , 0.5294118 , 0.3647059 ],
        ...,
        [0.7254902 , 0.7137255 , 0.01960784],
        [0.6862745 , 0.67058825, 0.00784314],
        [0.6509804 , 0.6313726 , 0.        ]]], dtype=float32)

4. 한개의 이미지 파일의 array 를 시각화하기 (visualizing an image array data)

matplotlib 라이브러리를 이용하여 위의 3번에서 이미지의 array 변환/ 전처리한 데이터를 시각화해보겠습니다. 예제로서 img_tensor[0] 으로 첫번째 고양이 이미지의 데이터를 시각화했습니다.

# Image show

import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (10, 10) # set figure size

plt.imshow(img_tensor[0])

plt.show()

5. 30개의 이미지 데이터를 6*5 격자에 나누어서 시각화하기
(visualizing 30 image data at 6*5 grid layout)

위의 3번에서 했던 이미지 파일 로딩, array로 변환, 1개 차원 추가, [0, 1] 범위로 표준화하는 전처리를 preprocess_img() 라는 이름의 사용자정의함수(UDF)로 만들었습니다.

# UDF of pre-processing image into a 4D tensor

def preprocess_img(img_path, target_size=100):

from keras.preprocessing import image

img = image.load_img(img_path, target_size=(target_size, target_size))

img_tensor = image.img_to_array(img)

# expand a dimension

img_tensor = np.expand_dims(img_tensor, axis=0)

# scaling into [0, 1]

img_tensor /= 255.

return img_tensor

이제 30개의 고양이 이미지 array 데이터를 사용해서 행(row) 6개 * 열(column) 5개의 격자 배열(grid layout) 에 시각화를 해보겠습니다. 이때 가독성을 높이기 위해서 고양이 사진 간에 검정색 구분선을 넣어서 시각화를 해보겠습니다.

참고로, 아래 코드의 for loop 중간에 방금 전에 위에서 정의한 preprocess_img() 사용자정의함수 (빨간색으로 표기) 가 사용되었습니다.

# layout

n_pic = 30

n_col = 5

n_row = int(np.ceil(n_pic / n_col))

# plot & margin size

target_size = 100

margin = 3

# blank matrix to store results

total = np.zeros((n_row * target_size + (n_row - 1) * margin, n_col * target_size + (n_col - 1) * margin, 3))

# append the image tensors to the 'total matrix'

img_seq = 0

for i in range(n_row):

for j in range(n_col):

fname = 'cat.{}.jpg'.format(img_seq)

img_path = os.path.join(cats30_dir, fname)

img_tensor = preprocess_img(img_path, target_size)

horizontal_start = i * target_size + i * margin

horizontal_end = horizontal_start + target_size

vertical_start = j * target_size + j * margin

vertical_end = vertical_start + target_size

total[horizontal_start : horizontal_end, vertical_start : vertical_end, :] = img_tensor[0]

img_seq += 1

# display the pictures in grid

plt.figure(figsize=(200, 200))

plt.imshow(total)

plt.show()

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요.

728x90

저작자표시 비영리 변경금지

'Deep Learning (TF, Keras, PyTorch)' 카테고리의 다른 글

[TensorFlow] 값 변경이 가능한 변수 (tf.Variable) (0)	2021.12.20
[Tensorflow] 딥러닝을 위한 공개 데이터셋 Tensorflow Datasets (3)	2020.03.19
Tensorflow, Keras가 GPU를 사용하고 있는지 확인하는 방법 (0)	2019.02.19
[Keras] TypeError: softmax() got an unexpected keyword argument 'axis' 에러 시 tensorflow upgrade (0)	2019.02.06
집에서 딥러닝 공부하기에 적합한 PC 사양 및 가격대 (2017-09월) (9)	2017.09.17

Posted by Rfriend

,

[Python] 경로 및 폴더 생성/제거(directory and path management using os), 파일 복사 (file copy using shutil)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 3. 3. 23:57

이번 포스팅에서는 os 라이브러리를 이용한 경로 및 폴더 관리, shutil 라이브러리를 이용한 파일 복사 방법에 대한 소소한 팁들을 소개하겠습니다.

os 라이브러리에 대해서 소개해 놓은 페이지 ( https://docs.python.org/3/library/os.html )에 가보면 '기타 운영 체계에 대한 인터페이스 (Miscellaneous operating system interfaces)' 라고 소개를 하면서 스크롤 압박에 굉장히 심할 정도로 여러개의 함수들을 소개해 놓았습니다.

그 많은 것을 모두 소개하기는 힘들구요, 그중에서도 이번 포스팅에서는 제가 자주 쓰는 함수들만 몇 개 선별해서 소개하도록 하겠습니다.

1. os 라이브러리를 이용한 경로 및 폴더 생성, 조회, 변경

먼저 os 라이브러리를 불러오겠습니다.

import os # Miscellaneous operating system interfaces

1-1. 현재 작업경로 확인하기: os.getcwd()

# os.getcwd(): returns the current working directory

os.getcwd()

'C:\\Users\\admin\\python'

1-2. 작업경로 안에 들어있는 파일 리스트 확인하기: os.listdir(path)

# os.listdir(path): return a list of then entries in the directory given by path

os.listdir(os.getcwd()) # a list of files at current directory

['.ipynb_checkpoints', 'numpy_adding_new_axis.ipynb', 'Numpy_clip.ipynb', 'python_os.ipynb']

1-3. 작업경로 바꾸기: os.chdir(path)

# os.chdir(path): change the current working directory to path

base_dir = 'C:/Users/admin'

os.chdir(base_dir)

os.getcwd()

'C:\\Users\\admin'

1-4. 기존 경로와 새로운 폴더 이름을 합쳐서 하위 경로 만들기: os.path.join()

# join one or more path components

path = os.path.join(base_dir, 'os')

path

'C:/Users/admin\\os'

1-5. 새로운 폴더를 만들기: os.mkdir(path)

# create a directory named path with numeric mode

os.mkdir(path)

1-6. 경로가 존재하는지 확인하기: os.path.isdir(path)

# return True if path is an existing directory

os.path.isdir(path)

True

1-7. 파일이나 경로 이름 바꾸기: os.rename(old_path_name, new_path_name)

# rename the file or directory src to dst

# os.rename(src, dst)

dst_path = os.path.join(base_dir, 'os_renamed')

os.rename(path, dst_path)

os.path.isdir(dst_path) # check whether dst_path is renamed or not

True

2. shutil 라이브러리를 이용한 파일 복사: shutil.copyfile(src, dst)

먼저, 파일을 복사해올 소스 경로(source directory, from)와 파일을 복사해놓은 종착지 경로(destination directory, to)를 만들어보겠습니다.

# creating src_dir, dst_dir

base_dir = 'C:/Users/admin'

src_dir = os.path.join(base_dir, 'src_dir')

dst_dir = os.path.join(base_dir, 'dst_dir')

os.mkdir(src_dir)

os.mkdir(dst_dir)

다음으로, 소스 경로(src_dir)에 'file_1.txt', 'file_2.txt', 'file_3.txt' 라는 이름으로 메모장으로 작성한 간단한 텍스트 파일 3개를 저장해두었습니다. (직접 수작업으로 메모장 열고 문자 몇개 입력하고 저장함)

os.listdir() 를 사용하여 소스 경로(src_dir)에 들어있는 3개의 텍스트파일 이름을 fnames 라는 이름의 리스트로 만들어두었습니다.

# put file_1, file_2, file_3 into src_dir

fnames = os.listdir(src_dir)

fnames

['file_1.txt', 'file_2.txt', 'file_3.txt']

마지막으로, shutil 라이브러리를 불러오고, shutil.copyfile(src, dst) 함수를 사용하여 소스 경로(source directory)에 들어있는 3개의 텍스트 파일을 종착지 경로(destination directory)로 복사해보겠습니다.

이때 for loop 문을 사용하여 텍스트 파일 별로 shutil.copyfile(src, dst)를 적용해주면 됩니다.

# copy files from src to dst directory

import shutil

for fname in fnames:

src = os.path.join(src_dir, fname)

dst = os.path.join(dst_dir, fname)

shutil.copyfile(src, dst)

os.listdir(dst_dir)

['file_1.txt', 'file_2.txt', 'file_3.txt']

3. os와 shutil 라이브러리를 이용한 폴더 삭제, 파일 삭제하기

아래와 같이 3개의 텍스트 파일이 들어있는 'C:/Users/admin/os' 라는 경로의 폴더를 예로 들어보겠습니다.

os.listdir('C:/Users/admin/os')

['big_data.txt', 'my_data.txt', 'sample_data.txt']

3-1. 경로(폴더) 제거하기: os.rmdir(path)

경로(폴더) 안에 파일이 없어야지 os.rmdir()을 사용할 수 있습니다. 경로(폴더) 안에 파일이 있으면 아래처럼 "OSError: [WinError 145] 디렉토리가 비어 있지 않습니다"라는 에러가 발생합니다.

# OSError: directory is not empty

os.rmdir('C:/Users/admin/os')

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-11-4b25f55d427c> in <module>()
----> 1 os.rmdir('C:/Users/admin/os')

OSError: [WinError 145] 디렉터리가 비어 있지 않습니다: 'C:/Users/admin/os'

os.path.isdir(dst_path) # check whether dst_path is removed or not

False

3-2. 파일 삭제하기 : os.remove(path)

os.remove() 는 인자로 1개의 파일 경로를 받습니다. 한번에 한개씩 지워야 하므로 불편한점이 있습니다.

# delete file

os.remove('C:/Users/admin/os/my_data.txt')

os.remove('C:/Users/admin/os/big_data.txt')

os.remove('C:/Users/admin/os/sample_data.txt')

위에서 'C:/Users/admin/os' 경로 안의 파일 3개를 모두 삭제했으므로 이제 os.rmdir() 을 사용해서 폴더를 삭제할 수 있습니다.

# delete directory only when it is empty

os.rmdir('C:/Users/admin/os')

경로(폴더)가 존재하는지 os.path.isdir(path)로 확인해보겠습니다. 방금전에 경로를 os.rmdir()로 삭제를 했기 때문에 False 를 반환하였습니다.

# check whether the directory is present or not

os.path.isdir('C:/Users/admin/os')

False

3-3. 경로(폴더)와 파일을 한꺼번에 모두 삭제하기 : shutil.rmtree(path)

os.mkdir('C:/Users/admin/os')

# delete directory and files at once

import shutil

shutil.rmtree('C:/Users/admin/os')

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame의 문자열 칼럼을 분할하여 일부분으로 새로운 칼럼 만들기 (2)	2019.07.01
[Python] 텍스트 파일 읽어와서 숫자형 데이터 표준화하기 (reading csv or text file, standardizing or normalizing of numeric data) (0)	2019.05.21
[Python Numpy] 배열에 차원 추가하기 (Adding Dimensions to a Numpy Array) (2)	2019.02.24
[Python Numpy] 배열에서 0보다 작은 수를 0으로 변환하는 방법 (0)	2019.02.21
[Python pandas] 다수 그룹 별 다수의 변수 간 상관관계 분석 (correlation coefficients with multiple columns by groups) (0)	2019.02.17

Posted by Rfriend

,

[Python Numpy] 배열에 차원 추가하기 (Adding Dimensions to a Numpy Array)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 2. 24. 00:27

이번 포스팅에서는 Python Numpy 배열 (array)에 차원을 추가하는 3가지 방법을 소개하겠습니다. 딥러닝 공부하다 보면 computer vision의 CNN에서 이미지 파일을 불러와서 다차원 배열로 변환할 때 사용하곤 합니다.

1. numpy.reshape() 을 이용한 차원 추가

2. numpy.expand_dims() 을 이용한 차원 추가

3. numpy.newaxis 을 이용한 차원 추가

예제로 사용할 간단한 (4, 3, 2) 3차원의 다차원 배열을 만들어보겠습니다.

import numpy as np

a = np.arange(24).reshape(4, 3, 2)

a

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]],

       [[12, 13],
        [14, 15],
        [16, 17]],

       [[18, 19],
        [20, 21],
        [22, 23]]])

a.shape

(4, 3, 2)

(4, 3, 2) 차원의 배열 a에 차원을 추가하여 (1, 4, 3, 2)의 4차원 배열로 만들어보겠습니다.

1. numpy.reshape() 를 이용한 차원 추가

np.reshape(a, (1, 4, 3, 2))

array([[[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]],

        [[18, 19],
         [20, 21],
         [22, 23]]]])

np.reshape(a, ((1,) + a.shape))

array([[[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]],

        [[18, 19],
         [20, 21],
         [22, 23]]]])

a.reshape((1,) + a.shape)

array([[[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]],

        [[18, 19],
         [20, 21],
         [22, 23]]]])

2. numpy.expand_dims() 를 이용한 차원 추가

np.expand_dims(a, axis=0)

array([[[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]],

        [[18, 19],
         [20, 21],
         [22, 23]]]])

3. numpy.newaxis 를 이용한 차원 추가

a[:, np.newaxis]

array([[[[ 0,  1],
         [ 2,  3],
         [ 4,  5]]],


       [[[ 6,  7],
         [ 8,  9],
         [10, 11]]],


       [[[12, 13],
         [14, 15],
         [16, 17]]],


       [[[18, 19],
         [20, 21],
         [22, 23]]]])

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 텍스트 파일 읽어와서 숫자형 데이터 표준화하기 (reading csv or text file, standardizing or normalizing of numeric data) (0)	2019.05.21
[Python] 경로 및 폴더 생성/제거(directory and path management using os), 파일 복사 (file copy using shutil) (0)	2019.03.03
[Python Numpy] 배열에서 0보다 작은 수를 0으로 변환하는 방법 (0)	2019.02.21
[Python pandas] 다수 그룹 별 다수의 변수 간 상관관계 분석 (correlation coefficients with multiple columns by groups) (0)	2019.02.17
[Python pandas] 그룹 별 무작위 표본 추출 (random sampling by group) (0)	2018.12.26

Posted by Rfriend

,

[Python Numpy] 배열에서 0보다 작은 수를 0으로 변환하는 방법

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 2. 21. 23:52

이번 포스팅에서는 배열(array)에서 0보다 작은 수는 0으로 변환하고 나머지는 그대로 두는 여러가지 방법을 소개하겠습니다.

1. List Comprehension with for loop

2. Indexing

3. np.where(condition[, x, y])

4. np.clip(a, a_min, a_max, out=None)

1. List Comprehension: [0 if i < 0 else i for i in a]

아래처럼 for loop 을 써서 list comprehension 방법을 사용하면 특정 라이브러리의 함수를 사용하지 않아도 0보다 작은 수는 0으로 변환할 수 있습니다. 하지만, for loop 을 돌기 때문에 배열(array)가 커지면 성능이 문제될 수 있습니다. 원래의 배열 a는 그대로 있습니다.

>>> import numpy as np

>>> a = np.arange(-5, 5)

>>> a

array([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4])

>>> [0 if i < 0 else i for i in a]

[0, 0, 0, 0, 0, 0, 1, 2, 3, 4]

>>> a

array([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4])

2. Indexing: a[a < 0] = 0

아래처럼 indexing을 사용해서 a[a < 0] = 0 처럼 0보다 작은 값이 위치한 곳에 0을 직접 할당할 수 있습니다. 이렇게 하면 원래의 배열 a가 변경됩니다.

>>> a = np.arange(-5, 5)

>>> a

array([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4])

>>> a[a < 0] = 0

>>> a

array([0, 0, 0, 0, 0, 0, 1, 2, 3, 4])

3. np.where() : np.where(a < 0, 0, a)

np.where(조건, True일 때 값, False일 때 값) 를 사용하면 편리하게 0보다 작은 조건의 위치에 0을 할당할 수 있습니다. 벡터 연산을 하므로 for loop이 돌지 않아서 속도가 매우 빠릅니다. 원래의 배열 a는 변경되지 않고 그대로 있습니다.

>>> a = np.arange(-5, 5)

>>> a

array([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4])

>>> np.where(a < 0, 0, a)

array([0, 0, 0, 0, 0, 0, 1, 2, 3, 4])

>>> a

array([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4])

만약 0보다 작은 수는 0으로 변환, 2보다 큰 수는 2로 변환하고 싶다면 아래처럼 np.where() 안에 np.where()를 한번 더 넣어서 써주면 되는데요, 코드가 좀 복잡해보입니다.

>>> a = np.arange(-5, 5)

>>> a

array([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4])

>>>

>>> np.where(a < 0, 0, np.where(a > 2, 2, a))

array([0, 0, 0, 0, 0, 0, 1, 2, 2, 2])

4. np.clip() : np.clip(a, 0, 4, out=a)

np.clip(배열, 최소값 기준, 최대값 기준) 을 사용하면 최소값과 최대값 조건으로 값을 기준으로 해서, 이 범위 기준을 벗어나는 값에 대해서는 일괄적으로 최소값, 최대값으로 대치해줄 때 매우 편리합니다. 최소값 부분을 0으로 해주었으므로 0보다 작은 값은 모두 0으로 대치되었습니다. 이때 원래의 배열 a는 그대로 있습니다.

>>> a = np.arange(-5, 5)

>>> a

array([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4])

>>> np.clip(a, 0, 4)

array([0, 0, 0, 0, 0, 0, 1, 2, 3, 4])

>>> a

array([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4])

np.clip(배열, 최소값 기준, 최대값 기준, out 배열)을 사용해서 out = a 를 추가로 설정해주면 반환되는 값을 배열 a에 저장할 수 있습니다. 배열 a의 0보다 작았던 부분이 모두 0으로 대치되어 a가 변경되었음을 확인할 수 있습니다.

>>> np.clip(a, 0, 4, out=a)

array([0, 0, 0, 0, 0, 0, 1, 2, 3, 4])

>>> a

array([0, 0, 0, 0, 0, 0, 1, 2, 3, 4])

최소값 기준만 적용해서 간단하게 '0'보다 작은 수는 모두 0으로 바꾸는 것은 a.clip(0) 처럼 메소드를 사용해도 됩니다.

>>> a = np.arange(-5, 5)

>>> a

array([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4])

>>> a.clip(0)

array([0, 0, 0, 0, 0, 0, 1, 2, 3, 4])

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. ^^

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 경로 및 폴더 생성/제거(directory and path management using os), 파일 복사 (file copy using shutil) (0)	2019.03.03
[Python Numpy] 배열에 차원 추가하기 (Adding Dimensions to a Numpy Array) (2)	2019.02.24
[Python pandas] 다수 그룹 별 다수의 변수 간 상관관계 분석 (correlation coefficients with multiple columns by groups) (0)	2019.02.17
[Python pandas] 그룹 별 무작위 표본 추출 (random sampling by group) (0)	2018.12.26
[Python pandas] 그룹 별 선형회귀모형 적합하기 (Group-wise Linear Regression) (0)	2018.12.25

Posted by Rfriend

,

[Python pandas] 다수 그룹 별 다수의 변수 간 상관관계 분석 (correlation coefficients with multiple columns by groups)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 2. 17. 23:30

이번 포스팅에서는 다수 그룹 별 다수의 변수 간 쌍을 이룬 상관계수 분석(paired correlation coefficients with multiple columns by multiple groups) 을 하는 방법을 소개하겠습니다.

보통 다수의 변수간의 상관계수를 구할 때는 상관계수 행렬 (correlation matrix)를 하면 되는데요, 이때 '다수의 그룹별 (by multiple groups)'로 나누어서 다수의 변수 간 상관계수를 구하려면 머리가 좀 복잡해집니다.

간단한 예제 데이터셋을 만들어서 예를 들어보겠습니다.

(1) 3개의 그룹 변수, 4개의 연속형 변수를 가진 예제 DataFrame 만들기

import numpy as np

import pandas as pd

'group_1' 변수 내 ('A', 'B' 그룹), 'group_2' 변수 내 ('C', 'D', 'E', 'F' 그룹), 'group_3' 변수 내 ('G', 'H', 'I', 'J', 'K', 'L', 'M', 'N' 그룹) 별로 나누어서 상관계수를 구해보겠습니다.

# making groups

group_1 = ['A', 'B']*20

group_2 = ['C', 'D', 'E', 'F']*10

group_3 = ['G', 'H', 'I', 'J', 'K', 'L', 'M', 'N']*5;

상관계수를 구할 연속형 변수는 'col_1', 'col_2', 'col_3', 'col_4' 라는 4개의 변수를 사용하겠습니다.

df = pd.DataFrame({'group_1': group_1,

'group_2': group_2,

'group_3': group_3,

'col_1': np.random.randn(40),

'col_2': np.random.randn(40),

'col_3': np.random.randn(40),

'col_4': np.random.randn(40)})

df.sort_values(by=['group_1', 'group_2', 'group_3'], axis=0)

	col_1	col_2	col_3	col_4	group_1	group_2	group_3
0	-0.351969	0.026318	-1.037910	0.849338	A	C	G
8	-0.163435	-0.175277	-1.349251	0.645246	A	C	G
16	0.728652	1.731762	0.691091	-0.189488	A	C	G
24	-1.490956	0.083991	-0.503727	1.690979	A	C	G
32	0.076380	0.634184	-0.424101	-0.608869	A	C	G
4	0.902027	1.454501	-1.467817	0.448042	A	C	K
12	0.899792	0.833289	0.829877	-0.062950	A	C	K
20	-0.559971	0.539967	0.005397	0.362061	A	C	K
28	-1.052539	0.558581	-0.799314	0.979169	A	C	K
36	0.919377	-1.430321	-1.818365	0.061561	A	C	K
2	-0.030675	-0.168537	-1.341236	-1.149740	A	E	I
10	0.112267	-0.476736	0.967436	-0.222528	A	E	I
18	-0.774158	-0.081231	0.438514	1.611915	A	E	I
26	-0.173712	-1.358414	0.653392	0.053665	A	E	I
34	1.110080	1.175692	-0.867843	1.042837	A	E	I
6	-0.083481	-0.200750	-0.702476	-1.072645	A	E	M
14	0.223843	-1.345315	0.899668	1.126941	A	E	M
22	0.529680	0.062743	1.035399	-0.729469	A	E	M
30	1.456441	-0.403748	-0.446094	0.408010	A	E	M
38	-1.308548	0.367232	-0.963109	0.918776	A	E	M
1	0.579627	-1.720893	-0.798200	-0.107270	B	D	H
9	2.101038	-0.581516	-0.796230	0.324806	B	D	H
17	-0.168765	-1.176664	-0.024593	-0.348601	B	D	H
25	0.166594	-1.418307	0.916661	-0.912822	B	D	H
33	0.889615	0.014690	-0.711458	0.649833	B	D	H
5	1.199802	0.968027	-0.780434	0.884857	B	D	L
13	-0.038637	0.694750	0.219160	-0.693826	B	D	L
21	-1.054844	-0.559508	-0.890659	-0.321867	B	D	L
29	-0.574888	0.812719	-0.823804	-0.382432	B	D	L
37	0.670548	0.178911	0.497704	-0.402953	B	D	L
3	0.477194	-0.355853	-1.441898	1.418857	B	F	J
11	0.965187	0.563026	0.964660	-0.249644	B	F	J
19	-2.318685	0.079057	-0.107432	-1.358502	B	F	J
27	-0.951459	-0.466933	1.141424	-2.860606	B	F	J
35	-0.462823	-0.397081	0.373452	-1.303045	B	F	J
7	0.398693	-0.086113	-0.081445	0.871010	B	F	N
15	0.121970	0.258130	0.654156	-0.497327	B	F	N
23	1.228697	-0.625133	-1.761145	-0.577502	B	F	N
31	1.074855	0.784140	0.529190	0.479893	B	F	N
39	0.341767	0.170529	-0.287884	0.329371	B	F	N

(2) 그룹별 두 개 변수 간 상관계수를 구하는 사용자 정의 함수

예제 데이터셋이 준비가 되었으니 이제 '그룹별로 두 개 변수 간 상관계수를 구하는 사용자 정의 함수 (a user-defined function of correlation coefficients with paired variables by groups)' 를 정의해보겠습니다.

# a user-defined function of correlation coefficients with paired variables by groups

def corr_group(df, var_1, var_2, group_list):

# correlaiton fuction with 2 variables

corr_func = lambda g: g[var_1].corr(g[var_2])

# GroupBy operator

grouped = df.groupby(group_list)

# calculate correlation coefficient by Group

corr_coef_df = pd.DataFrame(grouped.apply(corr_func), columns=['corr_coef'])

# add var_1, var_2 column names

corr_coef_df['var1'] = var_1

corr_coef_df['var2'] = var_2

return corr_coef_df

(3) 다수 그룹별 다수 변수 간 두개 씩 쌍을 이루어 상관계수 구하기

'group_1', 'group_2', 'group_3' 의 3개의 그룹 변수로 만들어진 모든 경우의 수의 그룹 조합에 대해서, 'col_1', 'col_2', 'col_3', 'col_4'의 4개 연속형 변수로 2개씩 쌍(pair)을 이루어 만들어진 모든 경우의 수의 조합, 즉, ('col_1', 'col_2'), ('col_1', 'col_3'), ('col_1', 'col_4'), ('col_2', 'col_3'), ('col_2', 'col_4'), ('col_3', 'col_4') 의 4C2=6개의 조합별 상관계수를 구해보겠습니다.

이때 위의 (2)번에서 만들었던 '두 개 쌍의 변수간 상관계수 구하는 사용자 정의함수'인 corr_group() 함수를 사용하여 for loop 문으로 6개의 연속형 변수의 조합별로 상관계수를 구한 후에, corr_coef_df_all 데이터 프레임에 append 해나가는 방식을 사용하였습니다.

# blank DataFrame

corr_coef_df_all = pd.DataFrame()

# group by list

group_list = ['group_1', 'group_2', 'group_3']

# column lists for correlation matrix

col_list = ['col_1', 'col_2', 'col_3', 'col_4']

# get all cominations of col_list with length 2

from itertools import combinations

comb = combinations(col_list, 2)

# calculate correlation coefficients pair-wise

for var in list(comb):

corr_tmp = corr_group(df, var[0], var[1], group_list)

corr_coef_df_all = corr_coef_df_all.append(corr_tmp)

# result

corr_coef_df_all[['var1', 'var2', 'corr_coef']]

			var1	var2	corr_coef
group_1	group_2	group_3
A	C	G	col_1	col_2	0.703392
	C	K	col_1	col_2	-0.139566
	E	I	col_1	col_2	0.642818
	E	M	col_1	col_2	-0.410050
B	D	H	col_1	col_2	0.511432
	D	L	col_1	col_2	0.569900
	F	J	col_1	col_2	0.247295
	F	N	col_1	col_2	-0.186798
A	C	G	col_1	col_3	0.466368
	C	K	col_1	col_3	-0.167176
	E	I	col_1	col_3	-0.455445
	E	M	col_1	col_3	0.385438
B	D	H	col_1	col_3	-0.615976
	D	L	col_1	col_3	0.362789
	F	J	col_1	col_3	-0.063979
	F	N	col_1	col_3	-0.556404
A	C	G	col_1	col_4	-0.867131
	C	K	col_1	col_4	-0.790912
	E	I	col_1	col_4	-0.052166
	E	M	col_1	col_4	-0.191858
B	D	H	col_1	col_4	0.656101
	D	L	col_1	col_4	0.631548
	F	J	col_1	col_4	0.604571
	F	N	col_1	col_4	-0.144041
A	C	G	col_2	col_3	0.956775
	C	K	col_2	col_3	0.423775
	E	I	col_2	col_3	-0.597295
	E	M	col_2	col_3	-0.506746
B	D	H	col_2	col_3	-0.399239
	D	L	col_2	col_3	0.036270
	F	J	col_2	col_3	0.262685
	F	N	col_2	col_3	0.875746
A	C	G	col_2	col_4	-0.631931
	C	K	col_2	col_4	0.315081
	E	I	col_2	col_4	0.395802
	E	M	col_2	col_4	-0.381141
B	D	H	col_2	col_4	0.789146
	D	L	col_2	col_4	0.363601
	F	J	col_2	col_4	0.216682
	F	N	col_2	col_4	0.406150
A	C	G	col_3	col_4	-0.434402
	C	K	col_3	col_4	-0.250838
	E	I	col_3	col_4	0.274027
	E	M	col_3	col_4	-0.008633
B	D	H	col_3	col_4	-0.874220
	D	L	col_3	col_4	-0.472953
	F	J	col_3	col_4	-0.775485
	F	N	col_3	col_4	0.366142

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python Numpy] 배열에 차원 추가하기 (Adding Dimensions to a Numpy Array) (2)	2019.02.24
[Python Numpy] 배열에서 0보다 작은 수를 0으로 변환하는 방법 (0)	2019.02.21
[Python pandas] 그룹 별 무작위 표본 추출 (random sampling by group) (0)	2018.12.26
[Python pandas] 그룹 별 선형회귀모형 적합하기 (Group-wise Linear Regression) (0)	2018.12.25
[Python pandas] 그룹 별 변수 간 상관관계 분석 (correlation with columns by groups) (0)	2018.12.24

Posted by Rfriend

,

[Keras] TypeError: softmax() got an unexpected keyword argument 'axis' 에러 시 tensorflow upgrade

Deep Learning (TF, Keras, PyTorch) 2019. 2. 6. 17:09

Tensorflow, Keras를 사용하는 중에 'TypeError: softmax() got an unexpected keyword argument 'axis' 의 TypeError 발생 시 업그레이드를 해주면 해결할 수 있습니다. (저는 Python 2.7 버전, Tensorflow 1.4 버전 사용 중에 Keras로 softmax() 하려니 아래의 에러 발생하였습니다)

먼저, 명령 프롬프트 창에서 Tensorflow 가 설치된 conda environment 를 활성화시켜보겠습니다.

$ conda env list

tensorflow /Users/myid/anaconda3/envs/tensorflow

$ source activate tensorflow # for mac OS

$ activate tensorflow # for Windows OS

(tensorflow) $

참고로 Python과 Tensorflow 버전 확인하는 방법은 아래와 같습니다.

(tensorflow) $ python -V

Python 2.7.14 :: Anaconda custom (64-bit)

(tensorflow) $ python

Python 2.7.14 |Anaconda custom (640bit)| (default, Oct 5 2017, 02:28:52)

[GCC 4.2.1. Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin

Type "help", "copyright", "credits" ro "license" for more information.

>>> import tensorflow as tf

>>> tf.VERSION

'1.4.0'

(1) TypeError: softmax() got an unexpected keyword argument 'axis' 에러 대처법

Python에서 패키지 관리할 때 사용하는 pip 를 먼저 upgrade 시켜 준 후에 Tensorflow 를 업그레이트 해줍니다. Python 3.n 버전에서는 pip3 install package_name 을 사용하구요, GPU 의 경우 tensorflow-gpu 처럼 뒤에 -gpu를 추가로 붙여줍니다.

# --------------------
# TypeError: softmax() got an unexpected keyword argument 'axis'
# --------------------

# => upgrade tensorflow to the latest version

(tensorflow)$ pip install pip --upgrade # for Python 2.7
(tensorflow)$ pip3 install pip --upgrade # for Python 3.n

(tensorflow)$ pip install tensorflow --upgrade # for Python 2.7
(tensorflow)$ pip3 install tensorflow --upgrade # for Python 3.n
(tensorflow)$ pip install tensorflow-gpu --upgrade # for Python 2.7 and GPU

(tensorflow)$ pip3 install tensorflow-gpu --upgrade # for Python 3.n and GPU

Tensorflow 업그레이드 해줬더니 이번에는 numpy에서 아래의 에러가 나네요, 그래서 numpy도 업그레이드 해주었더니 문제가 해결되었습니다.

(2) numpy Traceback (most recent call last) RuntimeError:

module compiled against API version 0xc but this version of numpy is 0xb

# --------------
# Traceback (most recent call last) RuntimeError:

# module compiled against API version 0xc but this version of numpy is 0xb
# ---------------

# => upgrade numpy to the latest version

(tensorflow)$ pip install numpy --upgrade

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Deep Learning (TF, Keras, PyTorch)' 카테고리의 다른 글

[Keras] 이미지 파일 업로드하고 전처리하여 시각화하는 방법 (how to upload, preprocess and visualize images) (52)	2019.03.05
Tensorflow, Keras가 GPU를 사용하고 있는지 확인하는 방법 (0)	2019.02.19
집에서 딥러닝 공부하기에 적합한 PC 사양 및 가격대 (2017-09월) (9)	2017.09.17
Mac OSX, Python 3.6.1, tensorflow 1.2.1 환경에서 MNIST dataset 다운로드 시 [SSL: CERTIFICATE_VERIFY_FAILED] 에러 발생 시 대처방법 (0)	2017.07.24
맥 OS X에 텐서플로우(tensorflow) 설치하는 방법 - CPU, virtualenv(가상환경), Python 3.n 버전 (4)	2017.07.15

Posted by Rfriend

,

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'Python'에 해당되는 글 243건

[Python] Python으로 Postgresql, GPDB, DB2, Presto DB connect 하는 방법

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python pandas] DataFrame의 문자열 칼럼을 분할하여 일부분으로 새로운 칼럼 만들기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 텍스트를 단어 단위로 파싱해서 One-hot encoding 하기 (parsing text and one-hot encoding at word-level)

1. 텍스트 데이터를 Python string methods를 사용하여 단어 단위로 파싱하고, 단어별 token index 만들기

2. 텍스트를 단어 단위로 One-hot encoding 하기

'Deep Learning (TF, Keras, PyTorch) > Natural Language Processing' 카테고리의 다른 글

[Python] 텍스트 파일 읽어와서 숫자형 데이터 표준화하기 (reading csv or text file, standardizing or normalizing of numeric data)

1. text 파일을 읽어서 숫자형 값으로 만든 matrix, 라벨을 저장한 vector를 만들기

2-1. 숫자형 데이터를 표준화(Standardization) 하기

2-2. 숫자형 데이터를 정규화(Normalization) 하기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Keras] 이미지 파일 업로드하고 전처리하여 시각화하는 방법 (how to upload, preprocess and visualize images)

'Deep Learning (TF, Keras, PyTorch)' 카테고리의 다른 글

[Python] 경로 및 폴더 생성/제거(directory and path management using os), 파일 복사 (file copy using shutil)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python Numpy] 배열에 차원 추가하기 (Adding Dimensions to a Numpy Array)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python Numpy] 배열에서 0보다 작은 수를 0으로 변환하는 방법

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 다수 그룹 별 다수의 변수 간 상관관계 분석 (correlation coefficients with multiple columns by groups)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Keras] TypeError: softmax() got an unexpected keyword argument 'axis' 에러 시 tensorflow upgrade

'Deep Learning (TF, Keras, PyTorch)' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바