'python DataFrame을 Greenplum DB에 쓰기' 태그의 글 목록

'python DataFrame을 Greenplum DB에 쓰기'에 해당되는 글 1건

2019.07.28 [GPDB, PostgreSQL] Python DataFrame을 Sqlalchemy engine을 이용해 DB Table에 직접 쓰기 : df.to_sql()

[GPDB, PostgreSQL] Python DataFrame을 Sqlalchemy engine을 이용해 DB Table에 직접 쓰기 : df.to_sql()

Greenplum and PostgreSQL Database 2019. 7. 28. 19:06

지난 포스팅에서는 Python pandas DataFrame을 csv 파일로 다운로드 한 후에 로컬에서 PostgreSQL, Greenplum DB에 Copy해서 넣는 방법 (https://rfriend.tistory.com/457) 을 소개하였습니다.

이번 포스팅에서는 Python pandas DataFrame을 csv 파일로 다운로드 하는 절차 없이, sqlalchemy engine과 to_sql() 함수를 이용하여 바로 직접 PostgreSQL, Greenplum DB에 쓰는 방법을 소개하겠습니다. 이렇게 하면 다운로드하는 절차가 필요없기 때문에 좀더 간편하고 workflow가 간소화되는 장점은 있는데요, csv로 내려서 copy 하는 것 대비 속도고 조금 더 느리다는 단점이 있습니다.

먼저 예제로 사용할 간단한 DataFrame을 만들어보겠습니다.

# make a sample DataFrame

import pandas as pd

score = pd.DataFrame({

'date': ['2019-07-28']*4,

'name': ['kim', 'lee', 'choi', 'park'],

'age': [19, 20, 19, 20],

'math_score': [91, 95, 92, 70],

'pass_yn': [True, True, True, False]},

columns=['date', 'name', 'age', 'math_score', 'pass_yn'])

이제 'score' 라는 이름의 pandas DataFrame을 Postgresql, Greenplum DB에 'score' 라는 Table 이름으로 public schema에 생성해서 써보겠습니다.

이때 DB connection을 하기 위해 SQLAlchemy로 DB engine 을 생성해줘야 하는데요,
engine = sqlalchemy.create_engine("postgresql://user:password@host:port/database")
의 순서대로 자신의 DB 설정값을 입력해주면 됩니다. (port 가 5432 디폴트 값이면 생략 가능)

import sqlalchemy

from sqlalchemy import create_engine

# engine = sqlalchemy.create_engine("postgresql://user:password@host:port/database")

engine = create_engine("postgresql://postgres:postgres@localhost:5432/postgres") # set yours

engine.execute("DROP TABLE IF EXISTS public.score;") # drop table if exists

score.to_sql(name = 'score',

con = engine,

schema = 'public',

if_exists = 'fail', # {'fail', 'replace', 'append'), default 'fail'

index = True,

index_label = 'id',

chunksize = 2,

dtype = {

'id': sqlalchemy.types.INTEGER(),

'date': sqlalchemy.DateTime(),

'name': sqlalchemy.types.VARCHAR(100),

'age': sqlalchemy.types.INTEGER(),

'math_score': sqlalchemy.types.Float(precision=3),

'pass_yn': sqlalchemy.types.Boolean()

})

'if_exists' 옵션에는 {'fail', 'replace', 'append'}의 3개가 존재하고, 디폴트는 'fail' 옵션입니다.

if_exists = 'fail' : 같은 이름의 Table이 존재할 경우 ValueError 가 남
if_exists = 'replace': 같은 이름의 Table이 존재할 경우 기존 Table을 Drop하고 새로운 값을 Insert함
if_exists = 'append': 같은 이름의 Table이 존재할 경우 기존 Table에 추가로 새로운 값을 Insert함

index = True 로 설정해주면 pandas DataFrame의 Index도 DB Table에 insert 해주며, index_label = 'xxx'로 index의 칼럼 이름을 부여해줄 수 있습니다.

chunksize = xx 를 설정해주면 pandas DataFrame 데이터를 xx row 개수 만큼 DB table 에 insert를 해줍니다. 설정해주지 않으면 pandas DataFrame을 통째로 한꺼번에 insert를 합니다.

dtype 은 pandas DataFrame의 각 변수별로 DB table에 넣어줄 Data Type을 사전형(Dictionary)으로 {'column': data_type} 형식으로 설정해줄 수 있습니다. 위의 예시에서 INTEGER, DateTime(), VARCHAR(), Float(), Boolean 데이터 형태 지정하는 것을 보여주었는데요, 대/소문자, 괄호() 여부를 위의 예시처럼 똑같이 사용해야 합니다. (괄호를 빼먹거나, 대/소문자가 틀리면 에러가 납니다)

참고로, to_sql() 에서 dtype 을 칼럼 별로 설정하지 않으면 전부 'text' 데이터 형태로 해서 DB table에 입력됩니다.

확인차, DBeaver로 PostgreSQL에 score table을 조회해보겠습니다. Python pandas의 'score' DataFrame이 PostgreSQL의 score table로 데이터가 잘 들어갔네요!

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요.

728x90

저작자표시 비영리 변경금지

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[Postgres, Greenplum] 현재 수행중인 query 조회하고 강제 종료 (kill) 하는 방법 (0)	2019.11.28
[Greenplum] 오픈소스 Greenplum DB와 Apache MADlib을 활용한 그래프 분석, 네트워크 분석 (Graph analytics with Greenplum and Apache MADlib) (0)	2019.08.25
[GPDB, Postgres] Python DataFrame을 Postgresql, Greenplum DB에 Copy 해서 넣는 방법 (0)	2019.07.17
[Greenplum DB] GPDB에 PL/R Language Extension, R 패키지 수동 설치 방법 (0)	2019.05.16
[Greenplum DB] PostGIS - 공간지리 테이블 백업하기, 백업 다시 불러오기 (Backup and Restore geospatial table using pg_dump, pg_restore) (0)	2019.04.27

Posted by Rfriend

이전 1 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'python DataFrame을 Greenplum DB에 쓰기'에 해당되는 글 1건

[GPDB, PostgreSQL] Python DataFrame을 Sqlalchemy engine을 이용해 DB Table에 직접 쓰기 : df.to_sql()

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바