'Jupyter Notebook에서 Greenplum DB에 SQL query한 결과를 pandas DataFrame으로 가져오기' 태그의 글 목록

'Jupyter Notebook에서 Greenplum DB에 SQL query한 결과를 pandas DataFrame으로 가져오기'에 해당되는 글 1건

2020.11.21 [PostgreSQL, Greenplum] ipython-sql로 Jupyter Notebook에서 SQL query한 결과를 pandas DataFrame으로 가져오는 3가지 방법

[PostgreSQL, Greenplum] ipython-sql로 Jupyter Notebook에서 SQL query한 결과를 pandas DataFrame으로 가져오는 3가지 방법

Greenplum and PostgreSQL Database 2020. 11. 21. 14:52

지난번 포스팅에서는 Python의 ipython-sql 패키지를 이용하여 Jupyter Notebook 의 로컬변수(variable in local namespace)로 PostgreSQL, Greenplum database에 동적으로 SQL query를 하는 3가지 방법을 소개하였습니다.(https://rfriend.tistory.com/578)

이번 포스팅에서는 Python의 ipython-sql 패키지를 이용하여 Jupyter Notebook에서 PostgreSQL, Greenplum database에 SQL query한 결과를 pandas DataFrame으로 가져오는 (즉, 객체로 할당하는) 3가지 방법을 소개하겠습니다.

(1) ResultSet = %sql SELECT ... (single-line query)

(2) %%sql ResultSet << SELECT ... (multi-line queries)

(3) %%sql SELECT ... 를 먼저 실행한 후에,

ResultSet = _ (select 결과를 사후에 할당)

==> (공통) sql.run.ResultSet을 pandas DataFrame으로 변환: df = ResultSet.DataFrame()

(0) Python 패키지 사전 설치 및 PostgreSQL, Greenplum database 접속

명령 프롬프트 창에서 pip로 PostgreSQL, Greenplum database에 접속하여 SQL query를 할 수 있는 Python 패키지를 설치해주세요.

-- (명령 프롬프트 창에서)

$ pip install --upgrade pip

$ pip install sqlalchemy

$ pip install psycopg2

$ pip install ipython-sql==0.3.9

$ pip install pgspecial

$ pip install sql_magic

ipython-sql 패키지를 이용하여 Jupyter Notebook에서 Greenplum database에 접속해보겠습니다.

%load_ext sql

# %sql postgresql://username:passowrd@host:port/database

%sql postgresql://gpadmin:changeme@localhost:5432/demo

[Out]

* postgresql://gpadmin:***@localhost:5432/demo 1 rows affected.

예제로 사용할 간단한 houses 테이블을 만들고 데이터를 넣어보겠습니다.

%%sql
DROP TABLE IF EXISTS houses;
CREATE TABLE houses (id INT, tax INT, bedroom INT, bath FLOAT, price INT,
            size INT, lot INT, region TEXT);
INSERT INTO houses VALUES
(1 , 590 ,       2 ,    1 , 50000 , 770 , 22100, 'seoul'),
(2 , 1050 ,       3 ,    2 , 85000 , 1410 , 12000, 'seoul'),
(3 ,   20 ,       3 ,    1 , 22500 , 1060 , 3500, 'busan'),
(4 , 870 ,       2 ,    2 , 90000 , 1300 , 17500, 'seoul'),
(5 , 1320 ,       3 ,    2 , 133000 , 1500 , 30000, 'bundang'),
(6 , 1350 ,       2 ,    1 , 90500 , 820 , 25700, 'bundang'),
(7 , 2790 ,       3 , 2.5 , 260000 , 2130 , 25000, 'busan'),
(8 , 680 ,       2 ,    1 , 142500 , 1170 , 22000, 'busan'),
(9 , 1840 ,       3 ,    2 , 160000 , 1500 , 19000, 'inchon'),
(10 , 3680 ,       4 ,    2 , 240000 , 2790 , 20000, 'seoul'),
(11 , 1660 ,       3 ,    1 , 87000 , 1030 , 17500, 'inchon'),
(12 , 1620 ,       3 ,    2 , 118600 , 1250 , 20000, 'busan'),
(13 , 3100 ,       3 ,    2 , 140000 , 1760 , 38000, 'bundang'),
(14 , 2070 ,       2 ,    3 , 148000 , 1550 , 14000, 'bundang'),
(15 , 650 ,       3 , 1.5 , 65000 , 1450 , 12000, 'inchon');

(1) ResultSet = %sql SELECT ... (single-line query)

SQL query가 1줄 짜리인 경우에 일반적인 IPython assignment 방식(result = %sql SELECT ...)으로 간단하게 사용할 수 있습니다.

SQL query 결과를 할당받으면 'sql.run.ResultSet' 데이터 형태입니다. 이를 pandas DataFrame으로 변환하려면 pandas의 DataFrame() 메소드를 사용하면 됩니다. ResultSet.DataFrame() 처럼 해서 일단 pandas DataFrame으로 변환을 해주고 나면, pandas의 데이터전처리 및 시각화 함수를 사용할 수 있습니다.

# for 1 line SQL query
house_sub_1 = %sql SELECT * FROM houses WHERE price <= 80000;

[Out]

 * postgresql://gpadmin:***@localhost:5432/demo
3 rows affected.

house_sub_1

[Out]

id	tax	bedroom	bath	price	size	lot	region
3	20	3	1.0	22500	1060	3500	busan
1	590	2	1.0	50000	770	22100	seoul
15	650	3	1.5	65000	1450	12000	inchon

type(house_sub_1)

[Out] sql.run.ResultSet

# converting 'sql.run.ResultSet into pandas DataFrame

df_house_sub_1 = house_sub_1.DataFrame()
type(df_house_sub_1)

[Out] pandas.core.frame.DataFrame

# plot using pandas function

import matplotlib.pyplot as plt
df_house_sub_1['price'].plot(kind='bar')
plt.show()

(2) %%sql ResultSet << SELECT ... (multi-line queries)

2줄 이상의 SQL quries 결과를 할당하려면 %%sql result_name 그리고 이어서 '<<' 연산자(<< operator)를 사용합니다.

(참고로, %sql 은 1줄 짜리 SQL query를 쓸 때, %%sql 은 2줄 이상의 SQL query를 쓸 때 사용함)

'sql.run.ResultSet' 데이터 형태로 할당받은 결과를 DataFrame() 메소드를 사용해서 pandas DataFrame으로 변환하는 것은 위의 (1)번과 동일합니다.

%%sql house_sub_2 << SELECT *
FROM houses
WHERE price <= 80000;

[Out]

* postgresql://gpadmin:***@localhost:5432/demo 3 rows affected. Returning data to local variable house_sub_2

house_sub_2

[Out]

id	tax	bedroom	bath	price	size	lot	region
1	590	2	1.0	50000	770	22100	seoul
3	20	3	1.0	22500	1060	3500	busan
15	650	3	1.5	65000	1450	12000	inchon

type(house_sub_2)

[Out] sql.run.ResultSet

# Converting 'sql.run.ResultSet' into pandas DataFrame

df_house_sub_2 = house_sub_2.DataFrame()
type(df_house_sub_2)

[Out] pandas.core.frame.DataFrame

(3) %%sql SELECT ... 를 먼저 실행한 후에,

ResultSet = _ (select 결과를 사후에 할당)

경우에 따라서는 먼저 %sql 로 1줄 짜리 SQL query를 먼저 하거나, 혹은 %%sql 로 2줄 이상의 SQL query 를 Jupyter Notebook에서 먼저 실행한 후에, 이의 결과를 확인하고 나서 나중에 그 결과를 Python의 pandas DataFrame으로 가져오고 싶을 때도 있을 것입니다.

이럴 경우에 직전에 실행한 %sql SELECT ... 또는 %%sql SELECT ... 결과를 result = _ 를 사용해서 'sql.run.ResultSet'으로 할당받을 수 있습니다.

'sql.run.ResultSet' 데이터 형태로 할당받은 결과를 DataFrame() 메소드를 사용해서 pandas DataFrame으로 변환하는 것은 위의 (1)번과 동일합니다.

%sql SELECT * FROM houses WHERE price <= 80000;

[Out]

* postgresql://gpadmin:***@localhost:5432/demo 3 rows affected.

Out[22]:

id	tax	bedroom	bath	price	size	lot	region
1	590	2	1.0	50000	770	22100	seoul
3	20	3	1.0	22500	1060	3500	busan
15	650	3	1.5	65000	1450	12000	inchon

# Capturing the quries results and assigning to ResultSet

house_sub_3 = _

house_sub_3

[Out]

id	tax	bedroom	bath	price	size	lot	region
1	590	2	1.0	50000	770	22100	seoul
3	20	3	1.0	22500	1060	3500	busan
15	650	3	1.5	65000	1450	12000	inchon

type(house_sub_3)

[Out] sql.run.ResultSet

df_house_sub_3 = house_sub_3.DataFrame()
type(df_house_sub_3)

[Out] pandas.core.frame.DataFrame

* Reference: https://pypi.org/project/ipython-sql/

이번 포스팅이 많은 도움이 되었기를 바랍니다.

행복한 데이터 과학자 되세요! :-)

728x90

저작자표시 비영리 변경금지

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[PostgreSQL, Greenplum] 상관관계 분석 (Correlation Coefficients using SQL, MADlib) (0)	2020.11.29
[PostgreSQL, Greenplum] 연속형 데이터의 집계/ 요약 통계량 구하기 (0)	2020.11.28
[PostgreSQL, Greenplum] ipython-sql 로 Jupyter Notebook 의 로컬변수로 동적으로 SQL query 하는 3가지 방법 (0)	2020.11.15
[PostgreSQL, Greenplum] ipython-sql 로 Jupyter Notebook에서 DB 접속하는 4가지 방법 (0)	2020.11.15
[PostgreSQL, Greenplum] ipython-sql, pgspecial로 Jupyter Notebook에서 DB접속하고 SQL query 하기 (0)	2020.10.20

Posted by Rfriend

이전 1 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'Jupyter Notebook에서 Greenplum DB에 SQL query한 결과를 pandas DataFrame으로 가져오기'에 해당되는 글 1건

[PostgreSQL, Greenplum] ipython-sql로 Jupyter Notebook에서 SQL query한 결과를 pandas DataFrame으로 가져오는 3가지 방법

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바