R, Python 분석과 프로그래밍의 친구 (by R Friend)

'hist()'에 해당되는 글 2건

2018.12.28 [Python] 여러개의 그룹, 변수로 히스토그램, 커널밀도곡선 그리기 (Multiple histograms)
2015.12.24 R 그래프 Base Graphics plotting system : hist() boxplot(), stem(), barplot(), dotchart(), pie(), plot() 4

[Python] 여러개의 그룹, 변수로 히스토그램, 커널밀도곡선 그리기 (Multiple histograms)

Python 분석과 프로그래밍/Python 그래프_시각화 2018. 12. 28. 12:10

지난번 포스팅에서는 하나의 그룹, 하나의 변수에 대한 히스토그램, 커널밀도곡선을 그리는 방법을 소개하였습니다.

이번 포스팅에서는

(1) 여러개의 그룹에 대한 히스토그램, 커널밀도곡선 그리기

(2) 여러개의 변수에 대한 히스토그램, 커널밀도곡선 그리기

에 대해서 알아보겠습니다.

먼저, matlplotlib.pyplot, seaborn 패키지를 importing하고, 예제로 사용할 iris 데이터셋을 불러오겠습니다.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

# loading 'iris' dataset

iris = sns.load_dataset('iris')

iris.shape

(150, 5)

iris.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

iris.groupby('species').size()

species
setosa        50
versicolor    50
virginica     50
dtype: int64

iris는 붓꽃인데요, 아래처럼 versicolor, setosa, virginica의 3개 종(species) 그룹별로 각 50개씩 꽃잎 길이와 넓이, 꽃받침 길이와 넓이의 4개 변수를 측정한 데이터셋입니다.

* image source: https://www.datacamp.com/community/tutorials/machine-learning-in-r

(1) 여러개 그룹의 히스토그램, 커널밀도곡선 그리기

petal_length 변수에 대해서 setosa, versicolor, virginica 종의 3개 그룹(groups)의 히스토그램과 커널밀도곡선을 그룹별로 색깔을 다르게 하여 그려보겠습니다.

# 1-1. Multiple histograms on the same axis

sns.distplot(iris[iris.species == "setosa"]["petal_length"],

color="blue", label="setosa")

sns.distplot(iris[iris.species == "versicolor"]["petal_length"],

color="red", label="versicolor")

sns.distplot(iris[iris.species == "virginica"]["petal_length"],

color="green", label="virginica")

plt.legend(title="Species")

plt.show()

만약 그룹 개수가 많아서 위에서처럼 일일이 코딩하기가 시간이 오래걸리고 반복되는 코드가 길게 늘어서는게 싫다면 아래처럼 for loop 을 사용해주면 됩니다.

그래프의 제목, X축 이름, Y축 이름, 범례 이름을 설정하는 방법도 같이 소개합니다.

# 1-2. Via for loop

grp_col_dict = {'setosa': 'blue',

'versicolor': 'red',

'virginica': 'green'}

# for loop of species group

for group in grp_col_dict:

# subset of group

subset = iris[iris['species'] == group]

# histogram and kernel density curve

sns.distplot(subset['petal_length'],

hist = True, # histogram

kde = True, # density curve

kde_kws = {'linewidth': 2},

color = grp_col_dict[group],

label = group)

# setting plot format

plt.title('Histogram & Density Plot by Groups')

plt.xlabel('Petal Length(unit:cm)')

plt.ylabel('Density')

plt.legend(prop={'size': 12}, title = 'Group')

plt.show()

(2) 여러개 변수의 히스토그램, 커널밀도곡선 그리기

이번에는 sepal_width, sepal_length, petal_width, petal_length 의 4개 변수(variable)에 대해서 히스토그램과 커널밀도곡선을 그려보겠습니다. (단, 종(species)의 구분없이 전체 사용)

for loop 을 사용하였는데요, 위의 그룹 indexing 과 이번의 변수 indexing 부분이 다르다는 점 유심히 살펴보시기 바랍니다.

# 2-1. Multiple histograms on the same axis

var_color_dict = {'sepal_length': 'blue',

'sepal_width': 'red',

'petal_length': 'yellow',

'petal_width': 'green'}

# for loop

for var in var_color_dict:

sns.distplot(iris[var],

color = var_color_dict[var],

hist_kws = {'edgecolor': 'gray'},

label = var)

plt.legend(title = 'Variables')

plt.show()

위의 (2-1) 그래프는 1개의 window에 동일한 축을 사용하여 4개 변수의 히스토그램과 밀도곡선을 그리다보니 중첩이 되면서 좀 헷갈리고 보기에 어려운 점이 있습니다.

이런 경우에 그래프를 각 변수별로 분리해서 4개의 window subplots에 하나씩 그려서 비교하는 것도 좋은 방법입니다. ax=axes[0, 0] 은 좌상, ax=axes[0, 1]은 우상, ax=axes[1, 0]은 좌하, ax=axes[1, 1]은 우하 위치의 subplot 입니다.

# 2-2. Multiple histograms at separate windows

f, axes = plt.subplots(2, 2, figsize=(8, 6), sharex=True)

sns.distplot(iris["sepal_length"], color="blue", ax=axes[0, 0])

sns.distplot(iris["sepal_width"], color="red", ax=axes[0, 1])

sns.distplot(iris["petal_length"], color="yellow", ax=axes[1, 0])

sns.distplot(iris["petal_width"], color="green", ax=axes[1, 1])

plt.show()

for loop을 사용해서 그리려면 아래 코드를 참고하세요.

var_color_dict = {'sepal_length': 'blue',

'sepal_width': 'red',

'petal_length': 'yellow',

'petal_width': 'green'}

i = [0, 0, 1, 1]

j = [0, 1, 0, 1]

# for loop

f, axes = plt.subplots(2, 2, figsize=(8, 6), sharex=True)

for var, i, j in zip(var_color_dict, i, j):

sns.distplot(iris[var],

color = var_color_dict[var],

ax = axes[i, j])

plt.show()

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 그래프_시각화' 카테고리의 다른 글

[Python] 산점도 그래프 (Scatter Plot) (1/4) (0)	2019.01.13
[Python] 원 그래프 (Pie Chart), 하위 그룹을 포함한 도넛 그래프 (Donut Chart with Subgroups) (1)	2019.01.12
[Python] 막대 그래프 (Bar Chart) (7)	2019.01.11
[Python] 상자 그림 (Box plot, Box-and-Whisker Plot) (2)	2019.01.08
[Python] 하나의 변수/그룹에 대한 히스토그램 (Histogram) (0)	2018.12.27

Posted by Rfriend

R 그래프 Base Graphics plotting system : hist() boxplot(), stem(), barplot(), dotchart(), pie(), plot()

R 분석과 프로그래밍/R 그래프_시각화 2015. 12. 24. 01:20

R의 plotting system에는 크게 (1) Base Graphics, (2) Lattice, (3) ggplot2 의 3가지가 있습니다. 이전 포스팅에서 ggplot2 plotting system을 활용한 그래프 그리기를 소개하였다면, 이제부터는 쉽고 빠르게, 대화형으로 직관적으로 그래프를 단계적으로 그려나갈 수 있는 Base Graphics plotting system에 대해서 알아보겠습니다.

Base Graphics system 은 기본 뼈대에 해당하는 (1) 높은 수준의 그래프 함수 (High Level Graphics facilities), 여기에 살을 하나, 둘씩 차근 차근 더해가는 (2) 낮은 수준의 그래프 함수 (Low Level Graphics facilities), 색깔이나 모양, 선 형태, 마진 등의 다양한 그래프 특성에 해당하는 옵션을 설정하는 (3) 그래픽 모수 (Graphic Parameters) 를 조합하여 단계적으로 (step by step) 그래프를 대화형으로 그려나가게 됩니다.

아래에 산점도(scatter plot)을 가지고 위에서 소개한 용어들이 의미하는 바를 예를 들어 설명해보도록 하겠습니다.

> library(MASS)
> attach(Cars93)
> 
> # high level graphics facility : plot()
> # graphics parameters : type, pch, col, etc.
> plot(MPG.highway ~ Weight, type = "p", pch = 19, col = "black")
> 
> # low level graphics facility : abline(), title(), text()
> # graphics parameters : labels, cex, pos, col, etc.
> abline(lm(MPG.highway ~ Weight))
> text(Weight, MPG.highway, labels = abbreviate(Manufacturer, minlength = 5), 
+      cex = 0.6, pos = 2, col = "blue")

> 
> detach(Cars93)

위 그래프의 R함수에서 높은 수준의 그래프 함수, 낮은 수준의 그래프 함수, 그래프 모수에 해당하는 부분을 각 각 표기하면 아래와 같습니다. 높은 수준의 그래프 함수 plot()으로 먼저 뼈대를 잡아놓고, 낮은 수준의 그래프 함수 abline()로 차의 무게(Weight)와 고속도로연비(MPG.highway) 간 회귀선을 적합시킨 선을 추가하고 text()로 차 제조사 이름을 명기하였습니다. 이때 그래프 모수(parameters)로 그래프의 형태(type), 점의 형태(pch), 색깔(col), 레이블(labels), default 대비 확대 배수(cex), 다른 축과 교차되는 좌표(pos) 등을 옵션으로 설정하게 됩니다.

높은 수준의 그래프 함수 (High Level Graphics facilities) 들을 표로 정리해보면 아래와 같습니다.

Graph	High Level Graphics Functions of Base Graphics system
histogram	hist()
Box-and-Whiskers Plot	boxplot()
Stem and Leaf Plot	stem()
Bar Plot	barplot()
Cleveland Dot Plot	dotchart()
Pie Plot	pie()
Scatter Plot	plot(x, y)
Scatter Plot Matrix	plot(dataframe) cf) other package: scatterplotMatrx()
Line Plot	plot(x, y, type=“l”)
High Density Needle Plot	plot(x, y, type=“h”)
Both Dot and Line Plot	plot(x, y, type=“b”)
Overlapped Dot and Line Plot	plot(x, y, type=“o”)
Step Plot	plot(x, y, type=“s”)
Empty Plot	plot(x, y, type=“n”)

디폴트 모수로 해서 간단하게 그래프를 예로 들어보겠습니다.

일변량 연속형 데이터 그래프 (plot for 1 variable, continuous data)

Histogram : hist()

> # histogram : hist()
> hist(Cars93$MPG.highway, main = "histogram : hist()")

box-and-whisker plot : boxplot()

> # box-and-whisker plot : boxplot()
> boxplot(Cars93$MPG.highway, main = "box-and-whisker plot : boxplot()")

stem and leaf plot : stem()

> # stem and leaf plot : stem()
> stem(Cars93$MPG.highway)

  The decimal point is 1 digit(s) to the right of the |

  2 | 00112233334444
  2 | 55555555666666666667777778888888888999999
  3 | 000000000111111123333333444
  3 | 6667778
  4 | 13
  4 | 6
  5 | 0

일변량 범주형 자료 그래프 (plot for 1 variable, categorical data)

bar plot : barplot()

> ##-------- plot for one variable, categorical data
> # bar plot : barplot()
> table_cyl <- table(Cars93$Cylinders)
> barplot(table_cyl, main = "bar plot : barplot()")

Cleveland dot plot : dotchart()

> # cleveland dot plot : dotchart()
> table_cyl <- table(Cars93$Cylinders) # frequency table
> Cylinders <- names(table_cyl) # names for label
> 
> dotchart(as.numeric(table_cyl), labels = Cylinders, main = "cleveland dot plot")

pie chart : pie()

> # pie chart : pie()
> table_cyl <- table(Cars93$Cylinders) # frequency table
> Cylinders <- names(table_cyl) # names for label
> 
> pie(table_cyl, labels = Cylinders, main = "pie chart")

이변량 연속형변수 그래프 (plot for 2 variables, continuous data)

산점도 (scatter plot)

> ##----- plot for 2 variables, continuous data
> # scatter plot : plot(x, y)
> with(Cars93, plot(Weight, MPG.highway, main = "scatter plot : plot(x, y)"))

scatter plot matrix : plot(dataframe), pairs(), scatterplotMatrix(dataframe)

> # scatter plot matrix : plot(dataframe)

> > str(Cars93) 'data.frame': 93 obs. of 27 variables: $ Manufacturer : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ... $ Model : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ... $ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ... $ Min.Price : num 12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ... $ Price : num 15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ... $ Max.Price : num 18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ... $ MPG.city : int 25 18 20 19 22 22 19 16 19 16 ... $ MPG.highway : int 31 25 26 26 30 31 28 25 27 25 ... $ AirBags : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ... $ DriveTrain : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ... $ Cylinders : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ... $ EngineSize : num 1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ... $ Horsepower : int 140 200 172 172 208 110 170 180 170 200 ... $ RPM : int 6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ... $ Rev.per.mile : int 2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ... $ Man.trans.avail : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ... $ Fuel.tank.capacity: num 13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ... $ Passengers : int 5 5 5 6 4 6 6 6 5 6 ... $ Length : int 177 195 180 193 186 189 200 216 198 206 ... $ Wheelbase : int 102 115 102 106 109 105 111 116 108 114 ... $ Width : int 68 71 67 70 69 69 74 78 73 73 ... $ Turn.circle : int 37 38 37 37 39 41 42 45 41 43 ... $ Rear.seat.room : num 26.5 30 28 31 27 28 30.5 30.5 26.5 35 ... $ Luggage.room : int 11 15 14 17 13 16 17 21 14 18 ... $ Weight : int 2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ... $ Origin : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ... $ Make : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ... >

> Cars93_subset <- Cars93[,c("Weight", "Horsepower", "MPG.highway", "MPG.city")]
> plot(Cars93_subset, main = "scatter plot matrix : plot(dataframe)")
>





> 
> # scatter plot matrix : scatterplotMatrix(dataframe)
> library(car)
> scatterplotMatrix(Cars93_subset, main = "scatter plot matrix : scatterplotMatrx(dataframe)")
>

plot by various types : plot(x, y, type = "l, h, b, o, s, n")

> ##--------
> # plot by various type : l, h, b, o, s, n
> # order by Weight
> Cars93_1 <- Cars93[order(Cars93$Weight),]
> 
> # dividing window frame
> par(mfrow = c(3, 2))
> 
> # plots by type
> attach(Cars93_1)

> # line plot
> plot(MPG.highway ~ Weight, type = "l", main = "type = l") 
> 
> # high density needle plot
> plot(MPG.highway ~ Weight, type = "h", main = "type = h") 
> 
> # both dot and line plot
> plot(MPG.highway ~ Weight, type = "b", main = "type = b") 
> 
> # overlapped dot and line plot
> plot(MPG.highway ~ Weight, type = "o", main = "type = o") 
> 
> # step plot
> plot(MPG.highway ~ Weight, type = "s", main = "type = s") 
> 
> # empty plot
> plot(MPG.highway ~ Weight, type = "n", main = "type = n") 
> 
> detach(Cars93_1)

위의 그래프들은 높은 수준의 그래프 함수의 그래프 함수에 대해서 간략하게 소개하기 위해서 모수를 거의 손대지 않고 그린 그래프들입니다. 낮은 수준의 그래프 함수와 주요 모수 (parameter) 설정하는 방법에 대해서는 다음번 포스팅에서 소개하도록 하겠습니다.

많은 도움 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감 ~♡' 단추를 꾸욱 눌러주세요.^^

728x90

저작자표시 비영리 변경금지

'R 분석과 프로그래밍 > R 그래프_시각화' 카테고리의 다른 글

R 그래프 모수(Graphical Parameters) : 기호 모양 pch, 크기 cex, 선 유형 lty, 선 두께 lwd (0)	2015.12.26
R 그래프 모수 (Graphical Parameters) : 그래프 모수 설정하는 2가지 방법 (0)	2015.12.24
R 동적 그래프 (Interactive Plotting in R) : with manipulate package in Rstudio (0)	2015.09.12
R ggplot2 연속확률분포 곡선, stat_function() (0)	2015.09.07
R ggplot2 히트맵(Heat map) 그리기 : geom_tile(), geom_raster() (12)	2015.09.06

Posted by Rfriend

이전 1 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'hist()'에 해당되는 글 2건

[Python] 여러개의 그룹, 변수로 히스토그램, 커널밀도곡선 그리기 (Multiple histograms)

'Python 분석과 프로그래밍 > Python 그래프_시각화' 카테고리의 다른 글

R 그래프 Base Graphics plotting system : hist() boxplot(), stem(), barplot(), dotchart(), pie(), plot()

'R 분석과 프로그래밍 > R 그래프_시각화' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바