'분류 전체보기' 카테고리의 글 목록 (32 Page)

[R] 자기상관계수 (Autocorrelation Coefficients), 자기상관그림(Autocorrelation Plot)

R 분석과 프로그래밍/R 통계분석 2020. 3. 22. 23:58

자기상관계수(Autocorrelation coefficients), 자기상관그림(Autocorrelation Plot)은 (a) 데이터가 시간에 의존하는 것 없이 무작위성 (randomness, no time dependence in the data) 을 띠는지 확인하는데 사용됩니다. 그리고 (b) 시계열분석에서 Box-Jenkins ARIMA (Autoregressive Integrated Moving Average) 모형 식별 단계(model identification stage)에서 AR 차수를 구하는데도 사용됩니다.

이번 포스팅에서는 R을 사용해서

(1) 자기상관계수(Autocorrelation coefficients)를 계산하고,

(2) 자기상관계수의 95% 신뢰구간(95% confidence interval)을 계산하고,

(3) 자기상관그림(Autocorrelation Plot)을 그리는 방법

을 소개하겠습니다.

본론에 들어가기 전에 먼저, 상관계수(correlation coefficients)와 자기상관계수(autocorrelation coefficients)를 비교해보겠습니다.

상관계수와 자기상관계수의 유사한 점은 두 연속형 변수 간의 관계를 분석하다는 점입니다. 기본 개념은 공분산을 표준편차로 나누어서 표준화해주며, (자기)상관계수가 -1 ~ 1 사이의 값을 가지게 됩니다.

상관계수와 자기상관계수가 서로 다른 점은, 상관계수는 특정 동일 시점을 횡단면으로 해서 Y와 다른 X1, X2, ... 변수들 간의 관계를 분석합니다. 반면에 자기상관계수는 동일한 변수(Yt, Yt-1, Yt-2, ...)의 서로 다른 시간 차이 (time lag) 를 두고 관계를 분석하는 것입니다.

기존에 cross-sectional 관점의 Y와 X 변수들 간의 상관관계 분석에 많이 익숙해져 있는 경우에, 시계열 분석을 공부할 때 보면 자기 자신의 시간 차이에 따른 자기상관관계 관점이 처음엔 헷갈리고 잘 이해가 안가기도 합니다. 관점이 바뀐것일 뿐 어려운 개념은 아니므로 정확하게 이해하고 가면 좋겠습니다.

(1) 자기상관계수(Autocorrelation coefficients) 구하고, 자기상관그림 그리기

자기상관계수는 아래의 공식과 같이 로 계산합니다.

간단한 예로서, 1 ~ 50까지의 시간 t에 대해서 싸인곡선 형태의 주기적인 파동을 띠는 값에 정규확률분포 N(0, 1) 에서 추출한 난수를 더하여 생성한 Y 데이터셋에 대해서 R로 위의 공식을 이용해서 각 시차(time lag)별로 자기상관계수를 구해보겠습니다.

# sine wave with noise

set.seed(123)

noise <- 0.5*rnorm(50, mean=0, sd=1)

t <- seq(1, 50, 1)

y <- sin(t) + noise

plot(y, type="b")

자기상관계수를 구하고 자기상관그림을 그리는 가장 간단한 방법은 R stats 패키지에 들어있는 acf() 함수를 사용하는 것입니다.

z <- acf(y, type=c("correlation"), plot=TRUE)

# autocorrelation coefficients per time lag k

round(z$acf, 4)

[,1]

[1,] 1.0000

[2,] 0.3753

[3,] -0.3244

[4,] -0.5600

[5,] -0.4216

[6,] 0.2267

[7,] 0.5729

[8,] 0.3551

[9,] -0.0952

[10,] -0.5586

[11,] -0.4582

[12,] 0.0586

[13,] 0.3776

[14,] 0.4330

[15,] 0.0961

[16,] -0.4588

[17,] -0.4759

# autocorrelation plot

좀 복잡하기는 합니다만, 위의 자기상관계수를 구하는 공식을 이용해서 R로 직접 사용자 정의 함수를 만들고, for loop 반복문을 이용해서 시차(time lag) k 별로 자기상관계수를 구할 수도 있습니다. 자기공분산을 분산으로 나누어서 표준화해주었습니다. 시차 0(time lag 0) 일 경우에는 자기자신과의 상관관계이므로 자기상관계수는 '1'이 되며, 표준화를 해주었기 때문에 자기상관계수는 -1 <= autocorr(Y, Yt-k) <= 1 사이의 값을 가집니다.

# User Defined Function of Autocorrelation

acf_func <- function(y, lag_k){

# y: input vector

# lag_k : Lag order of k

N = length(y) # total number of observations

y_bar = mean(y)

# Variance

var = sum((y - y_bar)^2) / N

# Autocovariance

auto_cov = sum((y[1:(N-lag_k)] - y_bar) * (y[(1+lag_k):(N)] - y_bar)) / N

# Autocorrelation coefficient = Autocovariance / Variance

r_k = auto_cov / var

return(r_k)

}

# Compute Autocorrelation per lag (from 0 to 9 in this case)

acf <- data.frame()

for (k in 0:(length(y)-1)){

acf_k <- round(acf_func(y, lag_k = k), 4)

acf[k+1, 'lag'] = k

acf[k+1, 'ACF'] = acf_k

}

> print(acf)

lag ACF

1 0 1.0000

2 1 0.3753

3 2 -0.3244

4 3 -0.5600

5 4 -0.4216

6 5 0.2267

7 6 0.5729

8 7 0.3551

9 8 -0.0952

10 9 -0.5586

11 10 -0.4582

12 11 0.0586

13 12 0.3776

14 13 0.4330

15 14 0.0961

16 15 -0.4588

17 16 -0.4759

18 17 -0.0594

19 18 0.2982

20 19 0.3968

21 20 0.1227

22 21 -0.2544

23 22 -0.3902

24 23 -0.1981

25 24 0.1891

26 25 0.3701

27 26 0.1360

28 27 -0.1339

29 28 -0.2167

30 29 -0.1641

31 30 0.0842

32 31 0.2594

33 32 0.1837

34 33 0.0139

35 34 -0.1695

36 35 -0.1782

37 36 -0.0363

38 37 0.1157

39 38 0.1754

40 39 0.0201

41 40 -0.1428

42 41 -0.1014

43 42 0.0000

44 43 0.0750

45 44 0.0641

46 45 -0.0031

47 46 -0.0354

48 47 -0.0405

49 48 -0.0177

50 49 -0.0054

이번 예의 경우 시간의 흐름에 따라 싸인 파동 (sine wave) 형태를 보이는 데이터이므로, 당연하게도 시차를 두고 주기적으로 큰 양(+)의 자기상관, 큰 음(-)의 자기 상관을 보이고 있습니다.

(2) 자기상관계수의 95% 신뢰구간(95% confidence interval) 계산

자기상관계수의 95% 신뢰구간은 아래의 공식으로 간단하게 구할 수 있습니다.

자기상관계수의 95% 신뢰구간 =

이때 N은 샘플 개수, z 는 표준정규분포의 누적분포함수, 는 유의수준 5% 입니다.

이미 익숙하게 알고 있다시피 유의수준 =0.05 에서의 z = 1.96 입니다. 이번 예에서의 관측치 개수 N = 50 입니다. 따라서 자기상관계수의 95% 신뢰구간은 이 됩니다.

> # z quantile of 95% confidence level

> qnorm(0.975, mean=0, sd=1, lower.tail=TRUE)

[1] 1.959964

> qnorm(0.025, mean=0, sd=1, lower.tail=TRUE)

[1] -1.959964

>

> # sample size

> N <- length(y)

> print(N)

[1] 50

>

> # 95% confidence interval of autocorrelation

> qnorm(0.975, mean=0, sd=1, lower.tail=TRUE)/ sqrt(N)

[1] 0.2771808

> qnorm(0.025, mean=0, sd=1, lower.tail=TRUE)/ sqrt(N)

[1] -0.2771808

위의 acf() 함수로 그린 자기상관그림(autocorrelation plot)에서 ACF 축 +0.277 과 -0.277 위치에 수평으로 그어진 점선이 바로 95% 신뢰구간의 상, 하한 선입니다.

위의 자기상관그림에서 보면 95% 신뢰구간의 상, 하한 점선을 주기적으로 벗어나므로 이 Y 데이터셋은 무작위적이지 않으며 (not random), 시계열적인 자기상관(autocorrelation)이 존재한다고 판단할 수 있습니다.

만약, 무작위적인 데이터셋에 대해서 자기상관계수를 계산하고 자기상관그림을 그린다면, 시차(time lag)에 따른 자기상관계수가 모두 95% 신뢰구간 상, 하한 점선의 안쪽에 위치할 것입니다.

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'R 분석과 프로그래밍 > R 통계분석' 카테고리의 다른 글

[R] 빈도 데이터 분석을 위한 포아송 회귀모형과 대안 회귀모형에 대한 소개 (9)	2019.10.23
[R] 로지스틱 회귀분석을 통한 유방암 예측(분류) (4/4): 로지스틱 회귀모형 적합 및 모델평가, 해석 (19)	2018.12.18
[R] 로지스틱 회귀분석을 통한 유방암 예측(분류) (3/4): 1차 변수 선택 및 목표변수와 설명변수 간 관계 분석 (8)	2018.12.17
[R] 로지스틱 회귀분석을 통한 유방암 예측(분류) (2/4): 탐색적 데이터 분석 및 전처리 (32)	2018.12.16
[R] 로지스틱 회귀분석을 통한 유방암 예측(분류) (1/4): WDBC Data 소개 (4)	2018.12.16

Posted by Rfriend

,

[Tensorflow] 딥러닝을 위한 공개 데이터셋 Tensorflow Datasets

Deep Learning (TF, Keras, PyTorch) 2020. 3. 19. 00:26

딥러닝, 머신러닝을 공부하다 보면 예제로 활용할 수 있는 데이터를 마련하거나 찾기가 어려워서 곤란할 때가 있습니다. 특히 라벨링이 된 이미지, 영상, 음성 등의 데이터의 경우 자체적으로 마련하기가 쉽지 않습니다.

이번 포스팅에서는 딥러닝, 기계학습을 하는데 활용할 수 있도록 공개된 데이터셋을 TensorFlow Datasets 에서 다운로드하고 fetch 하는 방법을 소개하겠습니다.

TensorFlow 데이터셋은 아래의 두 곳에서 다운로드 할 수 있습니다.

(Many many thanks to TensorFlow team!!! ^__^)

TensorFlow Datasets : https://www.tensorflow.org/datasets
TensorFlow Datasets on GitHub : https://github.com/tensorflow/datasets

(1) TensorFlow 2.0 과 TensorFlow Datasets 라이브러리 설치

cmd 명령 프롬프트 창에서 pip install 로 TensorFlow 2.0 과 TensorFlow DataSets 라이브러리를 설치합니다.

(* CPU 를 사용할 경우)

$ pip install --upgrade pip

$ pip install tensorflow

$ pip install tensorflow_datasets

(* GPU를 사용할 경우)

$ pip install tensorflow-gpu

(2) tensorflow 와 tensorflow_datasets 라이브러리 import 후 Dataset 로딩하기

TensorFlow v2와 tensorflow_datasets 라이브르러를 import 하겠습니다.

import tensorflow.compat.v2 as tf

import tensorflow_datasets as tfds

TensorFlow v2 부터는 PyTorch처럼 Eager 모드를 지원합니다. Eager모드와 Graph 모드를 활성화시키겠습니다.

# tfds works in both Eager and Graph modes

tf.enable_v2_behavior()

TensorFlow Datasets에 등록된 모든 공개 데이터셋 리스트를 조회해보겠습니다. 양이 너무 많아서 중간은 생략했는데요, 아래에 카테고리별로 리스트를 다시 정리해보았습니다.

# Tensorflow Datasets Lists

tfds.list_builders()

['abstract_reasoning',
 'aeslc',
 'aflw2k3d',
 'amazon_us_reviews',
 'arc',
 'bair_robot_pushing_small',
   :  
  -- 너무 많아서 중간 생략 ^^; --

 'wmt18_translate',
 'wmt19_translate',
 'wmt_t2t_translate',
 'wmt_translate',
 'xnli',
 'xsum',
 'yelp_polarity_reviews']

Audio, Image, Object Detection, Structured, Summarization, Text, Translate, Video 의 8개 범주로 데이터셋이 구분되어 정리되어 있습니다. (* link: https://www.tensorflow.org/datasets/catalog/overview)

Audio	Image
groove librispeech libritts ljspeech nsynth savee speech_commands	abstract_reasoning aflw2k3d arc beans bigearthnet binarized_mnist binary_alpha_digits caltech101 caltech_birds2010 caltech_birds2011 cars196 cassava cats_vs_dogs celeb_a celeb_a_hq cifar10 cifar100 cifar10_1 cifar10_corrupted citrus_leaves cityscapes clevr cmaterdb coil100 colorectal_histology colorectal_histology_large curated_breast_imaging_ddsm cycle_gan deep_weeds diabetic_retinopathy_detection div2k dmlab downsampled_imagenet dsprites dtd duke_ultrasound emnist eurosat fashion_mnist flic food101 geirhos_conflict_stimuli horses_or_humans i_naturalist2017 image_label_folder imagenet2012 imagenet2012_corrupted imagenet_resized imagenette imagewang kmnist lfw lost_and_found lsun malaria mnist mnist_corrupted omniglot oxford_flowers102 oxford_iiit_pet patch_camelyon pet_finder places365_small plant_leaves plant_village plantae_k quickdraw_bitmap resisc45 rock_paper_scissors scene_parse150 shapes3d smallnorb so2sat stanford_dogs stanford_online_products sun397 svhn_cropped tf_flowers the300w_lp uc_merced vgg_face2 visual_domain_decathlon
Object Detection
coco kitti open_images_v4 voc wider_face
Structured
amazon_us_reviews forest_fires german_credit_numeric higgs iris rock_you titanic
Summarization
aeslc big_patent billsum cnn_dailymail gigaword multi_news newsroom opinosis reddit_tifu scientific_papers wikihow xsum
Text
blimp c4 cfq civil_comments cos_e definite_pronoun_resolution eraser_multi_rc esnli gap glue imdb_reviews librispeech_lm lm1b math_dataset movie_rationales multi_nli multi_nli_mismatch natural_questions qa4mre scan scicite snli squad super_glue tiny_shakespeare trivia_qa wikipedia xnli yelp_polarity_reviews
Translate
flores para_crawl ted_hrlr_translate ted_multi_translate wmt14_translate wmt15_translate wmt16_translate wmt17_translate wmt18_translate wmt19_translate wmt_t2t_translate
Video
bair_robot_pushing_small moving_mnist robonet starcraft_video ucf101

(3) CIFAR 100 데이터셋을 로컬 디스크에 다운로드 하고 로딩하기

(download & load CIFAR 100 dataset)

딥러닝을 활용한 이미지 분류 학습에 많이 사용되는 예제 데이터셋인 CIFAR 100 Dataset 을 로컬 디스크에 다운로드해보겠습니다.

[ CIFAR 10 이미지 시각화 (예시) ]

cifar_builder = tfds.builder("cifar100")

cifar_builder.download_and_prepare()

CIFAR 100 데이터셋은 ./tensorflow_datasets/cifar100/3.0.0. 폴더 밑에 다운로드되어 있습니다. train, test 데이터셋 레코드와 label 데이터, dataset과 image에 대한 JSON 데이터셋이 로컬 디스크에 다운로드 되었습니다.

(4) 데이터셋의 속성 정보 조회하기 (Datasets Attributes)

cifiar_builder.info 로 데이터셋의 속성을 조회해보면 아래와 같습니다. 아래의 JSON 속성 정보를 참고해서 필요한 정보를 참조할 수 있습니다.

print(cifar_builder.info)

[Out]: 
tfds.core.DatasetInfo(
    name='cifar100',
    version=3.0.0,
    description='This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).',
    homepage='https://www.cs.toronto.edu/~kriz/cifar.html',
    features=FeaturesDict({
        'coarse_label': ClassLabel(shape=(), dtype=tf.int64, num_classes=20),
        'image': Image(shape=(32, 32, 3), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=100),
    }),
    total_num_examples=60000,
    splits={
        'test': 10000,
        'train': 50000,
    },
    supervised_keys=('image', 'label'),
    citation="""@TECHREPORT{Krizhevsky09learningmultiple,
        author = {Alex Krizhevsky},
        title = {Learning multiple layers of features from tiny images},
        institution = {},
        year = {2009}
    }""",
    redistribution_info=,
)

CIFAR 100 데이터셋의 100개 라벨을 인쇄해보겠습니다. 위의 JSON 정보를 참고해서 features["label"].names 로 속성값을 조회할 수 있습니다.

# label naems

print(cifar_builder.info.features["label"].names)

[Out]:

['apple', 'aquarium_fish', 'baby', 'bear', 'beaver', 'bed', 'bee', 'beetle', 'bicycle', 'bottle', 'bowl', 'boy', 'bridge', 'bus', 'butterfly', 'camel', 'can', 'castle', 'caterpillar', 'cattle', 'chair', 'chimpanzee', 'clock', 'cloud', 'cockroach', 'couch', 'crab', 'crocodile', 'cup', 'dinosaur', 'dolphin', 'elephant', 'flatfish', 'forest', 'fox', 'girl', 'hamster', 'house', 'kangaroo', 'keyboard', 'lamp', 'lawn_mower', 'leopard', 'lion', 'lizard', 'lobster', 'man', 'maple_tree', 'motorcycle', 'mountain', 'mouse', 'mushroom', 'oak_tree', 'orange', 'orchid', 'otter', 'palm_tree', 'pear', 'pickup_truck', 'pine_tree', 'plain', 'plate', 'poppy', 'porcupine', 'possum', 'rabbit', 'raccoon', 'ray', 'road', 'rocket', 'rose', 'sea', 'seal', 'shark', 'shrew', 'skunk', 'skyscraper', 'snail', 'snake', 'spider', 'squirrel', 'streetcar', 'sunflower', 'sweet_pepper', 'table', 'tank', 'telephone', 'television', 'tiger', 'tractor', 'train', 'trout', 'tulip', 'turtle', 'wardrobe', 'whale', 'willow_tree', 'wolf', 'woman', 'worm']

print(cifar_builder.info.features["coarse_label"].names)

[Out]:

['aquatic_mammals', 'fish', 'flowers', 'food_containers', 'fruit_and_vegetables', 'household_electrical_devices', 'household_furniture', 'insects', 'large_carnivores', 'large_man-made_outdoor_things', 'large_natural_outdoor_scenes', 'large_omnivores_and_herbivores', 'medium_mammals', 'non-insect_invertebrates', 'people', 'reptiles', 'small_mammals', 'trees', 'vehicles_1', 'vehicles_2']

(5) Train, Validation Set 분할하여 불러오기

# Train/ Validation Datasets

train_cifar_dataset = cifar_builder.as_dataset(split=tfds.Split.TRAIN)

val_cifar_dataset = cifar_builder.as_dataset(split=tfds.Split.TEST)

print(train_cifar_dataset)

[Out]:

<DatasetV1Adapter shapes: {coarse_label: (), image: (32, 32, 3), label: ()}, types: {coarse_label: tf.int64, image: tf.uint8, label: tf.int64}>

print(val_cifar_dataset)

[Out]:

<DatasetV1Adapter shapes: {coarse_label: (), image: (32, 32, 3), label: ()}, types: {coarse_label: tf.int64, image: tf.uint8, label: tf.int64}>

# Number of classes: 100

num_classes = cifar_builder.info.features['label'].num_classes

# Number of images: train 50,000 . validation 10,000

num_train_imgs = cifar_builder.info.splits['train'].num_examples

num_val_imgs = cifar_builder.info.splits['test'].num_examples

print("Training dataset instance:", train_cifar_dataset)

Training dataset instance: <DatasetV1Adapter shapes: {coarse_label: (), image: (32, 32, 3), label: ()}, types: {coarse_label: tf.int64, image: tf.uint8, label: tf.int64}>

(6) 데이터셋 전처리 (크기 조정, 증식, 배치 샘플링, 검증 데이터셋 생성)

* code reference: Hands-on Computer Vision with TensorFlow 2 by Eliot Andres & Benjamin Planche

(https://www.amazon.com/Hands-Computer-Vision-TensorFlow-processing-ebook/dp/B07SMQGX48)

import math

input_shape = [224, 224, 3]

batch_size = 32

num_epochs = 30

train_cifar_dataset = train_cifar_dataset.repeat(num_epochs).shuffle(10000)

def _prepare_data_fn(features, input_shape, augment=False):

"""

Resize image to expected dimensions, and opt. apply some random transformations.

- param features: Data

- param input_shape: Shape expected by the models (images will be resized accordingly)

- param augment: Flag to apply some random augmentations to the images

- return: Augmented Images, Labels

"""

input_shape = tf.convert_to_tensor(input_shape)

# Tensorflow-dataset returns batches as feature dictionaries, expected by Estimators.

# To train Keras models, it is mor straightforward to return the batch content as tuples.

image = features['image']

label = features['label']

# Convert the images to float type, also scaling their values from [0, 255] to [0., 1.]

image = tf.image.convert_image_dtype(image, tf.float32)

if augment:

# Randomly applied horizontal flip

image = tf.image.random_flip_left_right(image)

# Random B/S changes

image = tf.image.random_brightness(image, max_delta=0.1)

image = tf.image.random_saturation(image, lower=0.5, upper=1.5)

image = tf.clip_by_value(image, 0.0, 1.0) # keeping pixel values in check

# random resize and random crop back to expected size

random_scale_factor = tf.random.uniform([1], minval=1., maxval=1.4, dtype=tf.float32)

scaled_height = tf.cast(tf.cast(input_shape[0], tf.float32) * random_scale_factor, tf.int32)

scaled_width = tf.cast(tf.cast(input_shape[1], tf.float32) * random_scale_factor, tf.int32)

scaled_shape = tf.squeeze(tf.stack([scaled_height, scaled_width]))

image = tf.image.resize(image, scaled_shape)

image = tf.image.random_crop(image, input_shape)

else:

image = tf.image.resize(image, input_shape[:2])

return image, label

import functools

prepare_data_fn_for_train = functools.partial(_prepare_data_fn,

input_shape=input_shape,

augment=True)

train_cifar_dataset = train_cifar_dataset.map(prepare_data_fn_for_train, num_parallel_calls=4)

# batch the samples

train_cifar_dataset = train_cifar_dataset.batch(batch_size)

train_cifar_dataset = train_cifar_dataset.prefetch(1)

# validation dataset (not shuffling or augmenting it)

prepare_data_fn_for_val = functools.partial(_prepare_data_fn,

input_shape=input_shape,

augment=False)

val_cifar_dataset = (val_cifar_dataset

.repeat()

.map(prepare_data_fn_for_val, num_parallel_calls=4)

.batch(batch_size)

.prefetch(1))

train_steps_per_epoch = math.ceil(num_train_imgs / batch_size)

val_steps_per_epoch = math.ceil(num_val_imgs / batch_size)

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Deep Learning (TF, Keras, PyTorch)' 카테고리의 다른 글

[TensorFlow] 값 변경이 가능한 변수 (tf.Variable) (0)	2021.12.20
[Keras] 이미지 파일 업로드하고 전처리하여 시각화하는 방법 (how to upload, preprocess and visualize images) (52)	2019.03.05
Tensorflow, Keras가 GPU를 사용하고 있는지 확인하는 방법 (0)	2019.02.19
[Keras] TypeError: softmax() got an unexpected keyword argument 'axis' 에러 시 tensorflow upgrade (0)	2019.02.06
집에서 딥러닝 공부하기에 적합한 PC 사양 및 가격대 (2017-09월) (9)	2017.09.17

Posted by Rfriend

,

[Python] 기존 함수를 재활용해 매개변수 값을 고정하여 새로운 함수 만들기: Functools partial() 함수

Python 분석과 프로그래밍/Python 설치 및 기본 사용법 2020. 3. 18. 12:09

Python은 고수준의 객체지향 프로그래밍 언어입니다. 파이썬 툴을 사용해 재사용 가능한 코드를 작성하는 기능은 파이썬이 가지고 있는 큰 장점 중의 하나입니다.

Python Functools 는 고차함수(high order functions)를 위해 설계된 라이브러리로서, 호출 가능한 기존의 객체, 함수를 사용하여 다른 함수를 반환하거나 활용할 수 있게 해줍니다. Functools 라이브러리는 기존 함수를 이용하는 함수를 모아놓은 모듈이라고 하겠습니다. Functools 라이브러리에는 partial(), total_ordering(), reduce(), chched_perperty() 등 여러 함수가 있는데요, 이번 포스팅에서는 partial() 메소드에 대해서 소개하겠습니다.

(* functools.reduce() 함수 참고 : https://rfriend.tistory.com/366)

Functools partial 함수는 기존 파이썬 함수를 재사용하여 일부 위치 매개변수 또는 키워드 매개변수를 고정한(freezed, fixec) 상태에서, 원래의 함수처럼 작동하는 새로운 부분 객체(partial object)를 반환합니다. functools.partial() 함수의 syntax는 아래와 같이, 기존 함수이름을 써주고, 고정할 매개변수 값을 써주면 됩니다.

from functools import partial

functools.partial(func, /, *args, **keywords)

간단한 예제로서, 숫자나 문자열을 정수(integer)로 변환해주는 파이썬 내장 함수 int(x, base=10)를 재활용하여, base=2 로 고정한 새로운 함수를 functools.partial() 함수로 만들어보겠습니다.

# python build-in function int(x, base=10)

int('101', base=2)

[Out]: 5

int('101', base=5)

[Out]: 26

# create a basetwo() function using functools.partial() and int() function

from functools import partial

basetwo = partial(int, base=2) # freeze base=2

basetwo('101')

[Out]: 5

int() 함수를 재활용해서 base=2 로 고정해서 새로 만든 basetwo() 라는 이름의 함수에 basetwo.__doc__ 로 설명을 추가하고, 호출해서 확인해보겠습니다.

# add __doc__ attribute

basetwo.__doc__ = 'Convert base 2 string to an int.'

basetwo.__doc__

[Out]: 'Convert base 2 string to an int.'

functools.partial() 로 만든 새로운 객체는 func, args, keywords 속성(attributes)을 가지고 있습니다.

이번 예에서는 재활용한 기존 함수(func)가 int(), 키워드 매개변수(keywords)가 base=2 ({'base': 2} 였는데요, 아래처럼 확인할 수 있습니다.

basetwo.func

[Out]: int

basetwo.args

[Out]: ()

basetwo.keywords

[Out]: {'base': 2}

* functools library refreence: https://docs.python.org/3/library/functools.html#module-functools

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python] zipfile 모듈로 압축 파일 쓰기, 읽기, 해제하기, 열기, 닫기 (2)	2021.01.09
[Python] 파이썬 사전형을 일정 간격을 두고 키와 값을 인쇄하기 (Python Dict Print Options) (0)	2020.08.31
[Python] 파이썬 객체를 직렬화하여 저장하고, 역직렬화하여 불러오기 : pickle.dump(), pickle.load() (2)	2020.03.16
[Python] 텍스트 파일 쓰기와 읽기 (writelines() 메소드, readlines() 메소드) (0)	2020.03.08
[Python] 파일을 열기, 데이터 읽기와 쓰기, 파일 닫기 (0)	2020.03.01

Posted by Rfriend

,

[Python] 파이썬 객체를 직렬화하여 저장하고, 역직렬화하여 불러오기 : pickle.dump(), pickle.load()

Python 분석과 프로그래밍/Python 설치 및 기본 사용법 2020. 3. 16. 23:56

파이썬 객체를 일정한 규칙(규약)을 따라서 (a) 효율적으로 저장하거나 스트림으로 전송할 때 파이썬 객체의 데이터를 줄로 세워 저장하는 것을 직렬화(serialization) 라고 하고, (b) 이렇게 직렬화된 파일이나 바이트를 원래의 객체로 복원하는 것을 역직렬화(de-serialization)라고 합니다.

직렬화, 역직렬화를 하는데는 Pickle, JSON, YAML 등 여러가지 방법이 있습니다. Pickle 방식은 사람이 읽을 수 없는 반면 저장이나 전송이 효율적입니다. 대신 JSON, YAML 방식은 저장이나 전송이 Pickle 보다는 덜 효율적이지만 사람이 읽을 수 있는 가독성이 좋은 장점이 있습니다. (* 파이썬에서 JSON 데이터 읽고 쓰기: https://rfriend.tistory.com/474)

이번 포스팅에서는 Pickle 방식으로 이진 파일이나 바이트 객체로 직렬화하고, 이를 다시 역직렬화해서 불러오는 방법에 대해서 소개하겠습니다. Pickle 방식에서는 직렬화(serialization)는 Pickling 이라고도 하며, 역직렬화(deserialization)는 Unpickling 이라고도 합니다.

파이썬에서 Pickle 방식으로 직렬화, 역직렬화하는데는 파이썬으로 작성되고 파이썬 표준 라이브러리에 포함되어 있는 pickle module, C로 작성되어 매우 빠르지만 Subclass 는 지원하지 않는 cPickle module 의 두 가지 패키지를 사용할 수 있습니다. (* source : https://pymotw.com/2/pickle/)

파이썬의 Pickle 방식으로 직렬화할 수 있는 객체는 모든 파이썬 객체를 망라합니다. (정수, 실수, 복소소, 문자열, 블리언, 바이트 객체, 바이트 배열, None, 리스트, 튜플, 사전, 집합, 함수, 클래스, 인스턴스 등)

이번 포스팅에서는 Python 3.7 버전(protocol = 3)에서 pickle 모듈을 사용하여

(1-1) 파이썬 객체를 직렬화하여 이진 파일(binary file)에 저장하기 : pickle.dump()

(1-2) 직렬화되어 있는 이진 파일로 부터 파이썬 객체로 역직렬화하기: pickle.load()

(2-1) 파이썬 객체를 직렬화하여 메모리에 저장하기: pickle.dumps()

(2-2) 직렬화되어 있는 바이트 객체(bytes object)를 파이썬 객체로 역직렬화: pickle.loads()

로 나누어서 각 방법을 소개하겠습니다.

pickle.dump()와 pickle.dumps() 메소드, 그리고 pickle.load()와 pickle.loads() 메소드의 끝에 's'가 안붙고 붙고의 차이를 유심히 봐주세요.

구분

Binary File on disk

Bytes object on memory

직렬화

(serialization)

with open("file.txt", "wb") as MyFile:

pickle.dump(MyObject, MyFile)

pickle.dumps(MyObject, MyBytes)

역직렬화

(deserialization)

with open("file.txt", "rb") as MyFile:

MyObj2 = pickle.load(MyFile)

MyObj2 = pickle.loads(MyBytes)

직렬화, 역직렬화를 할 때 일정한 규칙을 따라야 하는데요, 파이썬 버전별로 Pickle Protocol Version 은 아래에 정리한 표를 참고하시기 바랍니다. 그리고 상위 버전의 Pickle Protocol Version에서 저장한 경우 하위 버전에서 역직렬화할 수 없으므로 주의가 필요합니다. (가령 Python 3.x에서 Protocol = 3 으로 직렬화해서 저장한 파일을 Python 2.x 에서 Protocol = 2 버전으로는 역직렬화 할 수 없습니다.)

먼저, 예제로 사용할 정수, 텍스트, 실수, 블리언 고유 자료형을 포함하는 파이썬 사전 객체 (python dictionary object)를 만들어보겠습니다.

MyObject = {'id': [1, 2, 3, 4],

'name': ['KIM', 'CHOI', 'LEE', 'PARK'],

'score': [90.5, 85.7, 98.9, 62.4],

'pass_yn': [True, True, True, False]}

(1-1) 파이썬 객체를 직렬화하여 이진 파일(binary file)에 저장하기 : pickle.dump()

import pickle

with open("serialized_file.txt", "wb") as MyFile:

pickle.dump(MyObject, MyFile, protocol=3)

with open("serialized_file.txt", "wb") as MyFile 함수를 사용해서 "serialized_file.txt" 라는 이름의 파일을 이진 파일 쓰기 모드 ("wb") 로 열어 놓습니다. (참고로, with open() 은 close() 를 해줄 필요가 없습니다.)

pickle.dump(MyObject, MyFile) 로 위에서 만든 MyObject 사전 객체를 MyFile ("serialized_file.txt") 에 직렬화해서 저장합니다.

Python 3.7 버전에서 작업하고 있으므로 protocol=3 으로 지정해줬는데요, Python 3.0~3.7 버전에서는 기본값이 protocol=3 이므로 안써줘도 괜찮습니다.

현재의 작업 폴더에 가보면 "serialized_file.txt" 파일이 생성이 되어있을 텐데요, 이 파일을 클릭해서 열어보면 아래와 같이 사람이 읽을 수 없는 binary 형태로 저장이 되어 있습니다. (만약 사람이 읽을 수 있고 가독성이 좋은 저장, 전송을 원한다면 JSON, YAML 등을 사용해서 직렬화 하면됨)

(1-2) 직렬화되어 있는 이진 파일로 부터 파이썬 객체로 역직렬화: pickle.load()

import pickle

with open("serialized_file.txt", "rb") as MyFile:

UnpickledObject = pickle.load(MyFile)

UnpickledObject

[Out]:

{'id': [1, 2, 3, 4],
 'name': ['KIM', 'CHOI', 'LEE', 'PARK'],
 'score': [90.5, 85.7, 98.9, 62.4],
 'pass_yn': [True, True, True, False]}

with open("serialized_file.txt", "rb") as MyFile 를 사용하여 위의 (1-1)에서 파이썬 사전 객체를 직렬화하여 이진 파일로 저장했던 "serialized_file.txt" 파일을 이진 파일 읽기 모드 ("rb") 로 MyFile 이름으로 열어 놓습니다.

UnpickledObject = pickle.load(MyFile) 로 앞에서 열어놓은 MyFile 직렬화된 파일을 역직렬화(de-serialization, unpickling, decoding) 하여 UnpickledObject 라는 이름의 파이썬 객체를 생성합니다.

이렇게 만든 UnpickledObject 파이썬 객체를 호출해보니 다시 사람이 읽을 수 있는 사전 객체로 다시 잘 복원되었음을 알 수 있습니다.

(2-1) 파이썬 객체를 직렬화하여 메모리에 Bytes object로 저장하기: pickle.dumps()

import pickle

MyBytes = pickle.dumps(MyObject, protocol=3)

# unreadable bytes object

MyBytes

[Out]:
b'\x80\x03}q\x00(X\x02\x00\x00\x00idq\x01]q\x02(K\x01K\x02K\x03K\x04eX\x04\x00\x00\x00nameq\x03]q\x04(X\x03\x00\x00\x00KIMq\x05X\x04\x00\x00\x00CHOIq\x06X\x03\x00\x00\x00LEEq\x07X\x04\x00\x00\x00PARKq\x08eX\x05\x00\x00\x00scoreq\t]q\n(G@V\xa0\x00\x00\x00\x00\x00G@Ul\xcc\xcc\xcc\xcc\xcdG@X\xb9\x99\x99\x99\x99\x9aG@O333333eX\x07\x00\x00\x00pass_ynq\x0b]q\x0c(\x88\x88\x88\x89eu.'

위의 (1-1)이 파이썬 객체를 이진 파일(binary file) 로 로컬 디스크에 저장하였다면, 이번 (2-1)은 pickle.dumps(object_name, protocol=3) 을 사용해서 메모리에 Bytes object로 직렬화해서 저장하는 방법입니다. pickle.dumps() 메소드의 제일 뒤에 's'가 추가로 붙어있는 것 유의하세요.

이렇게 직렬화해서 저장한 Bytes object의 경우 사람이 읽을 수 없는 형태입니다. (반면, 컴퓨터한테는 데이터를 저장하기에 더 효율적인 형태)

(2-2) 직렬화되어 있는 바이트 객체를 파이썬 객체로 역직렬화: pickle.loads()

import pickle

MyObj2 = pickle.loads(MyBytes)

MyObj2

[Out]:

{'id': [1, 2, 3, 4],
 'name': ['KIM', 'CHOI', 'LEE', 'PARK'],
 'score': [90.5, 85.7, 98.9, 62.4],
 'pass_yn': [True, True, True, False]}

위의 (2-1)에서 직렬화하여 저장한 바이트 객체 MyBytes 를 pickle.loads() 메소드를 사용하여 역직렬화하여 MyObj2 라는 이름의 파이썬 객체를 생성한 예입니다.

* reference: https://docs.python.org/3.7/library/pickle.html

* pickle and cPickle: https://pymotw.com/2/pickle/

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python] 파이썬 사전형을 일정 간격을 두고 키와 값을 인쇄하기 (Python Dict Print Options) (0)	2020.08.31
[Python] 기존 함수를 재활용해 매개변수 값을 고정하여 새로운 함수 만들기: Functools partial() 함수 (0)	2020.03.18
[Python] 텍스트 파일 쓰기와 읽기 (writelines() 메소드, readlines() 메소드) (0)	2020.03.08
[Python] 파일을 열기, 데이터 읽기와 쓰기, 파일 닫기 (0)	2020.03.01
[Python] 객체 지향 프로그래밍과 클래스 (Object-Oriented Programming and Class in Python) (0)	2020.01.30

Posted by Rfriend

,

[Python] 텍스트 파일 쓰기와 읽기 (writelines() 메소드, readlines() 메소드)

Python 분석과 프로그래밍/Python 설치 및 기본 사용법 2020. 3. 8. 22:36

지난번 포스팅에서는 파일을 열고(open), 파일을 읽고(read), 쓰고(write), 닫기(close)를 하는 방법을 소개하였습니다.

이번 포스팅에서는 Python 에서

(1) 문자열로 이루어진 리스트를 텍스트 파일에 쓰기

- (1-1) 한 줄씩 쓰기 : write() 메소드 (for loop 문과 함께 사용)

- (1-2) 한꺼번에 모든 줄 쓰기 : writelines() 메소드

(2) 텍스트 파일을 읽어오기

- (2-1) 텍스트를 한 줄씩 읽어오기 : readline() 메소드 (while 문과 함께 사용)

- (2-2) 텍스트를 한꺼번에 모두 읽어오기 : readlines() 메소드

에 대해서 소개하겠습니다.

(1-1) 문자열로 이루어진 리스트를 한 줄씩 텍스트 파일에 쓰기 : write() 메소드

예제로 사용할 텍스트는 영화 매트릭스(movie Matrix)에서 모피어스가 네오에게 말했던 대사 (Morpheus Quotes to Neo) 중 몇 개로 이루어진 파이썬 리스트입니다. '\n'으로 줄을 구분하였습니다.

matrix_quotes = ["The Matrix is the world that has been pulled over your eyes to blind you from the truth.\n",

"You have to let it all go, Neo. Fear, doubt, and disblief. Free your mind.\n",

"There is a difference between knowing the path and walking the path.\n",

"Welcom to the desert of the real!\n"]

* image source: https://www.elitecolumn.com/morpheus-the-matrix-quotes/morpheus-the-matrix-quotes-4/

위의 텍스트 문자열로 이루어진 'matrix_quotes' 리스트를 'matrix_quotes.txt'라는 이름의 파일에 한 줄씩 써보도록 하겠습니다. 먼저 with open('matrix_quotes.txt', 'w') 함수를 사용해서 'matrix_quotes.txt'라는 이름의 파일을 '쓰기 모드('w')'로 열고, 모두 4줄로 이루어진 리스트이므로 for loop 순환문을 이용해서 한줄 씩 line 객체로 가져온 다음에 f.write(line) 메소드로 한 줄씩(line by line) 'matrix_quotes.txt' 텍스트 파일에 써보겠습니다.

with open('matrix_quotes.txt', 'w') as f:

for line in matrix_quotes:

f.write(line)

* 참고로, with open() 함수를 사용하면 close() 를 해주지 않아도 알아서 자동으로 close()를 해줍니다.

* with open('matrix_quotes.txt', 'w') 에서 'w' 는 쓰기 모드를 의미합니다.

현재 Jupyter notebook을 열어서 작업하고 있는 폴더에 'matrix_quotes.txt' 텍스트 파일이 생성되었음을 확인할 수 있습니다.

for loop 문과 write() 메소드를 사용해서 텍스트를 파일에 쓴 'matrix_quotes.txt' 파일을 더블 클릭해서 열어보니 Python list 의 내용과 동일하고, 줄 구분도 '\n'에 맞게 잘 써진 것을 알 수 있습니다.

(1-2) 문자열로 이루어진 리스트의 모든 줄을 텍스트 파일에 쓰기: writelines() 메소드

그런데 위의 write() 메소드의 경우 여러 개의 줄이 있을 경우 for loop 반복 순환문을 함께 써줘야 하는 불편함이 있습니다. 이때 writelines() 메소드를 사용하면 리스트 안의 모든 줄을 한꺼번에 텍스트 파일에 써줄 수 있어서 편리합니다. 위의 (1-1)과 동일한 작업을 writelines() 메소드를 사용해서 하면 아래와 같습니다.

matrix_quotes = ["The Matrix is the world that has been pulled over your eyes to blind you from the truth.\n",

"You have to let it all go, Neo. Fear, doubt, and disblief. Free your mind.\n",

"There is a difference between knowing the path and walking the path.\n",

"Welcom to the desert of the real!\n"]

with open('maxtix_quotes_2.txt', 'w') as f:

f.writelines(matrix_quotes)

Jupyter notebook을 작업하고 있는 폴더에 가서 보니 'matrix_quotes_2.txt' 파일이 새로 잘 생성되었습니다.

(2-1) 텍스트 파일을 한 줄씩 읽어오기 : readline() 메소드

이제는 방금전 (1-1)에서 만든 'matrix_quotes.txt' 텍스트 파일로 부터 텍스트를 한 줄씩 읽어와서 matrix_quotes_list 라는 이름의 파이썬 리스트를 만들어보겠습니다.

readline() 메소드는 텍스트 파일을 '\n', '\r', '\r\n' 과 같은 개행 문자(newline) 를 기준으로 한 줄씩 읽어오는 반면에 vs. readlines() 메소드는 텍스트 파일 안의 모든 텍스트를 한꺼번에 읽어오는 차이가 있습니다. 따라서 한 줄씩만 읽어오는 readline() 메소드를 사용해서 텍스트 파일 내 모든 줄을 읽어오려면 while 순환문을 같이 사용해주면 됩니다.

with open('matrix_quotes.txt', 'r') as text_file: matrix_quotes_list = [] line = text_file.readline() while line != '': matrix_quotes_list.append(line) line = text_file.readline()

matrix_quotes_list

[Out]:

['The Matrix is the world that has been pulled over your eyes to blind you from the truth.\n',  'You have to let it all go, Neo. Fear, doubt, and disblief. Free your mind.\n',  'There is a difference between knowing the path and walking the path.\n',  'Welcom to the desert of the real!\n',  '']

* 참고로 with open('matrix_quotes.txt', 'r') 에서 'r'은 '읽기 모드'를 의미합니다.

만약 파일을 한 줄씩 읽어오면서 매 10번째 행마다 새로운 파일에 쓰기를 하고 싶다면 아래 코드 참고하세요.

with open(in_name, 'r') as in_file:
    with open(out_name, 'w') as out_file:
        count = 0
        for line in in_file:
            if count % 10 == 0:
                out_file.write(line)
            count += 1

(2-2) 텍스트 파일을 모두 한꺼번에 읽어오기 : readlines() 메소드

이번에는 readlines() 메소드를 사용해서 'matrix_quotes.txt' 텍스트 파일 안의 모든 줄을 한꺼번에 읽어와서 matrix_quotes_list_2 라는 이름의 파이썬 리스트를 만들어보겠습니다. readlines() 메소드는 한꺼번에 텍스트를 읽어오기 때문의 위의 (2-1) 의 readline()메소드와 while 순환문을 함께 쓴 것도다 코드가 간결합니다.

with open('matrix_quotes.txt', 'r') as text_file:

matrix_quotes_list_2 = text_file.readlines()

matrix_quotes_list_2

[Out]:

['The Matrix is the world that has been pulled over your eyes to blind you from the truth.\n',  'You have to let it all go, Neo. Fear, doubt, and disblief. Free your mind.\n',  'There is a difference between knowing the path and walking the path.\n',  'Welcom to the desert of the real!\n']

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~

'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python] 기존 함수를 재활용해 매개변수 값을 고정하여 새로운 함수 만들기: Functools partial() 함수 (0)	2020.03.18
[Python] 파이썬 객체를 직렬화하여 저장하고, 역직렬화하여 불러오기 : pickle.dump(), pickle.load() (2)	2020.03.16
[Python] 파일을 열기, 데이터 읽기와 쓰기, 파일 닫기 (0)	2020.03.01
[Python] 객체 지향 프로그래밍과 클래스 (Object-Oriented Programming and Class in Python) (0)	2020.01.30
[Python] Jupyter Notebook에서 cell 너비, DataFrame 칼럼 너비, 텍스트 정렬, 소수점 자리수, 그래프 크기 설정, 최대 행의 개수 설정 방법 (0)	2019.10.07

Posted by Rfriend

,

[Python] 파일을 열기, 데이터 읽기와 쓰기, 파일 닫기

Python 분석과 프로그래밍/Python 설치 및 기본 사용법 2020. 3. 1. 22:49

이번 포스팅에서는 파이썬으로

(1) 파일을 열고, 데이터를 쓰고, 파일을 닫기

(2) 파일을 열고, 데이터를 읽고, 파일을 닫기

(3) with open(file_name) as file_object: 로 파일 열고, 데이터를 읽고, 자동으로 파일 닫기

(4) open() error 처리 옵션 설정

하는 방법을 소개하겠습니다.

파이썬으로 파일을 열 때 open() 함수를 사용하는데요, 1개의 필수 매개변수와 7개의 선택적 매개변수를 가지며, 파일 객체를 반환합니다. open() 함수의 mode 는 'w', 'r', 'x', 'a', '+', 'b', 't'의 7개 모드가 있으며, 위의 이미지 우측에 있는 표를 참고하시기 바랍니다. ('r' 읽기용으로 파일 열기와 't' 텍스트 모드로 열기가 기본 설정값입니다)

open(

file, # 필수 매개변수, 파일의 경로

mode = 'r', # 선택적 매개변수, 'rt' (읽기용, text mode)가 기본값

buffering = -1, # 선택적 매개변수, 버퍼링 정책 (0 : binary mode 시 버퍼링 미수행,

# 1: 텍스트 모드 시 개행문자(\n)을 만날 때까지 버퍼링)

encoding = None, # 선택적 매개변수, 문자 인코딩 방식 (예: 'utf-8', 'cp949', 'latin' 등)

errors = None, # 선택적 매개변수, 텍스트 모드 시 에러 처리 (예: 'ignore' 에러 무시)

newline = None, # 선택적 매개변수, 줄바꿈 처리 (None, '\n', '\r', '\r\n')

closefd = True, # 선택적 매개변수, False 입력 시 파일 닫더라도 파일 기술자 계속 열어둠

opener = None) # 선택적 매개변수, 파일을 여는 함수를 직저 구현 시 사용

(1) 파일을 열고 --> 파일에 데이터를 쓰고 --> 파일을 닫기

(1-1) MyFile = open('myfile.txt', 'w') : open() 함수를 사용하여 'myfile.txt' 파일을 'w' 모드 (쓰기용으로 파일 열기, 파일이 존재하지 않으면 새로 생성, 파일이 존재하면 파일 내용을 비움(truncate)) 열어서 MyFile 객체에 할당하고,

(1-2) MyFile.write('data') : MyFile 객체에 'data'를 쓴 후에,

(1-3) MyFile.close() : 파일을 닫습니다. 자원 누수 방지를 위해 마지막에는 꼭 파일을 닫아(close) 주어야 합니다.

참고로, Windows 10 OS의 기본 encoding 은 'cp949' 입니다.

# Open and Write

MyFile = open('myfile.txt', 'w')

MyFile.write('Open, read, write and close a file using Python')

MyFile.close()

# encoding = 'cp949' for Windows10 OS

MyFile

[Out] <_io.TextIOWrapper name='myfile.txt' mode='w' encoding='cp949'>

현재 작업 경로에 위에서 만든 'myfile.txt' 파일이 생성이 되어서 존재하는지 확인해보겠습니다.

import os

os.listdir(os.getcwd())

[Out]:
['myfile.txt',
 'Python_read_write_file.ipynb']

(2) 파일을 열고 --> 파일의 데이터를 읽고 --> 파일을 닫기

(2-1) MyFile = open('myfile.txt', 'r') : 이번에는 (1)번에서 데이터를 써서 만들어놓은 'myfile.txt' 파일을 open() 함수의 'r' 모드 (읽기용으로 파일 열기, default) 로 열어서 MyFile 객체에 할당하고,

(2-2) MyString = MyFile.read() : 'myfile.txt' 파일을 읽어서 MyString 객체에 데이터를 저장하고, print(MyString) 로 인쇄를 한 후,

(2-3) MyFile.close() : 파일을 닫습니다. 자원 누수 방지를 위해 마지막에는 꼭 파일을 닫아(close) 주어야 합니다.

# Open and Read

MyFile = open('myfile.txt', 'r')

# equivalent to the above

#MyFile = open('myfile.txt', 'rt')

#MyFile = open('myfile.txt', 't')

#MyFile = open('myfile.txt')

MyString = MyFile.read()

print(MyString)

[Out]: Open, read, write and close a file using Python

# close the 'MyFile' file

MyFile.close()

open() 함수의 7개 mode 중에서 'r' 읽기용으로 열기와 't' 텍스트 모드로 열기가 기본값이므로 위의 open('myfile.txt', 'r') 은 open('myfile.txt', 'rt'), open('myfile.txt', 't'), open('myfile.txt') 와 동일한 코드입니다.

(3) with open(파일 이름) as 파일 객체: 를 사용해서 파일 열기

with open() as 문을 사용하여 파일을 열면 마지막에 close() 함수를 명시적으로 써주지 않아도 자동으로 파일이 닫힙니다. 위의 (2) 번 코드를 with open(file_name) as file_object: 문으로 바꿔주면 아래와 같습니다.

# Using the with statement the file is automatically closed

with open('myfile.txt', 'r') as MyFile:

MyString = MyFile.read()

print(MyString)

# no need for MyFile.close()


[Out]: Open, read, write and close a file using Python

(4) 텍스트 모드에서 인코딩, 디코딩 에러 발생 시 처리 : errors

가령, 파일에서 데이터를 읽거나 쓰는 와중에 에러가 발생했을 시 무시하고 싶다면 open('file.txt', 'rw', errors = 'ignore') 라고 설정을 해주면 됩니다.

errors 매개변수	error 처리 방법
'strict'	인코딩 에러 발생 시 ValueError 예외
'ignore'	에러 무시
'replace'	대체 기호 삽입 (예: "?")
'surrogateescape'	U+DC80~U+DCFF 사이의 유니코드 사용자 자유 영역의 잘못된 바이트를 code points 로 나타냄
'xmlcharrefreplace'	파일을 쓸 때 파일에 기록하려는 텍스트 안의 지정된 인코딩에서 지원되지 않는 문자를 &#NNN; 의 XML 문자 참조로 바꿔서 기록
'backslashreplace'	파일을 쓸 때 현재 인코딩에서 지원되지 않는 문자를 역슬래시(back slash, \)로 시작되는 escape sequence로 바꿔서 기록

다음번 포스팅에서는 줄 단위로 텍스트 파일을 읽고 쓰는 방법을 소개하겠습니다.

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python] 파이썬 객체를 직렬화하여 저장하고, 역직렬화하여 불러오기 : pickle.dump(), pickle.load() (2)	2020.03.16
[Python] 텍스트 파일 쓰기와 읽기 (writelines() 메소드, readlines() 메소드) (0)	2020.03.08
[Python] 객체 지향 프로그래밍과 클래스 (Object-Oriented Programming and Class in Python) (0)	2020.01.30
[Python] Jupyter Notebook에서 cell 너비, DataFrame 칼럼 너비, 텍스트 정렬, 소수점 자리수, 그래프 크기 설정, 최대 행의 개수 설정 방법 (0)	2019.10.07
[R] Jupyter Notebook에서 R 사용하기 (6)	2019.10.06

Posted by Rfriend

,

[R dplyr] 여러개의 if, else if 조건절을 벡터화해서 처리해주는 case_when() 함수

R 분석과 프로그래밍/R 데이터 전처리 2020. 2. 23. 20:01

이번 포스팅에서는 R dplyr 패키지의 case_when() 함수를 이용해서 연속형 변수를 여러개의 범주로 구분하여 범주형 변수를 만들어보겠습니다. dplyr 패키지의 case_when() 함수를 사용하면 여러개의 if, else if 조건절을 사용하지 않고도 벡터화해서 쉽고 빠르게 처리를 할 수 있습니다. R dplyr 의 case_when() 함수는 SQL의 case when 절과 유사하다고 보면 되겠습니다.

간단한 예제로 1~10 까지의 양의 정수를 "2 이하", "3~5", "6~8", "9 이상" 의 4개 범주로 구분을 해보겠습니다.

(dplyr::case_when()에서 dplyr:: 는 생략해도 되며, dplyr 패키지의 함수를 이용하다는 의미입니다)

case_when(

조건 ~ 할당값,

TRUE ~ 할당값)

의 형식으로 작성합니다.

아래의 예에서는 조건절이 총 4개 사용되었는데요, if, else if, else if, else 등의 조건절문 없이 case_when() 함수의 괄호안에 바로 조건을 나열했고, 마지막에는 앞의 조건절에 모두 해당 안되는 나머지(else)에 대해서 TRUE ~ "9~" 로 지정을 해주었습니다.

library(dplyr)

x <- 1:10

x

[1] 1 2 3 4 5 6 7 8 9 10

dplyr::case_when(

x <= 2 ~ "~2",

x <= 5 ~ "3~5",

x <= 8 ~ "6~8",

TRUE ~ "9~"

)

[1] "~2" "~2" "3~5" "3~5" "3~5" "6~8" "6~8" "6~8" "9~" "9~"

이때 조건절의 순서가 중요합니다. 복수의 조건절을 나열하면 앞에서 부터 순서대로(in order) 조건에 해당하는 관측치에 대해 값을 할당하게 됩니다. 따라서 만약 TRUE ~ "9~"를 case_when(() 조건절의 제일 앞에 사용하게 되면 1~10까지의 모든 값에 대해 "9~" 를 할당하게 됩니다. 따라서 조건절의 처리 순서를 반드시 고려해서 조건절을 작성해줘야 합니다.

# order matters!!!

case_when(

TRUE ~ "9~",

x <= 2 ~ "~2",

x <= 5 ~ "3~5",

x <= 8 ~ "6~8",

)

[1] "9~" "9~" "9~" "9~" "9~" "9~" "9~" "9~" "9~" "9~"

case_when() 조건절의 오른쪽(right hand side)의 데이터 유형이 모두 동일해야 합니다. 만약 데이터 유형이 다를 경우 error를 발생합니다. 가령, 아래 예에서는 오른쪽에 character를 반환하게끔 되어있는데 logical 인 NA 가 포함되는 경우 Error가 발생합니다. 이때는 'NA_character_' 를 사용해서 NA가 character로 반환되게끔 해주면 됩니다.

오른쪽에 문자형(character) 반환하는 경우 NA 값으로는 NA_character_ 사용

잘못된 사용 예 (오른쪽 데이터 유형 다름)

올바른 사용 예 (오른쪽 데이터 유형 같음)

# error as NA is logical not character

case_when(

x <= 2 ~ "~2",

x <= 5 ~ "3~5",

x <= 8 ~ "6~8",

TRUE ~ NA

)

Error: must be a character vector, not a logical vector

Call `rlang::last_error()` to see a backtrace

# use NA_character_

case_when(

x <= 2 ~ "~2",

x <= 5 ~ "3~5",

x <= 8 ~ "6~8",

TRUE ~ NA_character_

)

[1] "~2" "~2" "3~5" "3~5" "3~5" "6~8" "6~8" "6~8" NA NA

오른쪽에 숫자형(numeric)을 반환하는 경우 NA 값으로는 NA_real_ 사용

잘못된 사용 예 (오른쪽 데이터 유형 다름)

올바른 사용 예 (오른쪽 데이터 유형 같음)

# error as NA is logical not numeric

case_when(

x <= 2 ~ 2,

x <= 5 ~ 5,

x <= 8 ~ 8,

TRUE ~ NA

)

Error: must be a double vector, not a logical vector

Call `rlang::last_error()` to see a backtrace

# use NA_real_

case_when(

x <= 2 ~ 2,

x <= 5 ~ 5,

x <= 8 ~ 8,

TRUE ~ NA_real_

)

[1] 2 2 5 5 5 8 8 8 NA NA

dplyr의 case_when() 함수는 mutate() 함수와 함께 사용하면 매우 강력하고 편리하게 여러개의 조건절을 사용해서 새로운 변수를 만들 수 있습니다. 아래는 mtcars 데이터셋의 cyl (실린더 개수) 와 hp (자동차 마력) 의 두 개 변수를 사용해 첫번째 "or" 조건절로 "big" 유형으로 찾고, 두번째 "and" 조건절로 "medium" 유형을 찾으며, 마지막으로 나머지에 대해서는 "small" 유형을 명명해본 예입니다.

mtcars$name <- row.names(mtcars)

mtcars %>%

select(name, mpg, cyl, hp) %>%

mutate(

type = case_when(

cyl >= 8 | hp >= 180 ~ "big", # or

cyl >= 4 & hp >= 120 ~ "medium", # and

TRUE ~ "small"

)

name mpg cyl hp type

1 Mazda RX4 21.0 6 110 small

2 Mazda RX4 Wag 21.0 6 110 small

3 Datsun 710 22.8 4 93 small

4 Hornet 4 Drive 21.4 6 110 small

5 Hornet Sportabout 18.7 8 175 big

6 Valiant 18.1 6 105 small

7 Duster 360 14.3 8 245 big

8 Merc 240D 24.4 4 62 small

9 Merc 230 22.8 4 95 small

10 Merc 280 19.2 6 123 medium

---- 이하 생략 ----

위에서 R dplyr의 case_when() 함수로 진행했던 내용을 PostgreSQL, Greenplum DB에서 하려면 SQL CASE WHEN 문을 아래처럼 사용하면 됩니다. 참고하세요.

-- PostgreSQL CASE WEHN

SELECT

name,

mpg,

cyl,

hp,

CASE

WHEN (cyl >= 8) OR (hp >= 180) THEN "big"

WHEN (cyl >= 4) AND (hp >= 120) THEN "median"

ELSE "small"

END AS type

FROM mtcars

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요.

728x90

저작자표시 비영리 변경금지

'R 분석과 프로그래밍 > R 데이터 전처리' 카테고리의 다른 글

[R data.table] 데이터를 읽어와서 data.table 만들기, data.frame을 data.table로 변환하기 (0)	2020.09.27
[R data.table] data.table 은 무엇이고, 왜 data.table 인가? (0)	2020.08.24
[R] 여러개의 변수를 가진 DataFrame을 무작위 층화 샘플링으로 Train, Test set 분할하고 표준화하기 (0)	2020.01.19
[R] 장비 On, Off 상태로 가동 시간을 구하고, 장비 별 평균 가동 시간 구하기 (16)	2019.10.24
[R] R 패키지 함수의 소스 코드를 볼 수 있는 방법 (how to see function's source codes in R package) (0)	2019.10.22

Posted by Rfriend

,

[Python pandas] 연속형을 범주형으로 변환하는 np.digitize(), pd.cut() 비교 (comparison of categorization using np.digitize(), pd.cut())

Python 분석과 프로그래밍/Python 데이터 전처리 2020. 2. 18. 17:19

이번 포스팅에서는 연속형 변수를 여러개의 구간별로 구분하여 범주형 변수로 변환(categorization of a continuous variable by multiple bins) 하는 두가지 방법을 비교하여 설명하겠습니다.

(1) np.digitize(X, bins) 를 이용한 연속형 변수의 여러개 구간별 범주화

(2) pd.cut(X, bins, labels) 를 이용한 연속형 변수의 여러개 구간별 범주화

np.digitize(X, bins)와 pd.cut(X, bins, labels) 함수가 서로 비슷하면서도 사용법에 있어서는 모든 면에서 조금씩 다르므로 각 함수의 syntax에 맞게 정확하게 확인하고서 사용하기 바랍니다.

[ np.digitize()와 pd.cut() 비교 ]

구분	np.digitize(X, bins)	pd.cut(X, bins, labels)
bins=[start, end]	[포함, 미포함)	(미포함, 포함)
bin 구간 대비 작거나 큰 수	bin 첫 구간 보다 작으면 [-inf, start) --> 자동으로 '1'로 digitize bin 마지막 구간 보다 크면 [end, inf) --> 자동으로 bin 순서에 따라 digitize	bin 첫번째 구간보다 작으면 --> NaN bin 마지막 구간보다 크면 --> Nan
label	0, 1, 2, ... 순서의 양의 정수 자동 설정	사용자 지정 가능 (labels option)
반환 (return)	numpy array	a list of categories with labels

(1) np.digitize(X, bins) 를 이용한 연속형 변수의 여러개 구간별 범주화

먼저 예제로 사용할 간단한 pandas DataFrame을 만들어보겠습니다.

import pandas as pd

import numpy as np

df = pd.DataFrame({'col': np.arange(10)})

df

	col
0	0
1	1
2	2
3	3
4	4
5	5
6	6
7	7
8	8
9	9

이제 np.digitize(X, bins=[0, 5, 8]) 함수를 사용해서 {[0, 5), [5, 8), [8, inf)} 구간 bin 별로 {1, 2, 3} 의 순서로 양의 정수를 자동으로 이름을 부여하여 'grp_digitize'라는 이름의 새로운 칼럼을 df DataFrame에 만들어보겠습니다.

참고로 '(' 또는 ')'는 미포함 (not included), '[' 또는 ']' 보호는 포함(included)을 나타냅니다.

bins=[0, 5, 8]

# returns numpy array

np.digitize(df['col'], bins)

[Out]: array([1, 1, 1, 1, 1, 2, 2, 2, 3, 3])

df['grp_digitize'] = np.digitize(df['col'], bins)

df

[Out]:

	col	grp_digitize
0	0	1
1	1	1
2	2	1
3	3	1
4	4	1
5	5	2
6	6	2
7	7	2
8	8	3
9	9	3

(2) pd.cut(X, bins, labels) 를 이용한 연속형 변수의 여러개 구간별 범주화

이번에는 pd.cut(X, bins=[0, 5, 8]) 을 이용하여 {(0, 5], (5, 8]} 의 2개 구간별로 범주화해보겠습니다. array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 의 각 원소가 어느 bin에 속하는지를 나타내는 category 리스트를 반환합니다.

import pandas as pd

import numpy as np

df = pd.DataFrame({'col': np.arange(10)})

# pd.cut(미포함, 포함]

bins=[0, 5, 8]

# returns a list of catogiries with labels

pd.cut(df["col"], bins=bins)

[Out]:
0           NaN
1    (0.0, 5.0]
2    (0.0, 5.0]
3    (0.0, 5.0]
4    (0.0, 5.0]
5    (0.0, 5.0]
6    (5.0, 8.0]
7    (5.0, 8.0]
8    (5.0, 8.0]
9           NaN
Name: col, dtype: category
Categories (2, interval[int64]): [(0, 5] < (5, 8]]

위 (1)번의 np.digitize() 가 [포함, 미포함) 인 반면에 pd.cut()은 (미포함, 포함]으로 정반대입니다.

위 (1)번의 np.digitize() 가 bin 안의 처음 숫자보다 작거나 같은 값에 자동으로 '1'의 정수를 부여하고, bin 안의 마지막 숫자보다 큰 값에 대해서는 bin 순서에 따라 자동으로 digitze 정수를 부여하는 반면에, pd.cut()은 bin 구간에 없는 값에 대해서는 'NaN'을 반환하고 bin 구간 내 값에 대해서는 사용자가 labels=['a', 'b'] 처럼 입력해준 label 값을 부여해줍니다.

df['grp_cut'] = pd.cut(df["col"], bins=bins, labels=['a', 'b'])

df

[Out]:

	col	grp_digitize	grp_cut
0	0	1	NaN
1	1	1	a
2	2	1	a
3	3	1	a
4	4	1	a
5	5	2	a
6	6	2	b
7	7	2	b
8	8	3	b
9	9	3	NaN

이렇게 연속형 변수를 범주형 변수로 변환을 한 후에 'col' 변수에 대해 groupby('grp_cut') 로 그룹별 합계(sum by group)를 집계해 보겠습니다.

df.groupby('grp_cut')['col'].sum()

[Out]:

grp_cut
a    15
b    21
Name: col, dtype: int64

'grp_cut' 기준 그룹('a', 'b')별로 합(sum), 개수(count), 평균(mean), 분산(variance) 등의 여러개 통계량을 한번에 구하려면 사용자 정의 함수를 정의한 후에 --> df.groupby('grp_cut').apply(my_summary) 처럼 apply() 를 사용하면 됩니다. 그룹별로 통계량을 한눈에 보기에 좋도록 unstack()을 사용해서 세로로 길게 늘어선 결과를 가로로 펼쳐서 제시해보았습니다.

# UDF of summary statistics

def my_summary(x):

result = {

'sum': x.sum(),

'count': x.count(),

'mean': x.mean(),

'variance': x.var()

}

return result

df.groupby('grp_cut')['col'].apply(my_summary).unstack()

[Out]:

	sum	count	mean	variance
grp_cut
a	15.0	5.0	3.0	2.5
b	21.0	3.0	7.0	1.0

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] TimeStamp와 ID의 모든 조합 MultiIndex로 시계열 데이터 만들기 (0)	2020.06.21
[Python pandas] read_csv() 로 데이터 읽어올 때 날짜/시간 데이터 파싱하기 (parsing datetime from file using read_csv()) (4)	2020.05.17
[Python] 층화 무작위 추출을 통한 train set, test set 분할 (Train, Test set Split by Stratified Random Sampling in Python) (3)	2020.02.15
[Python numpy] Train, Test 데이터셋 분할하기 (split train and test set) (2)	2020.02.11
[Python Numpy] numpy array 거꾸로 뒤집기 (how to reverse numpy array) (0)	2020.02.05

Posted by Rfriend

,

[Python] 층화 무작위 추출을 통한 train set, test set 분할 (Train, Test set Split by Stratified Random Sampling in Python)

Python 분석과 프로그래밍/Python 데이터 전처리 2020. 2. 15. 23:29

지난번 포스팅에서는 무작위로 데이터셋을 추출하여 train set, test set을 분할(Train set, Test set split by Random Sampling)하는 방법을 소개하였습니다.

이번 포스팅에서는 데이터셋 내 층(stratum) 의 비율을 고려하여 층별로 구분하여 무작위로 train set, test set을 분할하는 방법(Train, Test set Split by Stratified Random Sampling)을 소개하겠습니다.

(1) sklearn.model_selection.train_test_split 함수를 이용한 Train, Test set 분할

(층을 고려한 X_train, X_test, y_train, y_test 반환)

(2)sklearn.model_selection.StratifiedShuffleSplit 함수를 이용한 Train, Test set 분할

(층을 고려한 train/test indices 반환 --> Train, Test set indexing)

참고로 단순 임의 추출(Simple Random Sampling), 체계적 추출(Systematic Sampling), 층화 임의 추출(Stratified Random Sampling), 군집 추출(Cluster Sampling), 다단계 추출(Multi-stage Sampling) 방법에 대한 소개는 https://rfriend.tistory.com/58 를 참고하세요.

(1) sklearn.model_selection.train_test_split 함수를 이용한 Train, Test set 분할

(층을 고려한 X_train, X_test, y_train, y_test 반환)

먼저 간단한 예제로 사용하기 위해 15행 2열의 X 배열, 15개 원소를 가진 y 배열 데이터셋을 numpy array 를 이용해서 만들어보겠습니다. 그리고 앞에서 부터 5개의 관측치는 '0' 그룹(층), 6번째부터 15번째 관측치는 '1' 그룹(층)에 속한다고 보고, 이 정보를 가지고 있는 'grp' 리스트도 만들겠습니다.

import numpy as np

X = np.arange(30).reshape(15, 2)

X

[Out]:
array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15],
       [16, 17],
       [18, 19],
       [20, 21],
       [22, 23],
       [24, 25],
       [26, 27],
       [28, 29]])

y = np.arange(15)

y

[Out]: 
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

# stratum (group)

grp = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

grp

[Out]:

[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

이제 scikit-learn model_selection 클래스에서 train_test_split 함수를 가져와서 X_train, X_test, y_train_y_test 데이터셋을 분할해 보겠습니다.

- X와 y 데이터셋이 따로 분리되어 있는 상태에서 처음과 두번째 위치에 X, y를 각각 입력해줍니다.

- test_size에는 test set의 비율을 입력하고 stratify에는 층 구분 변수이름을 입력해주는데요, 이때 각 층(stratum, group) 별로 나누어서 test_size 비율을 적용해서 추출을 해줍니다.

- shuffle=True 를 지정해주면 무작위 추출(random sampling)을 해줍니다. 만약 체계적 추출(systematic sampling)을 하고 싶다면 shuffle=False를 지정해주면 됩니다.

- random_state 는 재현가능성을 위해서 난수 초기값으로 아무 숫자나 지정해주면 됩니다.

# returns X_train, X_test, y_train, y_test dataset

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,

y,

test_size=0.2,

shuffle=True,

stratify=grp,

random_state=1004)

print('X_train shape:', X_train.shape)

print('X_test shape:', X_test.shape)

print('y_train shape:', y_train.shape)

print('y_test shape:', y_test.shape)

[Out]:

X_train shape: (12, 2)
X_test shape: (3, 2)
y_train shape: (12,)
y_test shape: (3,)

아래는 X_train, y_train, X_test, y_test 로 각각 분할된 결과입니다.

X_train

[Out]:

array([[12, 13],
       [ 8,  9],
       [28, 29],
       [ 0,  1],
       [10, 11],
       [ 6,  7],
       [ 2,  3],
       [18, 19],
       [20, 21],
       [22, 23],
       [26, 27],
       [14, 15]])

y_train

[Out]:

array([ 6,  4, 14,  0,  5,  3,  1,  9, 10, 11, 13,  7])

X_test

[Out]:

array([[16, 17],
       [ 4,  5],
       [24, 25]])

y_test

[Out]: array([ 8,  2, 12])

(2) sklearn.model_selection.StratifiedShuffleSplit 함수를 이용한 Train, Test set 분할

(층을 고려한 train/test indices 반환 --> Train, Test set indexing)

(2-1) numpy array 예제

위의 train_test_split() 함수가 X, y를 input으로 받아서 각 층의 비율을 고려해 무작위로 X_train, X_test, y_train, y_test 로 분할된 데이터셋을 반환했다고 하며, 이번에 소개할 StratfiedShuffleSplit() 함수는 각 층의 비율을 고려해 무작위로 train/test set을 분할할 수 있는 indices 를 반환하며, 이 indices를 이용해서 train set, test set을 indexing 하는 작업을 추가로 해줘야 합니다. 위의 (1)번 대비 좀 불편하지요? (대신 이게 k-folds cross-validation 할 때n_splits 를 가지고 층화 무작위 추출할 때는 위의 (1)번 보다 편리합니다)

1개의 train/ test set 만을 분할하므로 n_splits=1 로 지정해주며, test_size에 test set의 비율을 지정해주고, random_state에는 재현가능성을 위해 난수 초기값으로 아무값이 지정해줍니다.

train_idx, test_idx 를 반환하므로 for loop문을 사용해서 X_train, X_test, y_train, y_test를 X와 y로 부터 indexing해서 만들었습니다.

i# Stratified ShuffleSplit cross-validator

# provides train/test indices to split data in train/test sets.

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1004)

for train_idx, test_idx in split.split(X, grp):

X_train = X[train_idx]

X_test = X[test_idx]

y_train = y[train_idx]

y_test = y[test_idx]

X_train, y_train, X_test, y_test 값을 확인해보면 아래와 같은데요, 이는 random_state=1004로 (1)번과 같게 설정해주었기때문에 (1)번의 train_test_split() 함수를 사용한 결과와 동일한 train, test set 데이터셋이 층화 무작위 추출법으로 추출되었습니다.

X_train

[Out]:
array([[12, 13],
       [ 8,  9],
       [28, 29],
       [ 0,  1],
       [10, 11],
       [ 6,  7],
       [ 2,  3],
       [18, 19],
       [20, 21],
       [22, 23],
       [26, 27],
       [14, 15]])

y_train

[Out]: array([ 6,  4, 14,  0,  5,  3,  1,  9, 10, 11, 13,  7])

X_test

[Out]:

array([[16, 17],
       [ 4,  5],
       [24, 25]])

y_test

[Out]: array([ 8,  2, 12])

(2-2) pandas DataFrame 예제

위의 (2-1)에서는 numpy array를 사용해서 해보았는데요, 이번에는 pandas DataFrame에 대해서 StratifiedShuffleSplit() 함수를 사용해서 층화 무작위 추출법을 이용한 Train, Test set 분할을 해보겠습니다.

먼저, 위에서 사용한 데이터셋과 똑같이 값으로 구성된, x1, x2, y, grp 칼럼을 가진 DataFrame을 만들어보겠습니다.

import pandas as pd

import numpy as np

X = np.arange(30).reshape(15, 2)

y = np.arange(15)

df = pd.DataFrame(np.column_stack((X, y)), columns=['X1','X2', 'y'])

df

	X1	X2	y
0	0	1	0
1	2	3	1
2	4	5	2
3	6	7	3
4	8	9	4
5	10	11	5
6	12	13	6
7	14	15	7
8	16	17	8
9	18	19	9
10	20	21	10
11	22	23	11
12	24	25	12
13	26	27	13
14	28	29	14

df['grp'] = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

df

	X1	X2	y	grp
0	0	1	0	0
1	2	3	1	0
2	4	5	2	0
3	6	7	3	0
4	8	9	4	0
5	10	11	5	1
6	12	13	6	1
7	14	15	7	1
8	16	17	8	1
9	18	19	9	1
10	20	21	10	1
11	22	23	11	1
12	24	25	12	1
13	26	27	13	1
14	28	29	14	1

이제 StratifiedShuffleSplit() 함수를 사용해서 층의 비율을 고려해서(유지한채) 무작위로 train set, test set DataFrame을 만들어보겠습니다.

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1004)

for train_idx, test_idx in split.split(df, df["grp"]):

df_strat_train = df.loc[train_idx]

df_strat_test = df.loc[test_idx]

층 내 class의 비율을 고려해서 층화 무작위 추출된 DataFrame 결과는 아래와 같습니다.

df_strat_train

	X1	X2	y	grp
6	12	13	6	1
4	8	9	4	0
14	28	29	14	1
0	0	1	0	0
5	10	11	5	1
3	6	7	3	0
1	2	3	1	0
9	18	19	9	1
10	20	21	10	1
11	22	23	11	1
13	26	27	13	1
7	14	15	7	1

df_strat_test

	X1	X2	y	grp
8	16	17	8	1
2	4	5	2	0
12	24	25	12	1

정말로 각 층 내 계급의 비율(percentage of samples for each class)이 train set, test set에서도 유지가 되고 있는지 확인을 해보겠습니다.

df["grp"].value_counts() / len(df)

[Out]:
1    0.666667
0    0.333333
Name: grp, dtype: float64

df_strat_train["grp"].value_counts() / len(df_strat_train)

[Out]:

1    0.666667
0    0.333333
Name: grp, dtype: float64

df_strat_test["grp"].value_counts() / len(df_strat_test)

[Out]:

1 0.666667 0 0.333333 Name: grp, dtype: float64

pandas DataFrame에 대한 무작위 표본 추출 방법은 https://rfriend.tistory.com/602 를 참고하세요.

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] read_csv() 로 데이터 읽어올 때 날짜/시간 데이터 파싱하기 (parsing datetime from file using read_csv()) (4)	2020.05.17
[Python pandas] 연속형을 범주형으로 변환하는 np.digitize(), pd.cut() 비교 (comparison of categorization using np.digitize(), pd.cut()) (2)	2020.02.18
[Python numpy] Train, Test 데이터셋 분할하기 (split train and test set) (2)	2020.02.11
[Python Numpy] numpy array 거꾸로 뒤집기 (how to reverse numpy array) (0)	2020.02.05
[Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation) (0)	2019.12.31

Posted by Rfriend

,

[Python numpy] Train, Test 데이터셋 분할하기 (split train and test set)

Python 분석과 프로그래밍/Python 데이터 전처리 2020. 2. 11. 21:48

기계학습에서 모델을 학습하는데 사용하는 train set, 적합된 모델의 성능을 평가하는데 사용하는 test set 으로 나누어놓고 시작합니다.

이번 포스팅에서는 2차원 행렬 형태의 데이터셋을 무작위로 샘플링하여 Train set, Test set 으로 분할하는 방법을 소개하겠습니다.

(1) scikit-learn 라이브러리 model_selection 클래스의 train_test_split 함수를 이용하여 train, test set 분할하기

(2) numpy random 클래스의 permutation() 함수를 이용하여 train, test set 분할하기

(3) numpy random 클래스의 choice() 함수를 이용하여 train, test set 분할하기

(4) numpy random 클래스의 shuffle() 함수를 이용하여 train, test set 분할하기

(1) scikit-learn.model_selection의 train_test_split 함수로 train, test set 분할하기

(split train and test set using sklearn.model_selection train_test_split())

제일 편리하고 그래서 (아마도) 제일 많이 사용되는 방법이 scikit-learn 라이브러리 model_selection 클래스의 train_test_split() 함수를 사용하는 것일 것입니다. 무작위 샘플링을 할지 선택하는 shuffle 옵션, 층화 추출법을 할 수 있는 stratify 옵션도 제공하여 간단한 옵션 설정으로 깔끔하게 끝낼 수 있으니 사용하지 않을 이유가 없습니다.

예제로 사용할 간단한 2차원 numpy array의 X와 1차원 numpy array의 y를 만들어보겠습니다.

import numpy as np

X = np.arange(20).reshape(10, 2)

X

[Out]:

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15],
       [16, 17],
       [18, 19]])

y = np.arange(10)

y

[Out]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

(1-1) 순차적으로 train, test set 분할

이제 sklearn.model_selection 의 train_test_split() 함수를 사용해서 train set 60%, test set 40%의 비율로 무작위로 섞는 것 없이 순차적으로(shuffle=False) 분할을 해보겠습니다. 시계열 데이터와 같이 순서를 유지하는 것이 필요한 경우에 이 방법을 사용합니다. suffle 옵션의 디폴트 설정은 True 이므로 만약 무작위 추출이 아닌 순차적 추출을 원하면 반드시 shuffle=False 를 명시적으로 설정해주어야 합니다.

from sklearn.model_selection import train_test_split

# shuffle = False

X_train, X_test, y_train, y_test = train_test_split(X,

y,

test_size=0.4,

shuffle=False,

random_state=1004)

print('X_train shape:', X_train.shape)

print('X_test shape:', X_test.shape)

print('y_train shape:', y_train.shape)

print('y_test shape:', y_test.shape)

[Out]:

X_train shape: (6, 2)
X_test shape: (4, 2)
y_train shape: (6,)
y_test shape: (4,)

X_train

[Out]:

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])

y_train

[Out]: array([0, 1, 2, 3, 4, 5])

(1-2) 무작위 추출로 train, test set 분할

이번에는 train set 60%, test set 40%의 비율로 무작위 추출(random sampling, shuffle=True)하여 분할을 해보겠습니다. random_state 는 재현가능(for reproducibility)하도록 난수의 초기값을 설정해주는 것이며, 아무 숫자나 넣어주면 됩니다. shuffle=True 가 디폴트 설정이므로 생략 가능합니다.

# shuffle = True

X_train, X_test, y_train, y_test = train_test_split(X,

y,

test_size=0.4,

shuffle=True,

random_state=1004)

X_train

array([[ 2,  3],
       [ 8,  9],
       [ 6,  7],
       [14, 15],
       [10, 11],
       [ 4,  5]])

y_train

array([1, 4, 3, 7, 5, 2])

(2) numpy random 클래스의 permutation 함수를 이용하여 train, test set 분할하기

이번에는 numpy 라이브러리를 이용해서 train, test set을 분할하는 사용자 정의 함수(user defined function)를 직접 만들어보겠습니다. 방법은 간단합니다. 먼저 np.random.permutation()으로 X의 관측치 개수(X.shape[0])의 정수를 무작위로 섞은 후에, --> train_num만큼의 train set을 슬라이싱하고, test_num 만큼의 test set을 슬라이싱 합니다.

np.random.seed(seed_number) 는 재현가능성을 위해서 난수 초기값을 설정해줍니다.

# UDF of split train, test set using np.random.permutation()

# X: 2D array, y:1D array

def permutation_train_test_split(X, y, test_size=0.2, shuffle=True, random_state=1004):

import numpy as np

test_num = int(X.shape[0] * test_size)

train_num = X.shape[0] - test_num

if shuffle:

np.random.seed(random_state)

shuffled = np.random.permutation(X.shape[0])

X = X[shuffled,:]

y = y[shuffled]

X_train = X[:train_num]

X_test = X[train_num:]

y_train = y[:train_num]

y_test = y[train_num:]

else:

X_train = X[:train_num]

X_test = X[train_num:]

y_train = y[:train_num]

y_test = y[train_num:]

return X_train, X_test, y_train, y_test

# create 2D X and 1D y array

X = np.arange(20).reshape(10, 2)

y = np.arange(10)

# split train and test set using by random sampling

X_train, X_test, y_train, y_test = permutation_train_test_split(X,

y,

test_size=0.4,

shuffle=True,

random_state=1004)

X_train

[Out]:

array([[ 0,  1],
       [12, 13],
       [16, 17],
       [18, 19],
       [ 2,  3],
       [ 8,  9]])

y_train

[Out]: array([0, 6, 8, 9, 1, 4])

(3) numpy random 클래스의 choice 함수를 이용하여 train, test set 분할하기

(3-1) 다음으로 numpy.random.choice(int_A, int_B, replace=False) 함수를 사용하면 비복원추출(replace=False)로 A개의 정수 중에서 B개의 정수를 무작위로 추출하여 이를 train set의 index로 사용하고, np.setdiff1d() 함수로 train set의 index를 제외한 나머지 index를 test set index로 사용하여 indexing하는 방식입니다.

def choice_train_test_split(X, y, test_size=0.2, shuffle=True, random_state=1004):

test_num = int(X.shape[0] * test_size)

train_num = X.shape[0] - test_num

if shuffle:

np.random.seed(random_state)

train_idx = np.random.choice(X.shape[0], train_num, replace=False)

#-- way 1: using np.setdiff1d()

test_idx = np.setdiff1d(range(X.shape[0]), train_idx)

X_train = X[train_idx, :]

X_test = X[train_idx, :]

y_train = y[test_idx]

y_test = y[test_idx]

else:

X_train = X[:train_num]

X_test = X[train_num:]

y_train = y[:train_num]

y_test = y[train_num:]

return X_train, X_test, y_train, y_test

(3-2) 아래는 위의 (3-1)과 np.random.choice()를 사용하여 train set 추출을 위한 index 번호 무작위 추출은 동일하며, test set 추출을 위한 index를 for loop 과 if not in 조건문을 사용하여 list comprehension 으로 생성한 부분이 (3-1)과 다릅니다.

def choice_train_test_split(X, y, test_size=0.2, shuffle=True, random_state=1004):

test_num = int(X.shape[0] * test_size)

train_num = X.shape[0] - test_num

if shuffle:

np.random.seed(random_state)

train_idx = np.random.choice(X.shape[0], train_num, replace=False)

#-- way 2: using list comprehension with for loop

test_idx = [idx for idx in range(X.shape[0]) if idx not in train_idx]

X_train = X[train_idx, :]

X_test = X[train_idx, :]

y_train = y[test_idx]

y_test = y[test_idx]

else:

X_train = X[:train_num]

X_test = X[train_num:]

y_train = y[:train_num]

y_test = y[train_num:]

return X_train, X_test, y_train, y_test

(4) numpy random shuffle() 함수를 이용하여 train, test set 분할하기

(split train, test set using np.random.shuffle() function)

np.random.shuffle() 함수는 배열을 통째로 무작위로 섞은 배열을 반환합니다. 따라서 무작위로 섞었을 때 X와 y가 동일한 순서로 무작위로 섞인 결과를 얻기 위해서 (4-1) X와 y를 먼저 np.column_stack((X, y)) 를 사용해서 옆으로 나란히 붙인 후에(concatenate), --> (4-2) np.random.shuffle(Xy)로 X와 y 배열을 나란히 붙힌 Xy 배열을 무작위로 섞고 (inplace 로 작동함), --> (4-3) train set 개수 (train_num row) 만큼 위에서 부터 행을 슬라이싱을 하고, X 배열의 열(column) 만큼 슬라이싱해서 X_train set을 만듭니다. (4-4) 무작위로 섞인 Xy 배열로부터 train set 개수(train_num row) 만큼 위에서 부터 행을 슬라이싱하고, y 배열이 제일 오른쪽에 붙여(concatenated) 있으므로 Xy[train_num:, -1] 로 제일 오른쪽의 행을 indexing 해오면 y_train set을 만들 수 있습니다.

# UDF of split train, test set using np.random.shuffle()

# X: 2D array, y:1D array

def shuffle_train_test_split(X, y, test_size=0.2, shuffle=True, random_state=1004):

import numpy as np

test_num = int(X.shape[0] * test_size)

train_num = X.shape[0] - test_num

if shuffle:

np.random.seed(random_state) # for reproducibility

Xy = np.column_stack((X, y)) # concatenate first

np.random.shuffle(Xy) # random shuffling second

X_train = Xy[:train_num, :-1] # slicing from 1 to train_num row, X column

X_test = Xy[train_num:, :-1] # slicing from 1 to train_num row, y column

y_train = Xy[:train_num, -1]

y_test = Xy[train_num:, -1]

else:

X_train = X[:train_num]

X_test = X[train_num:]

y_train = y[:train_num]

y_test = y[train_num:]

return X_train, X_test, y_train, y_test

# shuffle = True

X = np.arange(20).reshape(10, 2)

y = np.arange(10)

X_train, X_test, y_train, y_test = shuffle_train_test_split(X,

y,

test_size=0.4,

shuffle=True)

X_train

[Out]:

array([[ 0,  1],
       [12, 13],
       [16, 17],
       [18, 19],
       [ 2,  3],
       [ 8,  9]])

y_train

[Out]: array([0, 6, 8, 9, 1, 4])

무작위 층화 추출법을 이용한 train set, test set 분할 방법(train and test set split by stratified random sampling in python)은 https://rfriend.tistory.com/520 를 참고하시기 바랍니다.

np.random.choice() 메소드를 활용한 1-D 배열로 부터 임의확률표본추출하는 방법(Generate a random sample frm a given 1-D array)은 https://rfriend.tistory.com/548 를 참고하시기 바랍니다.

pandas DataFrame에 대한 무작위 표본 추출 방법은 https://rfriend.tistory.com/602 를 참고하세요.

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 연속형을 범주형으로 변환하는 np.digitize(), pd.cut() 비교 (comparison of categorization using np.digitize(), pd.cut()) (2)	2020.02.18
[Python] 층화 무작위 추출을 통한 train set, test set 분할 (Train, Test set Split by Stratified Random Sampling in Python) (3)	2020.02.15
[Python Numpy] numpy array 거꾸로 뒤집기 (how to reverse numpy array) (0)	2020.02.05
[Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation) (0)	2019.12.31
[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기 (0)	2019.12.30

Posted by Rfriend

,

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'분류 전체보기'에 해당되는 글 803건

[R] 자기상관계수 (Autocorrelation Coefficients), 자기상관그림(Autocorrelation Plot)

'R 분석과 프로그래밍 > R 통계분석' 카테고리의 다른 글

[Tensorflow] 딥러닝을 위한 공개 데이터셋 Tensorflow Datasets

'Deep Learning (TF, Keras, PyTorch)' 카테고리의 다른 글

[Python] 기존 함수를 재활용해 매개변수 값을 고정하여 새로운 함수 만들기: Functools partial() 함수

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python] 파이썬 객체를 직렬화하여 저장하고, 역직렬화하여 불러오기 : pickle.dump(), pickle.load()

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python] 텍스트 파일 쓰기와 읽기 (writelines() 메소드, readlines() 메소드)

만약 파일을 한 줄씩 읽어오면서 매 10번째 행마다 새로운 파일에 쓰기를 하고 싶다면 아래 코드 참고하세요.

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python] 파일을 열기, 데이터 읽기와 쓰기, 파일 닫기

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[R dplyr] 여러개의 if, else if 조건절을 벡터화해서 처리해주는 case_when() 함수

'R 분석과 프로그래밍 > R 데이터 전처리' 카테고리의 다른 글

[Python pandas] 연속형을 범주형으로 변환하는 np.digitize(), pd.cut() 비교 (comparison of categorization using np.digitize(), pd.cut())

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 층화 무작위 추출을 통한 train set, test set 분할 (Train, Test set Split by Stratified Random Sampling in Python)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python numpy] Train, Test 데이터셋 분할하기 (split train and test set)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바