'공개 텍스트 데이터셋'에 해당되는 글 1건

  1. 2020.03.19 [Tensorflow] 딥러닝을 위한 공개 데이터셋 Tensorflow Datasets 3

딥러닝, 머신러닝을 공부하다 보면 예제로 활용할 수 있는 데이터를 마련하거나 찾기가 어려워서 곤란할 때가 있습니다. 특히 라벨링이 된 이미지, 영상, 음성 등의 데이터의 경우 자체적으로 마련하기가 쉽지 않습니다. 


이번 포스팅에서는 딥러닝, 기계학습을 하는데 활용할 수 있도록 공개된 데이터셋을 TensorFlow Datasets 에서 다운로드하고 fetch 하는 방법을 소개하겠습니다. 


TensorFlow 데이터셋은 아래의 두 곳에서 다운로드 할 수 있습니다. 

(Many many thanks to TensorFlow team!!! ^__^)

  • TensorFlow Datasets : https://www.tensorflow.org/datasets
  • TensorFlow Datasets on GitHub : https://github.com/tensorflow/datasets



  (1) TensorFlow 2.0 과 TensorFlow Datasets 라이브러리 설치


cmd 명령 프롬프트 창에서 pip install 로 TensorFlow 2.0 과 TensorFlow DataSets 라이브러리를 설치합니다. 


(* CPU 를 사용할 경우)


$ pip install --upgrade pip

$ pip install tensorflow

$ pip install tensorflow_datasets

 



(* GPU를 사용할 경우)


$ pip install tensorflow-gpu

 




  (2) tensorflow 와 tensorflow_datasets 라이브러리 import 후 Dataset 로딩하기


TensorFlow v2와 tensorflow_datasets 라이브르러를 import 하겠습니다. 



import tensorflow.compat.v2 as tf

import tensorflow_datasets as tfds

 



TensorFlow v2 부터는 PyTorch처럼 Eager 모드를 지원합니다. Eager모드와 Graph 모드를 활성화시키겠습니다. 



# tfds works in both Eager and Graph modes

tf.enable_v2_behavior()



TensorFlow Datasets에 등록된 모든 공개 데이터셋 리스트를 조회해보겠습니다. 양이 너무 많아서 중간은 생략했는데요, 아래에 카테고리별로 리스트를 다시 정리해보았습니다. 



# Tensorflow Datasets Lists

tfds.list_builders()

['abstract_reasoning', 'aeslc', 'aflw2k3d', 'amazon_us_reviews', 'arc', 'bair_robot_pushing_small', :

-- 너무 많아서 중간 생략 ^^; --

   :
 'wmt18_translate',
 'wmt19_translate',
 'wmt_t2t_translate',
 'wmt_translate',
 'xnli',
 'xsum',
 'yelp_polarity_reviews']





Audio, Image, Object Detection, Structured, Summarization, Text, Translate, Video 의 8개 범주로 데이터셋이 구분되어 정리되어 있습니다. (* link: https://www.tensorflow.org/datasets/catalog/overview)


 Audio

 Image

 groove

librispeech

libritts

ljspeech

nsynth

savee

speech_commands

abstract_reasoning

aflw2k3d

arc

beans

bigearthnet

binarized_mnist

binary_alpha_digits

caltech101

caltech_birds2010

caltech_birds2011

cars196

cassava

cats_vs_dogs

celeb_a

celeb_a_hq

cifar10

cifar100

cifar10_1

cifar10_corrupted

citrus_leaves

cityscapes

clevr

cmaterdb

coil100

colorectal_histology

colorectal_histology_large

curated_breast_imaging_ddsm

cycle_gan

deep_weeds

diabetic_retinopathy_detection

div2k

dmlab

downsampled_imagenet

dsprites

dtd

duke_ultrasound

emnist

eurosat

fashion_mnist

flic

food101

geirhos_conflict_stimuli

horses_or_humans

i_naturalist2017

image_label_folder

imagenet2012

imagenet2012_corrupted

imagenet_resized

imagenette

imagewang

kmnist

lfw

lost_and_found

lsun

malaria

mnist

mnist_corrupted

omniglot

oxford_flowers102

oxford_iiit_pet

patch_camelyon

pet_finder

places365_small

plant_leaves

plant_village

plantae_k

quickdraw_bitmap

resisc45

rock_paper_scissors

scene_parse150

shapes3d

smallnorb

so2sat

stanford_dogs

stanford_online_products

sun397

svhn_cropped

tf_flowers

the300w_lp

uc_merced

vgg_face2

visual_domain_decathlon 

 Object Detection

coco

kitti

open_images_v4

voc

wider_face 

 Structured

 amazon_us_reviews

forest_fires

german_credit_numeric

higgs

iris

rock_you

titanic

 Summarization

aeslc

big_patent

billsum

cnn_dailymail

gigaword

multi_news

newsroom

opinosis

reddit_tifu

scientific_papers

wikihow

xsum 

 Text

 blimp

c4

cfq

civil_comments

cos_e

definite_pronoun_resolution

eraser_multi_rc

esnli

gap

glue

imdb_reviews

librispeech_lm

lm1b

math_dataset

movie_rationales

multi_nli

multi_nli_mismatch

natural_questions

qa4mre

scan

scicite

snli

squad

super_glue

tiny_shakespeare

trivia_qa

wikipedia

xnli

yelp_polarity_reviews

 Translate

flores

para_crawl

ted_hrlr_translate

ted_multi_translate

wmt14_translate

wmt15_translate

wmt16_translate

wmt17_translate

wmt18_translate

wmt19_translate

wmt_t2t_translate

 Video

bair_robot_pushing_small

moving_mnist

robonet

starcraft_video

ucf101




  (3) CIFAR 100 데이터셋을 로컬 디스크에 다운로드 하고 로딩하기 

       (download & load CIFAR 100 dataset)


딥러닝을 활용한 이미지 분류 학습에 많이 사용되는 예제 데이터셋인 CIFAR 100 Dataset 을 로컬 디스크에 다운로드해보겠습니다. 


[ CIFAR 10 이미지 시각화 (예시) ]




cifar_builder = tfds.builder("cifar100")

cifar_builder.download_and_prepare()





CIFAR 100 데이터셋은 ./tensorflow_datasets/cifar100/3.0.0. 폴더 밑에 다운로드되어 있습니다. train, test 데이터셋 레코드와 label 데이터, dataset과 image에 대한 JSON 데이터셋이 로컬 디스크에 다운로드 되었습니다. 





  (4) 데이터셋의 속성 정보 조회하기 (Datasets Attributes)


cifiar_builder.info 로 데이터셋의 속성을 조회해보면 아래와 같습니다. 아래의 JSON 속성 정보를 참고해서 필요한 정보를 참조할 수 있습니다. 



print(cifar_builder.info)

[Out]:

tfds.core.DatasetInfo( name='cifar100', version=3.0.0, description='This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).', homepage='https://www.cs.toronto.edu/~kriz/cifar.html', features=FeaturesDict({ 'coarse_label': ClassLabel(shape=(), dtype=tf.int64, num_classes=20), 'image': Image(shape=(32, 32, 3), dtype=tf.uint8), 'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=100), }), total_num_examples=60000, splits={ 'test': 10000, 'train': 50000, }, supervised_keys=('image', 'label'), citation="""@TECHREPORT{Krizhevsky09learningmultiple, author = {Alex Krizhevsky}, title = {Learning multiple layers of features from tiny images}, institution = {}, year = {2009} }""", redistribution_info=, )




CIFAR 100 데이터셋의 100개 라벨을 인쇄해보겠습니다. 위의 JSON 정보를 참고해서 features["label"].names 로 속성값을 조회할 수 있습니다. 



# label naems

print(cifar_builder.info.features["label"].names)


[Out]: 
['apple', 'aquarium_fish', 'baby', 'bear', 'beaver', 'bed', 'bee', 'beetle', 'bicycle', 'bottle', 'bowl', 'boy', 'bridge', 'bus', 'butterfly', 'camel', 'can', 'castle', 'caterpillar', 'cattle', 'chair', 'chimpanzee', 'clock', 'cloud', 'cockroach', 'couch', 'crab', 'crocodile', 'cup', 'dinosaur', 'dolphin', 'elephant', 'flatfish', 'forest', 'fox', 'girl', 'hamster', 'house', 'kangaroo', 'keyboard', 'lamp', 'lawn_mower', 'leopard', 'lion', 'lizard', 'lobster', 'man', 'maple_tree', 'motorcycle', 'mountain', 'mouse', 'mushroom', 'oak_tree', 'orange', 'orchid', 'otter', 'palm_tree', 'pear', 'pickup_truck', 'pine_tree', 'plain', 'plate', 'poppy', 'porcupine', 'possum', 'rabbit', 'raccoon', 'ray', 'road', 'rocket', 'rose', 'sea', 'seal', 'shark', 'shrew', 'skunk', 'skyscraper', 'snail', 'snake', 'spider', 'squirrel', 'streetcar', 'sunflower', 'sweet_pepper', 'table', 'tank', 'telephone', 'television', 'tiger', 'tractor', 'train', 'trout', 'tulip', 'turtle', 'wardrobe', 'whale', 'willow_tree', 'wolf', 'woman', 'worm']



print(cifar_builder.info.features["coarse_label"].names)

[Out]:
['aquatic_mammals', 'fish', 'flowers', 'food_containers', 'fruit_and_vegetables', 'household_electrical_devices', 'household_furniture', 'insects', 'large_carnivores', 'large_man-made_outdoor_things', 'large_natural_outdoor_scenes', 'large_omnivores_and_herbivores', 'medium_mammals', 'non-insect_invertebrates', 'people', 'reptiles', 'small_mammals', 'trees', 'vehicles_1', 'vehicles_2']





  (5) Train, Validation Set 분할하여 불러오기



# Train/ Validation Datasets

train_cifar_dataset = cifar_builder.as_dataset(split=tfds.Split.TRAIN)

val_cifar_dataset = cifar_builder.as_dataset(split=tfds.Split.TEST)


print(train_cifar_dataset)

[Out]: 
<DatasetV1Adapter shapes: {coarse_label: (), image: (32, 32, 3), label: ()}, types: {coarse_label: tf.int64, image: tf.uint8, label: tf.int64}>


print(val_cifar_dataset)

[Out]: 
<DatasetV1Adapter shapes: {coarse_label: (), image: (32, 32, 3), label: ()}, types: {coarse_label: tf.int64, image: tf.uint8, label: tf.int64}>



# Number of classes: 100

num_classes = cifar_builder.info.features['label'].num_classes


# Number of images: train 50,000 .  validation 10,000

num_train_imgs = cifar_builder.info.splits['train'].num_examples

num_val_imgs = cifar_builder.info.splits['test'].num_examples


print("Training dataset instance:", train_cifar_dataset)

Training dataset instance: <DatasetV1Adapter shapes: {coarse_label: (), image: (32, 32, 3), label: ()}, types: {coarse_label: tf.int64, image: tf.uint8, label: tf.int64}>





  (6) 데이터셋 전처리 (크기 조정, 증식, 배치 샘플링, 검증 데이터셋 생성)


* code reference: Hands-on Computer Vision with TensorFlow 2 by Eliot Andres & Benjamin Planche

(https://www.amazon.com/Hands-Computer-Vision-TensorFlow-processing-ebook/dp/B07SMQGX48)



import math


input_shape = [224, 224, 3]

batch_size = 32

num_epochs = 30


train_cifar_dataset = train_cifar_dataset.repeat(num_epochs).shuffle(10000)

 

def _prepare_data_fn(features, input_shape, augment=False):

    """

    Resize image to expected dimensions, and opt. apply some random transformations.

    - param features: Data

    - param input_shape: Shape expected by the models (images will be resized accordingly)

    - param augment: Flag to apply some random augmentations to the images

    - return: Augmented Images, Labels

    """

    input_shape = tf.convert_to_tensor(input_shape)

    

    # Tensorflow-dataset returns batches as feature dictionaries, expected by Estimators. 

    # To train Keras models, it is mor straightforward to return the batch content as tuples. 

    image = features['image']

    label = features['label']

    

    # Convert the images to float type, also scaling their values from [0, 255] to [0., 1.]

    image = tf.image.convert_image_dtype(image, tf.float32)

    

    if augment:

        # Randomly applied horizontal flip

        image = tf.image.random_flip_left_right(image)

        

        # Random B/S changes

        image = tf.image.random_brightness(image, max_delta=0.1)

        image = tf.image.random_saturation(image, lower=0.5, upper=1.5)

        image = tf.clip_by_value(image, 0.0, 1.0) # keeping pixel values in check

        

        # random resize and random crop back to expected size

        random_scale_factor = tf.random.uniform([1], minval=1., maxval=1.4, dtype=tf.float32)

        scaled_height = tf.cast(tf.cast(input_shape[0], tf.float32) * random_scale_factor, tf.int32)

        scaled_width = tf.cast(tf.cast(input_shape[1], tf.float32) * random_scale_factor, tf.int32)

        

        scaled_shape = tf.squeeze(tf.stack([scaled_height, scaled_width]))

        image = tf.image.resize(image, scaled_shape)

        image = tf.image.random_crop(image, input_shape)

    else:

        image = tf.image.resize(image, input_shape[:2])

        

    return image, label



import functools


prepare_data_fn_for_train = functools.partial(_prepare_data_fn, 

                                              input_shape=input_shape,

                                              augment=True)


train_cifar_dataset = train_cifar_dataset.map(prepare_data_fn_for_train, num_parallel_calls=4)



# batch the samples

train_cifar_dataset = train_cifar_dataset.batch(batch_size)

train_cifar_dataset = train_cifar_dataset.prefetch(1)



# validation dataset (not shuffling or augmenting it)

prepare_data_fn_for_val = functools.partial(_prepare_data_fn,

                                           input_shape=input_shape,

                                           augment=False)


val_cifar_dataset = (val_cifar_dataset

                     .repeat()

                    .map(prepare_data_fn_for_val, num_parallel_calls=4)

                    .batch(batch_size)

                    .prefetch(1))



train_steps_per_epoch = math.ceil(num_train_imgs / batch_size)

val_steps_per_epoch = math.ceil(num_val_imgs / batch_size)




많은 도움이 되었기를 바랍니다. 

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)



728x90
반응형
Posted by Rfriend
,