'분류 전체보기' 카테고리의 글 목록 (40 Page)

[Python] 텍스트를 단어 단위로 파싱해서 One-hot encoding 하기 (parsing text and one-hot encoding at word-level)

Deep Learning (TF, Keras, PyTorch)/Natural Language Processing 2019. 5. 22. 00:32

텍스트 분석을 할 때 제일 처음 하는 일이 문서, 텍스트를 분석에 적합한 형태로 전처리 하는 일입니다.

이번 포스팅에서는 (1) 텍스트 데이터를 Python의 string methods 를 이용하여 단어 단위로 파싱(parsing text at word-level) 한 후에, 단어별 index를 만들고, (2) 텍스트를 단어 단위로 one-hot encoding 을 해보겠습니다.

one-hot encoding of text at a word-level

1. 텍스트 데이터를 Python string methods를 사용하여 단어 단위로 파싱하고, 단어별 token index 만들기

예제로 사용할 텍스트는 Wikipedia 에서 검색한 Python 영문 소개자료 입니다.

python_wikipedia.txt

0.00MB

# import modules
import numpy as np
import os

# set directory
base_dir = '/Users/ihongdon/Documents/Python/dataset'
file_name = 'python_wikipedia.txt'
path = os.path.join(base_dir, file_name)

# open file and print it as an example
file_opened = open(path)
for line in file_opened.readlines():
    print(line)

Python programming language, from wikipedia

Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects.[26]

Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including procedural, object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.[27]

Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released 2000, introduced features like list comprehensions and a garbage collection system capable of collecting reference cycles. Python 3.0, released 2008, was a major revision of the language that is not completely backward-compatible, and much Python 2 code does not run unmodified on Python 3. Due to concern about the amount of code written for Python 2, support for Python 2.7 (the last release in the 2.x series) was extended to 2020. Language developer Guido van Rossum shouldered sole responsibility for the project until July 2018 but now shares his leadership as a member of a five-person steering council.[28][29][30]

Python interpreters are available for many operating systems. A global community of programmers develops and maintains CPython, an open source[31] reference implementation. A non-profit organization, the Python Software Foundation, manages and directs resources for Python and CPython development.

아래는 Python string method를 사용해서 텍스트에서 단어를 파싱하고 전처리할 수 있는 사용자 정의 함수 예시입니다. 가령, 대문자를 소문자로 바꾸기, stop words 제거하기, 기호 제거하기, 숫자 제거하기 등을 차례대로 적용할 수 있는 기본적인 예시입니다. (이 역시 텍스트 분석용 Python module 에 잘 정의된 함수들 사용하면 되긴 합니다. ^^;)

# UDF of word preprocessing
def word_preprocess(word):
    # lower case
    word = word.lower()
        
    # remove stop-words
    stop_words = ['a', 'an', 'the', 'in', 'with', 'to', 'for', 'from', 'of', 'at', 'on',
                  'until', 'by', 'and', 'but', 'is', 'are', 'was', 'were', 'it', 'that', 'this', 
                  'my', 'his', 'her', 'our', 'as', 'not'] # make your own list
    for stop_word in stop_words:
        if word != stop_word:
            word = word
        else:
            word = ''
    
    # remove symbols such as comma, period, etc.
    symbols = [',', '.', ':', '-', '+', '/', '*', '&', '%', '[', ']', '(', ')'] # make your own list
    for symbol in symbols:
        word = word.replace(symbol, '')
    
    # remove numbers
    if word.isnumeric():
        word = ''
    
    return word

다음으로, python_wikipedia.txt 파일을 열어서(open) 각 줄 단위로 읽고(readlines), 좌우 공백을 제거(strip)한 후에, 단어 단위로 분할(split) 하여, 위에서 정의한 word_preprocess() 사용자 정의 함수를 적용하여 전처리를 한 후, token_idx 사전에 단어를 Key로, Index를 Value로 저장합니다.

# blank dictionary to store
token_idx = {}

# opening the file
file_opened = open(path)

# catching words and storing the index at token_idx dictionary
for line in file_opened.readlines():
    # strip leading and trailing edge spaces
    line = line.strip()
        
    # split the line into word with a space delimiter
    for word in line.split():
        
        word = word_preprocess(word) # UDF defined above
        
        # put word into token_index
        if word not in token_idx:
            if word != '':
                token_idx[word] = len(token_idx) + 1

단어를 Key, Index를 Value로 해서 생성된 token_idx Dictionary는 아래와 같습니다.

token_idx

{'"batteries': 48,
'1980s': 56,
'2x': 87,
'abc': 58,
'about': 80,
'aims': 28,
'amount': 81,
'approach': 27,
'available': 104,
'backwardcompatible': 74,
'capable': 67,
'clear': 32,
'code': 18,

.... 중간 생략 ....

'successor': 57,
'support': 83,
'supports': 40,
'system': 66,
'systems': 107,
'the': 84,
'typed': 38,
'unmodified': 78,
'use': 22,
'van': 10,
'whitespace': 24,
'wikipedia': 4,
'write': 31,
'written': 82}

token_idx.values()

dict_values([104, 96, 102, 112, 68, 111, 21, 18, 8, 15, 20, 47, 37, 16, 74, 89, 57, 117, 19, 93, 83, 76, 91, 43, 30, 32, 54, 33, 35, 98, 64, 80, 17, 34, 10, 61, 50, 46, 49, 23, 72, 67, 119, 95, 14, 3, 116, 81, 85, 1, 99, 51, 77, 38, 90, 118, 120, 100, 101, 9, 39, 12, 123, 84, 122, 69, 26, 115, 88, 13, 36, 60, 5, 6, 75, 103, 66, 94, 78, 97, 121, 55, 108, 109, 58, 4, 82, 41, 79, 87, 29, 106, 114, 113, 105, 73, 45, 71, 24, 2, 53, 31, 86, 11, 22, 42, 59, 7, 110, 40, 56, 70, 92, 28, 27, 48, 62, 44, 107, 65, 25, 52, 63])

총 123개의 단어가 있으며, 이 중에서 'python'이라는 단어는 token_idx에 '1' 번으로 등록이 되어있습니다.

max(token_idx.values())

123

token_idx.get('python')

1

2. 텍스트를 단어 단위로 One-hot encoding 하기

하나의 텍스트 문장에서 고려할 단어의 최대 개수로 max_len = 40 을 설정하였습니다. (한 문장에서 41번째 부터 나오는 단어는 무시함). 그리고 One-hot encoding 한 결과를 저장할 빈 one_hot_encoded 다차원 배열을 np.zeros() 로 만들어두었습니다.

# consider only the first max_length words in texts            
max_len = 40

# array to store the one_hot_encoded results
file_opened = open(path)

one_hot_encoded = np.zeros(shape=(len(file_opened.readlines()), 
                                  max_len, 
                                  max(token_idx.values())+1))

one_hot_encoded 는 (5, 40, 124) 의 다차원 배열입니다. 5개의 텍스트 문장으로 되어 있고, 40개의 최대 단어 길이(max_len) 만을 고려하며, 총 124개의 token index 에 대해서 해당 단어가 있으면 '1', 없으면 '0'으로 one-hot encoding을 하게 된다는 뜻입니다.

one_hot_encoded.shape

(5, 40, 124)

아래는 파일을 열고 텍스트를 줄 별로 읽어 들인 후에, for loop 을 돌면서 각 줄에서 단어를 분할하고 전처리하여, token_idx.get(word) 를 사용해서 해당 단어(word)의 token index를 가져온 후, 해당 텍스트(i), 단어(j), token index(idx)에 '1'을 입력하여 one_hot_encoded 다차원 배열을 업데이트 합니다.

file_opened = open(path)
for i, line in enumerate(file_opened.readlines()):
    # strip leading and trailing edge spaces
    line = line.strip()
    
    for j, word in list(enumerate(line.split()))[:max_len]:
        
        # preprocess the word
        word = word_preprocess(word)
        
        # put word into token_index
        if word != '':
            idx = token_idx.get(word)
            one_hot_encoded[i, j, idx] = 1.

이렇게 생성한 one_hot_encoded 다차원배열의 결과는 아래와 같습니다.

one_hot_encoded

array([[[0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 1.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]])

type(one_hot_encoded)

numpy.ndarray

이해를 돕기 위하여 python_wikipedia.txt 파일의 첫번째 줄의, 앞에서 부터 40개 단어까지의 단어 중에서, token_idx 의 1번~10번 까지만 one-hot encoding이 어떻게 되었나를 단어와 token_idx 까지 설명을 추가하여 프린트해보았습니다. (말로 설명하려니 어렵네요. ㅜ_ㅜ)

# sort token_idx dictionary by value
import operator
sorted_token_idx = sorted(token_idx.items(), key=operator.itemgetter(1))

# print out 10 words & token_idx of 1st text's 40 words as an example
for i in range(10):
    print('word & token_idx:', sorted_token_idx[i])
    print(one_hot_encoded[0, :, i+1])

word & token_idx: ('python', 1)
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('programming', 2)
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('language', 3)
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('wikipedia', 4)
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('interpreted', 5)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('highlevel', 6)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('generalpurpose', 7)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('created', 8)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('guido', 9)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
word & token_idx: ('van', 10)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Deep Learning (TF, Keras, PyTorch) > Natural Language Processing' 카테고리의 다른 글

[NLP] TF-IDF (Term Frequency - Inverse Document Frequency) (2)	2022.04.10
[NLP] 언어 구조의 구성 요소 (Building Blocks of Language Structure) (0)	2022.02.20
[NLP] 자연어 처리(NLP, Natural Language Processing)란 무엇이고, NLP 응용분야는 무엇이 있나? (0)	2022.02.20
[Python] 텍스트로부터 CSR 행렬을 이용하여 Term-Document 행렬 만들기 (0)	2020.09.13
[Python] NLTK(Natural Language Toolkit)와 WordNet으로 자연어 처리하기 맛보기 (0)	2020.08.02

Posted by Rfriend

,

[Python] 텍스트 파일 읽어와서 숫자형 데이터 표준화하기 (reading csv or text file, standardizing or normalizing of numeric data)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 5. 21. 22:29

이번 포스팅에서는 (1) text 또는 csv 포맷으로 저장된 텍스트 파일을 Python의 string methods 를 사용하여 파일을 열어서 파싱하여 matrix 로 저장하고, (2) 숫자형 데이터를 표준화(standardization) 혹은 정규화(normalization) 하는 사용자 정의함수를 만들어보겠습니다.

예제로 사용할 text 파일은 전복의 성별과 length, diameter, height, whole_weight, shucked_weight, viscera_weight, shell_weight, rings 를 측정한 abalone.txt 파일 입니다.

abalone.txt

0.18MB

1. text 파일을 읽어서 숫자형 값으로 만든 matrix, 라벨을 저장한 vector를 만들기

물론, Pandas 모듈의 read_csv() 함수를 이용하여 편리하게 text, csv 포맷의 파일을 읽어올 수 있습니다.

# importing modules
import numpy as np
import pandas as pd
import os

# setting directory
base_dir = '/Users/ihongdon/Documents/Python'
work_dir = 'dataset'
path = os.path.join(base_dir, work_dir)

# reading text file using pandas read_csv() function
df = pd.read_csv(os.path.join(path, 'abalone.txt'), 
                 sep=',', 
                 names=['sex', 'length', 'diameter', 'height', 'whole_weight', 
                        'shucked_weight', 'viscera_weight', 'shell_weight', 'rings'], 
                 header=None)
                 
# check first 5 lines
df.head()
sex	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	rings
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
2	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
4	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7

위의 Pandas 의 함수 말고, 아래에는 Python의 string methods 를 사용해서 파일을 열고, 파싱하는 간단한 사용자 정의함수를 직접 만들어보았습니다.

위의 abalone.txt 파일의 데이터 형태를 참고해서 파일 이름, 숫자형 변수의 개수, 숫자형 변수의 시작 위치, 숫자형 변수의 끝나는 위치, 라벨 변수의 우치를 인자로 받는 사용자 정의함수를 정의하였습니다. 분석을 하려는 각 데이터셋에 맞게 인자와 함수 code block 을 수정하면 좀더 유연하고 데이터 특성에 적합하게 파일을 불어올 수 있는 사용자 정의함수를 만들 수 있습니다.

def file2matrix(filename, val_col_num, val_col_st_idx, val_col_end_idx, label_idx):
    """
    - filename: directory and file name
    - val_col_num: the number of columns which contains numeric values
    - val_col_st_idx: the index of starting column which contains numeric values
    - val_col_end_idx: the index of ending column which contains numeric values
    - label_idx: the index of label column
    """
    # open file
    file_opened = open(filename)
    lines_num = len(file_opened.readlines())
    
    # blank matrix and vector to store
    matrix_value = np.zeros((lines_num, val_col_num))
    vector_label = []
    
    # splits and appends value and label using for loop statement
    file_opened = open(filename)
    idx = 0
    for line in file_opened.readlines():
        # removes all whitespace in string
        line = line.strip()
        
        # splits string according to delimiter str
        list_from_line = line.split(sep=',')
        
        # appends value to matrix and label to vector
        matrix_value[idx, :] = list_from_line[val_col_st_idx : (val_col_end_idx+1)]
        vector_label.append(list_from_line[label_idx])
        idx += 1
        
    return matrix_value, vector_label

Python의 문자열 메소드 (string methods)는 https://rfriend.tistory.com/327 를 참고하세요.

위의 file2matrix() 사용자 정의 함수를 사용하여 abalone.txt 파일을 읽어와서 (a) matrix_value, (b) vector_label 을 반환하여 보겠습니다.

# run file2matrix() UDF
matrix_value, vector_label = file2matrix(os.path.join(path, 'abalone.txt'), 8, 1, 8, 0)

#--- matrix_value
# type
type(matrix_value)
numpy.ndarray

# shape
matrix_value.shape
(4177, 8)

# samples
matrix_value[:3]
array([[ 0.455 ,  0.365 ,  0.095 ,  0.514 ,  0.2245,  0.101 ,  0.15  , 15.    ],
       [ 0.35  ,  0.265 ,  0.09  ,  0.2255,  0.0995,  0.0485,  0.07  ,  7.    ],
       [ 0.53  ,  0.42  ,  0.135 ,  0.677 ,  0.2565,  0.1415,  0.21  ,  9.    ]])
       
#--- vector_label
# type
type(vector_label)
list

# number of labels
len(vector_label)
4177

# samples
vector_label[:3]
['M', 'M', 'F']

2-1. 숫자형 데이터를 표준화(Standardization) 하기

위의 숫자형 데이터로 이루어진 matrix_value 를 numpy를 이용해서 표준화, 정규화하는 사용자 정의함수를 작성해보겠습니다. (물론 scipy.stats 의 zscore() 나 sklearn.preprocessing 의 StandardScaler() 함수를 사용해도 됩니다.)

아래의 사용자 정의 함수는 숫자형 데이터로 이루어진 데이터셋을 인자로 받으면, 평균(mean)과 표준편차(standard deviation)를 구하고, standardized_value = (x - mean) / standard_deviation 으로 표준화를 합니다. 그리고 표준화한 matrix, 각 칼럼별 평균과 표준편차를 반환합니다.

def standardize(numeric_dataset):

    # standardized_value = (x - mean)/ standard_deviation
    
    # calculate mean and standard deviation per numeric columns
    mean_val = numeric_dataset.mean(axis=0)
    std_dev_val = numeric_dataset.std(axis=0)
    
    # standardization
    matrix_standardized = (numeric_dataset - mean_val)/ std_dev_val
    
    return matrix_standardized, mean_val, std_dev_val

위의 standardize() 함수를 사용하여 matrix_value 다차원배열을 표준화해보겠습니다.

# rund standardize() UDF
matrix_standardized, mean_val, std_dev_val = standardize(matrix_value)

# matrix after standardization
matrix_standardized
array([[-0.57455813, -0.43214879, -1.06442415, ..., -0.72621157,
        -0.63821689,  1.57154357],
       [-1.44898585, -1.439929  , -1.18397831, ..., -1.20522124,
        -1.21298732, -0.91001299],
       [ 0.05003309,  0.12213032, -0.10799087, ..., -0.35668983,
        -0.20713907, -0.28962385],
       ...,
       [ 0.6329849 ,  0.67640943,  1.56576738, ...,  0.97541324,
         0.49695471, -0.28962385],
       [ 0.84118198,  0.77718745,  0.25067161, ...,  0.73362741,
         0.41073914,  0.02057072],
       [ 1.54905203,  1.48263359,  1.32665906, ...,  1.78744868,
         1.84048058,  0.64095986]])
 
 # mean per columns
 mean_val
 array([0.5239921 , 0.40788125, 0.1395164 , 0.82874216, 0.35936749,
       0.18059361, 0.23883086, 9.93368446])
       
 # standard deviation per columns
 std_dev_val
 array([0.12007854, 0.09922799, 0.04182205, 0.49033031, 0.22193638,
       0.10960113, 0.13918601, 3.22378307])

2-2. 숫자형 데이터를 정규화(Normalization) 하기

다음으로 척도, 범위가 다른 숫자형 데이터를 [0, 1] 사이의 값으로 변환하는 정규화(Normalization)를 해보겠습니다. normalized_value = (x - minimum_value) / (maximum_value - minimum_value) 로 계산합니다.

def normalize(numeric_dataset):
    
    # normalized_value = (x - minimum_value) / (maximum_value - minimum_value)
    
    # calculate mean and standard deviation per numeric columns
    min_val = numeric_dataset.min(axis=0)
    max_val = numeric_dataset.max(axis=0)
    ranges = max_val - min_val
    
    # normalization, min_max_scaling
    matrix_normalized = (numeric_dataset - min_val)/ ranges
    
    return matrix_normalized, ranges, min_val

위의 normalize() 사용자 정의 함수에 matrix_value 다차원배열을 적용해서 정규화 변환을 해보겠습니다. 정규화된 다차원배열과 범위(range = max_val - min_val), 최소값을 동시에 반환합니다.

# run normalize() UDF
matrix_normalized, ranges, min_val = normalize(matrix_value)

# normalized matrix
matrix_normalized
array([[0.51351351, 0.5210084 , 0.0840708 , ..., 0.1323239 , 0.14798206,
        0.5       ],
       [0.37162162, 0.35294118, 0.07964602, ..., 0.06319947, 0.06826109,
        0.21428571],
       [0.61486486, 0.61344538, 0.11946903, ..., 0.18564845, 0.2077728 ,
        0.28571429],
       ...,
       [0.70945946, 0.70588235, 0.18141593, ..., 0.37788018, 0.30543099,
        0.28571429],
       [0.74324324, 0.72268908, 0.13274336, ..., 0.34298881, 0.29347285,
        0.32142857],
       [0.85810811, 0.84033613, 0.17256637, ..., 0.49506254, 0.49177877,
        0.39285714]])
        
# ranges
ranges
array([ 0.74  ,  0.595 ,  1.13  ,  2.8235,  1.487 ,  0.7595,  1.0035,  28.    ])

# minimum value
min_val
array([7.5e-02, 5.5e-02, 0.0e+00, 2.0e-03, 1.0e-03, 5.0e-04, 1.5e-03, 1.0e+00])

다음번 포스팅에서는 텍스트 파일을 파싱해서 One-Hot Encoding 하는 방법을 소개하겠습니다.

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame, Series의 행, 열 개수 세기 (1)	2019.07.03
[Python pandas] DataFrame의 문자열 칼럼을 분할하여 일부분으로 새로운 칼럼 만들기 (2)	2019.07.01
[Python] 경로 및 폴더 생성/제거(directory and path management using os), 파일 복사 (file copy using shutil) (0)	2019.03.03
[Python Numpy] 배열에 차원 추가하기 (Adding Dimensions to a Numpy Array) (2)	2019.02.24
[Python Numpy] 배열에서 0보다 작은 수를 0으로 변환하는 방법 (0)	2019.02.21

Posted by Rfriend

,

[Greenplum DB] GPDB에 PL/R Language Extension, R 패키지 수동 설치 방법

Greenplum and PostgreSQL Database 2019. 5. 16. 15:28

Greenplum DB에 R이나 Python, Perl, Java 등의 Procedural Language Extention을 설치해서 대용량 데이터를 In-Database 분산 병렬 처리, 분석할 수 있습니다.

이번 포스팅에서는 인터넷이 되는 환경에서 다운로드한 R 패키지들을 회사/ 기관 정책상 폐쇄망으로 운영하는 환경에서 Greenplum DB에 설치하는 방법을 소개하겠습니다.

1. Greenplum PL/R Extention (Procedural Language R) 설치 방법

2. Greenplum DB에 R 패키지 설치 방법

1. Greenplum PL/R Extention 설치 방법

PL/R 은 procedural language 로서, PL/R Extension을 설치하면 Greenplum DB에서 R 프로그래밍 언어, R 패키지의 함수와 데이터셋을 사용할 수 있습니다.

Greenplum DB에 PL/R 확장 언어 설치 방법은 https://gpdb.docs.pivotal.io/5180/ref_guide/extensions/pl_r.html 를 참고하였습니다. 웹 페이지의 상단에서 사용 중인 Greenplum DB version을 선택해주세요. (아래 예는 GPDB v5.18 선택 시 화면)

PL/R 은 패키지 형태로 되어 있으며, Pivotal Network(https://network.pivotal.io/products/pivotal-gpdb)에서 다운로드 할 수 있고, Greenplum Package Manager (gppkg) 를 사용해서 쉽게 설치할 수 있습니다.

Greenplum Package Manager (gppkg) 유틸리티는 Host와 Cluster에 PL/R 과 의존성있는 패키지들을 한꺼번에 설치를 해줍니다. 또한 gppkg는 시스템 확장이나 세그먼트 복구 시에 자동으로 PL/R extension을 설치해줍니다.

Greenplum PL/R Extention 설치 순서는 아래와 같습니다.

(0) 먼저, Greenplum DB 작동 중이고, source greenplum_path.sh 실행, $MASTER_DATA_DIRECTORY, $GPHOME variables 설정 완료 필요합니다.

psql에서 Greenplum DB 버전을 확인합니다.

psql # sql -c “select version;”

master host에서 gpadmin 계정으로 작업 디렉토리를 만듭니다.

(예: /home/gpadmin/packages)

(1) Pivotal Network에서 사용 중인 Greenplum DB version에 맞는 PL/R Extension을 다운로드 합니다.

(예: plr-2.3.3-gp5-rhel7-x86_64.gppkg)

(2) 다운로드 한 PL/R Extension Package를 scp 나 sftp 를 이용해서 Greenplum DB master host로 복사합니다. (아마 회사 정책 상 DBA만 root 권한에 접근 가능한 경우가 대부분일 것이므로, 그런 경우에는 DBA에게 복사/설치 요청을 하셔야 합니다).

$ scp plr-2.3.3-gp5-rhel7-x86_64.gppkg root@mdw:~/packages

(3) PL/R Extension Package를 gppkg 커맨드를 실행하여 설치합니다. (아래 예는 Linux에서 실행한 예)

$ gppkg -i plr-2.3.3-gp5-rhel7-x86_64.gppkg

(4) Greenplum DB를 재실행 합니다.

(GPDB를 껐다가 켜는 것이므로 DBA에게 반드시 사전 통보, 허락 받고 실행 필요합니다!)

$ gpstop -r

(5) Source the file $GPHOME/greenplum_path.sh

# source /usr/local/greenplum-db/greenplum_path.sh

R extension과 R 환경은 아래 경로에 설치되어 있습니다.

$ GPHOME/ext/R-2.3.3/

(6) 각 데이터베이스가 PL/R 언어를 사용하기 위해서는 SQL 문으로 CREATE LANGUAGE 또는 createlang 유틸리티로 PL/R을 등록해주어야 합니다. (아래는 testdb 데이터베이스에 등록하는 예)

$ createlang plr -d testdb

이렇게 하면 PL/R이 untrusted language 로 등록이 되었습니다.

참고로, Database 확인은 psql 로 \l 해주면 됩니다.

psql # \l

2. Greenplum DB에 R 패키지 설치 방법 (Installing external R packages)

(0) 필요한 R 패키지, 그리고 이에 의존성이 있는 R 패키지를 한꺼번에 다운로드 합니다. (=> https://rfriend.tistory.com/441 참조)

(1) 다운로드한 R 패키지들을 압축하여 Greenplum DB 서버로 복사합니다.

다운로드한 R 패키지들 조회해보겠습니다.

[root@mdw /]# find . | grep sp_1.3-1.tar.gz
./home/gpadmin/r-pkg/sp_1.3-1.tar.gz
[root@mdw /]# exit
logout
[gpadmin@mdw tmp]$ cd ~
[gpadmin@mdw ~]$ cd r-pkg
[gpadmin@mdw r-pkg]$ ls -la
total 47032
drwxrwxr-x 2 gpadmin gpadmin    4096 Apr 23 13:17 .
drwx------ 1 gpadmin gpadmin    4096 Apr 23 13:14 ..
-rw-rw-r-- 1 gpadmin gpadmin  931812 Apr 23 12:55 DBI_1.0.0.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  794622 Apr 23 12:55 LearnBayes_2.15.1.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  487225 Apr 23 12:55 MASS_7.3-51.3.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin 1860456 Apr 23 12:55 Matrix_1.2-17.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin   31545 Apr 23 12:55 R6_2.4.0.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin 3661123 Apr 23 12:55 Rcpp_1.0.1.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin   21810 Apr 23 12:55 abind_1.4-5.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  231855 Apr 23 12:55 boot_1.3-20.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin   17320 Apr 23 12:55 classInt_0.3-1.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin   19757 Apr 23 12:55 class_7.3-15.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin   73530 Apr 23 12:55 coda_0.19-2.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  658694 Apr 23 12:55 crayon_1.3.4.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin   80772 Apr 23 12:55 deldir_0.1-16.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  128553 Apr 23 12:55 digest_0.6.18.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  582415 Apr 23 12:55 e1071_1.7-1.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  137075 Apr 23 12:55 expm_0.999-4.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  347295 Apr 23 12:55 foreign_0.8-71.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin 1058430 Apr 23 12:55 gdata_2.18.0.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  758133 Apr 23 12:55 geosphere_1.5-7.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin   33783 Apr 23 12:55 gmodels_2.18.1.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin   12577 Apr 23 12:55 goftest_1.1-1.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  187516 Apr 23 12:55 gtools_3.8.1.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin   45408 Apr 23 12:55 htmltools_0.3.6.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin 1758514 Apr 23 12:55 httpuv_1.5.1.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin 1052728 Apr 23 12:55 jsonlite_1.6.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin   40293 Apr 23 12:55 later_0.8.0.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  359031 Apr 23 12:55 lattice_0.20-38.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  200504 Apr 23 12:55 magrittr_1.5.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin 1581592 Apr 23 12:55 maptools_0.9-5.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  915991 Apr 23 12:55 mgcv_1.8-28.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin   12960 Apr 23 12:55 mime_0.6.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin   79619 Apr 23 12:55 polyclip_1.10-0.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  106866 Apr 23 12:55 promises_1.0.1.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  255244 Apr 23 12:55 rgeos_0.4-2.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  858992 Apr 23 12:55 rlang_0.3.4.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  639286 Apr 23 12:55 rpart_4.1-15.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin 8166770 Apr 23 12:55 sf_0.7-3.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin 2991469 Apr 23 12:55 shiny_1.3.2.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin   24155 Apr 23 12:55 sourcetools_0.1.7.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin 3485268 Apr 23 12:55 spData_0.3.0.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin 1133621 Apr 23 12:55 sp_1.3-1.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin 2861828 Apr 23 12:55 spatstat.data_1.4-0.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin   65106 Apr 23 12:55 spatstat.utils_1.13-0.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin 6598638 Apr 23 12:55 spatstat_1.59-0.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin 1227625 Apr 23 12:55 spdep_1.1-2.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin    2518 Apr 23 12:55 tensor_1.5.tar.gz
-rwxr-xr-x 1 gpadmin gpadmin    2326 Apr 23 13:17 test.sh
-rw-rw-r-- 1 gpadmin gpadmin  917316 Apr 23 12:55 units_0.6-2.tar.gz
-rw-rw-r-- 1 gpadmin gpadmin  564589 Apr 23 12:55 xtable_1.8-4.tar.gz

R 패키지들이 들어있는 폴더를 r-pkg.tar 이름으로 압축해보겠습니다.

[gpadmin@mdw r-pkg]$ pwd
/home/gpadmin/r-pkg
[gpadmin@mdw r-pkg]$ cd ..
[gpadmin@mdw ~]$ tar cf r-pkg.tar r-pkg
[gpadmin@mdw ~]$ ls -lrt
total 47000
drwxr-xr-x 2 gpadmin gpadmin     4096 Aug 13  2018 gpconfigs
drwxr-xr-x 2 root    root        4096 Mar 22 07:02 gppkgs
drwxrwxr-x 1 gpadmin gpadmin     4096 Apr 23 12:48 gpAdminLogs
-rw-rw-r-- 1 gpadmin gpadmin      983 Apr 23 13:14 pkg.r
drwxrwxr-x 2 gpadmin gpadmin     4096 Apr 23 13:17 r-pkg
-rw-rw-r-- 1 gpadmin gpadmin 48107520 Apr 25 01:52 r-pkg.tar

명령 프롬프트 창에서 GPDB Docker 에서 압축한 파일을 로커로 복사 후에 ==> 다른 GPDB 서버로 복사하고 압축을 풀어줍니다. (저는 Docker 환경에서 하다보니 좀 복잡해졌는데요, 만약 로컬에서 R 패키지 다운받았으면 로컬에서 바로 GPDB 서버로 복사하면 됩니다. 압축한 R패키지 파일을 scp로 복사하거나 sftp로 업로드할 수 있으며, 권한이 없는 경우 DBA에게 요청하시구요.) 아래는 mdw에서 root 계정으로 시작해서 다운로드해서 압축한 R 패키지 파일을 scp로 /root/packages 경로에 복사하는 스크립트입니다.

-- GPDB Docker에서 압축한 파일을 로컬로 복사하기
-- 다른 명령 프롬프트 창에서 복사해오고 확인하기

ihongdon-ui-MacBook-Pro:Downloads ihongdon$ docker cp gpdb-ds:/home/gpadmin/r-pkg.tar /Users/ihongdon/Downloads/r-pkg.tar
ihongdon-ui-MacBook-Pro:Downloads ihongdon$
ihongdon-ui-MacBook-Pro:Downloads ihongdon$
ihongdon-ui-MacBook-Pro:Downloads ihongdon$ ls -lrt
-rw-rw-r--   1 ihongdon  staff  48107520  4 25 10:52 r-pkg.tar

-- 다른 GPDB 서버로 복사하기
ihongdon-ui-MacBook-Pro:Downloads ihongdon$ scp r-pkg.tar root@mdw:~/package

-- 압축 해제
$ tar -xvf r-pkg.tar

Greenplum DB에 R 패키지를 설치하려면 모든 Greenplum 서버에 R이 이미 설치되어 있어야 합니다.

여러개의 Segments 에 동시에 R 패키지들을 설치해주기 위해서 배포하고자 하는 host list를 작성해줍니다.

# source /usr/local/greenplum-db/greenplum_path.sh
# vi hostfile_packages

vi editor 창이 열리면 아래처럼 R을 설치하고자 하는 host 이름을 등록해줍니다. (1개 master, 3개 segments 예시)

-- vi 편집창에서 --
smdw
sdw1
sdw2
sdw3
~
~
~
esc 누르고 :wq!

명령 프롬프트 창에서 mdw로 부터 root 계정으로 각 노드에 package directory 를 복사해줍니다.

# gpscp -f hostfile_packages -r packages =:/root

hostfile_packages를 복사해서 hostfile_all 을 만들고, mdw를 추가해줍니다.

-- copy
$ cp hostfile_packages  hostfile_all

-- insert mdw
$ vi hostfile_all

-- vi 편집창에서 --
mdw
smdw
sdw1
sdw2
sdw3
~
~
~
esc 누르고 :wq!

mdw를 포함한 모든 서버에 R packages 를 설치하는 'R CMD INSTALL r_package_name' 명령문을 mdw에서 실행합니다. (hostfile_all 에 mdw, smdw, sdw1, sdw2, sdw3 등록해놓았으므로 R이 모든 host에 설치됨)

$ pssh -f hostfile_all -v -e 'R CMD INSTALL ./DBI_1.0.0.tar.gz 
LearnBayes_2.15.1.tar.gz MASS_7.3-51.3.tar.gz Matrix_1.2-17.tar.gz 
R6_2.4.0.tar.gz Rcpp_1.0.1.tar.gz 
abind_1.4-5.tar.gz boot_1.3-20.tar.gz classInt_0.3-1.tar.gz
class_7.3-15.tar.gz coda_0.19-2.tar.gz crayon_1.3.4.tar.gz
deldir_0.1-16.tar.gz digest_0.6.18.tar.gz e1071_1.7-1.tar.gz
expm_0.999-4.tar.gz foreign_0.8-71.tar.gz gdata_2.18.0.tar.gz
geosphere_1.5-7.tar.gz gmodels_2.18.1.tar.gz goftest_1.1-1.tar.gz
gtools_3.8.1.tar.gz htmltools_0.3.6.tar.gz httpuv_1.5.1.tar.gz
jsonlite_1.6.tar.gz later_0.8.0.tar.gz lattice_0.20-38.tar.gz
magrittr_1.5.tar.gz maptools_0.9-5.tar.gz mgcv_1.8-28.tar.gz
mime_0.6.tar.gz polyclip_1.10-0.tar.gz promises_1.0.1.tar.gz
rgeos_0.4-2.tar.gz rlang_0.3.4.tar.gz rpart_4.1-15.tar.gz
sf_0.7-3.tar.gz shiny_1.3.2.tar.gz sourcetools_0.1.7.tar.gz
spData_0.3.0.tar.gz sp_1.3-1.tar.gz spatstat.data_1.4-0.tar.gz
spatstat.utils_1.13-0.tar.gz spatstat_1.59-0.tar.gz spdep_1.1-2.tar.gz
tensor_1.5.tar.gz units_0.6-2.tar.gz xtable_1.8-4.tar.gz'

특정 R 패키지를 설치하려고 할 때, 만약 의존성 있는 패키지 (dependencies packages) 가 이미 설치되어 있지 않다면 특정 R 패키지는 설치가 되지 않습니다. 따라서 위의 'R CMD INSTALL r-package-names' 명령문을 실행하면 설치가 되는게 있고, 안되는 것(<- 의존성 있는 패키지가 먼저 설치된 이후에나 설치 가능)도 있게 됩니다. 따라서 이 설치 작업을 수작업으로 반복해서 여러번 돌려줘야 합니다. loop 돌리다보면 의존성 있는 패키지가 설치가 먼저 설치가 될거고, 그 다음에 이전에는 설치가 안되었던게 의존성 있는 패키지가 바로 전에 설치가 되었으므로 이제는 설치가 되고, ...., ....., 다 설치 될때까지 몇 번 더 실행해 줍니다.

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[GPDB, PostgreSQL] Python DataFrame을 Sqlalchemy engine을 이용해 DB Table에 직접 쓰기 : df.to_sql() (0)	2019.07.28
[GPDB, Postgres] Python DataFrame을 Postgresql, Greenplum DB에 Copy 해서 넣는 방법 (0)	2019.07.17
[Greenplum DB] PostGIS - 공간지리 테이블 백업하기, 백업 다시 불러오기 (Backup and Restore geospatial table using pg_dump, pg_restore) (0)	2019.04.27
[Greenplum DB] PostGIS에 raster2pgsql 을 사용하여 raster data import 하기 (0)	2019.04.17
[Greenplum DB] PostGIS : ogr2ogr 을 사용해 공간지리 벡터 데이터 import 하기 (GML, MIF, KML 포맷) (0)	2019.04.11

Posted by Rfriend

,

[R] 원하는 R 패키지와 의존성 있는 R 패키지를 모두 한꺼번에 다운로드 하기 (tools::package_dependencies(), download.packages())

R 분석과 프로그래밍/R 데이터 전처리 2019. 4. 27. 22:32

회사의 보안 정책 상 사내 폐쇄망으로 운영함에 따라 R 패키지를 인터넷으로부터 바로 다운로드하지 못하는 경우가 있습니다. 이처럼 폐쇄망일 경우 R 패키지를 설치하는 것이 '지옥 그 자체!' 일 수 있습니다. 왜냐하면 특정 R 패키지를 설치하려면 그와 의존성이 있는 다른 패키지를 설치해야만 하고, 그 의존성이 있는 패키지는 또 다른 의존성이 있는 패키지가 설치되어야만 하고.... 꼬리에 꼬리를 무는 의존성 있는 패키지들을 설치하다보면 입에서 투덜거림이 궁시렁 궁시렁 나오게 됩니다. -_-;;;

이럴 때 편리하게 사용할 수 있는 'R 패키지 다운로드 & 의존성 있는 R 패키지도 같이 한꺼번에 다운로드 하기' 하는 방법을 소개하겠습니다.

먼저 R 패키지를 다운로드 받아 놓을 'r-pkt' 폴더를 만들어보겠습니다.

> mainDir <- "/Users/ihongdon/Downloads"

> subDir <- "r-pkg"

> dir.create(file.path(mainDir, subDir), showWarnings = FALSE)

다음으로 tools 패키지의 package_dependencies() 함수를 이용하여 다운도르하려는 특정 R 패키지와 이 패키지의 의존성 있는 패키지(dependencies of package)들 정보를 가져오는 사용자 정의 함수 getPackages() 를 만들어보겠습니다.

# UDF for get packages with dependencies

getPackages <- function(packs){

packages <- unlist(

# Find (recursively) dependencies or reverse dependencies of packages.

tools::package_dependencies(packs, available.packages(),

which=c("Depends", "Imports"), recursive=TRUE)

)

packages <- union(packs, packages)

return(packages)

}

자, 이제 준비가 되었으니 예제로 "dplyr"와 "ggplot2" 의 두 개 패키지와 이들과 의존성을 가지는 패키지들을 getPackages() 사용자 정의 함수로 정보를 가져온 후에, download.packages() 함수로 '/Users/ihongdon/Downloads/r-pkg' 폴더로 다운로드 해보도록 하겠습니다.

> packages <- getPackages(c("dplyr", "ggplot2"))

> download.packages(packages, destdir=file.path(mainDir, subDir))

trying URL 'https://cran.rstudio.com/src/contrib/dplyr_0.8.0.1.tar.gz'

Content type 'application/x-gzip' length 1075146 bytes (1.0 MB)

==================================================

downloaded 1.0 MB

trying URL 'https://cran.rstudio.com/src/contrib/ggplot2_3.1.1.tar.gz'

Content type 'application/x-gzip' length 2862022 bytes (2.7 MB)

==================================================

downloaded 2.7 MB

trying URL 'https://cran.rstudio.com/src/contrib/assertthat_0.2.1.tar.gz'

Content type 'application/x-gzip' length 12742 bytes (12 KB)

==================================================

downloaded 12 KB

trying URL 'https://cran.rstudio.com/src/contrib/glue_1.3.1.tar.gz'

Content type 'application/x-gzip' length 122950 bytes (120 KB)

==================================================

downloaded 120 KB

trying URL 'https://cran.rstudio.com/src/contrib/magrittr_1.5.tar.gz'

Content type 'application/x-gzip' length 200504 bytes (195 KB)

==================================================

downloaded 195 KB

Warning in download.packages(packages, destdir = file.path(mainDir, subDir)) :

no package 'methods' at the repositories

trying URL 'https://cran.rstudio.com/src/contrib/pkgconfig_2.0.2.tar.gz'

Content type 'application/x-gzip' length 6024 bytes

==================================================

downloaded 6024 bytes

trying URL 'https://cran.rstudio.com/src/contrib/R6_2.4.0.tar.gz'

Content type 'application/x-gzip' length 31545 bytes (30 KB)

==================================================

downloaded 30 KB

trying URL 'https://cran.rstudio.com/src/contrib/Rcpp_1.0.1.tar.gz'

Content type 'application/x-gzip' length 3661123 bytes (3.5 MB)

==================================================

downloaded 3.5 MB

trying URL 'https://cran.rstudio.com/src/contrib/rlang_0.3.4.tar.gz'

Content type 'application/x-gzip' length 858992 bytes (838 KB)

==================================================

downloaded 838 KB

trying URL 'https://cran.rstudio.com/src/contrib/tibble_2.1.1.tar.gz'

Content type 'application/x-gzip' length 311836 bytes (304 KB)

==================================================

downloaded 304 KB

trying URL 'https://cran.rstudio.com/src/contrib/tidyselect_0.2.5.tar.gz'

Content type 'application/x-gzip' length 21883 bytes (21 KB)

==================================================

downloaded 21 KB

Warning in download.packages(packages, destdir = file.path(mainDir, subDir)) :

no package 'utils' at the repositories

Warning in download.packages(packages, destdir = file.path(mainDir, subDir)) :

no package 'tools' at the repositories

trying URL 'https://cran.rstudio.com/src/contrib/cli_1.1.0.tar.gz'

Content type 'application/x-gzip' length 40232 bytes (39 KB)

==================================================

downloaded 39 KB

trying URL 'https://cran.rstudio.com/src/contrib/crayon_1.3.4.tar.gz'

Content type 'application/x-gzip' length 658694 bytes (643 KB)

==================================================

downloaded 643 KB

trying URL 'https://cran.rstudio.com/src/contrib/fansi_0.4.0.tar.gz'

Content type 'application/x-gzip' length 266123 bytes (259 KB)

==================================================

downloaded 259 KB

trying URL 'https://cran.rstudio.com/src/contrib/pillar_1.3.1.tar.gz'

Content type 'application/x-gzip' length 103972 bytes (101 KB)

==================================================

downloaded 101 KB

trying URL 'https://cran.rstudio.com/src/contrib/purrr_0.3.2.tar.gz'

Content type 'application/x-gzip' length 373701 bytes (364 KB)

==================================================

downloaded 364 KB

Warning in download.packages(packages, destdir = file.path(mainDir, subDir)) :

no package 'grDevices' at the repositories

trying URL 'https://cran.rstudio.com/src/contrib/utf8_1.1.4.tar.gz'

Content type 'application/x-gzip' length 218882 bytes (213 KB)

==================================================

downloaded 213 KB

trying URL 'https://cran.rstudio.com/src/contrib/digest_0.6.18.tar.gz'

Content type 'application/x-gzip' length 128553 bytes (125 KB)

==================================================

downloaded 125 KB

Warning in download.packages(packages, destdir = file.path(mainDir, subDir)) :

no package 'grid' at the repositories

trying URL 'https://cran.rstudio.com/src/contrib/gtable_0.3.0.tar.gz'

Content type 'application/x-gzip' length 368081 bytes (359 KB)

==================================================

downloaded 359 KB

trying URL 'https://cran.rstudio.com/src/contrib/lazyeval_0.2.2.tar.gz'

Content type 'application/x-gzip' length 83482 bytes (81 KB)

==================================================

downloaded 81 KB

trying URL 'https://cran.rstudio.com/src/contrib/MASS_7.3-51.4.tar.gz'

Content type 'application/x-gzip' length 487233 bytes (475 KB)

==================================================

downloaded 475 KB

trying URL 'https://cran.rstudio.com/src/contrib/mgcv_1.8-28.tar.gz'

Content type 'application/x-gzip' length 915991 bytes (894 KB)

==================================================

downloaded 894 KB

trying URL 'https://cran.rstudio.com/src/contrib/plyr_1.8.4.tar.gz'

Content type 'application/x-gzip' length 392451 bytes (383 KB)

==================================================

downloaded 383 KB

trying URL 'https://cran.rstudio.com/src/contrib/reshape2_1.4.3.tar.gz'

Content type 'application/x-gzip' length 36405 bytes (35 KB)

==================================================

downloaded 35 KB

trying URL 'https://cran.rstudio.com/src/contrib/scales_1.0.0.tar.gz'

Content type 'application/x-gzip' length 299262 bytes (292 KB)

==================================================

downloaded 292 KB

Warning in download.packages(packages, destdir = file.path(mainDir, subDir)) :

no package 'stats' at the repositories

trying URL 'https://cran.rstudio.com/src/contrib/viridisLite_0.3.0.tar.gz'

Content type 'application/x-gzip' length 44019 bytes (42 KB)

==================================================

downloaded 42 KB

trying URL 'https://cran.rstudio.com/src/contrib/withr_2.1.2.tar.gz'

Content type 'application/x-gzip' length 53578 bytes (52 KB)

==================================================

downloaded 52 KB

Warning in download.packages(packages, destdir = file.path(mainDir, subDir)) :

no package 'graphics' at the repositories

trying URL 'https://cran.rstudio.com/src/contrib/nlme_3.1-139.tar.gz'

Content type 'application/x-gzip' length 793473 bytes (774 KB)

==================================================

downloaded 774 KB

trying URL 'https://cran.rstudio.com/src/contrib/Matrix_1.2-17.tar.gz'

Content type 'application/x-gzip' length 1860456 bytes (1.8 MB)

==================================================

downloaded 1.8 MB

Warning in download.packages(packages, destdir = file.path(mainDir, subDir)) :

no package 'splines' at the repositories

trying URL 'https://cran.rstudio.com/src/contrib/stringr_1.4.0.tar.gz'

Content type 'application/x-gzip' length 135777 bytes (132 KB)

==================================================

downloaded 132 KB

trying URL 'https://cran.rstudio.com/src/contrib/labeling_0.3.tar.gz'

Content type 'application/x-gzip' length 10722 bytes (10 KB)

==================================================

downloaded 10 KB

trying URL 'https://cran.rstudio.com/src/contrib/munsell_0.5.0.tar.gz'

Content type 'application/x-gzip' length 182653 bytes (178 KB)

==================================================

downloaded 178 KB

trying URL 'https://cran.rstudio.com/src/contrib/RColorBrewer_1.1-2.tar.gz'

Content type 'application/x-gzip' length 11532 bytes (11 KB)

==================================================

downloaded 11 KB

trying URL 'https://cran.rstudio.com/src/contrib/lattice_0.20-38.tar.gz'

Content type 'application/x-gzip' length 359031 bytes (350 KB)

==================================================

downloaded 350 KB

trying URL 'https://cran.rstudio.com/src/contrib/colorspace_1.4-1.tar.gz'

Content type 'application/x-gzip' length 2152594 bytes (2.1 MB)

==================================================

downloaded 2.1 MB

trying URL 'https://cran.rstudio.com/src/contrib/stringi_1.4.3.tar.gz'

Content type 'application/x-gzip' length 7290890 bytes (7.0 MB)

==================================================

downloaded 7.0 MB

[,1] [,2]

[1,] "dplyr" "/Users/ihongdon/Downloads/r-pkg/dplyr_0.8.0.1.tar.gz"

[2,] "ggplot2" "/Users/ihongdon/Downloads/r-pkg/ggplot2_3.1.1.tar.gz"

[3,] "assertthat" "/Users/ihongdon/Downloads/r-pkg/assertthat_0.2.1.tar.gz"

[4,] "glue" "/Users/ihongdon/Downloads/r-pkg/glue_1.3.1.tar.gz"

[5,] "magrittr" "/Users/ihongdon/Downloads/r-pkg/magrittr_1.5.tar.gz"

[6,] "pkgconfig" "/Users/ihongdon/Downloads/r-pkg/pkgconfig_2.0.2.tar.gz"

[7,] "R6" "/Users/ihongdon/Downloads/r-pkg/R6_2.4.0.tar.gz"

[8,] "Rcpp" "/Users/ihongdon/Downloads/r-pkg/Rcpp_1.0.1.tar.gz"

[9,] "rlang" "/Users/ihongdon/Downloads/r-pkg/rlang_0.3.4.tar.gz"

[10,] "tibble" "/Users/ihongdon/Downloads/r-pkg/tibble_2.1.1.tar.gz"

[11,] "tidyselect" "/Users/ihongdon/Downloads/r-pkg/tidyselect_0.2.5.tar.gz"

[12,] "cli" "/Users/ihongdon/Downloads/r-pkg/cli_1.1.0.tar.gz"

[13,] "crayon" "/Users/ihongdon/Downloads/r-pkg/crayon_1.3.4.tar.gz"

[14,] "fansi" "/Users/ihongdon/Downloads/r-pkg/fansi_0.4.0.tar.gz"

[15,] "pillar" "/Users/ihongdon/Downloads/r-pkg/pillar_1.3.1.tar.gz"

[16,] "purrr" "/Users/ihongdon/Downloads/r-pkg/purrr_0.3.2.tar.gz"

[17,] "utf8" "/Users/ihongdon/Downloads/r-pkg/utf8_1.1.4.tar.gz"

[18,] "digest" "/Users/ihongdon/Downloads/r-pkg/digest_0.6.18.tar.gz"

[19,] "gtable" "/Users/ihongdon/Downloads/r-pkg/gtable_0.3.0.tar.gz"

[20,] "lazyeval" "/Users/ihongdon/Downloads/r-pkg/lazyeval_0.2.2.tar.gz"

[21,] "MASS" "/Users/ihongdon/Downloads/r-pkg/MASS_7.3-51.4.tar.gz"

[22,] "mgcv" "/Users/ihongdon/Downloads/r-pkg/mgcv_1.8-28.tar.gz"

[23,] "plyr" "/Users/ihongdon/Downloads/r-pkg/plyr_1.8.4.tar.gz"

[24,] "reshape2" "/Users/ihongdon/Downloads/r-pkg/reshape2_1.4.3.tar.gz"

[25,] "scales" "/Users/ihongdon/Downloads/r-pkg/scales_1.0.0.tar.gz"

[26,] "viridisLite" "/Users/ihongdon/Downloads/r-pkg/viridisLite_0.3.0.tar.gz"

[27,] "withr" "/Users/ihongdon/Downloads/r-pkg/withr_2.1.2.tar.gz"

[28,] "nlme" "/Users/ihongdon/Downloads/r-pkg/nlme_3.1-139.tar.gz"

[29,] "Matrix" "/Users/ihongdon/Downloads/r-pkg/Matrix_1.2-17.tar.gz"

[30,] "stringr" "/Users/ihongdon/Downloads/r-pkg/stringr_1.4.0.tar.gz"

[31,] "labeling" "/Users/ihongdon/Downloads/r-pkg/labeling_0.3.tar.gz"

[32,] "munsell" "/Users/ihongdon/Downloads/r-pkg/munsell_0.5.0.tar.gz"

[33,] "RColorBrewer" "/Users/ihongdon/Downloads/r-pkg/RColorBrewer_1.1-2.tar.gz"

[34,] "lattice" "/Users/ihongdon/Downloads/r-pkg/lattice_0.20-38.tar.gz"

[35,] "colorspace" "/Users/ihongdon/Downloads/r-pkg/colorspace_1.4-1.tar.gz"

[36,] "stringi" "/Users/ihongdon/Downloads/r-pkg/stringi_1.4.3.tar.gz"

위에 다운로드된 패키지들의 리스트를 보면 알 수 있는 것처럼, 'dplyr'과 'ggplot2'의 두 개 패키지를 다운로드 하려고 했더니 이들이 의존성을 가지고 있는 [3] 'assertthat' 패키지부터 ~ [36] 'stringi' 패키지까지 34개의 패키지가 추가로 다운로드 되었습니다.

외부의 인터넷 연결이 되는 환경에서 다운받아 놓은 이들 R packages 파일들을 저장매체에 저장하여 가져가서 폐쇄망으로 되어 있는 서버에 복사를 한 후에, 수작업으로 R package 설치하면 되겠습니다. (이때 의존성 있는 패키지들까지 알아서 다 설치가 되도록 여러번 설치 작업을 반복해주면 됩니다.

아래 코드를 사용하시면 다운로드 받아놓은 모든 패키지의 리스트를 읽어와서 수동으로 R 패키지 설치를 할 수 있습니다.

## set working directory
setwd("C:/Users/hdlee/Documents/R") # set with yours

## getting the list of R packages downloaded
src_pkgs <- list.files("C:/Users/hdlee/Documents/R") # use yours

## install R packages manually using binary files
install.packages(src_pkgs, repos = NULL, type = "source")

##==============================================##

폐쇄망 환경에서 Greenplum DB에 R 패키지 설치하는 방법은

== > http://gpdbkr.blogspot.com/2019/12/greenplum-6-plr-r.html

==> https://rfriend.tistory.com/442

포스팅을 참고하세요.

##==============================================##

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'R 분석과 프로그래밍 > R 데이터 전처리' 카테고리의 다른 글

[R] R 패키지 함수의 소스 코드를 볼 수 있는 방법 (how to see function's source codes in R package) (0)	2019.10.22
[R] 시계열 정수의 순차 3개 묶음 패턴 별 개수를 구하고 내림차순 정렬하기 (16)	2019.10.15
[R] 그룹 별 행 합이 최대인 행만 선별하기 (Selecting rows which the RowSum is maximum per groups) (2)	2019.02.22
[R] 여러개의 데이터프레임을 리스트로 묶기 (how to combine many DataFrames into a List) : mget() (4)	2018.03.24
[R] 문자열을 특정 길이로 만들고, 빈 자리는 '0'으로 채우기, 소수점 길이 지정하기 : sprintf {base} (6)	2017.09.17

Posted by Rfriend

,

[Greenplum DB] PostGIS - 공간지리 테이블 백업하기, 백업 다시 불러오기 (Backup and Restore geospatial table using pg_dump, pg_restore)

Greenplum and PostgreSQL Database 2019. 4. 27. 18:59

앞의 포스팅에서는 공간지리 형태의 데이터셋을 import 하는 방법들을 소개하였습니다.

이번 포스팅에서는 PostgreSQL, Greenplum DB의 PostGIS 에서 테이블(Table) 형태로 있는 공간지리 데이터에 대해서 (1) pg_dump로 공간지리 테이블을 백업하기(Backup), (2) pg_restore 로백업한 공간지리 테이블을 다시 불러오기 (Restore) 를 해보겠습니다.

(* Reference: https://github.com/PacktPublishing/Mastering-PostGIS)

(1) pg_dump로 공간지리 데이터 테이블 백업하기 (Create a Backup table)

명령 프롬프트 창에서 docker로 Greenplum DB를 실행한 후에, gpadmin 계정으로 들어가서 이미 geometry 포맷으로 만들어두었던 data_import.earthquakes_subset_with_geom 테이블을 pg_dump 를 사용하여 백업해보았습니다. (host, port, user 부분은 각자의 database 설정을 입력하면 됨)

[gpadmin@mdw tmp]$ pg_dump -h localhost -p 5432 -U gpadmin -t data_import.earthquakes_subset_with_geom -c -F c -v -b -f earthquakes_subset_with_geom.backup gpadmin

pg_dump: reading extensions

pg_dump: identifying extension members

20190417:04:24:25|pg_dump-[INFO]:-reading schemas

pg_dump: reading user-defined tables

20190417:04:24:25|pg_dump-[INFO]:-reading user-defined functions

20190417:04:24:25|pg_dump-[INFO]:-reading user-defined types

20190417:04:24:25|pg_dump-[INFO]:-reading type storage options

20190417:04:24:25|pg_dump-[INFO]:-reading procedural languages

20190417:04:24:25|pg_dump-[INFO]:-reading user-defined aggregate functions

20190417:04:24:25|pg_dump-[INFO]:-reading user-defined operators

20190417:04:24:25|pg_dump-[INFO]:-reading user-defined external protocols

20190417:04:24:25|pg_dump-[INFO]:-reading user-defined operator classes

20190417:04:24:25|pg_dump-[INFO]:-reading user-defined operator families

pg_dump: reading user-defined text search parsers

pg_dump: reading user-defined text search templates

pg_dump: reading user-defined text search dictionaries

pg_dump: reading user-defined text search configurations

20190417:04:24:26|pg_dump-[INFO]:-reading user-defined conversions

20190417:04:24:26|pg_dump-[INFO]:-reading type casts

20190417:04:24:26|pg_dump-[INFO]:-reading table inheritance information

pg_dump: finding extension tables

20190417:04:24:26|pg_dump-[INFO]:-reading rewrite rules

20190417:04:24:26|pg_dump-[INFO]:-finding inheritance relationships

20190417:04:24:26|pg_dump-[INFO]:-reading column info for interesting tables

pg_dump: finding the columns and types of table "earthquakes_subset_with_geom"

20190417:04:24:26|pg_dump-[INFO]:-flagging inherited columns in subtables

20190417:04:24:26|pg_dump-[INFO]:-reading indexes

20190417:04:24:26|pg_dump-[INFO]:-reading constraints

20190417:04:24:26|pg_dump-[INFO]:-reading triggers

pg_dump: reading dependency data

pg_dump: saving encoding = UTF8

pg_dump: saving standard_conforming_strings = on

pg_dump: dumping contents of table earthquakes_subset_with_geom

[gpadmin@mdw tmp]$

(2) pg_restore 로 백업 테이블 다시 불러오기

이미 테이블로 만들어져 있는 data_import.earthquakes_subset_with_geom 테이블을 삭제한 후에, (1)번에서 백업해둔 데이터를 불러오겠습니다.

테이블을 먼저 삭제해볼께요.

-- (2) (DBeaver db tool 에서) drop table

DROP TABLE data_import.earthquakes_subset_with_geom;

테이블을 삭제하였으니, 이제 다시 (1)번에서 백업해두었던 데이터를 다시 불러와서 테이블을 생성(Restore a Backup table)해보겠습니다.

-- (3) (명령 프롬프트 창에서) Restore using pg_restore

[gpadmin@mdw tmp]$ pg_restore -h localhost -p 5432 -U gpadmin -v -d gpadmin earthquakes_subset_with_geom.backup

pg_restore: connecting to database for restore

pg_restore: creating TABLE earthquakes_subset_with_geom

pg_restore: restoring data for table "earthquakes_subset_with_geom"

pg_restore: setting owner and privileges for TABLE earthquakes_subset_with_geom

[gpadmin@mdw tmp]$

DBeaver db tool에서 백업 테이블을 잘 불어와서 테이블이 생성이 되었는지 확인해보겠습니다.

-- (4) (DBeaver db tool 에서) 백업 되었는지 조회 확인

SELECT * FROM data_import.earthquakes_subset_with_geom LIMIT 10;

백업 테이블 불어오기(restore)가 잘 되었네요.

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[GPDB, Postgres] Python DataFrame을 Postgresql, Greenplum DB에 Copy 해서 넣는 방법 (0)	2019.07.17
[Greenplum DB] GPDB에 PL/R Language Extension, R 패키지 수동 설치 방법 (0)	2019.05.16
[Greenplum DB] PostGIS에 raster2pgsql 을 사용하여 raster data import 하기 (0)	2019.04.17
[Greenplum DB] PostGIS : ogr2ogr 을 사용해 공간지리 벡터 데이터 import 하기 (GML, MIF, KML 포맷) (0)	2019.04.11
[Greenplum DB] 공간지리 shape files (SHP, SHX, DBF, PRJ)를 PostgreSQL, Greenplum DB로 import 하는 방법 (0)	2019.04.11

Posted by Rfriend

,

[Greenplum DB] PostGIS에 raster2pgsql 을 사용하여 raster data import 하기

Greenplum and PostgreSQL Database 2019. 4. 17. 20:44

이번 포스팅에서는 PostgreSQL, Greenplum DB의 PostGIS에 raster2pgsql 유틸리티를 사용하여 raster data를 import하는 방법을 소개하겠습니다.

예제로 사용할 데이터는 'Mastering PostGIS' (by Domink 외) 에서 소개된 raster format의 TIFF(Tagged Image File Format) 데이터인 'GRAY_50M_SR_OB.tif' 파일입니다.

데이터 형태는 왼쪽에 보는 바와 같이 4자리의 숫자와 알파벳의 조합으로 되어 있습니다.

이 tif 파일을 탐색기에서 미리보기 해보면 왼쪽에 보는 바와 같이 회색의 세계지도 이미지 이네요.

그럼 먼저, 명령 프롬프트 창에서 다운로드한 'GRAY_50M_SR_OB.tif' 파일을 docker cp 명령어로 Greenplum docker 의 tmp 폴더로 복사하겠습니다.

-- (명령 프롬프트 창에서) copy 'GRAY_50M_SR_OB.tif' file to GPDB docker

$ docker cp /Users/ihongdon/Documents/PostGIS/data/GRAY_50M_SR_OB/GRAY_50M_SR_OB.tif gpdb-ds:/tmp

다른 명령 프롬프트 창에서 Docker GPDB 의 gpadmin 계정으로 들어가서 파일이 잘 복사되었는지 확인해보겠습니다.

-- (docker gpdb 명령 프롬프트 창에서) raster matadata 읽기

[gpadmin@mdw tmp]$ ls -la

total 123532

drwxrwxrwt 1 root root 4096 Apr 10 13:13 .

drwxr-xr-x 1 root root 4096 Apr 9 07:11 ..

-rw-r--r-- 1 501 games 58405694 Apr 8 06:30 GRAY_50M_SR_OB.tif

[gpadmin@mdw tmp]$

GPDB gpadmin 명령 프롬프트 창에서 gdalinfo 명령어로 TIFF raster 파일의 메타정보를 조회해보겠습니다.

-- (명령 프롬프트 창에서) raster 파일의 메타정보 조회 : gdalinfo

[gpadmin@mdw tmp]$ gdalinfo GRAY_50M_SR_OB.tif

Driver: GTiff/GeoTIFF

Files: GRAY_50M_SR_OB.tif

Size is 10800, 5400

Coordinate System is:

GEOGCS["WGS 84",

DATUM["WGS_1984",

SPHEROID["WGS 84",6378137,298.257223563,

AUTHORITY["EPSG","7030"]],

AUTHORITY["EPSG","6326"]],

PRIMEM["Greenwich",0],

UNIT["degree",0.0174532925199433],

AUTHORITY["EPSG","4326"]]

Origin = (-179.999999999999972,90.000000000000000)

Pixel Size = (0.033333333333330,-0.033333333333330)

Metadata:

AREA_OR_POINT=Area

TIFFTAG_DATETIME=2014:10:18 09:28:20

TIFFTAG_RESOLUTIONUNIT=2 (pixels/inch)

TIFFTAG_SOFTWARE=Adobe Photoshop CC 2014 (Macintosh)

TIFFTAG_XRESOLUTION=342.85699

TIFFTAG_YRESOLUTION=342.85699

Image Structure Metadata:

INTERLEAVE=BAND

Corner Coordinates:

Upper Left (-180.0000000, 90.0000000) (180d 0' 0.00"W, 90d 0' 0.00"N)

Lower Left (-180.0000000, -90.0000000) (180d 0' 0.00"W, 90d 0' 0.00"S)

Upper Right ( 180.0000000, 90.0000000) (180d 0' 0.00"E, 90d 0' 0.00"N)

Lower Right ( 180.0000000, -90.0000000) (180d 0' 0.00"E, 90d 0' 0.00"S)

Center ( -0.0000000, 0.0000000) ( 0d 0' 0.00"W, 0d 0' 0.00"N)

Band 1 Block=10800x1 Type=Byte, ColorInterp=Gray

[gpadmin@mdw tmp]$

raster2pgsql 유틸리티를 사용하여 (1) 한개의 Raster 데이터셋을 import 하는 방법과, (2) 여러개의 Raster 데이터셋들을 한꺼번에 import 하는 방법으로 나누어서 소개하겠습니다.

(1) 한개의 Raster 데이터셋을 raster2pgsql 유틸리티로 import 하기

아래처럼 명령 프롬프트 창에서 raster2pgsql 유틸리티로 'GRAY_50M_SR_OB.tif' 파일을 import 하면 'gray_50m_sr_ob' 테이블이 생성됩니다. 더불어서, 'o_2_gray_50m_sr_ob', 'o_4_gray_50m_sr_ob'라는 미리보기 테이블이 같이 생성됩니다. (아래 소개된 SQL 문이 생성, 실행됩니다).

----------------------------------------------------------------------------------------

[ raster2pgsql 인자 설명 ]
- G: 유틸리티에 의해 지원되는 GDAL 포맷 리스트 인쇄
- s: import한 raster 데이터의 SRID 설정
-t: 타일(tile)의 폭 x 높이 크기
-P: 타일(tile)이 같은 차원을 가지도록 오른쪽/ 아래쪽의 모자란 차원만큼을 채워줌(pad)
-d: 테이블 삭제 및 생성(Drops and creates a table)
-a: 기존 테이블에 이어서 데이터 추가(Appends data to an existing table)
-c: 새로운 테이블 생성(Creates a new table)
-p: 준비 모드 켜기. (단지 테이블만 생성되고, 데이터 importing은 안됨)
-F: raster이름의 칼럼 추가
-l: 콤마로 구분된 overview 테이블 생성 (o__raster_table_name 이름)
-I: raster 칼럼에 GIST 공간 인덱스 생성
-C: raster 데이터 importing 후에 raster 칼럼에 표준 제약 설정

Sets the standard constraints on the raster column after the raster is imported.

* reference: https://postgis.net/docs/using_raster_dataman.html
----------------------------------------------------------------------------------------

('| psql' 뒤에 host, port, user, database name 부분에는 각자의 DB환경정보 입력)

-- (명령 프롬프트에서) Import a single raster dataset using raster2pgsql

[gpadmin@mdw tmp]$ raster2pgsql -s 4326 -C -l 2,4 -F -t 2700x2700 GRAY_50M_SR_OB.tif data_import.gray_50m_sr_ob | psql -h localhost -p 5432 -U gpadmin -d gpadmin

Processing 1/1: GRAY_50M_SR_OB.tif

BEGIN

NOTICE: CREATE TABLE will create implicit sequence "gray_50m_sr_ob_rid_seq" for serial column "gray_50m_sr_ob.rid"

NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "gray_50m_sr_ob_pkey" for table "gray_50m_sr_ob"

CREATE TABLE

NOTICE: CREATE TABLE will create implicit sequence "o_2_gray_50m_sr_ob_rid_seq" for serial column "o_2_gray_50m_sr_ob.rid"

NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "o_2_gray_50m_sr_ob_pkey" for table "o_2_gray_50m_sr_ob"

CREATE TABLE

NOTICE: CREATE TABLE will create implicit sequence "o_4_gray_50m_sr_ob_rid_seq" for serial column "o_4_gray_50m_sr_ob.rid"

NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "o_4_gray_50m_sr_ob_pkey" for table "o_4_gray_50m_sr_ob"

CREATE TABLE

INSERT 0 1

NOTICE: Adding SRID constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding scale-X constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding scale-Y constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding blocksize-X constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding blocksize-Y constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding alignment constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding number of bands constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding pixel type constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding nodata value constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding out-of-database constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding maximum extent constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

addrasterconstraints

----------------------

t

(1 row)

NOTICE: Adding SRID constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding scale-X constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding scale-Y constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding blocksize-X constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding blocksize-Y constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding alignment constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding number of bands constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding pixel type constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding nodata value constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding out-of-database constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding maximum extent constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

addrasterconstraints

----------------------

t

(1 row)

NOTICE: Adding SRID constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding scale-X constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding scale-Y constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding blocksize-X constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding blocksize-Y constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding alignment constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding number of bands constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding pixel type constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding nodata value constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding out-of-database constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding maximum extent constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

addrasterconstraints

----------------------

t

(1 row)

addoverviewconstraints

------------------------

t

(1 row)

addoverviewconstraints

------------------------

t

(1 row)

COMMIT

[gpadmin@mdw tmp]$

DBeaver db tool에서 data_import.gray_50m_sr_ob 테이블을 조회해보면 아래와 같습니다.

SELECT * FROM data_import.gray_50m_sr_ob LIMIT 10;

(2) 여러개의 Raster 데이터셋들을 한꺼번에 raster2pgsql 로 importing 하기

예제로 사용하기 위해 gdalwarp 문을 사용하여 원래의 'GRAY_50M_SR_OB.tif' raster 데이터셋을 4개의 raster 데이터셋으로 분할해보겠습니다.

-- split into four parts using gdalwarp utility

[gpadmin@mdw tmp]$ gdalwarp -s_srs EPSG:4326 -t_srs EPSG:4326 -te -180 -90 0 0 GRAY_50M_SR_OB.tif gray_50m_partial_bl.tif

Creating output file that is 5400P x 2700L.

Processing GRAY_50M_SR_OB.tif [1/1] : 0...10...20...30...40...50...60...70...80...90...100 - done.

[gpadmin@mdw tmp]$ gdalwarp -s_srs EPSG:4326 -t_srs EPSG:4326 -te -180 0 0 90 GRAY_50M_SR_OB.tif gray_50m_partial_tl.tif

Creating output file that is 5400P x 2700L.

Processing GRAY_50M_SR_OB.tif [1/1] : 0...10...20...30...40...50...60...70...80...90...100 - done.

[gpadmin@mdw tmp]$

[gpadmin@mdw tmp]$ gdalwarp -s_srs EPSG:4326 -t_srs EPSG:4326 -te 0 -90 180 0 GRAY_50M_SR_OB.tif gray_50m_partial_br.tif

Creating output file that is 5400P x 2700L.

Processing GRAY_50M_SR_OB.tif [1/1] : 0...10...20...30...40...50...60...70...80...90...100 - done.

[gpadmin@mdw tmp]$

[gpadmin@mdw tmp]$ gdalwarp -s_srs EPSG:4326 -t_srs EPSG:4326 -te 0 0 180 90 GRAY_50M_SR_OB.tif gray_50m_partial_tr.tif

Creating output file that is 5400P x 2700L.

Processing GRAY_50M_SR_OB.tif [1/1] : 0...10...20...30...40...50...60...70...80...90...100 - done.

[gpadmin@mdw tmp]$

[gpadmin@mdw tmp]$ ls -la

total 180572

drwxrwxrwt 1 root root 4096 Apr 16 10:35 .

drwxr-xr-x 1 root root 4096 Apr 9 07:11 ..

-rw-r--r-- 1 501 games 58405694 Apr 8 06:30 GRAY_50M_SR_OB.tif

-rw-rw-r-- 1 gpadmin gpadmin 14602098 Apr 16 10:34 gray_50m_partial_bl.tif

-rw-rw-r-- 1 gpadmin gpadmin 14602098 Apr 16 10:35 gray_50m_partial_br.tif

-rw-rw-r-- 1 gpadmin gpadmin 14602098 Apr 16 10:34 gray_50m_partial_tl.tif

-rw-rw-r-- 1 gpadmin gpadmin 14602098 Apr 16 10:35 gray_50m_partial_tr.tif

[gpadmin@mdw tmp]$

이제 raster2pgsql 유틸리티로 'gray_50m_partial*.tif' 처럼 파일 이름에 '*'를 사용하여 '*' 부분에 무엇이 들어있든지 간에 '*' 이외의 파일 이름이 같다면 전부 한꺼번에 importing 해보겠습니다. ('| psql' 뒤에 host, port, user, database name 부분에는 각자의 DB환경정보 입력)

-- (명령 프롬프트에서) Importing multiple rasters at once

[gpadmin@mdw tmp]$ raster2pgsql -s 4326 -C -F -t 2700x2700 gray_50m_partial*.tif data_import.gray_50m_partial | psql -h localhost -p 5432 -U gpadmin -d gpadmin

Processing 1/4: gray_50m_partial_bl.tif

BEGIN

NOTICE: CREATE TABLE will create implicit sequence "gray_50m_partial_rid_seq" for serial column "gray_50m_partial.rid"

NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "gray_50m_partial_pkey" for table "gray_50m_partial"

CREATE TABLE

INSERT 0 1

Processing 2/4: gray_50m_partial_br.tif

INSERT 0 1

Processing 3/4: gray_50m_partial_tl.tif

INSERT 0 1

Processing 4/4: gray_50m_partial_tr.tif

INSERT 0 1

NOTICE: Adding SRID constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding scale-X constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding scale-Y constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding blocksize-X constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding blocksize-Y constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding alignment constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding number of bands constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding pixel type constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding nodata value constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding out-of-database constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

NOTICE: Adding maximum extent constraint

CONTEXT: SQL statement "SELECT AddRasterConstraints( $1 , $2 , $3 , VARIADIC $4 )"

PL/pgSQL function "addrasterconstraints" line 52 at RETURN

addrasterconstraints

----------------------

t

(1 row)

COMMIT

[gpadmin@mdw tmp]$

DBeaver db tool에서 'data_import.gray_50m_partial' 테이블을 조회해 보겠습니다. 제일 마지막의 'filename' 칼럼을 보면 'gray_50m_partial_bl.tif', 'gray_50m_partial_br.tif', 'gray_50m_partial_tl.tif', 'gray_50m_partial_tr.tif' 의 4개 부분의 파일들이 들어가 있음을 알 수 있습니다.

-- (DBeaver db tool 에서) raster file 조회
SELECT * FROM data_import.gray_50m_partial LIMIT 10;

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[Greenplum DB] GPDB에 PL/R Language Extension, R 패키지 수동 설치 방법 (0)	2019.05.16
[Greenplum DB] PostGIS - 공간지리 테이블 백업하기, 백업 다시 불러오기 (Backup and Restore geospatial table using pg_dump, pg_restore) (0)	2019.04.27
[Greenplum DB] PostGIS : ogr2ogr 을 사용해 공간지리 벡터 데이터 import 하기 (GML, MIF, KML 포맷) (0)	2019.04.11
[Greenplum DB] 공간지리 shape files (SHP, SHX, DBF, PRJ)를 PostgreSQL, Greenplum DB로 import 하는 방법 (0)	2019.04.11
[Greenplum DB] PostGIS : 위도, 경도가 있는 csv 파일을 import하고 공간정보 뽑아내기 (0)	2019.04.10

Posted by Rfriend

,

[Greenplum DB] PostGIS : ogr2ogr 을 사용해 공간지리 벡터 데이터 import 하기 (GML, MIF, KML 포맷)

Greenplum and PostgreSQL Database 2019. 4. 11. 13:22

이번 포스팅에서는 공간지리 데이터 포맷 중에서도 GML format, MapInfo MIF & TAB format, KML format 등의 벡터 데이터 (vector data)를 GDAL의 ogr2ogr 툴을 사용하여 PostgreSQL, Greenplum DB에 import하는 방법을 소개하겠습니다.

ogr2ogr 은 GDAL(Geospatial Data Abstraction Library)의 벡터 변환 유틸리티이며, 소스파일 다운로드 및 설치는 아래 링크된 사이트를 참조하세요.

GDAL download: https://trac.osgeo.org/gdal/wiki/DownloadSource
GDAL 설치 가이드 : https://trac.osgeo.org/gdal/wiki/BuildingOnUnix

참고로, 저는 처음에 GDAL 1.x 버전으로 깔았더니 아래처럼 importing 에 필요한 driver 를 찾을 수 없다는 에러가 나더군요. 그래서 GDAL2.4.1 최신 버전으로 새로 설치를 했더니 문제가 해결되었습니다. (Thanks Jack!)

ERROR 1: Unable to find driver PostgreSQL'.
  The following drivers are available:
    ->PCIDSK' -> JP2OpenJPEG'
    ->PDF' -> ESRI Shapefile'
    ->MapInfo File' -> UK .NTF'
    ->OGR_SDTS' -> S57'
    ->DGN' -> OGR_VRT'
    ->REC' -> Memory'
    ->BNA' -> CSV'
    ->GML' -> GPX'
    ->KML' -> GeoJSON'

:

(1) GML 포맷의 공간지리 벡터 데이터 Import 하기

포스팅에 사용한 샘플 데이터(sx9090.gml 주소 데이터)와 예제 코드는 'Mastering PostGIS' (by Dominikwicz 외) 을 참고하였습니다.

sx9090.gml 파일 다운로드: https://github.com/PacktPublishing/Mastering-PostGIS/tree/master/Chapter01/data/os-addressbase-gml-sample-data GML 데이터 포맷은 아래에 sx9090.gml 파일을 열어서 화면캡쳐 해놓은 것처럼 Geometry 데이터를 XML(eXtensible Markup Language) 형태로 저장해놓은 파일 형식입니다.

docker로 Greenplum DB 설치하고 PostGIS 설치, 시작하는 방법은 https://rfriend.tistory.com/435 를 참고하세요.

자, 샘플 데이터를 다운로드 했다면 이제 시작해볼까요?

먼저 명령 프롬프트 창에서 sx9090.gml 파일을 docker cp로 Greenplum DB w/PostGIS 의 tmp 경로로 복사를 하겠습니다.

-- (명령 프롬프트에서) sx9090.gml 파일을 docker gpdb에 복사

ihongdon-ui-MacBook-Pro:data ihongdon$ docker cp /Users/ihongdon/Documents/PostGIS/data/os-addressbase-gml-sample-data/sx9090.gml gpdb-ds:/tmp

ihongdon-ui-MacBook-Pro:data ihongdon$

다른 명령 프롬프트 창에서 Docker의 Greenplum DB의 gpadmin 계정으로 tmp 폴더를 확인해보면 sx9090.gml 파일이 잘 복사되었음을 확인할 수 있습니다.

-- (docker gpdb 명령 프롬프트에서) importing GML data

[gpadmin@mdw gdal-2.4.1]$ cd /tmp

[gpadmin@mdw tmp]$ ls

2.5_day_age.kml a.sql gdal-2.4.1 hsperfdata_root ne_110m_coastline.dbf ne_110m_coastline.shx sx9090.gml

[gpadmin@mdw tmp]$

명령 프롬프트 창에서 ogrinfo 유틸리티로 sx9090.gml 데이터셋의 메타데이터(metadata) 정보를 알아보겠습니다. 2015년에 GeoPlace가 만든 Ordnance Survey의 주소(address) 공간지리 데이터셋이네요.

[gpadmin@mdw tmp]$ ogrinfo sx9090.gml

INFO: Open of `sx9090.gml'

using driver `GML' successful.

Metadata:

1: Address (Point)

[gpadmin@mdw tmp]$

ogrinfo 유틸리티로 '1: Address (Point)' layer 정보를 더 자세히 살펴보겠습니다. (ogrinfo의 -so 파라미터는 요약 정보만 보여달라는 의미임)

gml_id를 key로 하고, 총 22개의 칼럼을 가진 공간지리 데이터셋이네요.

[gpadmin@mdw tmp]$ ogrinfo sx9090.gml Address -so

INFO: Open of `sx9090.gml'

using driver `GML' successful.

Metadata:

Layer name: Address

Geometry: Point

Feature Count: 42861

Extent: (-3.560100, 50.699470) - (-3.488340, 50.744770)

Layer SRS WKT:

GEOGCS["ETRS89",

DATUM["European_Terrestrial_Reference_System_1989",

SPHEROID["GRS 1980",6378137,298.257222101,

AUTHORITY["EPSG","7019"]],

TOWGS84[0,0,0,0,0,0,0],

AUTHORITY["EPSG","6258"]],

PRIMEM["Greenwich",0,

AUTHORITY["EPSG","8901"]],

UNIT["degree",0.0174532925199433,

AUTHORITY["EPSG","9122"]],

AUTHORITY["EPSG","4258"]]

gml_id: String (0.0) NOT NULL

uprn: Real (0.0)

osAddressTOID: String (20.0)

udprn: Integer (0.0)

subBuildingName: String (25.0)

buildingName: String (36.0)

thoroughfare: String (27.0)

postTown: String (6.0)

postcode: String (7.0)

postcodeType: String (1.0)

rpc: Integer (0.0)

country: String (1.0)

changeType: String (1.0)

laStartDate: String (10.0)

rmStartDate: String (10.0)

lastUpdateDate: String (10.0)

class: String (1.0)

buildingNumber: Integer (0.0)

dependentLocality: String (27.0)

organisationName: String (55.0)

dependentThoroughfare: String (27.0)

poBoxNumber: Integer (0.0)

doubleDependentLocality: String (21.0)

departmentName: String (37.0)

[gpadmin@mdw tmp]$

ogr2ogr 로 sm9090.gml 데이터셋을 PostgreSQL, Greenplum DB에 import 해보겠습니다. 아래 ogr2ogr에서 사용한 파라미터들의 기능은 아래와 같으며, 이 외에 ogr2ogr 의 여러 파라미터 기능은 https://www.gdal.org/ogr2ogr.html 를 참고하세요. DB접속 정보는 각자 자신의 host, port, user, dbname 을 설정해주시면 됩니다.

-f : 아웃풋의 포맷이며, PostGIS로 importing할 경우 -f "PostgreSQL" 이라고 해주면 됩니다.
-nln : Importing 할 DB 스키마와 테이블 이름 (예: data_import 스키마의 osgb_address_base_gml 테이블 이름)
geomfield : 공간 필터가 동작하는 geometry field의 이름

[gpadmin@mdw tmp]$ ogr2ogr -f "PostgreSQL"
PG:"host=localhost port=5432 user=gpadmin dbname=gpadmin"
sx9090.gml
-nln data_import.osgb_address_base_gml -geomfield geom

[gpadmin@mdw tmp]$

이제 DBeaver query tool에서 data_import.osgb_address_base_gml 테이블을 조회해보겠습니다.

SELECT * FROM data_import.osgb_address_base_gml ORDER BY gml_id LIMIT 10;

SELECT gml_id, uprn, osaddresstoid, wkb_geometry

FROM data_import.osgb_address_base_gml

ORDER BY gml_id

LIMIT 10;

(2) MIF 포맷 (MapInfo formats) 데이터셋을 ogr2ogr 유틸리티로 PostGIS에 Import 하기

다음으로 MIF 포맷(MapInfo formats)데이터셋을 import 하는 방법을 소개할텐데요, 위에서 GML 포맷 데이터 import하는 방법과 동일합니다. 먼저 명령 프롬프트 창에서 docker cp 를 사용해서 EX_sample.mif 이름의 MIF 파일을 docker GPDB로 복사해서 넣겠습니다. (VM 환경에서 GPDB 사용 시 scp 로 파일 복사)

-- (명령 프롬프트 창에서) MIF 파일을 docker gpdb로 복사해서 넣기

ihongdon-ui-MacBook-Pro:data ihongdon$ docker cp /Users/ihongdon/Documents/PostGIS/data/os-code-point-polygons-mif-sample-data/EX_sample.mif gpdb-ds:/tmp

ihongdon-ui-MacBook-Pro:data ihongdon$

다른 명령 프롬프트 창에서 docker GPDB의 gpadmin 계정으로 들어가서 /tmp 경로에 들어있는 파일을 조회해보면 EX_sample.mif 파일이 잘 복사되었음을 알 수 있습니다.

--

[gpadmin@mdw]$ cd /tmp

[gpadmin@mdw tmp]$ ls -la

total 50844

drwxrwxrwt 1 root root 4096 Apr 10 01:42 .

drwxr-xr-x 1 root root 4096 Apr 9 07:11 ..

drwxrwxrwt 2 root root 4096 Sep 11 2017 .ICE-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .Test-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .X11-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .XIM-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .font-unix

srwxrwxr-x 1 gpadmin gpadmin 0 Mar 22 07:19 .s.GPMC.sock

srwxrwxrwx 1 gpadmin gpadmin 0 Apr 10 01:20 .s.PGSQL.40000

-rw------- 1 gpadmin gpadmin 25 Apr 10 01:20 .s.PGSQL.40000.lock

srwxrwxrwx 1 gpadmin gpadmin 0 Apr 10 01:20 .s.PGSQL.40001

-rw------- 1 gpadmin gpadmin 25 Apr 10 01:20 .s.PGSQL.40001.lock

srwxrwxrwx 1 gpadmin gpadmin 0 Apr 10 01:20 .s.PGSQL.5432

-rw------- 1 gpadmin gpadmin 25 Apr 10 01:20 .s.PGSQL.5432.lock

-rw-r--r-- 1 501 games 3624013 Apr 8 06:12 EX_sample.mif

[gpadmin@mdw tmp]$

명령 프롬프트 창에서 ogrinfo 유틸리티로 EX_sample.mif의 메타 데이터와 요약 설명을 알아보겠습니다.

[gpadmin@mdw tmp]$ ogrinfo ./EX_sample.mif

INFO: Open of `./EX_sample.mif'

using driver `MapInfo File' successful.

1: EX_sample

[gpadmin@mdw tmp]$

[gpadmin@mdw tmp]$ ogrinfo ./EX_sample.mif EX_Sample -so

INFO: Open of `./EX_sample.mif'

using driver `MapInfo File' successful.

Layer name: EX_sample

Geometry: Unknown (any)

Feature Count: 4142

Extent: (281282.800000, 85614.570000) - (300012.000000, 100272.000000)

Layer SRS WKT:

PROJCS["unnamed",

GEOGCS["unnamed",

DATUM["OSGB_1936",

SPHEROID["Airy 1930",6377563.396,299.3249646],

TOWGS84[375,-111,431,0,0,0,0]],

PRIMEM["Greenwich",0],

UNIT["degree",0.0174532925199433]],

PROJECTION["Transverse_Mercator"],

PARAMETER["latitude_of_origin",49],

PARAMETER["central_meridian",-2],

PARAMETER["scale_factor",0.9996012717],

PARAMETER["false_easting",400000],

PARAMETER["false_northing",-100000],

UNIT["Meter",1.0]]

POSTCODE: String (8.0)

UPP: String (20.0)

PC_AREA: String (2.0)

[gpadmin@mdw tmp]$

준비가 되었으니 ogr2ogr 로 EX_sample.mif 데이터셋을 data_import.osgb_code_point_polygons_mif 라는 이름으로 Greenplum DB에 import 하겠습니다. (아래 PG: "xxxx" 안의 DB 설정 정보는 각자 자신의 것으로 입력해주면 됨)

-lco GEOMETRY_NAME : 레이어 생성 옵션 (디폴트 wkb_geometry)
-s_srs : input SRID
-a_srs : output SRID

[gpadmin@mdw tmp]$ ogr2ogr -f "PostgreSQL" PG:"host=localhost port=5432 user=gpadmin dbname=gpadmin" EX_sample.mif -nln data_import.osgb_code_point_polygons_mif -lco GEOMETRY_NAME=geom -a_srs EPSG:27700

[gpadmin@mdw tmp]$

DB query tool에서 data_import.osgb_code_point_polygons_mif 테이블을 조회해보면 아래와 같이 POLYGON 공간지리 정보가 들어있는 테이블이 잘 생성되었음을 알 수 있습니다.

-- DBeaver에서 조회

SELECT * FROM data_import.osgb_code_point_polygons_mif ORDER BY ogc_fid LIMIT 10;

(3) KML(Keyhole Markup Language) 데이터셋을 ogr2ogr 유틸리티로 PostgreSQL, Greenplum DB에 import 하기

KML (Keyhole Markup Language) 데이터셋은 Google Earth에서 2D 혹은 3D로 웹브라우저 상에서 시각화할 수 있는 XML 기반의 공간지리 데이터 포맷입니다.

PostgreSQL, Greenplum DB에 KML 포맷 데이터를 Import 할 때도 GDAL의 ogr2ogr 유틸리티를 사용합니다.

먼저, 명령 프롬프트 창에서 docker cp 로 '2.5_day_age.kml' 데이터셋을 Greenplum DB docker container로 복사하겠습니다.

-- (1) Copy '2.5_day_age.kml' file to GPDB

ihongdon-ui-MacBook-Pro:~ ihongdon$ docker cp /Users/ihongdon/Documents/PostGIS/data/usgs-earthquakes/2.5_day_age.kml gpdb-ds:/tmp

ihongdon-ui-MacBook-Pro:~ ihongdon$

다음으로, 다른 명령 프롬프트에서 Greenplum gpadmin 계정으로 들어가서 파일이 잘 복사가 되었는지 확인해보겠습니다.

-- (2) (GPDB 명령 프롬프트 창에서) orginfo => 4개의 layer가 있음

[gpadmin@mdw tmp]$ ls -la

total 123532

drwxrwxrwt 1 root root 4096 Apr 10 13:13 .

drwxr-xr-x 1 root root 4096 Apr 9 07:11 ..

drwxrwxrwt 2 root root 4096 Sep 11 2017 .ICE-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .Test-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .X11-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .XIM-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .font-unix

srwxrwxr-x 1 gpadmin gpadmin 0 Mar 22 07:19 .s.GPMC.sock

srwxrwxrwx 1 gpadmin gpadmin 0 Apr 16 05:36 .s.PGSQL.40000

-rw------- 1 gpadmin gpadmin 27 Apr 16 05:36 .s.PGSQL.40000.lock

srwxrwxrwx 1 gpadmin gpadmin 0 Apr 16 05:36 .s.PGSQL.40001

-rw------- 1 gpadmin gpadmin 27 Apr 16 05:36 .s.PGSQL.40001.lock

srwxrwxrwx 1 gpadmin gpadmin 0 Apr 16 05:36 .s.PGSQL.5432

-rw------- 1 gpadmin gpadmin 27 Apr 16 05:36 .s.PGSQL.5432.lock

-rw-r--r-- 1 gpadmin gpadmin 4787 Apr 8 06:21 2.5_day.csv

-rw-r--r-- 1 501 games 30548 Apr 8 06:21 2.5_day_age.kml

[gpadmin@mdw tmp]$

ogrinfo 명령어로 '2.5_day_age.kml' 데이터의 메타정보를 확인해보겠습니다. Layer가 총 4개 있고, 3D Point 정보가 들어있는 KML 포맷을 공간지리 데이터셋임을 알 수 있습니다.

-- (3) metadata info.

[gpadmin@mdw tmp]$ ogrinfo 2.5_day_age.kml

INFO: Open of `2.5_day_age.kml'

using driver `KML' successful.

1: Magnitude 5 (3D Point)

2: Magnitude 4 (3D Point)

3: Magnitude 3 (3D Point)

4: Magnitude 2 (3D Point)

[gpadmin@mdw tmp]$

ogrinfo 2.5_day_age.kml -al -so 로 메타정보의 4개 Layer에 대한 상세 정보를 확인해보겠습니다.

-- (4) review metadata for each layer at once in depth

[gpadmin@mdw tmp]$ ogrinfo 2.5_day_age.kml -al -so

INFO: Open of `2.5_day_age.kml'

using driver `KML' successful.

Layer name: Magnitude 5

Geometry: 3D Point

Feature Count: 2

Extent: (-101.000100, -36.056300) - (120.706400, 13.588200)

Layer SRS WKT:

GEOGCS["WGS 84",

DATUM["WGS_1984",

SPHEROID["WGS 84",6378137,298.257223563,

AUTHORITY["EPSG","7030"]],

AUTHORITY["EPSG","6326"]],

PRIMEM["Greenwich",0,

AUTHORITY["EPSG","8901"]],

UNIT["degree",0.0174532925199433,

AUTHORITY["EPSG","9122"]],

AUTHORITY["EPSG","4326"]]

Name: String (0.0)

Description: String (0.0)

Layer name: Magnitude 4

Geometry: 3D Point

Feature Count: 8

Extent: (-93.869400, -30.966800) - (127.154100, 41.012000)

Layer SRS WKT:

GEOGCS["WGS 84",

DATUM["WGS_1984",

SPHEROID["WGS 84",6378137,298.257223563,

AUTHORITY["EPSG","7030"]],

AUTHORITY["EPSG","6326"]],

PRIMEM["Greenwich",0,

AUTHORITY["EPSG","8901"]],

UNIT["degree",0.0174532925199433,

AUTHORITY["EPSG","9122"]],

AUTHORITY["EPSG","4326"]]

Name: String (0.0)

Description: String (0.0)

Layer name: Magnitude 3

Geometry: 3D Point

Feature Count: 6

Extent: (-155.372167, 18.242700) - (-64.691100, 36.431400)

Layer SRS WKT:

GEOGCS["WGS 84",

DATUM["WGS_1984",

SPHEROID["WGS 84",6378137,298.257223563,

AUTHORITY["EPSG","7030"]],

AUTHORITY["EPSG","6326"]],

PRIMEM["Greenwich",0,

AUTHORITY["EPSG","8901"]],

UNIT["degree",0.0174532925199433,

AUTHORITY["EPSG","9122"]],

AUTHORITY["EPSG","4326"]]

Name: String (0.0)

Description: String (0.0)

Layer name: Magnitude 2

Geometry: 3D Point

Feature Count: 9

Extent: (-154.990005, 17.871900) - (-65.022300, 63.207400)

Layer SRS WKT:

GEOGCS["WGS 84",

DATUM["WGS_1984",

SPHEROID["WGS 84",6378137,298.257223563,

AUTHORITY["EPSG","7030"]],

AUTHORITY["EPSG","6326"]],

PRIMEM["Greenwich",0,

AUTHORITY["EPSG","8901"]],

UNIT["degree",0.0174532925199433,

AUTHORITY["EPSG","9122"]],

AUTHORITY["EPSG","4326"]]

Name: String (0.0)

Description: String (0.0)

[gpadmin@mdw tmp]$

마지막으로, 명령 프롬프트 창에서 ogr2ogr 유틸리티로 PosgreSQL, Greenplum DB에 KML 파일을 Import 해보겠습니다. (사용하고 있는 DB의 host, port, user, DBname 으로 설정 변경해주세요.)

제일 마지막에 '-append' 인자는 '2.5_day_age.kml' 데이터셋의 4개 Layer를 하나씩 순차적으로 읽어서 먼저 읽은 데이터셋 뒤에 붙여넣기로 Import 하라는 뜻입니다. ('-append' 인자를 추가하지 않으면 기존에 테이블이 존재한다는 에러 메시지가 뜹니다). 아래처럼 Warning 메시지가 나왔으면 잘 Import 가 된 것입니다.

-- (5) Import KML dataset to GPDB

[gpadmin@mdw tmp]$ ogr2ogr -f "PostgreSQL" PG:"host=localhost port=5432 user=gpadmin dbname=gpadmin" 2.5_day_age.kml -nln data_import.usgs_earthquakes_kml -lco GEOMETRY_NAME=geom -append

Warning 1: Layer creation options ignored since an existing layer is

being appended to.

Warning 1: Layer creation options ignored since an existing layer is

being appended to.

Warning 1: Layer creation options ignored since an existing layer is

being appended to.

[gpadmin@mdw tmp]$

데이터가 잘 Import 되었으니 DBeaver DB tool에서 SQL query로 데이터를 조회해 보겠습니다.

-- (DBeaver tool에서) Select KML dataset
SELECT * FROM data_import.usgs_earthquakes_kml LIMIT 10;

서두에 KML 데이터 포맷이 Google Earth 에서 2D, 3D로 시각화해볼 수 있다고 소개하였습니다. 실제로 Google Earth 애플리케이션에서 '2.5_day_age.kml' 데이터셋을 시각화해보면 아래와 같습니다.

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[Greenplum DB] PostGIS - 공간지리 테이블 백업하기, 백업 다시 불러오기 (Backup and Restore geospatial table using pg_dump, pg_restore) (0)	2019.04.27
[Greenplum DB] PostGIS에 raster2pgsql 을 사용하여 raster data import 하기 (0)	2019.04.17
[Greenplum DB] 공간지리 shape files (SHP, SHX, DBF, PRJ)를 PostgreSQL, Greenplum DB로 import 하는 방법 (0)	2019.04.11
[Greenplum DB] PostGIS : 위도, 경도가 있는 csv 파일을 import하고 공간정보 뽑아내기 (0)	2019.04.10
[Greenplum DB] GPDB docker에 PostGIS 설치하기 (0)	2019.03.27

Posted by Rfriend

,

[Greenplum DB] 공간지리 shape files (SHP, SHX, DBF, PRJ)를 PostgreSQL, Greenplum DB로 import 하는 방법

Greenplum and PostgreSQL Database 2019. 4. 11. 00:41

이번 포스팅에서는 공간지리 데이터 포맷 중에서 SHP, SHX, DBF, PRJ 로 이루어진 shape files 를 shp2pgsql 툴을 사용하여 PostgreSQL, Greenplum DB로 importing 하는 방법을 소개하겠습니다.

PostgreSQL을 설치하였다면 바로 shp2pgsql 툴을 사용할 수 있습니다.

예제 데이터셋은 https://github.com/PacktPublishing/Mastering-PostGIS/tree/master/Chapter01/data/ne_110m_coastline 에서 다운로드한 'ne_110m_coastline.shp', 'ne_110m_coastline.shx', 'ne_110m_coastline.dbf', 'ne_110m_coastline.prj' 의 4개 파일을 사용하였으며, 코드는 'Mastering PostGIS' (by Dominik Mikiewicz 외)를 참조하였습니다.

shp2pgsql 은 (1) shapefile data를 import하는 SQL문을 만들어주고, (2) shapefile data를 import하는 SQL문을 바로 psql로 보내서 import 해주는 command-line utility 입니다. 말이 좀 어려운데요, 아래 예제를 직접 보면 이해가 쉬울 듯 합니다.

Greenplum DB Docker 이미지에 PostGIS 설치해서 Greenplum 시작하는 방법은 https://rfriend.tistory.com/435 를 참고하세요.

(1) shp2pgsql로 shapefile data를 import하는 SQL문 만들기

먼저, 명령 프롬프트 창에서 'ne_110m_coastline.shp', 'ne_110m_coastline.shx', 'ne_110m_coastline.dbf', 'ne_110m_coastline.prj'의 4개 shape files 를 docker cp 명령문을 사용하여 Greenplum docker의 gpdb-ds:/tmp 폴더로 복사하겠습니다. (만약 VMware를 이용해서 Greenplum DB를 설치해서 사용 중이라면 scp 명령문으로 GPDB 안으로 파일 복사)

-- (1) 명령 프롬프트 창에서 shape files 를 docker gpdb-ds:/tmp 경로로 복사

ihongdon-ui-MacBook-Pro:~ ihongdon$

ihongdon-ui-MacBook-Pro:~ ihongdon$ cd /Users/ihongdon/Documents/PostGIS/data/ne_110m_coastline/

ihongdon-ui-MacBook-Pro:ne_110m_coastline ihongdon$ ls

ne_110m_coastline.dbf ne_110m_coastline.prj ne_110m_coastline.shp ne_110m_coastline.shx

ihongdon-ui-MacBook-Pro:ne_110m_coastline ihongdon$

ihongdon-ui-MacBook-Pro:ne_110m_coastline ihongdon$ docker cp
/Users/ihongdon/Documents/PostGIS/data/ne_110m_coastline/ne_110m_coastline.shp
gpdb-ds:/tmp

ihongdon-ui-MacBook-Pro:ne_110m_coastline ihongdon$ docker cp
/Users/ihongdon/Documents/PostGIS/data/ne_110m_coastline/ne_110m_coastline.dbf
gpdb-ds:/tmp

ihongdon-ui-MacBook-Pro:ne_110m_coastline ihongdon$ docker cp
/Users/ihongdon/Documents/PostGIS/data/ne_110m_coastline/ne_110m_coastline.shx
gpdb-ds:/tmp

ihongdon-ui-MacBook-Pro:ne_110m_coastline ihongdon$ docker cp
/Users/ihongdon/Documents/PostGIS/data/ne_110m_coastline/ne_110m_coastline.prj
gpdb-ds:/tmp

다른 명령 프롬프트 창에서 GPDB에 4개의 shape files 이 잘 들어갔는지 확인해보겠습니다.

[gpadmin@mdw /]$

[gpadmin@mdw /]$ cd tmp

[gpadmin@mdw tmp]$ ls -la

total 123532

drwxrwxrwt 1 root root 4096 Apr 10 13:13 .

drwxr-xr-x 1 root root 4096 Apr 9 07:11 ..

drwxrwxrwt 2 root root 4096 Sep 11 2017 .ICE-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .Test-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .X11-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .XIM-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .font-unix

-rw-r--r-- 1 501 games 3179 Apr 8 06:02 ne_110m_coastline.dbf

-rw-r--r-- 1 501 games 147 Apr 8 06:03 ne_110m_coastline.prj

-rw-r--r-- 1 501 games 89652 Apr 8 06:03 ne_110m_coastline.shp

-rw-r--r-- 1 501 games 1172 Apr 8 06:04 ne_110m_coastline.shx

[gpadmin@mdw tmp]$

이제 shp2pgsql 유틸리티를 이용해서 data_import.ne_110m_coastline.sql 라는 이름의 ne_coastline shape files 를 import 하는 SQL 문을 만들어보겠습니다.

-- (2) 명령 프롬프트 창 docker gpdb-ds의 gpadmin@mdw 에서 shp2pgsql 실행하여 sql 생성

[gpadmin@mdw tmp]$ shp2pgsql -s 4326 ne_110m_coastline data_import.ne_coastline > data_import.ne_coastline.sql

Shapefile type: Arc

Postgis type: MULTILINESTRING[2]

[gpadmin@mdw tmp]$

[gpadmin@mdw tmp]$ ls -la
total 123532
drwxrwxrwt  1 root    root        4096 Apr 10 13:13 .
drwxr-xr-x  1 root    root        4096 Apr  9 07:11 ..
drwxrwxrwt  2 root    root        4096 Sep 11  2017 .ICE-unix
drwxrwxrwt  2 root    root        4096 Sep 11  2017 .Test-unix
drwxrwxrwt  2 root    root        4096 Sep 11  2017 .X11-unix
drwxrwxrwt  2 root    root        4096 Sep 11  2017 .XIM-unix
drwxrwxrwt  2 root    root        4096 Sep 11  2017 .font-unix
-rw-rw-r--  1 gpadmin gpadmin   184004 Apr  8 08:03 data_import.ne_coastline.sql
-rw-r--r--  1     501 games       3179 Apr  8 06:02 ne_110m_coastline.dbf
-rw-r--r--  1     501 games        147 Apr  8 06:03 ne_110m_coastline.prj
-rw-r--r--  1     501 games      89652 Apr  8 06:03 ne_110m_coastline.shp
-rw-r--r--  1     501 games       1172 Apr  8 06:04 ne_110m_coastline.shx
[gpadmin@mdw tmp]$

다른 명령 프롬프트 창에서 docker GPDB에 만들어진 data_import.ne_coastline.sql 파일을 로컬로 복사해서 SQL 문이 어떻게 생겼는지 눈으로 확인을 해보겠습니다. 아래에 보는 바와 같이 CREATE TABLE "data_import"."ne_coastline" (gid serial, scalerank numeric(10,0), featurecla varchar(12)); ALTER TABLE data_import.ne_coastline ADD PRIMARY KEY (gid); SELECT AddGeometryColumn(data_import.ne_coastline, geom, 4326, MULTILINESTRING, 2); INSERT INTO data_import.ne_coastline VALUES (); 와 같이 우리가 이미 알고 있는 테이블 생성과 데이터 삽입 표준 SQL query 문이 자동으로 만들어졌음을 확인할 수 있습니다. 이 SQL 문을 실행하면 테이블 생성부터 데이터 importing까지 일괄로 수행이 됩니다.

* SQL 파일 첨부 :

data_import.ne_coastline.sql.txt

0.18MB

-- (3) (optional) 다른 명령 프롬프트 창에서 gpdb-ds:/tmp/data_import.ne_coastline.sql 를 docker gpdb에서 밖으로 복사하여 sql query 확인해 보기

ihongdon-ui-MacBook-Pro:ne_110m_coastline ihongdon$ docker cp gpdb-ds:/tmp/data_import.ne_coastline.sql /Users/ihongdon/Documents/PostGIS/data/ne_110m_coastline/

ihongdon-ui-MacBook-Pro:ne_110m_coastline ihongdon$

-- sql query문을 열어서 확인해보면 아래와 같음

SET CLIENT_ENCODING TO UTF8;

SET STANDARD_CONFORMING_STRINGS TO ON;

BEGIN;

CREATE TABLE "data_import"."ne_coastline" (gid serial,

"scalerank" numeric(10,0),

"featurecla" varchar(12));

ALTER TABLE "data_import"."ne_coastline" ADD PRIMARY KEY (gid);

SELECT AddGeometryColumn('data_import','ne_coastline','geom','4326','MULTILINESTRING',2);

INSERT INTO "data_import"."ne_coastline" ("scalerank","featurecla",geom) VALUES ('1','Coastline','0105000020E61000000100000001020000000B000000BFA9980AD07664C095CA366A1FA653C04B24ADB8626364C0700E7B2E4B8E53C03EE93FF8D72764C0BF309DD0549853C02960B7EFE00764C02CDFE4B164AC53C07CB8A9DB6FEF63C08FDFE431F7C253C0A4D39170A9E663C089F492F9CEDF53C09E264A4F152464C0165BF9DF96E853C0FB978739134E64C026353A8703D253C01A655486E06064C0C3343A0771BB53C0B2C4809F216264C0974C8585ADB753C0BFA9980AD07664C095CA366A1FA653C0');

INSERT INTO "data_import"."ne_coastline" ("scalerank","featurecla",geom) VALUES ('0','Coastline','0105000020E61000000100000001020000000C000000D0347456A2CA18C05BC1C65E0CEF4A403FFBA3ECC62118C08AB5EFBA9B934A4010BA8804CA271BC09650268B4B214A40D09F77358C1F21C036A1DEA9ABD5494040720A9544F423C08AA1DEA904E9494088D170FB225522C0BCBB2928AC6E4A40088BF249866023C04F137F7DD0F04A40F0F906F8EDA720C07E8CF6F40E554B40D08D1B64E6491EC028FE33FFD8904B4040B24E9775EF1AC0BA12E24620964B40B0E392DBD5A516C0BAA0A43CFD464B40D0347456A2CA18C05BC1C65E0CEF4A40');

.... 너무 길어서 중간 생략함 .....

INSERT INTO "data_import"."ne_coastline" ("scalerank","featurecla",geom) VALUES ('0','Coastline','0105000020E6100000010000000102000000840000008881E7DEC36147C090CC237F30A85440A4E4D53906B445C084177D0569CE5440E0965643E2F243C02AD9B11188CB544044C49448A24F43C09E779CA223E354405026FC523F8B41C0C8EA56CF49E95440608E1EBFB7193BC08842041C42E15440905DA27A6BD834C02615C61682AE5440F8B7921D1BB136C06C2BF697DD955440E0B298D87C843AC0E2E995B20C9354405866666666E63FC0CCCCCCCCCC8C5440102C0E677E653FC094E34EE96081544098CADB114EDB3BC0149161156F88544050205ED72FD838C0B83B6BB75D7254408821AB5B3DE736C0F8BD4D7FF6855440E87C3F355E1236C04C4F58E2016F544020139B8F6B2B37C06CF12900C6495440F8CD3637A69F34C0FAB4C35F9361544000E292E34E892FC00C4FAF94657A544010117008558A29C0B22E6EA3016E54404029CB10C76A28C0765E6397A852544048910A630B4930C0CA9717601F2554408899999999D930C06A6666666616544058C47762D60B34C0AA315A47550B544040A2B437F8BA31C09E508880430854406066666666E632C09C99999999D95340D00182397AB433C0E288B5F814B05340F8DF4A766CAC33C05E6397A8DE685340E0E995B20C7932C05CD3BCE3143F5340F860E0B9F70834C07E130A11703C53402098A3C7EFAD35C07E6132553028534018938C9C85D533C0141956F146065340A89C4B71559933C0C03E3A75E5CF524058D72FD80DAB34C03C454772F9C95240309FE579705F33C0BA973446EB92524018A14ACD1E9835C0340C1F11538E5240C8CD70033E6F34C028A5A0DB4B745240B86ED8B628C334C084640113B85D5240480B5EF4152C36C0849ECDAACF53524040D3D9C9E09037C0427E6FD39F535240E00C1AFA275036C0BA019F1F4628524008D847A7AE4C36C052616C21C80B5240302B4D4A414738C0ACBB79AA432652403809336DFFCA38C0EA482EFF21155240A86F99D3657137C0A49C685721055240E8940ED6FF2136C09A9EB0C403DE5140E892E34EE9C035C0AC2B9FE579AA5140B8AC1743398937C0A21A2FDD249E5140C037DBDC984E38C07CD66EBBD0B65140C851F2EA1C8B39C074A25D8594DB5140B00C71AC8B3339C0B685200725B05140084CE0D6DD5C3AC08C7615527E8E5140C874763238BA37C0C408E1D1C68B514068A6ED5F595936C0FC88981249885140F8C01C3D7E0739C01E0DE02D90505140E802ECA353BF3BC0DE8442041C1E5140F8FC304278AC3EC03641D47D000851403870CE88D2C63FC0B28009DCBA075140DCB5847CD06740C01C4CC3F011EF5040D48C45D3D91941C01C7233DC80AB5040385C72DC292D42C07A832F4CA67E5040041C42959A8542C04287F9F2027C5040B42E6EA3013043C028A5A0DB4B6C50401C9430D3F6E743C0B0777FBC575D5040E8ACDD76A15544C0C4E78711C2355040DCB06D51665744C05A5F24B4E5085040A8605452279844C064D0D03FC1BD4F40887F9F71E16845C03C8AE59656574F4094C6681D553545C01CFE9AAC51F34E40A00F5D50DF6E45C0444C89247A894E40386744696FB045C02015C616820C4E40C4CCCCCCCC6446C088E63A8DB4044E406CF59CF4BE2147C0642A6F47386D4E40B8679604A82148C0E4F3C308E16D4E4004ADC090D59D48C0D0D79E5912B44E4098D2C1FA3FF348C01C4CC3F011314F408C1804560ED149C08CC43D963ED04F40446E861BF0114AC05AA31EA2D11150404C33164D67234AC0024D840D4F4B504024895E46B1D44AC0686AD95A5F865040C0120F289BA64AC0AAC64B3789B55040F81CE6CB0BFC4AC05851836918CC5040CC7F48BF7D7D4AC01AD82AC1E2165140C4E9B298D8BC49C034F44F70B12E5140C022F8DF4A8A49C0660113B87549514078CF0F23846F49C0AEADD85F767B5140A85B3D27BD014AC0CA293A92CB64514064E42CEC69474AC046B79734465B5140F0BEF1B567BA4AC0FCD478E92652514080B2295778574BC00EE544BB0A67514028D6E25300604BC00A850838849251408C3F1878EE2D4BC0BA2DCA6C90B4514018A3755435B74AC0FE1D8A027DB55140406E861BF0B149C08E7A884677A45140F0940ED6FF8D4AC078A52C431CCD51409CFEEC478A004BC0B4AC342905E35140F8FFFFFFFF7F4BC040C3A0B304DA5140045053CBD6EA4BC06E4C4F58E2E95140342861A6ED5B4BC0F051B81E852552402CB05582C5A94BC09AE7C1DD593D5240FC449E245D0F4CC000FBE8D495695240940035B56CA94CC010EA5BE674AD5240344A5E9D634C4DC0C24351A04FC65240AC76DB85E64A4DC0EC4CA1F31AE1524074B0FECF61A24EC0EEF0D7648D06534034BD529621B24FC098900F7A360B53404C07EBFF1C8450C07C62D68BA108534068300DC3472051C072693524EE035340B8AF03E78C6A51C0FAA9F1D24D1853407416F6B4C3D951C0826E2F698C4053407A1EDC9DB53151C0C22B82FFAD54534004486DE2E4B050C09CBB96900F5853401C81785DBFC251C00EF9A067B36853405CBA490C025352C0DE334B02D4825340B8E82B48334A52C0C0A94885B19B53403E2CD49AE65751C0FA1D8A027DBA53401851DA1B7C6D50C072B6B9313DD9534028CB10C7BA5450C08872A25D85F05340921D1B81780151C0EA305F5E800754408C8D40BCAEC950C0DAA7E331032154404C62105839D84FC0BCA94885B14D5440E86F4221021E4FC0B6AF03E78C545440543BFC3559534FC0CE0BB08F4E71544034B9DFA128244EC0868A71FE26825440E8F0D7648D9A4CC082828B15358C5440D0F6AFAC34114BC028E8F692C68C54401849F4328A854AC092831266DA785440C8BE2B82FF3149C03AA06CCA159C54409CE1067C7E0048C00E9DD7D825845440906A9F8EC74C47C0E6A90EB9197F544070E7FBA9F14246C0E81DA7E8486A5440ECC039234A7347C05A04FF5BC98C54408881E7DEC36147C090CC237F30A85440');

INSERT INTO "data_import"."ne_coastline" ("scalerank","featurecla",geom) VALUES ('3','Country','0105000020E6100000010000000102000000060000006666666666A65AC06766666666665240713D0AD7A3505AC0295C8FC2F56852400000000000205AC07B14AE47E15A5240B91E85EB51585AC0713D0AD7A33052405C8FC2F528BC5AC03E0AD7A3705D52406666666666A65AC06766666666665240');

COMMIT;

(2) shp2pgsql로 shapefile data를 import하는 SQL문을 psql로 보내서 import하기

Greenplum docker 명령 프롬프트 창에서 shp2pgsql 유틸리티를 사용해서 (1)번에서 만든 SQL문을 psql로 보내서 실행시켜 보겠습니다. (아래에서 각자 자신의 Greenplum DB 혹은 PostgreSQL DB의 host(-h), port(-p), user(-U), database(-d) 이름을 설정해 줌)

[gpadmin@mdw tmp]$ shp2pgsql -s 4326 ne_110m_coastline data_import.ne_coastline
| psql -h localhost -p 5432 -U gpadmin -d gpadmin
Shapefile type: Arc
Postgis type: MULTILINESTRING[2]
SET
SET
BEGIN
NOTICE:  CREATE TABLE will create implicit sequence "ne_coastline_gid_seq" for serial column "ne_coastline.gid"
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'gid' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "ne_coastline_pkey" for table "ne_coastline"
ALTER TABLE
                          addgeometrycolumn
----------------------------------------------------------------------
data_import.ne_coastline.geom SRID:4326 TYPE:MULTILINESTRING DIMS:2
(1 row)

INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1

.... 중간 생략함 ....

INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
COMMIT
[gpadmin@mdw tmp]$

마지막으로, DBeaver db tool로 가서 SELECT 문으로 데이터가 잘 들어갔는지 조회를 해보겠습니다. 해안선(coastline)을 MULTILINESTRING 공간지리 데이터 포맷으로 저장해놓은 데이터셋이군요.

SELECT * FROM data_import.ne_coastline ORDER BY gid LIMIT 10;

이중에서 첫번째 gid의 geom만 가져다 살펴보면 아래에서 보는 것처럼 22개의 값으로 이루어져 있습니다.

MULTILINESTRING ((-163.7128956777287 -78.59566741324154, -163.1058009511638 -78.22333871857859, -161.24511349184644 -78.38017669058443, -160.24620805564453 -78.69364592886694, -159.48240454815448 -79.04633757925897, -159.20818356019765 -79.4970077452764, -161.12760128481472 -79.63420867301133, -162.43984676821842 -79.28146534618699, -163.027407803377 -78.92877369579496, -163.06660437727038 -78.8699659158468, -163.7128956777287 -78.59566741324154))

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[Greenplum DB] PostGIS에 raster2pgsql 을 사용하여 raster data import 하기 (0)	2019.04.17
[Greenplum DB] PostGIS : ogr2ogr 을 사용해 공간지리 벡터 데이터 import 하기 (GML, MIF, KML 포맷) (0)	2019.04.11
[Greenplum DB] PostGIS : 위도, 경도가 있는 csv 파일을 import하고 공간정보 뽑아내기 (0)	2019.04.10
[Greenplum DB] GPDB docker에 PostGIS 설치하기 (0)	2019.03.27
[Greenplum & PostgreSQL DB] 동일 간격 범위별로 관측치 개수를 세고(width_bucket), Python으로 막대그래프 시각화하기(bar plot) (0)	2019.03.21

Posted by Rfriend

,

[Greenplum DB] PostGIS : 위도, 경도가 있는 csv 파일을 import하고 공간정보 뽑아내기

Greenplum and PostgreSQL Database 2019. 4. 10. 21:50

이번 포스팅에서는 (1) Greenplum database에 위도와 경도를 포함하고 있는 csv파일을 psql, DBeaver tool을 사용하여 import 하고, (2) PostGIS의 sql query문을 사용하여 공간정보를 뽑아내는 방법을 소개하겠습니다.

참고로, PostgreSQL, Greenplum DB에 지리공간 데이터를 importing할 수 있는 PostGIS, importing tool들은 아래와 같이 매우 다양합니다.

(1) psql을 사용하여 위경도를 포함한 csv 파일을 Greenplum DB에 import하기

예제로 사용한 데이터셋은 https://github.com/PacktPublishing/Mastering-PostGIS/tree/master/Chapter01/data/usgs-earthquakes 에서 '2.5_day.csv' 파일을 다운로드 하였으며, 사용한 예제 코드는 'Mastering PostGIS' 책을 참조하였습니다.

Greenplum docker image를 사용해서 Greenplum을 시작합니다. (자세한 설명은 아래 링크 참조)

==> https://rfriend.tistory.com/435

먼저 DBeaver sql query 편집창에서 아래와 같이 data_import schema와 earthquakes_csv 테이블을 만들어주겠습니다.

-- create schema
CREATE SCHEMA data_import;

----------
-- (1) Importing CSV data format
----------

-- create table
DROP TABLE IF EXISTS data_import.earthquakes_csv;
CREATE TABLE data_import.earthquakes_csv (
"time" timestamp with time zone,
latitude numeric,
longitude numeric,
depth numeric,
mag numeric,
magType varchar,
nst numeric,
gap numeric,
dmin numeric,
rms numeric,
net varchar,
id varchar,
updated timestamp with time zone,
place varchar,
type varchar,
horizontalError numeric,
depthError numeric,
magError numeric,
magNst numeric,
status varchar,
locationSource varchar,
magSource varchar
);

다음으로 명령 프롬프트 cmd 창에서 PostGIS가 설치된 Greenplum Docker 에 docker cp를 사용하여 '2.5_day.csv'파일을 복사해 넣습니다.

MacBook-Pro:~ ihongdon$ docker cp /Users/ihongdon/Documents/PostGIS/data/usgs-earthquakes/2.5_day.csv gpdb-ds:/tmp

Greenplum의 gpadmin 계정으로 들어간 명령 프롬프트 창에서 '2.5_day.csv' 파일이 잘 복사가 되었는지 확인해보겠습니다. 그리고 root 계정으로 들어가서 gpadmin 으로 소유권한을 수정(chown)하겠습니다.

[gpadmin@mdw tmp]$ ls -la

total 123532

drwxrwxrwt 1 root root 4096 Apr 10 13:13 .

drwxr-xr-x 1 root root 4096 Apr 9 07:11 ..

drwxrwxrwt 2 root root 4096 Sep 11 2017 .ICE-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .Test-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .X11-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .XIM-unix

drwxrwxrwt 2 root root 4096 Sep 11 2017 .font-unix

srwxrwxr-x 1 gpadmin gpadmin 0 Mar 22 07:19 .s.GPMC.sock

srwxrwxrwx 1 gpadmin gpadmin 0 Apr 10 12:30 .s.PGSQL.40000

-rw------- 1 gpadmin gpadmin 27 Apr 10 12:30 .s.PGSQL.40000.lock

srwxrwxrwx 1 gpadmin gpadmin 0 Apr 10 12:30 .s.PGSQL.40001

-rw------- 1 gpadmin gpadmin 27 Apr 10 12:30 .s.PGSQL.40001.lock

srwxrwxrwx 1 gpadmin gpadmin 0 Apr 10 12:30 .s.PGSQL.5432

-rw------- 1 gpadmin gpadmin 27 Apr 10 12:30 .s.PGSQL.5432.lock

-rw-r--r-- 1 501 games 4787 Apr 8 06:21 2.5_day.csv

[gpadmin@mdw tmp]$

[root@mdw tmp]# exitlogout

다음으로 psql 을 실행해서 copy 문으로 '2.5_day.csv' 파일을 data_import.earthquakes_csv 테이블에 복사해서 importing 하겠습니다.

[gpadmin@mdw tmp]$ pwd

/tmp

[gpadmin@mdw tmp]$ psql

psql (8.3.23)

Type "help" for help.

gpadmin=# copy data_import.earthquakes_csv from '/tmp/2.5_day.csv' with DELIMITER ',' CSV HEADER;

COPY 25

gpadmin=# \q

[gpadmin@mdw tmp]$

다시 DBeaver query tool 로 돌아와서, data_import.earthquakes_csv 테이블에 데이터가 잘 들어갔는지 조회를 해보겠습니다. 잘 들어갔네요. ^^

SELECT * FROM data_import.earthquakes_csv LIMIT 10;

(2) PostGIS 함수로 공간 정보 뽑아내기

이제 원천 데이터 준비가 되었으니 PostGIS의 ST_Point(경도, 위도) 로 공간데이터 점(geometry Point)을 만들고, ST_SetSRID() 로 공간 참조 ID를 만들어보겠습니다.

-- (2) Extracting spatial information from flat data

DROP TABLE IF EXISTS data_import.earthquakes_subset_with_geom;

CREATE TABLE data_import.earthquakes_subset_with_geom AS (

SELECT

id,

"time",

depth,

mag,

magtype,

place,

ST_SetSRID(ST_Point(longitude, latitude), 4326) AS geom

FROM data_import.earthquakes_csv

);

SELECT * FROM data_import.earthquakes_subset_with_geom LIMIT 10;

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[Greenplum DB] PostGIS : ogr2ogr 을 사용해 공간지리 벡터 데이터 import 하기 (GML, MIF, KML 포맷) (0)	2019.04.11
[Greenplum DB] 공간지리 shape files (SHP, SHX, DBF, PRJ)를 PostgreSQL, Greenplum DB로 import 하는 방법 (0)	2019.04.11
[Greenplum DB] GPDB docker에 PostGIS 설치하기 (0)	2019.03.27
[Greenplum & PostgreSQL DB] 동일 간격 범위별로 관측치 개수를 세고(width_bucket), Python으로 막대그래프 시각화하기(bar plot) (0)	2019.03.21
[Greenplum DB] 데이터 전처리 (data preprocessing) : 결측값, 대소문자, 조건문, Substring, 날짜 (0)	2019.03.16

Posted by Rfriend

,

[Greenplum DB] GPDB docker에 PostGIS 설치하기

Greenplum and PostgreSQL Database 2019. 3. 27. 07:45

이번 포스팅에서는 PostgreSQL, Greenplum database에서 지리공간 데이터 분석 (Geo-Spatial data analysis) 을 할 수 있도록 해주는 외장 확장 오픈 소스 소프트웨어 프로그램인 PostGIS 를 Greenplum docker 위에 설치하는 방법을 소개하겠습니다.

Greenplum DB에 PostGIS 를 설치하는 가이드는 https://gpdb.docs.pivotal.io/5100/ref_guide/extensions/postGIS.html 를 참고하였습니다.

0. (사전 준비 사항) Docker를 이용하여 Greenplum DB + MADlib + PL/x 설치

CentOS + Greenplum + MADlib + PL/R + PL/Python 이 설치된 Docker Image를 이용하여 분석환경을 구성하는 자세한 내용은 https://rfriend.tistory.com/379 포스팅을 참고하기 바랍니다.

명령 프롬프트 창을 띄우고 아래 docker 명령어로 greenplum을 간편하게 설치해보세요.

---------------------------------

-- GPDB w/MADlib, PL/x on Docker : https://hub.docker.com/r/hdlee2u/gpdb-analytics

---------------------------------

-- (1) Docker Image Pull

$ docker pull hdlee2u/gpdb-analytics

$ docker images

-- (2) Docker Image Run(port 5432) -> Docker Container Creation

$ docker run -i -d -p 5432:5432 -p 28080:28080 --name gpdb-ds --hostname mdw hdlee2u/gpdb-analytics /usr/sbin/sshd -D

$ docker ps -a

-- (3) To Start Greenplum Database and Use psql

$ docker exec -it gpdb-ds /bin/bash

[root@mdw /]# su - gpadmin

[gpadmin@mdw ~]$ gpstart -a

.... GPDB start

....

CnetOS와 GPDB 버전에 맞는 PostGIS 버전을 다운로드해서 설치를 해야 합니다. IP 확인, CentOS version 확인, MADlib, PL/R 버전 확인, R & Python Data Science Package version 확인하는 방법은 아래를 참고하세요.

- CentOS : release 7.4

- Greenplum Database : ver 5.10.2

- MADlib : ver 1.15

- PL/R : 2.3.2

- DataScienceR : 1.0.1

- DataSciencePython : 1.1.1

-------------------------------------

-- IP check

[gpadmin@mdw ~]$

[root@mdw ~]# cd /home/gpadmin

[root@mdw gpadmin]# ifconfig -a

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet 172.17.0.2 netmask 255.255.0.0 broadcast 172.17.255.255

ether 02:42:ac:11:00:02 txqueuelen 0 (Ethernet)

RX packets 25395 bytes 10372326 (9.8 MiB)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 25074 bytes 79368842 (75.6 MiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

... (이하 생략)

--------------------------------------

-- MADlib, PL/R, Python Data Science Package, GP Command Center version check

--------------------------------------

[root@mdw gpadmin]# cd /setup

[root@mdw setup]# ls -al

total 810088

drwxr-xr-x 4 gpadmin gpadmin 4096 Aug 13 2018 .

drwxr-xr-x 1 root root 4096 Mar 11 05:08 ..

-rw-r--r-- 1 gpadmin gpadmin 218258940 Aug 13 2018 DataSciencePython-1.1.1-gp5-rhel7-x86_64.gppkg

-rw-r--r-- 1 gpadmin gpadmin 146189713 Aug 13 2018 DataScienceR-1.0.1-gp5-rhel7-x86_64.gppkg

drwxr-xr-x 2 gpadmin gpadmin 4096 Jul 23 2018 greenplum-cc-web-4.3.0-LINUX-x86_64

-rw-r--r-- 1 gpadmin gpadmin 29040039 Aug 13 2018 greenplum-cc-web-4.3.0-LINUX-x86_64.zip

-rwxr-xr-x 1 gpadmin gpadmin 197905185 Aug 10 2018 greenplum-db-5.10.2-rhel7-x86_64.bin

-rw-r--r-- 1 gpadmin gpadmin 195802895 Aug 13 2018 greenplum-db-5.10.2-rhel7-x86_64.zip

-rw-r--r-- 1 gpadmin gpadmin 4 Aug 13 2018 hostfile

drwxr-xr-x 2 gpadmin gpadmin 4096 Aug 11 2018 madlib-1.15-gp5-rhel7-x86_64

-rw-r--r-- 1 gpadmin gpadmin 3023537 Aug 13 2018 madlib-1.15-gp5-rhel7-x86_64.tar.gz

-rw-r--r-- 1 gpadmin gpadmin 39279994 Aug 13 2018 plr-2.3.2-gp5-rhel7-x86_64.gppkg

--------------------------------------

-- CentOS version check

[gpadmin@mdw setup]$ cat /etc/os-release

NAME="CentOS Linux"

VERSION="7 (Core)"

ID="centos"

ID_LIKE="rhel fedora"

VERSION_ID="7"

PRETTY_NAME="CentOS Linux 7 (Core)"

ANSI_COLOR="0;31"

CPE_NAME="cpe:/o:centos:centos:7"

HOME_URL="https://www.centos.org/"

BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"

CENTOS_MANTISBT_PROJECT_VERSION="7"

REDHAT_SUPPORT_PRODUCT="centos"

REDHAT_SUPPORT_PRODUCT_VERSION="7"

1. Pivotal Network에서 PostGIS 다운로드

(1) https://network.pivotal.io/ 접속 (다운로드를 위해서는 회원가입 필요)

> (2) 'Pivotal Greenplum Releases: 5.10.2' : https://network.pivotal.io/products/pivotal-gpdb#/releases/158026

> (3) 'Greenplum Adnvanced Analytics' : https://network.pivotal.io/products/pivotal-gpdb#/releases/158026/file_groups/1084

> (4) 'PostGIS 2.1.5+pivotal.1 for RHEL 7' file download

의 순서대로 경로를 찾아가서 PostGIS 2.1.5+pivotal.1 for RHEL 7 파일을 다운로드 합니다.

2. 다운로드한 PostGIS 압축파일을 Greenplum Docker 컨테이너 안으로 복사(copy)하기

다른 명령 프롬프트 창을 띄우고, 아래처럼 Downloads 폴더로 경로 변경 후에 docker cp 명령문으로 1번에서 다운로드한 PostGIS 2.1.5 압축 파일을 Greenplum 도커 컨테이너 안의 'gpdb-ds:/setup' 경로로 복사해주세요.

-- [At another terminal window] Copy PostGIS 2.1.5 to GPDB-DS Docker Container

ihongdon-ui-MacBook-Pro:~ ihongdon$ pwd

/Users/ihongdon

ihongdon-ui-MacBook-Pro:~ ihongdon$ cd Downloads/

ihongdon-ui-MacBook-Pro:Downloads ihongdon$ ls -al

-rw-r--r--@ 1 ihongdon staff 19839907 3 22 16:28 postgis-2.1.5+pivotal.1-gp5-rhel7-x86_64.gppkg

ihongdon-ui-MacBook-Pro:Downloads ihongdon$ docker cp postgis-2.1.5+pivotal.1-gp5-rhel7-x86_64.gppkg gpdb-ds:/setup

ihongdon-ui-MacBook-Pro:Downloads ihongdon$

3. gpadmin 계정에게 postgis-2.1.5 파일에 대한 권한 부여 (chown)

(1) gpadmin 으로 들어와 있는 명령 프롬프트 창으로 와서 root 계정으로 로그인 후에 => (2) chown 명령어를 이용하여 gpadmin 에 PostGIS 파일에 대한 권한을 부여해줍니다.

-- 파일 소유자나 소유 그룹 변경 : chown

[gpadmin@mdw setup]$ su -

Password:

Last login: Fri Mar 22 07:01:35 UTC 2019 on pts/0

[root@mdw ~]# cd /setup

[root@mdw setup]# ls -al

total 829464

drwxr-xr-x 1 gpadmin gpadmin 4096 Mar 22 07:33 .

drwxr-xr-x 1 root root 4096 Mar 11 05:08 ..

-rw-r--r-- 1 gpadmin gpadmin 218258940 Aug 13 2018 DataSciencePython-1.1.1-gp5-rhel7-x86_64.gppkg

-rw-r--r-- 1 gpadmin gpadmin 146189713 Aug 13 2018 DataScienceR-1.0.1-gp5-rhel7-x86_64.gppkg

drwxr-xr-x 2 gpadmin gpadmin 4096 Jul 23 2018 greenplum-cc-web-4.3.0-LINUX-x86_64

-rw-r--r-- 1 gpadmin gpadmin 29040039 Aug 13 2018 greenplum-cc-web-4.3.0-LINUX-x86_64.zip

-rwxr-xr-x 1 gpadmin gpadmin 197905185 Aug 10 2018 greenplum-db-5.10.2-rhel7-x86_64.bin

-rw-r--r-- 1 gpadmin gpadmin 195802895 Aug 13 2018 greenplum-db-5.10.2-rhel7-x86_64.zip

-rw-r--r-- 1 gpadmin gpadmin 4 Aug 13 2018 hostfile

drwxr-xr-x 2 gpadmin gpadmin 4096 Aug 11 2018 madlib-1.15-gp5-rhel7-x86_64

-rw-r--r-- 1 gpadmin gpadmin 3023537 Aug 13 2018 madlib-1.15-gp5-rhel7-x86_64.tar.gz

-rw-r--r-- 1 gpadmin gpadmin 39279994 Aug 13 2018 plr-2.3.2-gp5-rhel7-x86_64.gppkg

-rw-r--r-- 1 501 games 19839907 Mar 22 07:28 postgis-2.1.5+pivotal.1-gp5-rhel7-x86_64.gppkg

[root@mdw setup]# chown gpadmin:gpadmin postgis-2.1.5+pivotal.1-gp5-rhel7-x86_64.gppkg

4. gppkg로 각 Segment 노드에 PostGIS 설치하기

(1) 명령 프롬프트 창에서 root 계정에서 exit 후 => gpadmin 계정에서 gppkg -i 로 PostGIS 2.1.5를 설치합니다.

(2) 그러면 로그 메시지에 'gppkg:mdw:gpadmin-[INFO]:-Please run the following commands to enable the PostGIS package: $GPHOME/share/postgresql/contrib/postgis-2.1/postgis_manager.sh mydatabase install'라는 메시지가 나옵니다. 이 메시지를 추가로 실행시킵니다.

-- PostGIS 2.1.5 install

[root@mdw setup]# exit

logout

[gpadmin@mdw setup]$ ls -al

total 829464

drwxr-xr-x 1 gpadmin gpadmin 4096 Mar 22 07:33 .

drwxr-xr-x 1 root root 4096 Mar 11 05:08 ..

-rw-r--r-- 1 gpadmin gpadmin 218258940 Aug 13 2018 DataSciencePython-1.1.1-gp5-rhel7-x86_64.gppkg

-rw-r--r-- 1 gpadmin gpadmin 146189713 Aug 13 2018 DataScienceR-1.0.1-gp5-rhel7-x86_64.gppkg

drwxr-xr-x 2 gpadmin gpadmin 4096 Jul 23 2018 greenplum-cc-web-4.3.0-LINUX-x86_64

-rw-r--r-- 1 gpadmin gpadmin 29040039 Aug 13 2018 greenplum-cc-web-4.3.0-LINUX-x86_64.zip

-rwxr-xr-x 1 gpadmin gpadmin 197905185 Aug 10 2018 greenplum-db-5.10.2-rhel7-x86_64.bin

-rw-r--r-- 1 gpadmin gpadmin 195802895 Aug 13 2018 greenplum-db-5.10.2-rhel7-x86_64.zip

-rw-r--r-- 1 gpadmin gpadmin 4 Aug 13 2018 hostfile

drwxr-xr-x 2 gpadmin gpadmin 4096 Aug 11 2018 madlib-1.15-gp5-rhel7-x86_64

-rw-r--r-- 1 gpadmin gpadmin 3023537 Aug 13 2018 madlib-1.15-gp5-rhel7-x86_64.tar.gz

-rw-r--r-- 1 gpadmin gpadmin 39279994 Aug 13 2018 plr-2.3.2-gp5-rhel7-x86_64.gppkg

-rw-r--r-- 1 gpadmin gpadmin 19839907 Mar 22 07:28 postgis-2.1.5+pivotal.1-gp5-rhel7-x86_64.gppkg

[gpadmin@mdw setup]$ gppkg -i postgis-2.1.5+pivotal.1-gp5-rhel7-x86_64.gppkg

20190322:07:36:54:011243 gppkg:mdw:gpadmin-[INFO]:-Starting gppkg with args: -i postgis-2.1.5+pivotal.1-gp5-rhel7-x86_64.gppkg

20190322:07:36:55:011243 gppkg:mdw:gpadmin-[INFO]:-Installing package postgis-2.1.5+pivotal.1-gp5-rhel7-x86_64.gppkg

20190322:07:36:55:011243 gppkg:mdw:gpadmin-[INFO]:-Validating rpm installation cmdStr='rpm --test -i /usr/local/greenplum-db-5.10.2/.tmp/libexpat-2.1.0-1.x86_64.rpm /usr/local/greenplum-db-5.10.2/.tmp/gdal-1.11.1-1.x86_64.rpm /usr/local/greenplum-db-5.10.2/.tmp/proj-4.8.0-1.x86_64.rpm /usr/local/greenplum-db-5.10.2/.tmp/postgis-2.1.5-1.x86_64.rpm /usr/local/greenplum-db-5.10.2/.tmp/json-c-0.12-1.x86_64.rpm /usr/local/greenplum-db-5.10.2/.tmp/geos-3.4.2-1.x86_64.rpm --dbpath /usr/local/greenplum-db-5.10.2/share/packages/database --prefix /usr/local/greenplum-db-5.10.2'

20190322:07:36:55:011243 gppkg:mdw:gpadmin-[INFO]:-Installing postgis-2.1.5+pivotal.1-gp5-rhel7-x86_64.gppkg locally

20190322:07:36:56:011243 gppkg:mdw:gpadmin-[INFO]:-Validating rpm installation cmdStr='rpm --test -i /usr/local/greenplum-db-5.10.2/.tmp/libexpat-2.1.0-1.x86_64.rpm /usr/local/greenplum-db-5.10.2/.tmp/gdal-1.11.1-1.x86_64.rpm /usr/local/greenplum-db-5.10.2/.tmp/proj-4.8.0-1.x86_64.rpm /usr/local/greenplum-db-5.10.2/.tmp/postgis-2.1.5-1.x86_64.rpm /usr/local/greenplum-db-5.10.2/.tmp/json-c-0.12-1.x86_64.rpm /usr/local/greenplum-db-5.10.2/.tmp/geos-3.4.2-1.x86_64.rpm --dbpath /usr/local/greenplum-db-5.10.2/share/packages/database --prefix /usr/local/greenplum-db-5.10.2'

20190322:07:36:56:011243 gppkg:mdw:gpadmin-[INFO]:-Installing rpms cmdStr='rpm -i /usr/local/greenplum-db-5.10.2/.tmp/libexpat-2.1.0-1.x86_64.rpm /usr/local/greenplum-db-5.10.2/.tmp/gdal-1.11.1-1.x86_64.rpm /usr/local/greenplum-db-5.10.2/.tmp/proj-4.8.0-1.x86_64.rpm /usr/local/greenplum-db-5.10.2/.tmp/postgis-2.1.5-1.x86_64.rpm /usr/local/greenplum-db-5.10.2/.tmp/json-c-0.12-1.x86_64.rpm /usr/local/greenplum-db-5.10.2/.tmp/geos-3.4.2-1.x86_64.rpm --dbpath /usr/local/greenplum-db-5.10.2/share/packages/database --prefix=/usr/local/greenplum-db-5.10.2'

20190322:07:37:01:011243 gppkg:mdw:gpadmin-[INFO]:-Completed local installation of postgis-2.1.5+pivotal.1-gp5-rhel7-x86_64.gppkg.

20190322:07:37:01:011243 gppkg:mdw:gpadmin-[INFO]:-Please run the following commands to enable the PostGIS package: $GPHOME/share/postgresql/contrib/postgis-2.1/postgis_manager.sh mydatabase install

20190322:07:37:01:011243 gppkg:mdw:gpadmin-[INFO]:-postgis-2.1.5+pivotal.1-gp5-rhel7-x86_64.gppkg successfully installed.

[gpadmin@mdw setup]$ cd $GPHOME

[gpadmin@mdw greenplum-db]$ cd share

[gpadmin@mdw share]$ ls

gdal greenplum packages postgresql proj

[gpadmin@mdw share]$ cd postgresql/

[gpadmin@mdw postgresql]$ cd contrib/

[gpadmin@mdw contrib]$ ls

citext.sql gp_distribution_policy.sql gp_svec_test.sql oid2name.txt postgis-2.1 uninstall_fuzzystrmatch.sql uninstall_hstore.sql

dblink.sql gp_session_state.sql hstore.sql orafunc.sql uninstall_citext.sql uninstall_gp_distribution_policy.sql uninstall_orafunc.sql

fuzzystrmatch.sql gp_sfv_test.sql indexscan.sql pgcrypto.sql uninstall_dblink.sql uninstall_gp_session_state.sql uninstall_pgcrypto.sql

[gpadmin@mdw contrib]$ cd postgis-2.1/

[gpadmin@mdw postgis-2.1]$ ls

install postgis_manager.sh uninstall upgrade

[gpadmin@mdw postgis-2.1]$ $GPHOME/share/postgresql/contrib/postgis-2.1/postgis_manager.sh gpadmin install

SET

BEGIN

DO

CREATE FUNCTION

CREATE TYPE

CREATE FUNCTION

:

INSERT 0 1

COMMIT

ANALYZE

[gpadmin@mdw postgis-2.1]$

:

자, 이제 PostGIS가 Greenplum docker 컨테이너 안에 설치가 되었습니다.

5. PostGIS 샘플 Query 실행해서 테스트해보기

DBeaver DB tool로 아래의 PostGIS 테이블 생성해보고 select query 를 날려보겠습니다.

-- Create PostGIS extension

CREATE EXTENSION postgis;

-- PostGIS version check

SELECT PostGIS_Version();

-- PostGIS sample query

CREATE TABLE geom_test ( gid int4, geom geometry,

name varchar(25) );

INSERT INTO geom_test ( gid, geom, name )

VALUES ( 1, 'POLYGON((0 0 0,0 5 0,5 5 0,5 0 0,0 0 0))', '3D Square');

INSERT INTO geom_test ( gid, geom, name )

VALUES ( 2, 'LINESTRING(1 1 1,5 5 5,7 7 5)', '3D Line' );

INSERT INTO geom_test ( gid, geom, name )

VALUES ( 3, 'MULTIPOINT(3 4,8 9)', '2D Aggregate Point' );

SELECT * from geom_test WHERE geom &&

Box3D(ST_GeomFromEWKT('LINESTRING(2 2 0, 3 3 0)'));

잘 작동하는군요. ^^

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾸욱 눌러주세요.

728x90

저작자표시 비영리 변경금지

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[Greenplum DB] 공간지리 shape files (SHP, SHX, DBF, PRJ)를 PostgreSQL, Greenplum DB로 import 하는 방법 (0)	2019.04.11
[Greenplum DB] PostGIS : 위도, 경도가 있는 csv 파일을 import하고 공간정보 뽑아내기 (0)	2019.04.10
[Greenplum & PostgreSQL DB] 동일 간격 범위별로 관측치 개수를 세고(width_bucket), Python으로 막대그래프 시각화하기(bar plot) (0)	2019.03.21
[Greenplum DB] 데이터 전처리 (data preprocessing) : 결측값, 대소문자, 조건문, Substring, 날짜 (0)	2019.03.16
[Greenplum DB] 외부 데이터를 Table에 업로드하는 5가지 방법 : CREATE EXTERNAL TABLE, COPY, INSERT INTO VALUES(), pd.DataFrame.to_sql(), DBeaver import (0)	2019.03.10

Posted by Rfriend

,

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'분류 전체보기'에 해당되는 글 803건

[Python] 텍스트를 단어 단위로 파싱해서 One-hot encoding 하기 (parsing text and one-hot encoding at word-level)

1. 텍스트 데이터를 Python string methods를 사용하여 단어 단위로 파싱하고, 단어별 token index 만들기

2. 텍스트를 단어 단위로 One-hot encoding 하기

'Deep Learning (TF, Keras, PyTorch) > Natural Language Processing' 카테고리의 다른 글

[Python] 텍스트 파일 읽어와서 숫자형 데이터 표준화하기 (reading csv or text file, standardizing or normalizing of numeric data)

1. text 파일을 읽어서 숫자형 값으로 만든 matrix, 라벨을 저장한 vector를 만들기

2-1. 숫자형 데이터를 표준화(Standardization) 하기

2-2. 숫자형 데이터를 정규화(Normalization) 하기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Greenplum DB] GPDB에 PL/R Language Extension, R 패키지 수동 설치 방법

1. Greenplum PL/R Extention (Procedural Language R) 설치 방법

2. Greenplum DB에 R 패키지 설치 방법

1. Greenplum PL/R Extention 설치 방법

2. Greenplum DB에 R 패키지 설치 방법 (Installing external R packages)

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[R] 원하는 R 패키지와 의존성 있는 R 패키지를 모두 한꺼번에 다운로드 하기 (tools::package_dependencies(), download.packages())

'R 분석과 프로그래밍 > R 데이터 전처리' 카테고리의 다른 글

[Greenplum DB] PostGIS - 공간지리 테이블 백업하기, 백업 다시 불러오기 (Backup and Restore geospatial table using pg_dump, pg_restore)

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[Greenplum DB] PostGIS에 raster2pgsql 을 사용하여 raster data import 하기

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[Greenplum DB] PostGIS : ogr2ogr 을 사용해 공간지리 벡터 데이터 import 하기 (GML, MIF, KML 포맷)

(1) GML 포맷의 공간지리 벡터 데이터 Import 하기

(2) MIF 포맷 (MapInfo formats) 데이터셋을 ogr2ogr 유틸리티로 PostGIS에 Import 하기

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[Greenplum DB] 공간지리 shape files (SHP, SHX, DBF, PRJ)를 PostgreSQL, Greenplum DB로 import 하는 방법

(1) shp2pgsql로 shapefile data를 import하는 SQL문 만들기

(2) shp2pgsql로 shapefile data를 import하는 SQL문을 psql로 보내서 import하기

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[Greenplum DB] PostGIS : 위도, 경도가 있는 csv 파일을 import하고 공간정보 뽑아내기

(1) psql을 사용하여 위경도를 포함한 csv 파일을 Greenplum DB에 import하기

(2) PostGIS 함수로 공간 정보 뽑아내기

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[Greenplum DB] GPDB docker에 PostGIS 설치하기

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바