首页 > 代码库 > GreenPlum 使用gpload通过gpfdist文件实现数据高速加载

GreenPlum 使用gpload通过gpfdist文件实现数据高速加载

1gpload环境准备

环境准备请参考博主以前的文章gpfdist部署实战:http://blog.csdn.net/mchdba/article/details/72540806  ,安装好gpfdist后,gpload也自动有了,可以自动使用。

 

安装完后,可以启动gpfdist服务:nohup /data/greenplum/bin/gpfdist -d /data/greenplum/ -p 8090> /home/gpadmin/gpfdist.log  &

 

 

 

 

2gpload简介

Greenplum的gpload工具通过可读外部表和并行化文件服务器gpfdist(或gpfdists)来加载数据。gpload处理并行化的基于文件的外部表设置,以及允许我们使用单个YAML文件来配置数据格式,外部表定义,以及gpfdist或gpfdists。


要使用gpload工具有几个前提条件必须满足:
1.    使用gpload的服务器必须安装Python 2.6.2或者以上版本,pygresql工具和pyyaml工具(数据库服务端已经安装了python和需要的python库文件)
2.    必须装gpfdist程序,并把它设置到环境变量PATH中(可以从数据库服务器端的安装目录的子目录bin中拷贝该工具)
3.    确保使用gpload工具的ETL服务器与Greenplum所有服务器的联通性,以及机器名解析正确

4.    确保greenplum集群的所有服务器之间的网络是联通的。

gpload通过它的控制文件(YAML格式控制文件)指定装载的细节信息。所以编写控制文件是用好gpload的关键所在。gpload的控制文件采用YAML1.0.0.1文档格式编写,因此它必须得是有效的YAML格式。通过编写的yaml文件的配置,gpload服务与greenplum集群打通,一个yaml配置对应一个greenplum里面的表。

 

 

 

3,查看gpload参数

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[gpadmin@g01~]$ gpload -?

gpload [options] -f configuration file

 

Options:

    -h hostname: host to connect to

    -p port: port to connect to

    -U username: user to connect as

    -d database: database to connect to

    -W: force password authentication

    -q: quiet mode

    -D: do not actually load data

    -v: verbose

    -V: very verbose

    -l logfile: log output to logfile

    --no_auto_trans: do not wrap gpload in transaction

    --gpfdist_timeout timeout: gpfdist timeout value

    --version: print version number and exit

    -?: help

 

[gpadmin@g01~]$                                                        

 

 

blog源地址:http://blog.csdn.net/mchdba/article/details/72681969,博主黄杉(mchdba),谢绝转载

 

4,开始gpload实战

创建全局序列,用来获取汇总执行gpload的次数。

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

create SEQUENCE gpload_audit_seq INCREMENT BY 1 MINVALUE 1 NO MAXVALUE START WITH 1;
select * from gpload_audit_seq ;

 

 

创建审计表,记录每一次gpload执行的开始和结束事件信息

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

create table gpload_audit(                                                                
    id bigint,
    state text,
    mode text,
    tablename text,
    updatetime timestamp
) distributed by (id);

 

 

创建gpload加载的表

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

CREATE TABLE expenses
(
    name text,
    amount numeric,
    category text,
    des text,
    update_date date
);

 

 

在gpload服务器上建立临时测试数据文件天

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[gpadmin@g01gpdextdata]$ more t21.txt

1|aaa

2|zhangsan

 

[gpadmin@g01gpdextdata]$

[gpadmin@g01gpdextdata]$ more t22.txt

3|wanger

4|mazi

 

[gpadmin@g01gpdextdata]$

 

 

在greenplum集群的master库上,建立外部表:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

create table t02 (

id integer,

name varchar(128)

)

location (

‘gpfdist://192.168.2.72:8090/gpextdata/t21.txt‘,

‘gpfdist://192.168.2.72:8090/gpextdata/t22.txt‘

)

Format ‘TEXT‘ (delimiter as E‘|‘ null as ‘‘ escape ‘OFF‘)

;

 

 

创建gpload.yml配置文件

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 [gpadmin@g01~]$ more g2.yml

 

--- 

VERSION: 1.0.0.1 

DATABASE: yueworld_db 

USER: mch

HOST: 10.254.2.111 

PORT: 5432 

GPLOAD: 

  INPUT: 

    - SOURCE: 

        LOCAL_HOSTNAME: 

          - 192.168.2.72 

        PORT: 8090 

        FILE: 

          - /data/greenplum/gpextdata/t21.txt

          - /data/greenplum/gpextdata/t22.txt

              

    - COLUMNS: 

        - id: int 

        - name: text

    - FORMAT: text 

    - DELIMITER: ‘|‘ 

    - ERROR_LIMIT: 25 

  OUTPUT: 

    - TABLE: dw.t02

    - MODE: INSERT 

  SQL:

    - BEFORE: "INSERT INTO gpload_audit VALUES(nextval(‘gpload_audit_seq‘), ‘start‘, ‘insert‘, ‘zhangyun_schema.expenses‘, current_timestamp)"

- AFTER: "INSERT INTO gpload_audit VALUES(nextval(‘gpload_audit_seq‘), ‘end‘, ‘insert‘, ‘zhangyun_schema.expenses‘, current_timestamp)"

[gpadmin@g01~]$

 

 

在greenplum db库上赋予mch账号权限:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

yueworld_db=# grant select,insert on gpload_audit to mch;

GRANT

yueworld_db=# grant select,insert,update,delete on t01 to mch;

GRANT

yueworld_db=#

yueworld_db=# grant select,usage,update on gpload_audit_seq to mch;

GRANT

yueworld_db=# grant all on table t02 to mch;

GRANT

yueworld_db=#

 

 

 

去gpload服务器上,执行gpload:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[gpadmin@g01~]$ gpload -f g2.yml

2017-05-15 17:52:30|INFO|gpload session started 2017-05-15 17:52:30

2017-05-15 17:52:30|INFO|setting schema ‘public‘ for table ‘t02‘

2017-05-15 17:52:30|INFO|started gpfdist -p 8090 -P 8091 -f "/data/greenplum/gpextdata/t21.txt /data/greenplum/gpextdata/t22.txt" -t 30

2017-05-15 17:52:30|INFO|running time: 0.29 seconds

2017-05-15 17:52:30|INFO|rows Inserted          = 4

2017-05-15 17:52:30|INFO|rows Updated           = 0

2017-05-15 17:52:30|INFO|data formatting errors = 0

2017-05-15 17:52:30|INFO|gpload succeeded

[gpadmin@g01~]$

 

 

在greenplum服务器上查询

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

yueworld_db=# select * from public.t01_ext_1;;

 id | name

----+------

  1 | aaa

  2 | zhangsan

  3 | wanger

  4 | mazi

(8 rows)

 

yueworld_db=#

 

 

 

奇怪的是,当你再执行一次gpload -f y2.yml的时候,会持续往外部表里面写数据,而且是将原有的数据重新写入到t02里面去,执行第二次,rows inserted数量是8,等执行第三次的时候,这个rows Inserted会变成16,如下看执行过程:

[gpadmin@g01~]$ gpload -f g2.yml

2017-05-15 17:52:30|INFO|gpload session started 2017-05-15 17:52:30

2017-05-15 17:52:30|INFO|setting schema ‘public‘ for table ‘t02‘

2017-05-15 17:52:30|INFO|started gpfdist -p 8090 -P 8091 -f "/data/greenplum/gpextdata/t21.txt /data/greenplum/gpextdata/t22.txt" -t 30

2017-05-15 17:52:30|INFO|running time: 0.29 seconds

2017-05-15 17:52:30|INFO|rows Inserted          = 8

2017-05-15 17:52:30|INFO|rows Updated           = 0

2017-05-15 17:52:30|INFO|data formatting errors = 0

2017-05-15 17:52:30|INFO|gpload succeeded

[gpadmin@g01~]$

 

 

Bty:这里只测试了一个表t02,如果实际生产环境中有多个大表的话,可以编写多个yml文件,用gpload启动多进程同时进行数据高速加载。这里曾经做过测试,500w的原始数据,fil://xxx.com方式需要2个小时,直接insert into select …需要4个小时,而gpload结合gpfdist只需要20分钟。

 

参考文档:https://gpdb.docs.pivotal.io/4330/client_tool_guides/load/unix/gpload.html

 

GreenPlum 使用gpload通过gpfdist文件实现数据高速加载