首页 > 代码库 > hbase0.96数据导入以及Kettle操作hbase问题

hbase0.96数据导入以及Kettle操作hbase问题

版本:

cdh5.0.0+hadoop2.3.0+hbase0.96.1.1+Spoon5.0.1

一、HBase数据导入

HBase数据导入使用org.apache.hadoop.hbase.mapreduce.ImportTsv 的两种方式,一种是直接导入,一种是转换为HFile,然后再次导入。

1. HDFS数据为(部分):

[root@node33 data]# hadoop fs -ls /input
Found 1 items
-rwxrwxrwx   1 hdfs supergroup   13245467 2014-05-01 17:09 /input/hbase-data.csv
[root@node33 data]# hadoop fs -cat /input/* | head -n 3
1,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0,0,1
2,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0,0,1
3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0,0,1

2. 使用直接导入的方式

a. 建立hbase-employees-1表,使用hbase shell,进入shell模式,使用命令:create ‘hbase-employees-1‘,‘col‘ ,建立表;

b. 进入hbase安装目录,如果使用cdh默认安装,一般在目录/usr/lib/hbase/bin中,运行:

./hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns=HBASE_ROW_KEY,col:x1,col:x2,col:x3,col:x4,col:x5,col:x6,col:x7,col:x8,col:x9,col:y hbase-employees-1 hdfs://node33:8020/input/hbase-data.csv

日志如下:

2014-05-02 13:15:07,716 INFO  [main] mapreduce.JobSubmitter: Submitting tokens for job: job_1398958404577_0018
2014-05-02 13:15:08,674 INFO  [main] impl.YarnClientImpl: Submitted application application_1398958404577_0018
2014-05-02 13:15:09,101 INFO  [main] mapreduce.Job: The url to track the job: http://node33:8088/proxy/application_1398958404577_0018/
2014-05-02 13:15:09,103 INFO  [main] mapreduce.Job: Running job: job_1398958404577_0018
2014-05-02 13:15:34,169 INFO  [main] mapreduce.Job: Job job_1398958404577_0018 running in uber mode : false
2014-05-02 13:15:34,207 INFO  [main] mapreduce.Job:  map 0% reduce 0%
2014-05-02 13:16:32,789 INFO  [main] mapreduce.Job:  map 1% reduce 0%
2014-05-02 13:16:53,477 INFO  [main] mapreduce.Job:  map 5% reduce 0%
2014-05-02 13:16:56,701 INFO  [main] mapreduce.Job:  map 9% reduce 0%
2014-05-02 13:16:59,928 INFO  [main] mapreduce.Job:  map 13% reduce 0%
2014-05-02 13:17:02,970 INFO  [main] mapreduce.Job:  map 16% reduce 0%
2014-05-02 13:17:07,260 INFO  [main] mapreduce.Job:  map 22% reduce 0%
2014-05-02 13:17:10,472 INFO  [main] mapreduce.Job:  map 29% reduce 0%
2014-05-02 13:17:12,879 INFO  [main] mapreduce.Job:  map 36% reduce 0%
2014-05-02 13:17:16,555 INFO  [main] mapreduce.Job:  map 45% reduce 0%
2014-05-02 13:17:43,452 INFO  [main] mapreduce.Job:  map 48% reduce 0%
2014-05-02 13:17:45,629 INFO  [main] mapreduce.Job:  map 63% reduce 0%
2014-05-02 13:17:52,845 INFO  [main] mapreduce.Job:  map 79% reduce 0%
2014-05-02 13:17:55,862 INFO  [main] mapreduce.Job:  map 91% reduce 0%
2014-05-02 13:18:01,000 INFO  [main] mapreduce.Job:  map 100% reduce 0%
2014-05-02 13:18:14,644 INFO  [main] mapreduce.Job: Job job_1398958404577_0018 completed successfully
2014-05-02 13:18:17,123 INFO  [main] mapreduce.Job: Counters: 31
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=115531
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=13245571
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=2
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
	Job Counters 
		Launched map tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=146367
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=146367
		Total vcore-seconds taken by all map tasks=146367
		Total megabyte-seconds taken by all map tasks=149879808
	Map-Reduce Framework
		Map input records=203500
		Map output records=203500
		Input split bytes=104
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=8595
		CPU time spent (ms)=11290
		Physical memory (bytes) snapshot=99639296
		Virtual memory (bytes) snapshot=647782400
		Total committed heap usage (bytes)=26517504
	ImportTsv
		Bad Lines=0
	File Input Format Counters 
		Bytes Read=13245467
	File Output Format Counters 
		Bytes Written=0


查看hbase表数据大小:

bash-4.1$ hadoop fs -count /hbase/data/default/hbase-employees-1
           6            4           60100796 /hbase/data/default/hbase-employees-1


大约57.3M数据,原始数据是12.3M的。耗时方面:从任务提交到任务结束耗时3分6秒。

3. 使用HFile方式:

a. 转换为HFile并新建hbase-employee表:

./hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.bulk.output=hdfs://node33:8020/output/hbase_bulk -Dimporttsv.columns=HBASE_ROW_KEY,col:x1,col:x2,col:x3,col:x4,col:x5,col:x6,col:x7,col:x8,col:x9,col:y hbase-employees hdfs://node33:8020/input/hbase-data.csv

这样运行后,可以查看HFile的大小:

hadoop fs -count /output/hbase_bulk

同时数据库中已经新建了hbase-employees表,但是表中没有数据。

b. 插入数据到表中

这里也有两种方式,但是这两种方式操作前都需要把相应目录的权限放开。简单来说可以这样:

bash-4.1$ hadoop fs -chmod -R 777 /output
bash-4.1$ hadoop fs -chmod -R 777 /hbase

(1)使用hbase的命令:

/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs://node33:8020/output/hbase_bulk hbase-employees

ps:这个我没有试,官网说可以这样。
(2)使用hadoop jar方式:

hadoop jar /usr/lib/hbase/lib/hbase-server-0.96.1.1-cdh5.0.0.jar completebulkload hdfs://node33:8020/output/hbase_bulk hbase-employees


不过做这个前需要把hbase的相关包拷贝到hadoop的lib目录下,可以拷贝到/usr/lib/hadoop-mapreduce目录下,jar包如下:

-rw-r--r--. 1 root root  818519 May  2 10:35 hbase-client.jar
-rw-r--r--. 1 root root  360422 May  2 10:36 hbase-common.jar
-rw-r--r--. 1 root root   66898 May  2 10:36 hbase-hadoop2-compat.jar
-rw-r--r--. 1 root root    4071 May  2 10:36 hbase-it.jar
-rw-r--r--. 1 root root   87533 May  2 10:36 hbase-prefix-tree.jar
-rw-r--r--. 1 root root 3142247 May  2 10:36 hbase-protocol.jar
-rw-r--r--. 1 root root 3110607 May  2 10:36 hbase-server.jar
-rw-r--r--. 1 root root    4499 May  2 10:37 hbase-shell-0.96.1.1-cdh5.0.0.jar
-rw-r--r--. 1 root root 2270053 May  2 10:37 hbase-thrift-0.96.1.1-cdh5.0.0.jar


这种方式的log,如下:

2014-05-02 13:39:27,916 INFO  [main] mapreduce.JobSubmitter: Submitting tokens for job: job_1398958404577_0020
2014-05-02 13:39:28,408 INFO  [main] impl.YarnClientImpl: Submitted application application_1398958404577_0020
2014-05-02 13:39:28,552 INFO  [main] mapreduce.Job: The url to track the job: http://node33:8088/proxy/application_1398958404577_0020/
2014-05-02 13:39:28,553 INFO  [main] mapreduce.Job: Running job: job_1398958404577_0020
2014-05-02 13:39:44,376 INFO  [main] mapreduce.Job: Job job_1398958404577_0020 running in uber mode : false
2014-05-02 13:39:44,402 INFO  [main] mapreduce.Job:  map 0% reduce 0%
2014-05-02 13:40:24,580 INFO  [main] mapreduce.Job:  map 67% reduce 0%
2014-05-02 13:40:38,125 INFO  [main] mapreduce.Job:  map 100% reduce 0%
2014-05-02 13:41:09,295 INFO  [main] mapreduce.Job:  map 100% reduce 67%
2014-05-02 13:41:17,905 INFO  [main] mapreduce.Job:  map 100% reduce 77%
2014-05-02 13:41:20,928 INFO  [main] mapreduce.Job:  map 100% reduce 90%
2014-05-02 13:41:23,948 INFO  [main] mapreduce.Job:  map 100% reduce 100%
2014-05-02 13:41:29,514 INFO  [main] mapreduce.Job: Job job_1398958404577_0020 completed successfully
2014-05-02 13:41:31,530 INFO  [main] mapreduce.Job: Counters: 50
	File System Counters
		FILE: Number of bytes read=48950370
		FILE: Number of bytes written=98519119
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=13245571
		HDFS: Number of bytes written=73861410
		HDFS: Number of read operations=8
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=3
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=51857
		Total time spent by all reduces in occupied slots (ms)=42035
		Total time spent by all map tasks (ms)=51857
		Total time spent by all reduce tasks (ms)=42035
		Total vcore-seconds taken by all map tasks=51857
		Total vcore-seconds taken by all reduce tasks=42035
		Total megabyte-seconds taken by all map tasks=53101568
		Total megabyte-seconds taken by all reduce tasks=43043840
	Map-Reduce Framework
		Map input records=203500
		Map output records=203500
		Map output bytes=48339864
		Map output materialized bytes=48950370
		Input split bytes=104
		Combine input records=203500
		Combine output records=203500
		Reduce input groups=203500
		Reduce shuffle bytes=48950370
		Reduce input records=203500
		Reduce output records=2035000
		Spilled Records=407000
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=1573
		CPU time spent (ms)=26250
		Physical memory (bytes) snapshot=341180416
		Virtual memory (bytes) snapshot=1296404480
		Total committed heap usage (bytes)=152965120
	ImportTsv
		Bad Lines=0
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=13245467
	File Output Format Counters 
		Bytes Written=73861410


这个是转为HFile的job,插入数据到HBase表中2秒不到就完成了。

查看HBase中数据大小:

bash-4.1$ hadoop fs -count /hbase/data/default/hbase-employees
           5            3           73861752 /hbase/data/default/hbase-employees


一共大概70.4M数据,原始数据是12.3M,耗时为:2分2秒。

总结:HBase的数据导入都会把数据扩大,一般大概为5倍左右,同时使用HFile的方式,在时间上会有优势,但是数据会比较大。ps:这个仅仅是一次测试总结而已,实际情况可能不同。(为什么HBase的数据被扩大了?参考前篇,Hive的数据如果是直接导入,一般是不会扩大数据占用空间的,而且使用orc方式的时候还可以达到很好的压缩比,但是HBase却是扩大的,这是为什么?)

二、Kettle操作Hbase0.96

1. HBase Output组件,可以参考http://wiki.pentaho.com/display/BAD/Loading+Data+into+HBase 。

2. HBase Input组件,可以参考http://wiki.pentaho.com/display/BAD/Extracting+Data+from+HBase+to+Load+an+RDBMS。

参考这两个操作步骤,都是可以得到正确结果的。这里只是说说遇到的问题。

(1)zookeeper和hbase-site.xml文件两者选择一个配置即可,配置两个可能有问题。

(2)mapping配置的是HBase表中Family的名称。

(3)如果遇到下面的问题:

java.lang.IllegalArgumentException: offset (0) + length (8) exceed the capacity of the array: 1
	at org.apache.hadoop.hbase.util.Bytes.explainWrongLengthOrOffset(Bytes.java:602)
	at org.apache.hadoop.hbase.util.Bytes.toLong(Bytes.java:580)
	at org.apache.hadoop.hbase.util.Bytes.toLong(Bytes.java:553)
	at org.pentaho.hbase.shim.common.CommonHBaseBytesUtil.toLong(CommonHBaseBytesUtil.java:87)
	at org.pentaho.hbase.shim.api.HBaseValueMeta.decodeKeyValue(HBaseValueMeta.java:567)
	at org.pentaho.di.trans.steps.hbaseinput.HBaseInputData.getOutputRow(HBaseInputData.java:464)
	at org.pentaho.di.trans.steps.hbaseinput.HBaseInput.processRow(HBaseInput.java:281)
	at org.pentaho.di.trans.step.RunThread.run(RunThread.java:60)
	at java.lang.Thread.run(Thread.java:724)


或者

Unknown type for column

	at org.pentaho.hbase.shim.api.HBaseValueMeta.decodeColumnValue(HBaseValueMeta.java:769)
	at org.pentaho.di.trans.steps.hbaseinput.HBaseInputData.getOutputRow(HBaseInputData.java:495)
	at org.pentaho.di.trans.steps.hbaseinput.HBaseInput.processRow(HBaseInput.java:281)
	at org.pentaho.di.trans.step.RunThread.run(RunThread.java:60)
	at java.lang.Thread.run(Thread.java:724)

或者

org.pentaho.di.core.exception.KettleException: 
Length of integer column value is not equal to the defined length of a short, int or long

	at org.pentaho.hbase.shim.api.HBaseValueMeta.decodeColumnValue(HBaseValueMeta.java:711)
	at org.pentaho.di.trans.steps.hbaseinput.HBaseInputData.getOutputRow(HBaseInputData.java:495)
	at org.pentaho.di.trans.steps.hbaseinput.HBaseInput.processRow(HBaseInput.java:281)
	at org.pentaho.di.trans.step.RunThread.run(RunThread.java:60)
	at java.lang.Thread.run(Thread.java:724)


则把Mapping的时候的数据格式全部设置为String就可以了,这个是因为,如果HBase表中的数据是以其他方式导入的话,那么其存储是以String格式存储的(但是如果使用kettle的HBase output组件导入数据,那么数据格式可以调整),所以使用其他的数据格式就会报上面的错误。

 

分享,成长,快乐

转载请注明blog地址:http://blog.csdn.net/fansy1990