CDH5 beta1を速攻で試す!

CDH5 beta1が公開された!!!!

ClouderaのEnterprise Data Hubの核とも言えるCloudera 5(CDH5とCloudera Manager5)のうちのCDH5。本日公開となったCDH5 beta1がClouderaのリポジトリに上がっていたので、速攻でインストールしてみました。
雑感ですが、つい2週間前にApache Hadoop 2.2がGAになったばかりで、Hive 0.12、Pig 0.12、HBase 0.96、、、とメジャーバージョンのリリースラッシュだったので、現時点でCDH5がベータ1として公開されているのはある意味当然かなと。。(そんな短期間でGAのリリースをするとしたら品質下がるし、テストもできてないでしょうし、サポートもできないですよ、、、、)

なお、ベータ1にはApache Hadoop 2.2は含まれていましたが、Hive、Pigは0.11、HBaseは0.95.2でした。

おそらく、すぐにCloudera Demo VMが公開されるはずですが、祭りは当日参加しないと意味がないので、、、ダウンロードしてインストールを決行!(笑
Cloudera Manager 5の検証をやっている時間はないので、どなたか試して見てください。

CDH5 beta1をインストールする

Yumレポジトリの追加

以下のように、CDH5のレポジトリを追加します。今回はCentOS6.3の環境にインストールします。

[root@localhost yum.repos.d]# cat /etc.yum.repos.d/Cloudera-cdh5.repo
[cloudera-cdh5]
# Packages for Cloudera's Distribution for Hadoop, Version 5, on RedHat or CentOS 6 x86_64
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5/
gpgkey = http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1

CDH5 beta1パッケージの一覧

まずはお約束で、yum search でどんなパッケージがあるかを確認してみました。
Cloudera ImpalaのパッケージがCDH5に含まれていますね。ちなみにCloudera Imaplaのバージョンは、1.2!ということは、このバージョンからはUDF(ユーザー定義関数)/UDAFが使えるようになっているはずです。てんこもりです。

Cloudera Impala


[root@localhost yum.repos.d]# yum info impala
Loaded plugins: fastestmirror, refresh-packagekit, security
Loading mirror speeds from cached hostfile
* base: www.ftp.ne.jp
Available Packages
Name : impala
Arch : x86_64
Version : 1.2.0+cdh5.0.0+0
Release : 0.cdh5b1.p0.81.el6

 

HBase

0.96が間に合わなかったのか、0.95.2でした。

[root@localhost ~]# yum info hbase
Loaded plugins: fastestmirror, refresh-packagekit, security
Loading mirror speeds from cached hostfile
* base: www.ftp.ne.jp
Available Packages
Name : hbase
Arch : x86_64
Version : 0.95.2+cdh5.0.0+272
Release : 0.cdh5b1.p0.37.el6

Hive

Hiveは0.11(これもbeta1に間に合わなかったんでしょうね。公開されたのが2週間前ですし、無理矢理パッケージングしてもね。。)

[root@localhost ~]# yum info hive
Loaded plugins: fastestmirror, refresh-packagekit, security
Loading mirror speeds from cached hostfile
* base: www.ftp.ne.jp
Available Packages
Name : hive
Arch : noarch
Version : 0.11.0+cdh5.0.0+483
Release : 0.cdh5b1.p0.47.el6

Pig

Pigも0.11。理由は同じでしょう。CDH5 beta2などでアップデートされるのかもしれません。

[root@localhost ~]# yum info pig
Loaded plugins: fastestmirror, refresh-packagekit, security
Loading mirror speeds from cached hostfile
* base: www.ftp.ne.jp
Available Packages
Name : pig
Arch : noarch
Version : 0.11.0+cdh5.0.0+46
Release : 0.cdh5b1.p0.32.el6

 

それ以外のめぼしいものは、、、MRv1?

あと、注目すべきパッケージとしてはNFSv3対応もあります。

HDFS NFSゲートウェイ

hadoop-hdfs-nfs3.x86_64 : Hadoop HDFS NFS v3 gateway service

パッケージ一覧(めぼしいもの)


[root@localhost yum.repos.d]# yum search hadoop
Loaded plugins: fastestmirror, refresh-packagekit, security
Loading mirror speeds from cached hostfile
* base: www.ftp.ne.jp
==================================================== N/S Matched: hadoop ====================================================
hadoop.x86_64 : Hadoop is a software platform for processing vast amounts of data
hadoop-0.20-conf-pseudo.x86_64 : Hadoop installation in pseudo-distributed mode with MRv1
hadoop-0.20-mapreduce.x86_64 : Hadoop is a software platform for processing vast amounts of data
hadoop-0.20-mapreduce-jobtracker.x86_64 : Hadoop JobTracker
hadoop-0.20-mapreduce-jobtrackerha.x86_64 : Hadoop JobTracker High Availability
hadoop-0.20-mapreduce-tasktracker.x86_64 : Hadoop Task Tracker
hadoop-0.20-mapreduce-zkfc.x86_64 : Hadoop MapReduce failover controller
hadoop-client.x86_64 : Hadoop client side dependencies
hadoop-conf-pseudo.x86_64 : Hadoop installation in pseudo-distributed mode
hadoop-debuginfo.x86_64 : Debug information for package hadoop
hadoop-doc.x86_64 : Hadoop Documentation
hadoop-hdfs.x86_64 : The Hadoop Distributed File System
hadoop-hdfs-datanode.x86_64 : Hadoop Data Node
hadoop-hdfs-journalnode.x86_64 : Hadoop HDFS JournalNode
hadoop-hdfs-namenode.x86_64 : The Hadoop namenode manages the block locations of HDFS files
hadoop-hdfs-nfs3.x86_64 : Hadoop HDFS NFS v3 gateway service
hadoop-hdfs-portmap.x86_64 : Hadoop HDFS Portmap service
hadoop-hdfs-secondarynamenode.x86_64 : Hadoop Secondary namenode
hadoop-hdfs-zkfc.x86_64 : Hadoop HDFS failover controller
hadoop-httpfs.x86_64 : HTTPFS for Hadoop
hadoop-libhdfs.x86_64 : Hadoop Filesystem Library
hadoop-mapreduce.x86_64 : The Hadoop MapReduce (MRv2)
hadoop-yarn.x86_64 : The Hadoop NextGen MapReduce (YARN)
flume-ng.noarch : Flume is a reliable, scalable, and manageable distributed log collection application for collecting data
: such as logs and delivering it to data stores such as Hadoop's HDFS.
hadoop-hdfs-fuse.x86_64 : Mountable HDFS
hadoop-mapreduce-historyserver.x86_64 : MapReduce History Server
hadoop-yarn-nodemanager.x86_64 : Yarn Node Manager
hadoop-yarn-proxyserver.x86_64 : Yarn Web Proxy
hadoop-yarn-resourcemanager.x86_64 : Yarn Resource Manager
hbase.x86_64 : HBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data. This
: project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters
: of commodity hardware.
hbase-master.x86_64 : The Hadoop HBase master Server.
hbase-regionserver.x86_64 : The Hadoop HBase RegionServer server.
hbase-thrift.x86_64 : The Hadoop HBase Thrift Interface
hive.noarch : Hive is a data warehouse infrastructure built on top of Hadoop
hive-hcatalog.noarch : Apache Hcatalog is a data warehouse infrastructure built on top of Hadoop
hive-webhcat.noarch : WebHcat provides a REST-like web API for HCatalog and related Hadoop components.
hue-common.x86_64 : A browser-based desktop interface for Hadoop
hue-plugins.x86_64 : Hadoop plugins for Hue
impala.x86_64 : Application for executing real-time queries on top of Hadoop
oozie.noarch : Oozie is a system that runs workflows of Hadoop jobs.
parquet.noarch : A columnar storage format for Hadoop.
pig-udf-datafu.noarch : A collection of user-defined functions for Hadoop and Pig.
sqoop.noarch : Sqoop allows easy imports and exports of data sets between databases and the Hadoop Distributed File System
: (HDFS).
sqoop2.noarch : Tool for easy imports and exports of data sets between databases and the Hadoop ecosystem
zookeeper-server.noarch : The Hadoop Zookeeper server

Name and summary matches only, use “search all” for everything.

Cloudera Search (Solr)

続いて、Cloudera Search (Solr)を見てみます。
こちらも予定通り、CDHに含まれていますね。

[root@localhost yum.repos.d]# yum search solr
Loaded plugins: fastestmirror, refresh-packagekit, security
Loading mirror speeds from cached hostfile
* base: www.ftp.ne.jp
===================================================== N/S Matched: solr =====================================================
hbase-solr.noarch : Apache Solr is the popular, blazing fast open source enterprise search platform
hbase-solr-doc.noarch : Documentation for Apache Solr
hbase-solr-indexer.noarch : The Solr server
solr.noarch : Apache Solr is the popular, blazing fast open source enterprise search platform
solr-doc.noarch : Documentation for Apache Solr
solr-mapreduce.noarch : Solr mapreduce indexer
solr-server.noarch : The Solr server

Name and summary matches only, use “search all” for everything.

Apache Spark

調子に乗ってSparkはどこかにあったりして?、、、なんてことはありませんでした。

[root@localhost yum.repos.d]# yum search spark
Loaded plugins: fastestmirror, refresh-packagekit, security
Loading mirror speeds from cached hostfile
* base: www.ftp.ne.jp
Warning: No matches found for: spark
No Matches found

疑似分散でインストール

では、お約束の疑似分散モードでインストールしてみます。MRv1用の疑似分散の設定ファイルパッケージを使い、手っ取り早く環境を構築。Hadoopは2.2.0になっていることがわかります。MRv1は0.20系なので、基本的にはCDH3、CDH4の後継と見なせますね。


[root@localhost yum.repos.d]# yum install hadoop-0.20-conf-pseudo
Loaded plugins: fastestmirror, refresh-packagekit, security
Loading mirror speeds from cached hostfile
* base: www.ftp.ne.jp
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package hadoop-0.20-conf-pseudo.x86_64 0:2.2.0+cdh5.0.0+353-0.cdh5b1.p0.79.el6 will be installed

<snip>

–> Finished Dependency Resolution


Dependencies Resolved


=============================================================================================================================
Package Arch Version Repository Size
=============================================================================================================================
Installing:
hadoop-0.20-conf-pseudo x86_64 2.2.0+cdh5.0.0+353-0.cdh5b1.p0.79.el6 cloudera-cdh5 8.0 k
Installing for dependencies:

<snip>

Verifying : hadoop-0.20-conf-pseudo-2.2.0+cdh5.0.0+353-0.cdh5b1.p0.79.el6.x86_64 1/12
Verifying : bigtop-utils-0.6.0+cdh5.0.0+266-0.cdh5b1.p0.44.el6.noarch 2/12
Verifying : hadoop-hdfs-datanode-2.2.0+cdh5.0.0+353-0.cdh5b1.p0.79.el6.x86_64 3/12
Verifying : hadoop-0.20-mapreduce-jobtracker-2.2.0+cdh5.0.0+353-0.cdh5b1.p0.79.el6.x86_64 4/12
Verifying : hadoop-hdfs-secondarynamenode-2.2.0+cdh5.0.0+353-0.cdh5b1.p0.79.el6.x86_64 5/12
Verifying : hadoop-hdfs-namenode-2.2.0+cdh5.0.0+353-0.cdh5b1.p0.79.el6.x86_64 6/12
Verifying : hadoop-0.20-mapreduce-tasktracker-2.2.0+cdh5.0.0+353-0.cdh5b1.p0.79.el6.x86_64 7/12
Verifying : hadoop-0.20-mapreduce-2.2.0+cdh5.0.0+353-0.cdh5b1.p0.79.el6.x86_64 8/12
Verifying : bigtop-jsvc-0.6.0+cdh5.0.0+266-0.cdh5b1.p0.41.el6.x86_64 9/12
Verifying : hadoop-hdfs-2.2.0+cdh5.0.0+353-0.cdh5b1.p0.79.el6.x86_64 10/12
Verifying : hadoop-2.2.0+cdh5.0.0+353-0.cdh5b1.p0.79.el6.x86_64 11/12
Verifying : zookeeper-3.4.5+cdh5.0.0+25-0.cdh5b1.p0.41.el6.noarch 12/12

<snip>

 

HDFSをフォーマット

[root@localhost yum.repos.d]# sudo -u hdfs hdfs namenode -format
13/10/29 05:57:52 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost.localdomain/127.0.0.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.2.0-cdh5.0.0-beta-1

<snip>

13/10/29 05:57:54 INFO common.Storage: Storage directory /var/lib/hadoop-hdfs/cache/hdfs/dfs/name has been successfully formatted.
13/10/29 05:57:54 INFO namenode.FSImage: Saving image file /var/lib/hadoop-hdfs/cache/hdfs/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
13/10/29 05:57:54 INFO namenode.FSImage: Image file /var/lib/hadoop-hdfs/cache/hdfs/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 196 bytes saved in 0 seconds.
13/10/29 05:57:54 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
13/10/29 05:57:54 INFO util.ExitUtil: Exiting with status 0
13/10/29 05:57:54 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1
************************************************************/

HDFSを起動

NameNodeとDataNode、Secondary NameNodeを起動します。jpsで起動も確認できました。

[root@localhost yum.repos.d]# service hadoop-hdfs-namenode start
Starting Hadoop namenode: [ OK ]
starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-localhost.localdomain.out
[root@localhost yum.repos.d]# service hadoop-hdfs-datanode start
Starting Hadoop datanode: [ OK ]
starting datanode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-datanode-localhost.localdomain.out
[root@localhost yum.repos.d]# service hadoop-hdfs-secondarynamenode start
Starting Hadoop secondarynamenode: [ OK ]
starting secondarynamenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-secondarynamenode-localhost.localdomain.out
[root@localhost yum.repos.d]#


[root@localhost yum.repos.d]# sudo jps
3255 NameNode
2942 DataNode
3029 SecondaryNameNode
3333 Jps
[root@localhost yum.repos.d]#

NameNodeのWebUIで確認

http://localhost:50070にアクセスします。CDH4とは少し異なっていますね。

Hadoop NameNode localhost-8020

図1:CDH5 beta1

NameNode (Active)

図2: CDH4

必要なディレクトリなどの作成

続いて、MRv1に必要なファイルを作成します。

[root@localhost ~]# sudo -u hdfs hadoop fs -mkdir /tmp
[root@localhost ~]# sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
[root@localhost ~]# sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
[root@localhost ~]# sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
[root@localhost ~]# sudo -u hdfs hadoop fs -chmod -R 1777 /var/lib/hadoop-hdfs/cache/mapred

JobTracker、TaskTrackerの起動

全部のデーモンが元気に動いています。

[root@localhost ~]# service hadoop-0.20-mapreduce-jobtracker start
Starting Hadoop jobtracker: [ OK ]
starting jobtracker, logging to /var/log/hadoop-0.20-mapreduce/hadoop-hadoop-jobtracker-localhost.localdomain.out
[root@localhost ~]# service hadoop-0.20-mapreduce-tasktracker start
Starting Hadoop tasktracker: [ OK ]
starting tasktracker, logging to /var/log/hadoop-0.20-mapreduce/hadoop-hadoop-tasktracker-localhost.localdomain.out
[root@localhost ~]#


[root@localhost ~]# sudo jps
32366 JobTracker
32437 Jps
3255 NameNode
32154 TaskTracker
2942 DataNode
3029 SecondaryNameNode

JobTrackerのWebUIで確認

こちらのWeb UIは0.20を踏襲しているようなので、変わっていませんでした。
localhost Hadoop Map-Reduce Administration

図3:JobTrackerのWebUI

ジョブの実行

最後に、お約束のサンプルジョブを実行して動作確認します。

[root@localhost ~]# sudo -u hdfs hadoop fs -mkdir -p /user/kawasaki
[root@localhost ~]# sudo -u hdfs hadoop fs -chown kawasaki /user/kawasaki

ユーザーを切り替えてファイルをHDFSにアップロードし、サンプルジョブを実行しました。特に何事もなく、あっさりと動作確認。

[kawasaki@localhost ~]$ hadoop fs -put /usr/share/dict/words words
[kawasaki@localhost ~]$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar wordcount words output
13/10/29 06:20:52 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/10/29 06:20:52 INFO input.FileInputFormat: Total input paths to process : 1
13/10/29 06:20:53 INFO mapred.JobClient: Running job: job_201310290614_0001
13/10/29 06:20:54 INFO mapred.JobClient: map 0% reduce 0%
13/10/29 06:21:06 INFO mapred.JobClient: map 100% reduce 0%
13/10/29 06:21:13 INFO mapred.JobClient: map 100% reduce 100%
13/10/29 06:21:16 INFO mapred.JobClient: Job complete: job_201310290614_0001
13/10/29 06:21:16 INFO mapred.JobClient: Counters: 32
13/10/29 06:21:16 INFO mapred.JobClient: File System Counters
13/10/29 06:21:16 INFO mapred.JobClient: FILE: Number of bytes read=15665364
13/10/29 06:21:16 INFO mapred.JobClient: FILE: Number of bytes written=23905251
13/10/29 06:21:16 INFO mapred.JobClient: FILE: Number of read operations=0
13/10/29 06:21:16 INFO mapred.JobClient: FILE: Number of large read operations=0
13/10/29 06:21:16 INFO mapred.JobClient: FILE: Number of write operations=0
13/10/29 06:21:16 INFO mapred.JobClient: HDFS: Number of bytes read=4953805
13/10/29 06:21:16 INFO mapred.JobClient: HDFS: Number of bytes written=5913357
13/10/29 06:21:16 INFO mapred.JobClient: HDFS: Number of read operations=2
13/10/29 06:21:16 INFO mapred.JobClient: HDFS: Number of large read operations=0
13/10/29 06:21:16 INFO mapred.JobClient: HDFS: Number of write operations=1
13/10/29 06:21:16 INFO mapred.JobClient: Job Counters
13/10/29 06:21:16 INFO mapred.JobClient: Launched map tasks=1
13/10/29 06:21:16 INFO mapred.JobClient: Launched reduce tasks=1
13/10/29 06:21:16 INFO mapred.JobClient: Data-local map tasks=1
13/10/29 06:21:16 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=13588
13/10/29 06:21:16 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=6560
13/10/29 06:21:16 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/10/29 06:21:16 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/10/29 06:21:16 INFO mapred.JobClient: Map-Reduce Framework
13/10/29 06:21:16 INFO mapred.JobClient: Map input records=479829
13/10/29 06:21:16 INFO mapred.JobClient: Map output records=479829
13/10/29 06:21:16 INFO mapred.JobClient: Map output bytes=6873015
13/10/29 06:21:16 INFO mapred.JobClient: Input split bytes=106
13/10/29 06:21:16 INFO mapred.JobClient: Combine input records=479829
13/10/29 06:21:16 INFO mapred.JobClient: Combine output records=479829
13/10/29 06:21:16 INFO mapred.JobClient: Reduce input groups=479829
13/10/29 06:21:16 INFO mapred.JobClient: Reduce shuffle bytes=7832679
13/10/29 06:21:16 INFO mapred.JobClient: Reduce input records=479829
13/10/29 06:21:16 INFO mapred.JobClient: Reduce output records=479829
13/10/29 06:21:16 INFO mapred.JobClient: Spilled Records=1439487
13/10/29 06:21:16 INFO mapred.JobClient: CPU time spent (ms)=4540
13/10/29 06:21:16 INFO mapred.JobClient: Physical memory (bytes) snapshot=314220544
13/10/29 06:21:16 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1243762688
13/10/29 06:21:16 INFO mapred.JobClient: Total committed heap usage (bytes)=147197952
[kawasaki@localhost ~]$ hadoop fs -tail output/part-r-00000
on 1
zygopterous 1
zygose 1
zygoses 1
zygosis 1
zygosities 1
zygosity 1
zygosperm 1
zygosphenal 1
zygosphene 1
zygosphere 1
zygosporange 1
zygosporangium 1
zygospore 1
zygosporic 1
zygosporophore 1
zygostyle 1
zygotactic 1
zygotaxis 1
zygote 1
zygotene 1
zygotenes 1
zygotes 1
zygotic 1
zygotically 1
zygotoblast 1
zygotoid 1
zygotomere 1
zygous 1
zygozoospore 1
zym- 1
zymase 1
zymases 1
zyme 1
zymes 1
zymic 1
zymin 1
zymite 1
zymo- 1
zymochemistry 1
zymogen 1
zymogene 1
zymogenes 1
zymogenesis 1
zymogenic 1
zymogenous 1
zymogens 1
zymogram 1
zymograms 1
zymoid 1
zymologic 1
zymological 1
zymologies 1
zymologist 1
zymology 1
zymolyis 1
zymolysis 1
zymolytic 1
zymome 1
zymometer 1
zymomin 1
zymophore 1
zymophoric 1
zymophosphate 1
zymophyte 1
zymoplastic 1
zymosan 1
zymosans 1
zymoscope 1
zymoses 1
zymosimeter 1
zymosis 1
zymosterol 1
zymosthenic 1
zymotechnic 1
zymotechnical 1
zymotechnics 1
zymotechny 1
zymotic 1
zymotically 1
zymotize 1
zymotoxic 1
zymurgies 1
zymurgy 1
zythem 1
zythum 1
zyzzyva 1
zyzzyvas 1
[kawasaki@localhost ~]$

まとめ

CDH5 beta1はHadoop2.2.0ベースとなり、Cloudera Impala 1.2、HBase 0.96(期待)、Hive 0.12(期待)、Pig 0.12(期待)、Hue 3.0などが含まれている、正常進化したCDHだと思います。CDH5のGAの予定は今後公開されると思いますが、いち早く試してみたい方は参考にしてみて下さい!

時間があればHue 3.0を試してみます。

 

 

Pocket

Leave a Reply

Your email address will not be published. Required fields are marked *

CAPTCHA


日本語が含まれない投稿は無視されますのでご注意ください。(スパム対策)