hadoop-logo
ようこそ Tech blogへ!
「これからHadoopを勉強しよう」という方はまず下記のページから

サイトの移行に伴って画像が表示されないなどの不具合が生じています

Cloudera Quickstart VM 5.7 を使って見る

前回アップグレードが完了したCloudera Quickstart VM。ふと見ると本家に5.7が.. orz…
cm57quickstarthttp://www.cloudera.com/downloads/quickstart_vms/5-7.html
前回アップグレードした版、あるいはこのver 5.7のVMのどちらを使っても同じ結果になるはずですが、せっかくなので5.7をダウンロードし、新しい仮想マシンを使ってみます。
※なお、日本語化、Cloudera Expressの起動、Parcel化の手順は前回と同じように行いました。

HDFS, YARN, Sparkサービスを起動

Cloudera ManagerのメニューからHDFS, YARN, Sparkのサービスをそれぞれ起動します。右側の▼をクリックし、それぞれ開始を選択するだけです。全てのサービスを起動しても良いのですが、今回は最低限3つのサービスのみを起動しました。

MapReduce v2でwordcount

動作確認のため、YARN(MRv2)でwordcountをテストします。適当なテストデータ(/etc/services)をHDFSにコピーし、CDHにもふくまれているサンプルのワードカウントを実行します。
[code]
[cloudera@quickstart ~]$ hdfs dfs -put /etc/services
[cloudera@quickstart ~]$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount services mr.out
16/05/02 18:41:04 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
16/05/02 18:41:05 INFO input.FileInputFormat: Total input paths to process : 1
16/05/02 18:41:05 INFO mapreduce.JobSubmitter: number of splits:1
16/05/02 18:41:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1462180462476_0001
16/05/02 18:41:06 INFO impl.YarnClientImpl: Submitted application application_1462180462476_0001
16/05/02 18:41:06 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1462180462476_0001/
16/05/02 18:41:06 INFO mapreduce.Job: Running job: job_1462180462476_0001
16/05/02 18:41:12 INFO mapreduce.Job: Job job_1462180462476_0001 running in uber mode : false
16/05/02 18:41:12 INFO mapreduce.Job:  map 0% reduce 0%
16/05/02 18:41:17 INFO mapreduce.Job:  map 100% reduce 0%
16/05/02 18:41:23 INFO mapreduce.Job:  map 100% reduce 100%
16/05/02 18:41:24 INFO mapreduce.Job: Job job_1462180462476_0001 completed successfully
16/05/02 18:41:24 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=143653
FILE: Number of bytes written=522127
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=641139
HDFS: Number of bytes written=236073
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=388608
Total time spent by all reduces in occupied slots (ms)=437376
Total time spent by all map tasks (ms)=3036
Total time spent by all reduce tasks (ms)=3417
Total vcore-seconds taken by all map tasks=3036
Total vcore-seconds taken by all reduce tasks=3417
Total megabyte-seconds taken by all map tasks=388608
Total megabyte-seconds taken by all reduce tasks=437376
Map-Reduce Framework
Map input records=10774
Map output records=58108
Map output bytes=645717
Map output materialized bytes=143647
Input split bytes=119
Combine input records=58108
Combine output records=21848
Reduce input groups=21848
Reduce shuffle bytes=143647
Reduce input records=21848
Reduce output records=21848
Spilled Records=43696
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=99
CPU time spent (ms)=3000
Physical memory (bytes) snapshot=274423808
Virtual memory (bytes) snapshot=1484644352
Total committed heap usage (bytes)=92798976
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=641020
File Output Format Counters
Bytes Written=236073
[cloudera@quickstart ~]$
[/code]
Cloudera Managerからも実行結果が確認できます。
cm57_yarn_app1YARNアプリケーションを選択して結果を確認します。
cm57_yarn_app2IDをクリックするとHistory Serverのログが確認できます。
cm57_yarn_app3

Sparkでwordcount

続いてSpark (Spark on YARN)でワードカウントを実行してみます。
このままの状態で実行するとエラーが出るので、下記のドキュメントに従ってパーミッションを直しておきましょう。
http://www.cloudera.com/documentation/enterprise/latest/topics/admin_spark_history_server.html
[code]
$ sudo -u hdfs hadoop fs -chown -R spark:spark /user/spark
$ sudo -u hdfs hadoop fs -chmod 1777 /user/spark/applicationHistory
[/code]
[code]
[cloudera@quickstart ~]$ spark-submit –master yarn-client –class org.apache.spark.examples.JavaWordCount /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar services
16/05/02 20:01:05 INFO spark.SparkContext: Running Spark version 1.6.0
16/05/02 20:01:05 WARN util.Utils: Your hostname, quickstart.cloudera resolves to a loopback address: 127.0.0.1; using 192.168.2.131 instead (on interface eth1)
16/05/02 20:01:05 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/05/02 20:01:05 INFO spark.SecurityManager: Changing view acls to: cloudera
16/05/02 20:01:05 INFO spark.SecurityManager: Changing modify acls to: cloudera
16/05/02 20:01:05 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cloudera); users with modify permissions: Set(cloudera)
16/05/02 20:01:05 INFO util.Utils: Successfully started service ‘sparkDriver’ on port 49380.
16/05/02 20:01:06 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/05/02 20:01:06 INFO Remoting: Starting remoting
16/05/02 20:01:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.2.131:56388]
16/05/02 20:01:06 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriverActorSystem@192.168.2.131:56388]
16/05/02 20:01:06 INFO util.Utils: Successfully started service ‘sparkDriverActorSystem’ on port 56388.
16/05/02 20:01:06 INFO spark.SparkEnv: Registering MapOutputTracker
16/05/02 20:01:06 INFO spark.SparkEnv: Registering BlockManagerMaster
16/05/02 20:01:06 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-5e96d027-5a84-461f-a27e-a81e5c09cc64
16/05/02 20:01:06 INFO storage.MemoryStore: MemoryStore started with capacity 530.3 MB
16/05/02 20:01:06 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/05/02 20:01:06 INFO util.Utils: Successfully started service ‘SparkUI’ on port 4040.
16/05/02 20:01:06 INFO ui.SparkUI: Started SparkUI at http://192.168.2.131:4040
16/05/02 20:01:06 INFO spark.SparkContext: Added JAR file:/opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar at spark://192.168.2.131:49380/jars/spark-examples-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar with timestamp 1462186866770
16/05/02 20:01:06 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
16/05/02 20:01:07 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers
16/05/02 20:01:07 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (2816 MB per container)
16/05/02 20:01:07 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
16/05/02 20:01:07 INFO yarn.Client: Setting up container launch context for our AM
16/05/02 20:01:07 INFO yarn.Client: Setting up the launch environment for our AM container
16/05/02 20:01:07 INFO yarn.Client: Preparing resources for our AM container
16/05/02 20:01:07 INFO yarn.Client: Uploading resource file:/tmp/spark-ac48b21e-750d-4f85-b85e-c10bec879bd0/__spark_conf__1667711018625762819.zip -> hdfs://quickstart.cloudera:8020/user/cloudera/.sparkStaging/application_1462180462476_0006/__spark_conf__1667711018625762819.zip
16/05/02 20:01:07 INFO spark.SecurityManager: Changing view acls to: cloudera
16/05/02 20:01:07 INFO spark.SecurityManager: Changing modify acls to: cloudera
16/05/02 20:01:07 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cloudera); users with modify permissions: Set(cloudera)
16/05/02 20:01:07 INFO yarn.Client: Submitting application 6 to ResourceManager
16/05/02 20:01:07 INFO impl.YarnClientImpl: Submitted application application_1462180462476_0006
16/05/02 20:01:08 INFO yarn.Client: Application report for application_1462180462476_0006 (state: ACCEPTED)
16/05/02 20:01:08 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.cloudera
start time: 1462186867889
final status: UNDEFINED
tracking URL: http://quickstart.cloudera:8088/proxy/application_1462180462476_0006/
user: cloudera
16/05/02 20:01:09 INFO yarn.Client: Application report for application_1462180462476_0006 (state: ACCEPTED)
16/05/02 20:01:10 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null)
16/05/02 20:01:10 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> quickstart.cloudera, PROXY_URI_BASES -> http://quickstart.cloudera:8088/proxy/application_1462180462476_0006), /proxy/application_1462180462476_0006
16/05/02 20:01:10 INFO ui.JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
16/05/02 20:01:10 INFO yarn.Client: Application report for application_1462180462476_0006 (state: RUNNING)
16/05/02 20:01:10 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 192.168.2.131
ApplicationMaster RPC port: 0
queue: root.cloudera
start time: 1462186867889
final status: UNDEFINED
tracking URL: http://quickstart.cloudera:8088/proxy/application_1462180462476_0006/
user: cloudera
16/05/02 20:01:10 INFO cluster.YarnClientSchedulerBackend: Application application_1462180462476_0006 has started running.
16/05/02 20:01:10 INFO util.Utils: Successfully started service ‘org.apache.spark.network.netty.NettyBlockTransferService’ on port 46785.
16/05/02 20:01:10 INFO netty.NettyBlockTransferService: Server created on 46785
16/05/02 20:01:10 INFO storage.BlockManager: external shuffle service port = 7337
16/05/02 20:01:10 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/05/02 20:01:10 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.2.131:46785 with 530.3 MB RAM, BlockManagerId(driver, 192.168.2.131, 46785)
16/05/02 20:01:10 INFO storage.BlockManagerMaster: Registered BlockManager
16/05/02 20:01:11 INFO scheduler.EventLoggingListener: Logging events to hdfs://quickstart.cloudera:8020/user/spark/applicationHistory/application_1462180462476_0006
16/05/02 20:01:11 INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
16/05/02 20:01:11 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 192.6 KB, free 192.6 KB)
16/05/02 20:01:11 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.6 KB, free 215.2 KB)
16/05/02 20:01:11 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.2.131:46785 (size: 22.6 KB, free: 530.3 MB)
16/05/02 20:01:11 INFO spark.SparkContext: Created broadcast 0 from textFile at JavaWordCount.java:45
16/05/02 20:01:11 INFO mapred.FileInputFormat: Total input paths to process : 1
16/05/02 20:01:11 INFO spark.SparkContext: Starting job: collect at JavaWordCount.java:68
16/05/02 20:01:12 INFO scheduler.DAGScheduler: Registering RDD 3 (mapToPair at JavaWordCount.java:54)
16/05/02 20:01:12 INFO scheduler.DAGScheduler: Got job 0 (collect at JavaWordCount.java:68) with 1 output partitions
16/05/02 20:01:12 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at JavaWordCount.java:68)
16/05/02 20:01:12 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
16/05/02 20:01:12 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0)
16/05/02 20:01:12 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at mapToPair at JavaWordCount.java:54), which has no missing parents
16/05/02 20:01:12 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.8 KB, free 220.0 KB)
16/05/02 20:01:12 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.6 KB, free 222.6 KB)
16/05/02 20:01:12 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.2.131:46785 (size: 2.6 KB, free: 530.3 MB)
16/05/02 20:01:12 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/05/02 20:01:12 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at mapToPair at JavaWordCount.java:54)
16/05/02 20:01:12 INFO cluster.YarnScheduler: Adding task set 0.0 with 1 tasks
16/05/02 20:01:13 INFO spark.ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 1)
16/05/02 20:01:15 INFO cluster.YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (192.168.2.131:34111) with ID 1
16/05/02 20:01:15 INFO spark.ExecutorAllocationManager: New executor 1 has registered (new total is 1)
16/05/02 20:01:15 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.2.131:53134 with 530.3 MB RAM, BlockManagerId(1, 192.168.2.131, 53134)
16/05/02 20:01:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.2.131, partition 0,RACK_LOCAL, 2242 bytes)
16/05/02 20:01:16 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.2.131:53134 (size: 2.6 KB, free: 530.3 MB)
16/05/02 20:01:16 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.2.131:53134 (size: 22.6 KB, free: 530.3 MB)
16/05/02 20:01:18 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2268 ms on 192.168.2.131 (1/1)
16/05/02 20:01:18 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (mapToPair at JavaWordCount.java:54) finished in 6.109 s
16/05/02 20:01:18 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/05/02 20:01:18 INFO scheduler.DAGScheduler: running: Set()
16/05/02 20:01:18 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/05/02 20:01:18 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1)
16/05/02 20:01:18 INFO scheduler.DAGScheduler: failed: Set()
16/05/02 20:01:18 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at JavaWordCount.java:61), which has no missing parents
16/05/02 20:01:18 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.9 KB, free 225.5 KB)
16/05/02 20:01:18 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1734.0 B, free 227.2 KB)
16/05/02 20:01:18 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.2.131:46785 (size: 1734.0 B, free: 530.3 MB)
16/05/02 20:01:18 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/05/02 20:01:18 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at JavaWordCount.java:61)
16/05/02 20:01:18 INFO cluster.YarnScheduler: Adding task set 1.0 with 1 tasks
16/05/02 20:01:18 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, 192.168.2.131, partition 0,NODE_LOCAL, 1991 bytes)
16/05/02 20:01:18 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.2.131:53134 (size: 1734.0 B, free: 530.3 MB)
16/05/02 20:01:18 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 192.168.2.131:34111
16/05/02 20:01:18 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 140 bytes
16/05/02 20:01:19 INFO scheduler.DAGScheduler: ResultStage 1 (collect at JavaWordCount.java:68) finished in 0.833 s
16/05/02 20:01:19 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 832 ms on 192.168.2.131 (1/1)
16/05/02 20:01:19 INFO cluster.YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/05/02 20:01:19 INFO scheduler.DAGScheduler: Job 0 finished: collect at JavaWordCount.java:68, took 7.062183 s
1691/udp: 1
bcs-lmserver: 4
3876/udp: 1
JDataStore: 2
GCM: 2
Physical: 2
10008/udp: 1
3393/udp: 1
3505/udp: 1
RDMA: 3
dpsi: 2
secure: 15
Multiplex: 4
Bitforest: 1
1113/udp: 1
NETX: 4
3667/tcp: 1
(略)
[/code]
MapReduceと同様にCloudera Managerから結果を確認します。
cm_spark_app1IDをクリックし、
cm_spark_app2YARNのResource ManagerのUIからSparkのHistory Serverのログを確認します。
cm_spark_app3HDFS、YARN、MapReduce、Sparkは最低限うまく動作しているようです。
 

コメント