hiveでcogroup

Programming Hiveのレビューをしています(2)

cogroupのサンプル、中途半端にhttps://cwiki.apache.org/Hive/tutorial.html#Tutorial-CoGroupsをコピーして修正しているようんなんだけど、何カ所か中途半端に修正してあって、そのままだと絶対に動かない。
Chapter 14 Calculating Cogroups (CDH4.1.2で検証)
サンプルデータ(cog.txt.1)

1 100 "2013-03-29"
2 102 "2013-03-15"
3 200 "2013-03-11"
4 213 "2013-02-29"
5 134 "2013-01-29"
6 189 "2013-03-30"

サンプルデータ(cog.txt.2)

1 3 "2013-03-19"
2 102 "2013-03-15"
3 200 "2013-03-11"
4 213 "2013-02-29"
5 8 "2013-01-29"
6 21 "2013-03-30"

実行例 (reduce_script は具体例がないので、暫定的に /bin/cat で良いかと。そうじゃないとスクリプトを分散キャッシュで配布しないといけないし)

hive> CREATE TABLE order_log (userid INT, orderid int, ts STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
hive> CREATE TABLE clicks_log (userid INT, id INT, ts STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
hive> CREATE TABLE log_analysis (uid INT, id int, reduced_val STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
hive> load data local inpath 'cog.txt1' INTO TABLE order_log;
hive> load data local inpath 'cog.txt2' INTO TABLE clicks_log;
hive> FROM (
> FROM (
> FROM order_log ol
> -- User Id, order Id, and timestamp:
> SELECT ol.userid AS uid, ol.orderid AS id, ol.ts AS ts
>
> UNION ALL
>
> FROM clicks_log cl
> SELECT cl.userid AS uid, cl.id AS id, cl.ts AS ts
>) union_msgs
>SELECT union_msgs.uid, union_msgs.id, union_msgs.ts
>CLUSTER BY union_msgs.uid, union_msgs.ts) mapout
>INSERT OVERWRITE TABLE log_analysis
>SELECT TRANSFORM(mapout.uid, mapout.id, mapout.ts) USING 'reduce_script' AS (uid, id, >reduced_val);

Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2013-03-30 01:00:49,616 Stage-1 map = 0%, reduce = 0%
2013-03-30 01:00:53,626 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 0.74 sec
2013-03-30 01:00:54,631 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-03-30 01:00:55,635 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-03-30 01:00:56,638 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.36 sec
2013-03-30 01:00:57,644 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.36 sec
MapReduce Total cumulative CPU time: 2 seconds 360 msec
Ended Job = job_201303152339_0086
Loading data to table default.log_analysis
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted /user/hive/warehouse/log_analysis
12 Rows loaded to log_analysis
MapReduce Jobs Launched:
Job 0: Map: 2 Reduce: 1 Cumulative CPU: 2.36 sec HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 360 msec
OK
Time taken: 10.534 seconds
hive> select * from log_analysis;
OK
1 3 "2013-03-19"
1 100 "2013-03-29"
2 102 "2013-03-15"
2 102 "2013-03-15"
3 200 "2013-03-11"
3 200 "2013-03-11"
4 213 "2013-02-29"
4 213 "2013-02-29"
5 8 "2013-01-29"
5 134 "2013-01-29"
6 189 "2013-03-30"
6 21 "2013-03-30"
Time taken: 0.065 seconds
hive>