MapReduceの中間データを保持する

MapReduceでジョブの実行時に生成される中間データは、ジョブの終了時に削除されます。これにより不要なデータは削除され、ディスク容量が圧迫されることがなくなります。
しかし、デバッグのために中間データを保持しておきたいと言うこともあるでしょう。今回はCDH3を使用した2つのやり方を紹介します。
1. 失敗時のみファイルを残す

keep.failed.task.files

このパラメータをtrueにすることにより、失敗したジョブの中間データは削除されないようになります。
2. 条件に基づいてファイルを残す

keep.task.files.pattern

このパラメータを”.*_m_0000.*”のように指定します。条件に一致したファイルは削除されません。
なお、CDH4からはパラメータ名が以下のように変更となりますのでご注意下さい。
keep.failed.task.files -> mapreduce.task.files.preserve.failedtasks
keep.task.files.pattern -> mapreduce.task.files.preserve.filepattern


*** 実行例 ***
例:keep.task.files.pattern=”.*_m_0000.*”として設定します。
$hadoop jar wc.jar WordCount -D keep.task.files.pattern=”.*_m_0000.*” shakespeare output
12/08/15 05:25:11 WARN snappy.LoadSnappy: Snappy native library is available
12/08/15 05:25:11 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/08/15 05:25:11 INFO snappy.LoadSnappy: Snappy native library loaded
12/08/15 05:25:11 INFO mapred.FileInputFormat: Total input paths to process : 8
12/08/15 05:25:12 INFO mapred.JobClient: Running job: job_201208081946_0008
12/08/15 05:25:13 INFO mapred.JobClient: map 0% reduce 0%
12/08/15 05:25:22 INFO mapred.JobClient: map 21% reduce 0%
12/08/15 05:25:23 INFO mapred.JobClient: map 22% reduce 0%
12/08/15 05:25:24 INFO mapred.JobClient: map 25% reduce 0%
12/08/15 05:25:28 INFO mapred.JobClient: map 37% reduce 0%
12/08/15 05:25:32 INFO mapred.JobClient: map 50% reduce 0% :
(略)
[root@localhost mapreduce]# find . |grep job_201208081946_0009|more
./local1/ttprivate/taskTracker/training/jobcache/job_201208081946_0009
./local1/ttprivate/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_r_000000_0
./local1/ttprivate/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_r_000000_0/taskjvm.sh
./local1/ttprivate/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000003_0
./local1/ttprivate/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000003_0/taskjvm.sh
./local1/ttprivate/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000002_0
./local1/ttprivate/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000002_0/taskjvm.sh
./local1/ttprivate/taskTracker/training/jobcache/job_201208081946_0009/job.xml ./local1/ttprivate/taskTracker/training/jobcache/job_201208081946_0009/.job.xml.crc
./local1/ttprivate/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000004_0
./local1/ttprivate/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000004_0/taskjvm.sh
./local1/ttprivate/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000005_0
./local1/ttprivate/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000005_0/taskjvm.sh
./local1/ttprivate/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000000_0
./local1/ttprivate/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000000_0/taskjvm.sh
./local1/taskTracker/training/jobcache/job_201208081946_0009
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000007_0
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000007_0/split.info
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000007_0/.split.info.crc
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000007_0/taskTracker
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000007_0/taskTracker/training
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000007_0/taskTracker/training/jobcache
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000007_0/taskTracker/training/jobcache/job_201208081946_0009
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000007_0/taskTracker/training/jobcache/job_201208081946_0009/attempt_201 208081946_0009_m_000007_0
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000007_0/taskTracker/training/jobcache/job_201208081946_0009/attempt_201 208081946_0009_m_000007_0/split.info
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000007_0/taskTracker/training/jobcache/job_201208081946_0009/attempt_201 208081946_0009_m_000007_0/.split.info.crc
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000006_0
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000006_0/job.xml
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000006_0/.job.xml.crc
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000006_0/output
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000006_0/output/file.out.index
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000006_0/output/file.out
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000006_0/taskTracker
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000006_0/taskTracker/training
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000006_0/taskTracker/training/jobcache
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000006_0/taskTracker/training/jobcache/job_201208081946_0009
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000006_0/taskTracker/training/jobcache/job_201208081946_0009/attempt_201 208081946_0009_m_000006_0
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000006_0/taskTracker/training/jobcache/job_201208081946_0009/attempt_201 208081946_0009_m_000006_0/split.info
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000006_0/taskTracker/training/jobcache/job_201208081946_0009/attempt_201 208081946_0009_m_000006_0/.split.info.crc
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000001_0 ./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000001_0/split.info
./local1/taskTracker/training/jobcache/job_201208081946_0009/attempt_201208081946_0009_m_000001_0/job.xml

コメント

  1. kernel023 kawasaki より:

    CDH4で中間データを残す必要がある場合、下記のページをご覧下さい。
    MapReduceの中間データを保持する(2)
    https://linux.wwing.net/WordPress/?p=222