HDFSのfsck

HDFSの不良ブロック

先日、CDH5.4へのアップグレード中に仮想マシンが落ちた際、不良ブロックが大量に発生しました。
Cloudera Managerのヘルステストによると、52の紛失したブロックがあると報告されています。
Cloudera QuickStart - hdfs - Cloudera ManagerNameNodeのWebUIには次のように表示されています。
Namenode information紛失したブロックはOozieのsharelibで、ちょうどアップロード中に落ちたので原因も明らかです。せっかくの機会なので、コマンドラインからも状況を確認してみましょう。

hdfs fsck

まずは利用できるオプションの再確認です。
[code]
[cloudera@quickstart ~]$ hdfs fsck
Usage: DFSck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]]
<path>    start checking from this path
-move    move corrupted files to /lost+found
-delete    delete corrupted files
-files    print out files being checked
-openforwrite    print out files opened for write
-includeSnapshots    include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
-list-corruptfileblocks    print out list of missing blocks and files they belong to
-blocks    print out block report
-locations    print out locations for every block
-racks    print out network topology for data-node locations
Please Note:
1. By default fsck ignores files opened for write, use -openforwrite to report such files. They are usually  tagged CORRUPT or HEALTHY depending on their block allocation status
2. Option -includeSnapshots should not be used for comparing stats, should be used only for HEALTH check, as this may contain duplicates if the same file present in both original fs tree and inside snapshots.
Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
Generic options supported are
[/code]
まずは破損しているブロックのリストを表示してみます。oozieディレクトリ以下のファイルが不良だということがわかりますね。
[code]
[cloudera@quickstart ~]$ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
Connecting to namenode via http://quickstart.cloudera:50070
The list of corrupt files under path ‘/’ are:
blk_1073750597    /user/oozie/share/lib/lib_20150426015844/sqoop/findbugs-annotations-1.3.9-1.jar
blk_1073750598    /user/oozie/share/lib/lib_20150426015844/sqoop/commons-io-2.4.jar
blk_1073750599    /user/oozie/share/lib/lib_20150426015844/sqoop/activation-1.1.jar
blk_1073750600    /user/oozie/share/lib/lib_20150426015844/sqoop/commons-compiler-2.7.6.jar
blk_1073750601    /user/oozie/share/lib/lib_20150426015844/sqoop/libthrift-0.9.2.jar
blk_1073750602    /user/oozie/share/lib/lib_20150426015844/sqoop/curator-client-2.7.1.jar
blk_1073750603    /user/oozie/share/lib/lib_20150426015844/sqoop/calcite-avatica-1.0.0-incubating.jar
blk_1073750604    /user/oozie/share/lib/lib_20150426015844/sqoop/curator-framework-2.6.0.jar
blk_1073750605    /user/oozie/share/lib/lib_20150426015844/sqoop/httpcore-4.2.5.jar
blk_1073750606    /user/oozie/share/lib/lib_20150426015844/sqoop/libfb303-0.9.2.jar
blk_1073750607    /user/oozie/share/lib/lib_20150426015844/sqoop/calcite-core-1.0.0-incubating.jar
blk_1073750608    /user/oozie/share/lib/lib_20150426015844/sqoop/parquet-avro.jar
blk_1073750609    /user/oozie/share/lib/lib_20150426015844/sqoop/avro-ipc-tests.jar
blk_1073750610    /user/oozie/share/lib/lib_20150426015844/sqoop/hive-shims-scheduler.jar
blk_1073750611    /user/oozie/share/lib/lib_20150426015844/sqoop/datanucleus-api-jdo-3.2.1.jar
blk_1073750612    /user/oozie/share/lib/lib_20150426015844/sqoop/parquet-encoding.jar
blk_1073750613    /user/oozie/share/lib/lib_20150426015844/sqoop/parquet-column.jar
blk_1073750614    /user/oozie/share/lib/lib_20150426015844/sqoop/apache-curator-2.6.0.pom
blk_1073750615    /user/oozie/share/lib/lib_20150426015844/sqoop/htrace-core-3.1.0-incubating.jar
blk_1073750616    /user/oozie/share/lib/lib_20150426015844/sqoop/geronimo-jta_1.1_spec-1.1.1.jar
blk_1073750617    /user/oozie/share/lib/lib_20150426015844/sqoop/hive-service.jar
blk_1073750618    /user/oozie/share/lib/lib_20150426015844/sqoop/asm-3.2.jar
blk_1073750619    /user/oozie/share/lib/lib_20150426015844/sqoop/commons-dbcp-1.4.jar
blk_1073750620    /user/oozie/share/lib/lib_20150426015844/sqoop/antlr-runtime-3.4.jar
blk_1073750621    /user/oozie/share/lib/lib_20150426015844/sqoop/sqoop.jar
blk_1073750622    /user/oozie/share/lib/lib_20150426015844/sqoop/calcite-linq4j-1.0.0-incubating.jar
blk_1073750623    /user/oozie/share/lib/lib_20150426015844/sqoop/janino-2.7.6.jar
blk_1073750624    /user/oozie/share/lib/lib_20150426015844/sqoop/hbase-annotations.jar
blk_1073750625    /user/oozie/share/lib/lib_20150426015844/sqoop/ST4-4.0.4.jar
blk_1073750626    /user/oozie/share/lib/lib_20150426015844/sqoop/snappy-java-1.0.4.1.jar
blk_1073750627    /user/oozie/share/lib/lib_20150426015844/sqoop/commons-compress-1.4.1.jar
blk_1073750628    /user/oozie/share/lib/lib_20150426015844/sqoop/hive-shims.jar
blk_1073750629    /user/oozie/share/lib/lib_20150426015844/sqoop/avro-mapred-hadoop2.jar
blk_1073750630    /user/oozie/share/lib/lib_20150426015844/sqoop/xz-1.0.jar
blk_1073750631    /user/oozie/share/lib/lib_20150426015844/sqoop/logredactor-1.0.2.jar
blk_1073750632    /user/oozie/share/lib/lib_20150426015844/sqoop/ant-1.8.1.jar
blk_1073750633    /user/oozie/share/lib/lib_20150426015844/sqoop/hive-exec.jar
blk_1073750634    /user/oozie/share/lib/lib_20150426015844/sqoop/hbase-common.jar
blk_1073750635    /user/oozie/share/lib/lib_20150426015844/hive/hive-metastore.jar
blk_1073750636    /user/oozie/share/lib/lib_20150426015844/hive/jta-1.1.jar
blk_1073750637    /user/oozie/share/lib/lib_20150426015844/hive/jpam-1.1.jar
blk_1073750638    /user/oozie/share/lib/lib_20150426015844/hive/hive-common.jar
blk_1073750639    /user/oozie/share/lib/lib_20150426015844/hive/hive-serde.jar
blk_1073750640    /user/oozie/share/lib/lib_20150426015844/hive/jersey-servlet-1.14.jar
blk_1073750641    /user/oozie/share/lib/lib_20150426015844/hive/curator-client-2.6.0.jar
blk_1073750642    /user/oozie/share/lib/lib_20150426015844/hive/ant-launcher-1.8.1.jar
blk_1073750643    /user/oozie/share/lib/lib_20150426015844/hive/mail-1.4.jar
blk_1073750644    /user/oozie/share/lib/lib_20150426015844/hive/opencsv-2.3.jar
blk_1073750645    /user/oozie/share/lib/lib_20150426015844/hive/hive-shims-0.23.jar
blk_1073750646    /user/oozie/share/lib/lib_20150426015844/hive/geronimo-jaspic_1.0_spec-1.0.jar
blk_1073750595    /user/oozie/share/lib/lib_20150426015844/sqoop/geronimo-annotation_1.0_spec-1.1.1.jar
blk_1073750596    /user/oozie/share/lib/lib_20150426015844/sqoop/avro-ipc.jar
The filesystem under path ‘/’ has 52 CORRUPT files
[cloudera@quickstart ~]$
[/code]
続いて通常の hdfs fsck を実行してみます。
[code]
[cloudera@quickstart ~]$ sudo -u hdfs hdfs fsck /
Connecting to namenode via http://quickstart.cloudera:50070
FSCK started by hdfs (auth:SIMPLE) from /127.0.0.1 for path / at Mon Apr 27 10:01:31 PDT 2015
……………………………………………………………………………………….
……………………………………………………………………………………….
……………………………………………………………………………………….
……………………………………………………………………………………….
……………………………………………………
/user/oozie/share/lib/lib_20150426015844/hive/ant-launcher-1.8.1.jar: CORRUPT blockpool BP-150411824-127.0.0.1-1418915217884 block blk_1073750642
/user/oozie/share/lib/lib_20150426015844/hive/ant-launcher-1.8.1.jar: MISSING 1 blocks of total size 12302 B..
/user/oozie/share/lib/lib_20150426015844/hive/curator-client-2.6.0.jar: CORRUPT blockpool BP-150411824-127.0.0.1-1418915217884 block blk_1073750641
(略)
/user/oozie/share/lib/lib_20150426015844/sqoop/xz-1.0.jar: CORRUPT blockpool BP-150411824-127.0.0.1-1418915217884 block blk_1073750630
/user/oozie/share/lib/lib_20150426015844/sqoop/xz-1.0.jar: MISSING 1 blocks of total size 94672 B..Status: CORRUPT
Total size:    564578917 B
Total dirs:    1687
Total files:    620
Total symlinks:        0
Total blocks (validated):    603 (avg. block size 936283 B)
********************************
CORRUPT FILES:    52
MISSING BLOCKS:    52
MISSING SIZE:        41940094 B
CORRUPT BLOCKS:     52
********************************
Minimally replicated blocks:    551 (91.37645 %)
Over-replicated blocks:    0 (0.0 %)
Under-replicated blocks:    0 (0.0 %)
Mis-replicated blocks:        0 (0.0 %)
Default replication factor:    1
Average block replication:    0.91376454
Corrupt blocks:        52
Missing replicas:        0 (0.0 %)
Number of data-nodes:        1
Number of racks:        1
FSCK ended at Mon Apr 27 10:01:32 PDT 2015 in 367 milliseconds
The filesystem under path ‘/’ is CORRUPT
[cloudera@quickstart ~]$
[/code]

復旧

このまま放置していても仕方ないので、不良ブロックを hdfs fsck -move で/lost+found に移動します。
[code]
[cloudera@quickstart ~]$ sudo -u hdfs hdfs fsck / -move
Connecting to namenode via http://quickstart.cloudera:50070
FSCK started by hdfs (auth:SIMPLE) from /127.0.0.1 for path / at Mon Apr 27 10:24:19 PDT 2015
……………………………………………………………………………………….
……………………………………………………………………………………….
……………………………………………………………………………………….
……………………………………………………………………………………….
……………………………………………………..
/user/oozie/share/lib/lib_20150426015844/hive/ant-launcher-1.8.1.jar: CORRUPT blockpool BP-150411824-127.0.0.1-1418915217884 block blk_1073750642
/user/oozie/share/lib/lib_20150426015844/hive/ant-launcher-1.8.1.jar: MISSING 1 blocks of total size 12302 B..
/user/oozie/share/lib/lib_20150426015844/hive/curator-client-2.6.0.jar: CORRUPT blockpool BP-150411824-127.0.0.1-1418915217884 block blk_1073750641
/user/oozie/share/lib/lib_20150426015844/hive/curator-client-2.6.0.jar: MISSING 1 blocks of total size 67585 B..
(略)
[cloudera@quickstart ~]$ sudo -u hdfs hdfs dfs -ls -R /lost+found|head
drwxr–r–   – hdfs supergroup          0 2015-04-27 10:31 /lost+found/user
drwxr–r–   – hdfs supergroup          0 2015-04-27 10:31 /lost+found/user/oozie
drwxr–r–   – hdfs supergroup          0 2015-04-27 10:31 /lost+found/user/oozie/share
drwxr–r–   – hdfs supergroup          0 2015-04-27 10:31 /lost+found/user/oozie/share/lib
drwxr–r–   – hdfs supergroup          0 2015-04-27 10:31 /lost+found/user/oozie/share/lib/lib_20150426015844
drwxr–r–   – hdfs supergroup          0 2015-04-27 10:31 /lost+found/user/oozie/share/lib/lib_20150426015844/hive
drw-r–r–   – hdfs supergroup          0 2015-04-27 10:31 /lost+found/user/oozie/share/lib/lib_20150426015844/hive/ant-launcher-1.8.1.jar
drw-r–r–   – hdfs supergroup          0 2015-04-27 10:31 /lost+found/user/oozie/share/lib/lib_20150426015844/hive/curator-client-2.6.0.jar
[/code]
.metaファイルだけ残骸が残っていたので、hdfs fsck / -delete で削除します。
[code]
[cloudera@quickstart ~]$ sudo -u hdfs hdfs fsck / -delete
Connecting to namenode via http://quickstart.cloudera:50070
FSCK started by hdfs (auth:SIMPLE) from /127.0.0.1 for path / at Mon Apr 27 10:38:57 PDT 2015
……………………………………………………………………………………….
……………………………………………………………………………………….
……………………………………………………………………………………….
……………………………………………………………………………………….
……………………………………………………………………………………….
………………………………………………………………Status: HEALTHY
Total size:    522638823 B
Total dirs:    1807
Total files:    572
Total symlinks:        0
Total blocks (validated):    551 (avg. block size 948527 B)
Minimally replicated blocks:    551 (100.0 %)
Over-replicated blocks:    0 (0.0 %)
Under-replicated blocks:    0 (0.0 %)
Mis-replicated blocks:        0 (0.0 %)
Default replication factor:    1
Average block replication:    1.0
Corrupt blocks:        0
Missing replicas:        0 (0.0 %)
Number of data-nodes:        1
Number of racks:        1
FSCK ended at Mon Apr 27 10:38:57 PDT 2015 in 102 milliseconds
The filesystem under path ‘/’ is HEALTHY
[cloudera@quickstart ~]$
[/code]
無事にHDFSも復旧完了です。
hdfs_good
 

コメント