Hadoop Management: How to Recover From Hadoop Cluster Failures
There are three major reasons Hadoop is so popular in commercial IT: It can handle big data, it's open source, and it's very robust. But anything built by man is destined to fail, even if it's infrequently, including Hadoop clusters. Now what?
To understand how to recover a failed cluster it's important to understand why it fails. Hadoop may be very robust, but there is one very big, well documented single point of failure, -- the NameNode. The NameNode stores the metadata, the data about the data. This metadata includes owners, permissions, etc. If the NameNode crashes, the data still exists, but there's no map so it can be used. And, of course, all that is very important. Hadoop 1.0 doesn't support automatic recovery if the NameNode fails. The Hadoop management staff at Yahoo!, one of the largest implementations of Hadoop, has learned NameNodes typically fail for three reasons: when they're misconfigured, when there's a network issue or when clients misbehave. Lower on the list is hardware failure.
But in the event of failure all is not lost. NameNode data is backed up in both the FsImage and edit file and the Hadoop management staff replicates two copies of both the FsImage and edit file on separate hard disks. So, when the worst happens, the IT staff has several options to recover a bad cluster.
Better End-of-File Validation
The big problem with edit log files is that they include extra space, padding, at the end. This extra space makes it difficult to find the end of the file because it creates an end-of-file condition. One solution to this is to locate the OP_INVALID because it's a single bit and it’s highly unlikely that random corruption would produce a single bit. Another, more reliable way to find the end of the file is to locate the last transaction ID written to the file. Finding that last transaction means the end of the file can be verified.
Old, reliable Fsck examines offline disks, checks the structure and fixes corrupted blocks. HDFS Fsck finds and repairs corrupt blocks. The problem with this method is that it only works on data, not metadata, and it only works on offline disks.
Manual NameNode Metadata Recovery
If the Hadoop management staff correctly configures HDFS, it will protect the metadata better than a local file system because it stores multiple copies of all the data everywhere across the storage architecture.
Another point of Hadoop cluster failure is MapReduce, the re-execution mechanism that's supposed to improve reliability. During map tasks it allocates a few back-ups associated with that task and runs only when that task fails. This sounds good but the hitch is that the back-ups must be hand configured based on the reliability of the system. IT has to make a bet that straddles the risk of not allocating enough space with the cost of allocating too much.
All the methods described above require time and skilled staff. Regarding time, financial applications can't afford to go offline. No processing means no business. Regarding Hadoop management staff, the time spent hand tooling the fixes describe above is time wasted, time that could have been spent on application development for example. The other alternative is to use a commercial system. Whatever this package costs, the overall Hadoop implementation will still be dramatically less expensive than an end-to-end commercial system, both in dollars and staff.