Scaling Hadoop for Big Data One Bite at a Time -- How to Eat an Elephant
For decades enterprises have been collecting data on customer preferences, product sales, and scores of other metrics. Making sense of all that data hasn’t been easy though. Only those with substantial software development capabilities have been able to really benefit from collecting all that data. Think Google, Yahoo, Amazon, eBay, and Facebook. Fortunately, Hadoop has emerged as an accessible approach to data-intensive distributed applications that many are having success with. Companies of all sizes have started implementing Hadoop to extract value from their vast data stores. However, as is the case with most leading edge technologies, analyzing big data with Hadoop is fraught with challenges.
Challenges to Scaling
One of the greatest challenges a business faces is constraint on human resources: data architects, engineers and data scientists. Companies like Facebook that have implemented the biggest Hadoop installations have hundreds if not thousands of people they can assign to the project. Furthermore, because Hadoop is still a new and relatively immature platform, there is a dearth of tools designed for scaling projects. Right now, most traditional IT staff can't handle this type of work. But if Hadoop can't be incorporated into normal IT workflows, it will never become a mainstream platform. Still, the financial incentives for implementing and scaling Hadoop are too compelling to ignore.
Foremost of the financial considerations is the cost of licensing. As long as commercial software locks licensing fees to the amount of data it manages, and as long as the quantity of data continues to grow beyond the terabyte barrier, companies will increasingly be willing to embrace open-source software on commodity hardware for their fundamental data management model.
Hardware & Software Considerations
Regarding the hardware requirements for scaling, Hadoop is a clear winner. It scales predictably and dependably across a hardware platform of ever-inexpensive commodity servers and direct-attached disks. Google, for example, famously uses racks of commodity blade servers held together with Velcro straps to make service a snap.
And there are several emerging tool sets that help IT staff scale out Hadoop installations. The cost of commercial tools will never be more than a fraction of database platform costs. Splunk, NetApp, and StackIQ are bringing tools to market that increase ease-of-use and performance.
Scaling problems are difficult and there isn't a “one size fits all” solution. The large web companies that have successfully scaled their systems are showing us the way. Sometimes, difficult scaling issues can be partially resolved by pushing some processing into the background. For example, if you run a matchmaking site, finding new, compatible matches for customers can be processed in the background instead of immediately serving the results. Facebook used massive background processing to solve its scalability issues. Background processing allocates machine power to the problem, and this is where Hadoop shines.
One thing is clear – Hadoop is here to stay. So now is a good time to figure out how to solve the scaling challenge that’s sure to come. If you’d like to read more on the topic of scaling Hadoop, be sure to check out the latest research paper from GigaOM called, “Scaling Hadoop Clusters: the role of cluster management” and learn how to:
- Install, patch and monitor a Hadoop cluster
- Handle application dependencies across the cluster
- Automate a wide range of tasks associated with cluster creation, management and maintenance
photo credit: Kyle McDonald via photo pin cc