The word “automatic” has been around since the 1500’s, but really came to the fore in 1939. That’s when the New York World’s Fair sparked everyone’s imagination with visions of technology that promised to solve all of our problems through automation. Recently, while working with one of our customers, I was reminded how automation can still surprise people. Let me tell you what I mean.
A large credit card company recently asked us to participate in a “proof-of-concept” for their big data project. As a startup, we are always thrilled when one of the big boys wants to try out our wares, so we jumped at the opportunity.
When we arrived on site in their data center, they assigned a half-dozen machines for us to use. One would become the StackIQ Cluster Manager, and the other 5 would become cluster nodes running Hadoop. We are used to building clusters of all sizes using our software, and knew that a small, straightforward installation like this one would be a cake walk. We set about our task.
We set up a few parameters for the cluster, and launched the StackIQ Cluster Manager. It was soon up and running without a hitch, as expected.
Next, we used the Cluster Manager to install the cluster machines. Twenty minutes later, all 5 backend machines are up and running Hadoop services. Smooth. No problem. Expected.
It’s A Trap!
That’s when my colleague and I noticed that the customer’s IT people are whispering to each other, and we started to wonder if we’d done something wrong. We checked our screens, and found that cluster was indeed up and running — ready to accept Map/Reduce jobs.
So we took a deep breath and walked over to the gathered whisperers and asked if there was a problem. One of them asked in a hushed voice, “Um, how’d you guys do that?”
“Do what?” we answered.
“Bring up that one machine?” he said, pointing at one of the cluster servers.
After we explained that we hadn’t done anything special, we just let our Cluster Manager do its thing, the customer confessed, "We’ve been struggling to configure that machine for over 2 weeks now and haven’t been able to get it to install. There seemed to be something wrong with the configuration of the disk controller, but we haven’t been able to fix it.”
That’s the power of true automation. That’s what we designed our software to do. That’s what makes us very proud of the software we build. It takes the headaches out of setting up clustered infrastructure of any size by automating nearly everything — including configuring those pesky disk controllers.
What was a major problem for our customer — one they hadn’t been able to solve in weeks — wasn’t even a bump in the road for our cluster manager. It found the controller, configured it, and moved on to its next task. Smooth. No problem. Expected.
It can take as many as 80 manual steps to correctly configure a disk controller for use in a Big Data cluster, and clusters have a lot of disks — and controllers. We knew that we had to automate the configuration of all those disks to help cluster operators build their clusters efficiently. Automating the procedure dramatically reduces the time it takes to put a cluster into production.
Here’s how we do it. On first installation of a server, our software interacts with the disk controller to optimally configure it based on the node’s intended role. For example, if the machine is a data node, the disk controller will be configured in “JBOD mode” with each disk configured as a RAID 0. However, if the machine is going to be a Cassandra data node, the data disks will be automatically configured as a RAID 10. This all happens automatically — no manual steps — ensuring that all cluster nodes are optimally configured from the start.
The goal is a smooth configuration process. It’s just a bonus when we get to surprise and delight a customer who sees their cluster up and running after struggling for weeks on their own trying to solve a stubborn configuration problem.
Smooth. No problem. Expected.
Lately, it has become all the rage to declare Big Data dead, but much like reports of Mark Twain’s death in 1897, it’s an exaggeration. In Twain’s case, it turns out that it was his cousin, James Ross Clemens, who had taken ill and died. But word got around it was Mark and he was forced to clear things up in that famous yet often misquoted note. Since Big Data is in no position to defend itself, I will make an attempt on its behalf.
There’s no denying that Big Data has been the cause of much frenzy in the industry.
Vendors ascribe near-miraculous capabilities to their Big Data offerings, arguing that business need to buy their products to get an edge on their competitors, or just to keep from being buried under a mountain of incoming Big Data.
Analysts make spectacular predictions, with some forecasting a $47 billion market in 2017. While this may seem like hype, the fact is that the Big Data market has been exceeding earlier predictions. Wikibon says that the total Big Data market reached $11.4 billion in 2012, which was ahead of their own forecast made the previous year. So it seems the Big Data market isn’t all hype, after all.
Investors are heaping money upon anything remotely connected to Big Data, and that trend shows no signs of letting up. For instance, Accel partners has setup a very large Big Data fund saying, “We believe the future multi-billion dollar software companies will emerge from the Big Data ecosystem.”
Business owners scramble to acquire Big Data analysis capabilities in the hopes it will magically transform their business. Some are still reeling from the race to embrace the last wave of enterprise tech-mania — cloud computing — but they believe there’s opportunity in using Big Data for a competitive advantage, and they don’t want to miss out.
Often reality is much less exciting than the hype, but dismissing it as dead is as much of an exaggeration as the hyperbolic claims made by its proponents. So what is real?
It’s true that the term ‘Big Data’ is being diluted through overuse. Some might even call it abuse. But we shouldn’t let the terminology get in the way of seeing the underlying reality. For example, mobile computing is growing exponentially with wireless data networks expanding and getting faster all the time. That’s likely to continue. One result is that people are generating more data every day — both at work and at play. That data can be used to provide users with better service. The more you know about the likes, dislikes, habits and foibles of your users, the better you can serve them.
So while Big Data may not live up too all the hype, we can hardly fault it for that. Business owners who dismiss Big Data as mere hype, do so at the risk of their company’s well being. Data center administrators who ignore it may well be putting their jobs at risk.
Whether we like it or not, Big Data is here to stay. It may come under some other name. It may not live up to the more hyperbolic claims made about it. But it’s here nonetheless, and it has value. Rather than fall behind the curve, it’s better to embrace the opportunity. I like the way Steve Sarsfield, author of “Data Governance Imperative” put it:
The opportunity is for data management pros to think about their Big Data management strategy holistically and solve some of their old and tired issues around data management. It's pretty easy to draw a picture for management that Big Data needs to take a Total Data Management approach.
I say, it’s best accept the reality of it, and plan accordingly.
What do you think?
photo credit: dddaag
via photopin cc
Have you ever gotten so immersed in a topic that you forgot that others might not be? For instance, you may have lapsed into jargon from your workplace while at a party and been met with that look that says, “Umm, I think I’ll go refresh my cocktail now.” I know I have. It turns out people have better things to do with their time than study whatever particular topic you think about all day long.
The same thing can happen when your company communicates with people. I don’t know what business you’re in, but I’ll bet the way you and your colleagues talk about it would baffle the uninitiated. I was recently reminded of the problems insider-speak can create as we were gearing up to start a new proof-of-concept project with a prospective customer.
Here’s what happened.
At StackIQ, we make software that builds clusters for big data from bare metal. By “bare metal” we mean machines that have no software on them at all. We use that term in our presentations, sales pitches, web site, and marketing collateral.
The reason our software provisions systems from bare metal stems from the philosophy our founders developed during their years building and maintaining clusters. They discovered that if you allow operators to apply patches and change configuration settings incrementally to various machines in the cluster, you eventually wind up with a system in an unknown state. That makes it very difficult to troubleshoot problems. Which machine is running which version of the OS? Which ones are at the current patch level? Which have yet to be updated? Were all the change logs updated — every time? Who knows?
The only way to know for sure what is running on all of the machines in your cluster, is to install each of them from scratch (aka bare metal) using a known-good source. So we developed a system that does just that, and does it fast.
OK, back to our confused customer. We had given them our sales pitch, and they agreed to try out our software in their labs. When it came time to allocate some servers to the test, they asked us which operating system we wanted them to install. We explained (again) that it didn’t matter, since our software would install everything “from bare metal.” To which they responded, “Oh, OK. So we’ll leave the cluster nodes empty, and just install Linux on the management node.” “No need,” we explained,“ we will install the entire cluster from bare metal — including the management node. There’s no need for you to install any software at all.”
Anyway, we got it all straightened out, and the customer gave us a set of bare machines to run our tests on.
Why was our customer confused? It wasn’t their fault. What we do is decidedly different from what others in our space do. Our competitors require that an OS and other software be in place before they begin their installation. They don’t operate from our “clean slate” philosophy. What’s more, the term “bare metal” is often used to mean something different in the IT community. For example in the cloud computing space, “bare metal” is used to describe a software stack that is running directly on the hardware, and not in a virtual machine. Even wikipedia redirects a search for bare metal to an article on “bare machine.”
I took this incident as a reminder that we should never assume what others know. Everyone’s experience is different, and that experience gives them a unique perspective. So whether you’re a marketing professional, a sales professional, or a technologist, it’s always a good idea to check that people have understood your message, and adjust your language to make yourself clear.
Hmmmm, maybe I should go run a find/replace operation on our product information to replace “bare metal” with “bare machine”…
photo credit: JD Hancock via photopin cc
GigaOM caught up with StackIQ executive, Tom Melzl, during the Structure Data conference to get an update on the company. In the interview, Tom explains why cluster management is crucial to any successful big data project, and what differentiates StackIQ from its competitors. He also gives us a peek at the technology areas the company is focused on as they develop innovations for the future.
GigaOM talks to StackIQ's Tom Melzl (3:50)
Have you heard this story? A couple of MBA students were scoping out the local 24-hour convenience store and noticed an end cap that featured an odd pairing of products: diapers and beer. Huh? Turns out that someone crunched their customer behavior data deeply enough to figure out that when a bleary-eyed new father stumbles into the store late at night, diapers or beer were probably what he was after. By displaying these prominently on the front end of the aisles, the store was able to make the late-night shopper’s quarry easy to find.
If beer goes with diapers the way cookies go with milk, imagine what insights big data could bring to your business. Retail is right in the sweet spot to benefit most from big data projects. Some large retail organizations generate terabytes of data every minute. Inventory systems, loyalty cards, and sales transactions reveals exactly what was sold, when it was sold and what other items were rung up in the same purchase.
So Much Data, So Many Ways to Use it
What’s happening to that data now? Much of it gets stored, and later used for financial analyses of various sorts. Increasingly, other departments are starting to dip into the data for their own purposes.
Human Resources departments are using big data to determine how many sales associates and other personnel to have on hand, and when. Hiring and staffing patterns will become more precise, contributing to the bottom line.
Buyers are leveraging the data because with suppliers. The result? Fewer returns, fewer overstocks, fewer costly mistakes like all those leftover candy canes in the back of the shop months after the holiday season has come and gone.
Shelf placement is usually done by suppliers, but retailers can use the results of their big data analysis to help optimize that placement. Maybe those oversized boxes of laundry detergent ought to be on the middle shelf instead of the bottom. Better data means better sales and both buyers and suppliers will like that.
Big data tools can let help your marketing staff do better, faster research. In one store, they’ll put batteries on the end cap closest to the door. At another store that’s where the bathroom tissue goes. Who’s right? Who’s wrong? What about the beer and diapers? With the right data, you can take the guesswork out of it. And while we’re at it, take a look at which coupons are working best and which ones never move anything? The possibilities for tweaking are nearly endless.
So, What Do You Need to Make it Happen?
Most retailers are choosing open source Apache Hadoop software running on low cost, commodity hardware for their big data projects.
Setting up and operating a big data cluster can be an intimidating proposition for IT departments used to working with more traditional enterprise data center resources such as email, web, and database servers. Big data clusters are different animals. Fortunately, the market has responded by providing good deployment and management tools. With the right tools, any IT department can deploy and manage big data clusters with confidence, even if they’re never done it before.
Another benefit of working with a good vendor is that they are experts in the art of cluster management. You can draw on their years of experience, building and running clusters of all sizes. Chances are pretty good they’ve already seen and solved any problem you run across.
So, are you ready to take the big data plunge? Start out on the right foot, and pretty soon you’ll move from being a big data beginner to petabyte-crunching pioneer.
photo credit: x-ray delta one
via photopin cc
Last week, Pat Gelsinger, CEO at VMware opened a can of worms with his comments at his company's partner confab in Las Vegas. Gelsinger is clearly concerned about enterprise computing workloads migrating to Amazon’s public cloud (AWS). Further, he states that those lost workloads are gone forever - "a workload goes to Amazon, you lose, and we have lost forever" and that "we want to own corporate workload." Gelsinger's comments gave rise to several posts, tweets and articles in the IT blogosphere, but what I found more interesting was the statement from VMware President and Chief Operating Officer Carl Eschenbach, "I look at this audience, and I look at VMware and the brand reputation we have in the enterprise, and I find it really hard to believe that we cannot collectively beat a company that sells books." Well Carl, you should be concerned because that measly bookseller is creating competitive advantage in IT faster than VMware and most every other IT vendor; and your predicament is exacerbated by the ossification of enterprise IT organizations which cannot adequately react to the needs of the business.
Amazon became the dominant bookseller by driving its costs down rapidly while providing a very convenient, automated book buying experience. Guess what? At AWS they're doing the same thing for computing - making it cheap and easy to consume. The fanatical AWS team is singularly focused on delivering needed solutions at the lowest possible cost that can be easily provisioned and managed by the user. Does this sound like the way IT vendors and enterprise IT organizations create and deliver new solutions that support the needs of their business users? Hardly. Vendors instead behave according to corporate edict, selling products and pushing services that don't create the best solution for the customer, while the enterprise IT organization remains comfortable in its cocoon of processes and standards. Is it any wonder that workloads migrate to AWS with or without IT approval?
So, will Amazon and its ilk win the enterprise workload war? No doubt that some percentage of corporate computing is appropriate for the public cloud and the mix will be determined over time by competitive markets - public cloud and enterprise IT are both viable. However, down the road, should enterprise IT be concerned that public clouds will completely dominate computing with traditional solutions shrinking into oblivion leaving CIOs with no more to do than cost accounting? They probably feel safe for now, but it's also clear that IT vendors and IT departments need to take heed of the cost, responsiveness (read automation), and maniacal focus at AWS lest that steamroller flattens them. Gelsinger and Eschenbach are half right -- it’s not time to throw out the enterprise data center, but it is time to throw out the traditional enterprise IT playbook. StackIQ can help.
Joe Markee, StackIQ, CEO
Maybe you made a pitch to bring Hadoop into your organization. Or maybe your boss did. Either way, you’re on the hook to make the business case for it. We can’t say exactly what you’ll need, but at the very least you will want to find some compelling evidence that includes predictions, ROI estimates, market trend analysis, and a few successful case studies to back you up.
It wouldn’t hurt to have some technology intel about current and upcoming developments and innovations that people believe will provide the most positive impact for businesses adding Hadoop to their data analysis toolkit. Here are some things that may help you:
Business and Innovation Trend Predictions for Hadoop Solutions
The information technology experts at Hadoopsphere have a number of predictions about the future of Hadoop solutions in the enterprise.
Herb Cunitz, President of Hortonworks, makes a number of predictions there. Here are some you may find helpful:
- The term "Big Data" will diminish to simply "Data"
- Apache Hadoop solutions will emerge as a vertically-aligned solution
- The "right time" query of Apache Hadoop will become reality
- Big data ecosystems will expand
- More Hadoop startups will emerge
Forrester Research’s Mike Gualtieri focuses on software technology, platforms, and practices of interest to enterprise developers. He adds these predictions into the mix:
- Real-time architectures will become prominent
- Companies will focus on creating real-time analytics and predictive models
- Mobile capacity needs will become big drivers of innovation
- New "intelligent" applications will emerge for increased engagement and interactive touch-points for users
- Data analytics and business intelligence as-a-service will also expand
- The emergence of elastic big-cloud ecosystems behind the firewall with trusted third-party data center providers
Determining Return on Investments and Marketing Trends
Determining what your company’s ROI for its Hadoop investments will be can be tricky. It depends on the kind of data you will be operating on, and the business you are in. For example high-volume online retailers can probably expect higher returns from analyzing their customer’s shopping patterns, than a meteorological company that integrates Hadoop into their weather pattern forecasts. But we’ve seen some companies make good ROI forecasts based on comparing the capital outlay and operating costs of a high-capacity Hadoop cluster (comprised of commodity servers) to the cost of deploying and managing a purpose-built, dedicated data analysis system from one of the many traditional enterprise data equipment purveyors.
Extracting critical information such as exactly what your return on your big data infrastructure investment will deliver from your big data clusters means organizations using IT specialists; the scientists making data-transformation and advanced analytics a total focus.
A less tangible, but potentially more impactful area to consider when estimating ROI, is the potential for increases new business from better predicting buying trends. Or, conversely, the savings afforded by avoiding introducing new products lines on a hunch, when the data predicts the market isn’t ready for it yet. Using big data technology to develop your company’s next business strategy may prove very profitable.
The trend is to use big data analytics to get important information to the right place at exactly the right time. Companies are turning to models that provide complete transformations within the big data, cloud, and analytics enablement ecosystem. Getting the most from these new technologies might require that you engage an advisory team made up of seasoned big data executives or highly-skilled consultants.
Business Successes and New Developments
Businesses are being inundated with increasing amounts of data of every kind. Some have amassed terabytes—even petabytes—of new information.
- 12 terabytes of Tweets can be analyzed daily to determine product sentiment
- 350 billion annual meter readings can be put to use to better predict power consumption and distribution requirements
In some cases, real time (or near real time) analysis makes a difference. For example, a few minutes difference in the time it takes to detect fraudulent transactions in certain high-volume markets could make a difference measured in millions of dollars. While not designed explicitly for real-time analysis, Hadoop has made some strides to achieve something close to real-time in recent releases. As both Cunitz and Gualtieri said, you can expect that trend to continue. Already, some financial organizations need to handle a lot of data in little time. For example:
- 5 million trade events can be generated daily for examination to identify possible fraud
- 500 million daily-call detailed accounts can be examined in real-time to predict customer-churn more quickly
Make it Manageable
One factor often overlooked when making the case for Hadoop in the enterprise, is the cost – both in money and in personnel – of designing, deploying, and operating the IT infrastructure it runs on. Taking a pro-active approach to the nuts and bolts of running that Hadoop cluster you’re about to deploy can contribute a lot to your overall ROI. Save yourself a lot of time, money, and effort by choosing a Hadoop cluster management system that will meet your needs as they grow over time.
The rising tide of big data has become nearly overwhelming for some retailers as they search for new ways to analyze all that data. They’re drowning in signals from social media, review sites, competitors, market trends, and very detailed customer behavior data. Big data is a seminal component for omnichannel marketing efforts but one third of retailers are in the dark about their data.
And there’s plenty of it. Ninety percent of the world's data has been created in the past two years, thanks in part to mobile subscriptions, Facebook devotees, and Twitter enthusiasts. With responsive web design becoming the norm, prospective shoppers are finding it easier than ever to not look at a store's site, either before shopping, or while they’re in the store.
What’s All That Data Good For?
So the challenge isn’t so much in getting the data, but rather how best to use it. According to one study, Fifty one percent of retailers surveyed say that the lack of sharing data is an obstacle in measuring marketing return of investment. But many are putting systems in place to surface the data from its various hiding places across the organizations, and putting it to good use. There are several use cases for big data analytics in retail, such as:
- predicting customer purchases
- customer micro-segmentation
- cross selling & upselling
- location-based marketing
- supply chain & logistics optimization
- retail fraud
Controlling retail fraud is a big goal for any retailer, since fraud has a negative impact on net revenue. Analyzing customer behavior data can help spot fraud early, and stem the loss. Cross-selling by using aggregate customer behavior analysis to suggest additional products to customers that match their needs and budget, is also proving to be a productive use of big data in retail. The more you know about your customers, the better you can serve them, which leads to better returns – and happier customers.
Retailers Blazing a Trail into Big Data Territory
One retailer of note that has is taking full advantage of the benefits of big data is Walmart. They use big data to optimize their inventory based on regional and seasonal preferences of customers in each geographic area they serve. Additionally, they use big data to implement an in-store, mobile navigation system that alerts customers to sales based on their preferences and location in store. They loop in social media information too, using data from a customer's Facebook profile to recommend products of potential interest.
Williams Sonoma has also embraced the power of big data. They collect repositories of customer purchase data, clicks, demographics, and web browsing history, and use the information to create predictive models for each product and customer. This allows for targeted email offers which drive higher conversion rates.
There’s Gold in All That Data
Expect forward-thinking retailers to focus their big data initiatives on improving store operations, supply chains, ecommerce, marketing, and merchandising, with most value attached to investing in resources in merchandising. A 2011 McKinsey & Company study estimates that retailers using big data to its optimal potential could possibly increase the operating margin by more than 60 percent. Could big data be the next gold rush!
photo credit: Jason Hargrove via photopin
Recent developments in the design, deployment, and management of Big Data applications are making it easier for insurance companies to harvest the benefits. These improved big data implementation tools provide integrated Hadoop and cluster management solutions.
Fully Configured, Full Stack Clusters
Rather than cobble a solution together from a hodge-podge of open source and commercial components, data center managers are turning to fully configured Hadoop clusters that are vendor supported. This makes it easier to deploy new analytics applications across the enterprise, and do it in far less time than using manual methods.
The insurance industry is particularly well positioned to capitalize on Big Data applications, due to the large volume of data it has access to. Insurers that can efficiently analyze information on the scope and quality of demographics, trends, competitors details, social and values-driven information can gain an advantage over competitors who have less effective analytics. Hadoop is proving to be a valuable tool in the efficient analysis of this data. It makes it easy for corporate data scientists to crunch huge volumes of information and extract insights that can enhance the marketing, sales, operations and underwriting.
Simpler installation and operation
The “engine” that drives that analysis is an Apache Hadoop cluster, which can be complex and time consuming to operate. Fortunately, clusters no longer require teams of administrators to install, configure, and maintain.
Until recently, data center managers had to rely on a variety of tools that handled the deployment, scalability, configuration changes, and other operational considerations. Often, this required them to write, test, and debug scores of new software scripts. Employing modern, automated management solutions can eliminate this time-consuming, and error-prone procedure.
Are you staying ahead of the curve?
With only 20% of today's insurance companies making use of these new applications, there is plenty of room for early adopters to leverage the technology to their advantage.
We’ve seen a number of situations where an integrated, full-stack Hadoop solution is helping insurers make the most of their large data sets. Are you using Hadoop for Big Data analytics in your business? What are the areas with the most pressing needs? How are you using Hadoop?
Our markets are changing quickly due to a technological evolution and huge economic shifts. Changes on this scale transform the way markets operate. For the insurance industry this affects the type of products offered, how they are marketed and advertised, and how new risks such as fraud are assessed, determined and discovered.
New analytical models have been developed to keep pace with the scope of these changes. It has required that actuaries re-address pricing policies and underwriting as well as how these analytical models are designed.
In today's insurance industry, companies have an abundance of new varieties of both data and opportunity. It is easier than ever before to control costs, counter risk and threat, increase revenue, and more.
Recent news reports confirm opportunities coming from the ability to manage huge volumes of demographic data, psychographic data, claims trends, and other product and risk-related data that can change management approaches. They can also change marketing strategies, product designs and other processes that result in far more efficient and effective claims processing and profit-margin expansion.
What does this mean for IT?
Changes in infrastructure are required when applying new analytical tools. Traditional databases have distinctly different architectures and query products. The technology for massive data clusters used to be confined to the fields of scientific research, national security, or the big oil industry. But today these applications run on thousands of clustered servers instead of on the super-computers of days gone by.
Technology research council studies indicate that, at best, only 20% of insurance industry companies have made the infrastructure investment required to support these massive new application abilities.
There are likely many reasons for this, but one is likely a general underestimation of what large scale deployment of analytics entails. As insurance industry leaders embrace the power of high-quality, Big Data analytics, these insights will change the industry. The advantage will go to the early adopters thanks to the insights they gain from improved analytics. Of course, acting on this information is key.
Some Insurers have already jumped in to Big Data
Leveraging of this information is already taking place. For example, MetLife uses Big Data Cluster applications to analyze hundreds of terabytes in order to discover data patterns that help reduce risk. They search for trends and product performance result patterns as well. Travelers are using the Big Data Cluster applications for rationalizing lines of product from new acquisitions, and to understand geopolitical influences and developments globally.
Both Progressive and Capital One have been conducting experiments for segmenting customers. Big Data clusters are needed to properly tailor products once information has been assessed. Special offers are also crafted based on this new information from deeper and more detailed customer profiles.
In past generations the personal relationships with both clients and their communities provided companies with access to much of this information. With decentralized relationships and virtual access points, insurers have access to layers of new data sources, giving them the ability to build statistical models that are unprecedented in delivering key information.
What might the future hold?
Satellite powered property assessments, weather pattern information and regional employment statistic access is only the tip of this massively advantageous analytical iceberg.
Photo by joiseyshowaa