Big Data 101: Some Big Data technologies you should know about
Big Data is the new buzz phrase in the field of database computing. It refers to very large data sets containing tens of millions of records, and generally measured in petabytes. However, Big Data is not defined solely by the size of the data set. The term is usually applied to data that is unstructured, or when there is a mix of structured and unstructured data in the set. The advance of Big Data has spurred technological advancement in analytics, cluster management, and means of storage and access. Here listed are a few of the key Big Data technologies. While this list is by no means exhaustive, it may help you get started if you are new to Big Data.
Hadoop
This open source framework from the Apache Software Foundation is based on Google’s MapReduce. Hadoop handles and distributes massive amounts of data across giant clusters of commodity servers that process and store data with no scaling limits. With Hadoop, you can run applications on thousands of nodes, processing petabytes of data. Hadoop uses a distributed file system, HDFS, which allows it to read data very quickly from the Hadoop cluster.
MapReduce
This software framework enables developers to craft programs that can sort through and generate enormous amounts data in parallel over a wide network of processors and individual computers. MapReduce was developed by Google to increase the efficiency of indexing web pages. MapReduce excels at performing calculations on very large data sets, splitting jobs up and distributing the pieces across a number of computers (or nodes) for processing.
HDFS
The Hadoop Distributed File System (HDFS) is designed to run on low-cost, commodity hardware, and offers fault-tolerant features to increase reliability. It was designed for, and is particularly suited to providing high-speed access to very large data sets.
Hive
Developed by Facebook, Apache Hive was built on top of Hadoop and works in conjunction with it. Apache Hive allows business intelligence (BI) applications to run queries against Hadoop clusters through a “SQL-like” bridge. Hive is now open source and allows anyone familiar with SQL to make queries against data sets in Hadoop clusters. Hive allows users to access the Hadoop cluster as if it was a traditional data store, thus increasing ease of use.
NoSQL
Non-relational, or “NoSQL” is a database format that is better suited to processing unstructured data than the quarter century old relational database management system model. NoSQL database is not built from tables, and doesn’t rely on the SQL query language for data control. NoSQL databases can be greatly optimized for quick retrieval of data. NoSQL databases are highly useful in Big Data applications when large amounts of data are needed to be stored and recalled but relationships between the data is not well defined. This new database structure allows for elastic scaling to take advantage of advances in cloud-based storage systems. Elastic scaling allows for storage across new nodes as they are added.
Cloud-based Storage
Cloud-based solutions have made rapid advancements in storage technology across the computing world. These advances are most evident in the ability to handle Big Data. Cloud-based solutions have revolutionized supply chain data gathering and sharing. Before a cloud solution, linking thousands of suppliers together was laborious and consumed substantial IT resources that were needed to address endless amounts of compatibility issues. However, when using a cloud solution, the cloud provider maintains the shared data pool forf anyone with authorized access to the network. This has greatly simplified supply chain information management systems, as well as increasing efficiency.
photo credit: Big_Data cc