In the past few years, we have witnessed the rapid increase of all kinds of data, from social network data continuously created by millions of users to large corporate transactional data and to real time streaming sensor data from ubiquitous sensors in our surrounding environment etc. Then, what is Big Data? And what does Big Data look like? Is Big Data really that critical and valuable for business players to make it the new frontier for business innovations? In this post, I will formally define the big data problem, identifying its features, discussing its value for organizations and in the end introducing challenges of managing and using big data.
What is Big Data?
Today, every organization across the globe is faced with an unprecedented growth of data. The most general definition of big data is that data with sizes go beyond the ability of commonly-used software tools to collect, manage, and process within a tolerable elapsed time.
What does big data look like?
The most important and obvious property of big data is that the data is orders of magnitude larger than data managed in traditional storage and analytical systems. More formally, data can be described with three features: volume, velocity, variety. And for Big Data, the size of data is so big or data moves so fast or there are so many varieties of data sources that traditional data management and analytical systems cannot handle.
The volume of data generated has skyrocketed in the past decade, and the measurement rises from megabytes to gigabyte, to terabytes, to petabytes and to the historical exabytes now. Today, it is common for production Big Data implementations to process petabytes and even exabytes of data on a daily basis. And the data size is expected to be measured in zettabytes in the next few years and predicted to double every two years.
Some examples of big data:
- The Human Genomes Project has generated more than 200 terabytes of data, which is equivalent to more than 30,000 standard DVDs.
- Microsoft’s search engine hosts over 100 petabytes of data to deliver high quality search results.
- Billions of pieces of information were created by more than 600 million Facebook users every day.
- The online game company, Zynga, processes 1 petabyte of content for players every day.
- With the rapid growth of network devices and internet users, the internet will soon have a daily traffic of bigger than one exabytes.
The second important feature of Big Data is that it is being generated in a much faster speed than ever before. Actually, in a lot of use cases, data is generated in real time. And many Big Data use cases require the processing of a real-time data stream to make time-sensitive decisions.
An example of high velocity data is that data systems process click streams from web sites, and make real time updates to the contents to serve the users in a timely manner.
The third feature of Big Data is the variety of data, which is largely caused by the variety of data sources. For example, the increased use of digital cameras and smart phones makes it much easier to generate high definition (HD) images and videos; ubiquitous sensors, such as utility meters, traffic and security cameras and medical devices are becoming important sources of big data etc.
The increased variety of big data has changed the format of data. Traditionally, the majority of data is well formatted based on some well-designed schema. In the big data world, unstructured data becomes dominate, comprising of more than 80% of the whole data sets. Unstructured data usually comes in many formats such as text, image and video etc. Sometimes, companies need to integrate information from multiple data sources, for example, from third parties.
Does Big Data Means Big Value for Business Players?
Today, more and more companies and joining the Big Data battle field. Amazon, Microsoft, Google, Facebook, IBM etc are releasing their big data systems and publishing their own perspectives on Big Data. So, why, in the past few years, the Big Data landscape is becoming so interesting to attract so many IT giants to join this community? Well, from their perspectives, it is the value of Big Data. They expect Big Data to serve so important role in the operation of businesses, including, optimizing corporate management, identifying and extracting new values from their customers for better customer services, or in other words, Big Data can help them to get richer, deeper, and more accurate insights to help differentiate themselves from their competitors and gain competitive advantage.
Big data is also an important source of innovation for scientific research, for example, the data from human genome project has the potential of revealing mechanisms of many complicated diseases. Due to the great potential of big data, the US Federal Agencies are committing more than $200 million to a collaborative effort to develop core technologies and other resources needed by researchers to manage and analyze enormous data sets.
In summary, big data can unlock significant value; not only help companies do better business but also help scientists tackle hard scientific problems.
The Value of Big Data from Technical PerspectiveFor the past decades, Business Intelligence (BI) has been helping companies to reveal values from their operation data. As the data is changing from "Small" and well structured data to "Big Data", will it change, or more radically, revolutionize the way to do Business Intelligence? The answer is "Yes". Then, how can we use Big Data, can we just simply fit the Big Data into our legacy Business Intelligence systems? Although, the answer will depend on the scale and power of your system and properties data is, generally, you can not do it. Or if you follow this path, your will suffer, because most of the legacy system are not designed to handle such big data in a graceful manor. We will discuss the challenges of handling Big Data later.
Diving to Learning Theory
According to the "Big Data Theorem", the more data we have, the more precise and stable the pattern that we can learn from the data will be. In learning theory, if we have a small data set, we can not use very complicated models, otherwise, we will suffer from overfitting (the model fits almost precisely to our learning data set, leading to very small training error). Now, Big Data can help us alleviate this problem. When data increases, we can use more complicated model (for example, model with much higher order), which in the end can lead to more details of the patterns in the data. Figure 1 shows the comparison of learning errors with different model complexity with increasing data set size.
|Figure 1. Learning Error Comparison for Different Model Complexity with increasing Data Size. |
From Figure 1, we can see that, if the data set is small, complex model (here is H10 with order of 10) will suffer from serious over fitting (the testing error is too large and the training error is to small); and the simple model (H2 with highest order of 2) does suffer from overfitting. But when the data set increases, the complex model will converge to lower errors than simple model, meaning that complex models incurs lower bias than simple ones. This the beauty of Big Data, which enables us to use more complex models leading to more faithful fitting of the patterns in the data.
Big Data Processing Challenges
The tremendous opportunities to gain new and exciting value from big data are compelling for most organizations, but the challenge of managing and transforming it into insights requires new approaches to deal with the dynamic data sources and multiple contexts for big data.
The fact that big data contains mostly unstructured data makes traditional database systems no longer a good fit for big data operations. We need a new system that can deal with the massive unstructured data. On the other hand, structured data will continue to be critical for organizations. Thus the integration of data stored in traditional systems with the big data system becomes a concern when designing big data systems.
Table 1 is a comparison of traditional data system and big data system:
Traditional Data System
Big Data System
Centralized relational database (SQL) storage and management
Distributed, non-relational (NoSQL) storage.
Centralized processing on single computer
Batch and real time, Distributed, parallel processing on large cluster
Table 1. Comparison of Traditional Data System and Big Data System
Hadoop was designed as a big data system; it is distributed and can handle massive data sets by coordinating the computation power of a large number of commodity machines. In the next post, I will introduce Hadoop and explain the features that enable its success in the Big Data community.
 Learning from Data, Slides of Overfitting. http://bit.ly/13nSL5F
 The humane genome project data available on AWS. http://1.usa.gov/Xqqs3j
| NSF big data initiative. http://1.usa.gov/ZCFZN3|