Let’s talk business
Do you know what’s the most valuable resource in the world? Few years back oil was referred as the most valuable resource in the world. But now it’s spot has been taken by data. In today’s information era data has become the single most valuable resource in the world with it’s value increasing with every passing year. For more info read on THE ECONOMIST.
BIG DATA plays a huge role in understanding valuable insights about target demographics and customer preferences and provides a new view into traditional metrics, like sales and marketing information.
If you look around yourself today the biggest IT giants be it Google, Facebook, IBM, Microsoft etc, they’re all thriving on data. And with the telecom revolution millions of new users are getting connected to the world of internet every year. Together we create such humongous volume of data that handling it has become a major problem.
What exactly is BIG DATA?
In simple terms it’s massive amount of data being generated on a rapid speed. Many people confuse it with some kind of technology but in fact BIG DATA is a name given to our data problem. The go-to definition: Big data is data that contains greater variety arriving in increasing volumes and with ever-higher velocity. It’s too large and complex for processing by traditional database management tools.
The Three Vs of BIG DATA
This was introduced by Doug Laney(an analyst at Gartner) in 2001:
It refers to the amount of data. It’s generated from multiple sources like social media platforms (such as Facebook, Twitter, Instagram), e-mail service platforms (like G-mail, Outlook.com, Yahoo mail) and so on. For some organizations, this might be tens of terabytes of data. For others, it may be hundreds of petabytes. The problem is it’s so huge that it even dwarfs the largest hardware available for storage(100TB Nimbus Data ExaDrive DC100).
Velocity refers to the speed at which the data is generated. Normally, the highest velocity of data streams directly into memory versus being written to disk.
Variety refers to the many types of data that are available i.e. structured, unstructured and semi-structured. With the rise of big data, data comes in new unstructured data types. Unstructured and semi-structured data types, such as text, audio, and video, require additional preprocessing to derive meaning and support metadata.
Apart from these there are few other important characteristics of BIG DATA two of which are:
How truthful is your data — and how much can you rely on it? Veracity is all about making sure the data you acquire is accurate, which requires processes to keep the bad data from accumulating in your systems.
At the end everything folds to the fact that how valuable is your data for your organization’s business. These massive volumes of data can be used to address business problems you wouldn’t have been able to tackle before. Your investment in big data pays off when you analyze and act on your data.
Types of BIG DATA (Variety)
- Structured Data: As the name suggests the data is structured or formatted in a consistent order that’s easily accessible for analysis.
- Unstructured Data: This type of data has no defined structure and hence cannot be stored in traditional relational database or RDBMS. It mainly comes from documents, social media feeds, pictures, videos and sensors.
To read about the importance of unstructured data visit forbes.com
3. Semi-structured Data: It’s got flavors of both structured and unstructured data i.e. it has some consistency but doesn’t sticks to a rigid structure. It’s easier to analyze as compared to unstructured data.
Hadoop: A BIG DATA Tool
Somewhere around 2005, people began to notice the sheer amount of data users generated through social media, emails, blogs, websites and other online services. Hadoop (an open-source framework created specifically to store and analyze big data sets) was developed that same year.
The development of open-source frameworks, such as Hadoop was essential for the growth of big data because they make big data easier to work with and cheaper to store. Since then, the volume of big data has only skyrocketed.
The massive data collected and generated through the IoT and Machine Learning have added more to this. Cloud computing has opened up further many doors of possibilities for BIG DATA analytics.
Distributed Storage: A solution to BIG DATA volume and velocity problem
- 1.7MB of data is created every second by every person during 2020. (Source: Domo)
- In the last two years alone, the astonishing 90% of the world’s data has been created. (Source: IORG)
- 2.5 quintillion bytes of data are produced by humans every day. (Source: Social Media Today)
- 463 exabytes of data will be generated each day by humans as of 2025. (Source: Raconteur)
- 95 million photos and videos are shared every day on Instagram.
- By the end of 2020, 44 zettabytes will make up the entire digital universe. (Source: Raconteur)
- Every day, 306.4 billion emails are sent, and 5 million Tweets are made. (Source: Internet Live Stats)
And the list goes on…
That’s not even all of the data that we create every day. All this data, it’s the lifeline to the IT giants. But storing such humongous volume of data is extremely challenging. And equally challenging is the velocity problem i.e. the data is generated at a much faster rate than the speed at which it’s stored. This is where Distributed Storage comes into play. It’s basically:
“Storing data on a multitude of standard servers, which behave as one storage system although data is distributed between these servers.”
Whatever data comes to the frontend server (Master Node) is sliced equally and sent to the backend servers (Slave Nodes) for storing them. This not only solves the volume problem but also the velocity problem.
Let’s take an example to better understand this. Facebook receives nearly 500TB of data each day and the largest available storage is only 100TB in size. So by distributed storage model we’ll club together 5 systems each with hard disks of 100TB with a master node. Now master node will split/slice the data into 5 equal parts and send it to each node. This solves the volume problem.
Now, say each system takes 5 hours to store 1TB of data. So, 100TB of data will be stored in 500 hrs and since all 5 systems are running parallelly it’ll take the same time to store 500TB of data. But say, if we double the number of slave nodes then each hard disk will only have to store 50TB of data and it’ll take half the time.
This method is not only efficient but also scalable and the best part is that the more you increase the number of slave nodes the faster I/O transfer rate you can get.
Let’s see thorough few examples how the companies leverage the benefits of BIG DATA ANALYTICS:
Without a doubt Google is an expert in BIG DATA. Have you ever wondered how Google search gives you search results so fast and accurate considering the fact that there are billions of webpages to scan? The answer is BIG DATA ANALYTICS. Google uses Big Data tools and techniques to understand our requirements based on several parameters like search history, locations, trends etc. Then all this is processed through an algorithm and displayed to the user in sorted manner according to the relevancy.
Here’s an infographic by Vertical Measures.
Amazon houses the widest variety of goods and services and has thrived on it. That’s what the arrow in it’s logo represents “everything from A to Z”. But when customers are exposed to such huge range of options they often feel overwhelmed. They get access to more choices but have poor insights and become confused about what to buy and what not to. To tackle this problem amazon uses the BIG DATA that has been collected from customer’s browsing history to fine-tune it’s recommendation engine that provides personalized product recommendations to each customer.
Hope you enjoyed the blog!
Bye bye… See you soon!