Every minute 50,000 photos are posted on Instagram, 500,000 photos are shared on Snapchat, 1,000,000 swipes are done on Tinder and 4,00,000 videos are watched on YouTube.
Simply put, this exponential increase and availability of various types of data are examples of big data. This blog post explains Big Data in simple words relatable examples and analogies. Checkout this video on Big Data if you prefer watching/listening over reading.
Big data refers to datasets whose volume, variety and velocity are beyond the ability of traditional data solutions to ingest, process and store the data with low latency. Let us look at each one of these characteristics with analogies to Train.
Volume refers to the quantity of generated and stored data. The size of the data has been growing exponentially.
Assume we have a train that can carry 1000 passengers at a time. On a particular day, we have 10000 people waiting to travel on our train. This poses a challenge as our train can only hold 1000 passengers at a time but we need to transport passengers 10 times that number. With the current scenario, we have two options.
The first one is to load 1000 passengers at a time. With this approach, we need to make 10 trips to transport 10000 passengers. The last batch of passengers that are travelling in the 10th trip would reach the destination with a significant delay. Moreover, some people may not want to wait for a longer time and may leave without boarding the train.
The second one is to overload the train with 10000 passengers. This poses safety concerns. The train may break down as it is not robust enough to hold 10000 passengers at a time. Also, squeezing in 10000 passengers in the train can lead to congestion and suffocation of the passengers.
Similarly, big data cannot be captured, stored and analyzed on traditional data storage such as relational databases. Let’s say the traditional data solution can process X amount of data. If we try to process 10X amount of data with the same data solution, it would either result in slow and delayed delivery or breakdown of the process.
Variety is the type and nature of the data. The data can be text, image, audio, video or other custom formats.
Continuing with the train analogy, let’s say our train is designed only to transport human beings. But there’s a special request from the local zoo to transport the zoo animals such as elephants, giraffes and dolphins on the train. With the current design, the train cannot transport these animals. We need to either modify the train or get a specially designed train to transport the animals.
Analogously, the traditional data solutions may be designed for text or specific kind of data. With Big Data, there are various other formats of data such as image, audio, video, encoded data, compressed data or encrypted data that require different data solutions.
Velocity refers to the speed at which the data is generated and processed.
Let’s assume our train is scheduled to make trips from point A to point B every one hour. With this schedule, people arrive at the train station and wait for the next closest trip. We now have a new requirement to transport people as soon as they arrive. An intuitive solution would be to add more trains. But this is inefficient and expensive. An efficient solution would be to add smaller vehicles such as cars, motorbikes or arrange cabs to make frequent trips as soon as passengers arrive.
Analogously, the traditional data solutions are usually batch processes that happen at regular intervals like once an hour or once in 15 minutes. On the contrary, Big Data is often produced and consumed in real-time. We need a different kind of solution to ingest, process and store the data as soon as they arrive.
Sources of Big Data
These are several sources that were invented or introduced in recent times that produce big data in real time and at a very large scale. Some of them include Devices, Social Media and Apps.
Devices & Sensors
We now have various devices and sensors that are connected to the Internet and produce data in real time. Some of the devices include laptops, smart phones, smart watches, fitness trackers, gaming consoles, cameras, virtual assistant devices, virtual reality headsets and drones
There are billions of accounts on social media platforms such as Facebook, YouTube and Instagram. Users regularly share stories, posts, photos and videos which result in constant generation of data in huge volume.
Transactional & Web Applications
Applications such as e-commerce website, online banking and food ordering continuously generate transactional and log data. For instance, when we place an order from an e-commerce website, the data about the order is sent to the server. Along with this, many of the behaviors of the user are also captured and transferred for analytical purposes. Each activity such as clicking a button, browsing the menu or navigating to a page results in one or more logs. These logs are generated and processed continuously in real-time to serve the user with customized content and targeted advertisements.
There are several challenges when working with Big Data such as capturing, storing, processing, transferring, visualizing, querying, updating and securing the data. We need to use the right tool for the job. Instead of using the traditional tools, we should leverage the modern tools that are specialized in solving each of these big data challenges.
Let’s now take a look at a simple big data architecture at a high level. Assume we need to store the web server logs in real-time
In an analytical data store. We can accomplish this with the following big data pipeline.
First, we publish the logs from the web server into A real time ingestion tool such as Apache Kafka. Then, we can leverage stream processing frameworks such as Apache Flink to read from Apache Kafka, process the data and then write into an analytical data store such as Apache Hbase. The data in analytical data store can then be used for reporting and analytics.
Big Data allows businesses to make better and faster decisions using data that was previously inaccessible or unusable. For instance, with big data, businesses can use machine learning and predictive analytics to gain new insights from the new influx of data from several sources.
In A Nutshell
Big data refers to datasets whose volume, variety and velocity are beyond the ability of traditional data solutions to ingest, process and store the data with low latency. There are several challenges when working with Big Data such as capturing, storing, processing, transferring, visualizing, querying, updating and securing the data. We need to architect modern solutions to deal with these challenges.