Big Data in a Nutshell

shaun Big Data Enthusiast and lover of all things distributed and scalable.

How many times a day do you hear the words “Big Data”?  You probably hear it from the full gamut of people: math people, empirically-challenged people, technical people, marketing folks, engineers, your clients and every time you eavesdrop on a conversation in a coffee shop in a business district.

Speaking over people’s heads is a faux pas here at Cloud Natvz.  We not only try to distill things down where the folks from any business unit can understand the basics but we also try to create metaphors to build a bridge to concepts you understand.

Now let’s see if we can make this happen with Big Data!

What is Big Data?

Big Data means different things to different people.  Think about the 3 use cases below:

1. My toddler – a book with more than 20 words on a page

2. Company server – terabytes of system logs and monitoring information

3. Laptop User – a 2gb spreadsheet with 200,000 lines

Why is this data “BIG” to these users?  It’s big because it causes problems!

Instead of focusing on the definition of Big Data, which you can see is fluid depending on your perspective, let’s think about the challenges that crop up when data becomes “BIG” .

“Instead of focusing on the definition of Big Data, , let’s think about the challenges that crop up when data becomes “BIG” .”

What are the Challenges?

Following the use cases above, the main challenges for each are:

1. Ingestion:  My toddler gets overwhelmed by too much “data” and can not consume it all.

2. Storage: The company server may run out of room when storing terabytes of log data

3. Processing:  Also known as “Excel hell” – the laptop starts to run very slow when using large spreadsheets in Excel

Big data presents novel challenges for every company but the three above are ones that almost all companies will have to consider when building optimal big data pipelines.

Challenge #1: Ingestion

The first challenge that occurs when we have Big Data is how to ingest it.  Do you enjoy drinking right out of the fire hydrant?  It’s not easy and a lot is wasted.  In fact, without something to harness the deluge of H2O, the hydrant is no more useful than the 2oz glass meant to shame those of us who choose water with our fast food. So let’s think of ingestion as hooking a hose up to the source of your data.

big data ingestion
Photo courtesy of https://leadinginlimbo.weebly.com/

Consider, for example, that Twitter users produce 250 million tweets a day that have to be captured or ingested before they can be processed or stored. If your ingestion system goes down you could get backed up real quick.

Solution: Messaging Service

Usually some type of streaming messaging system or broker is used to help ingest the data.  Think of a messaging system as a buffer between your data source and your data processing.   It queues up your data before it enters your pipeline.  Allowing your system to be “decoupled” from the data source.  You can read my post “Apache Kafka in a Nutshell” for more on messaging systems, but here is a quick primer.

“Think of a messaging system as a buffer between your data source and your data processing.”

A student is taking notes from teacher (no message broker)

The student falls asleep….so he misses some notes (no message broker)

If he had a messaging system in the middle, when he came back online he would start taking notes again from the messaging system and miss nothing.

Messaging systems can be very simple or they can be quite complex, able to handle real-time streaming solutions.

Messaging Systems

Challenge #2 Storage

Once you have hooked up the hose to the hydrant where does all this data get stored?  Save it to a disk drive?  It may not be as simple as it seems. This could be a lot of data.  There are things to consider


1. Type and size of data is important

Type of data is important.  Are we storing media or are we storing log messages?  Our data pipeline should store them in different places.  You would not necessarily want to store images in a database but you may want to store your log files there(eventually).

2. Moving data is very slow(Distributed Storage)

In fact, data movement is often the slowest process in a computing.  With Big Data we often try and store our data as close to the source as possible.  This is called “Distributed Storage”.  Below is an extremely simplified primer on distributed storage.

distributed vs. traditional storage

 

 

3. Initially capturing data is priority (Data Lakes)

At some point we have to consider how we plan to use our data because that will inform the way we store it but first, we don’t want to lose any valuable data.  We really don’t want the fire hose flooding the street, so let’s drop our data into a big, empty lake for initial storage. Once data is stored, we can figure out what to do with it later.  At this stage, our main concern is speed to move the data. As such, depending on the source, the data may lack any real structure at this stage which we will eventually require for future queries/searches.

4. How we store is important for processing (Databases)

As mentioned above, at some point our data will need structure and a clearly defined “data model”.  The type of processes or queries we perform on our data directly influence how we store it.

While we might initially store our data in a lake, we will ultimately want to add some structure and store it in a database that is efficient at performing fast queries on big data.  Some big data pipelines may move their data to different databases for different use cases.

Distributed File Storage Systems

Distributed DataBases

Challenge #3: Processing

Now our data is being captured and saved. What do we do with it?  I’m hearing data is the new gold – should we swim in it?

As a company we need the ability to analyze our data so that we can make solid, data-driven decisions.  Processing the data is the first step in making business sense from what we have collected. When buidling processing into our data pipeline, we must consider the type of processing we are doing.

“Processing the data is the first step in making business sense from what we have collected.”


1. No processing, just let everyone know that we are doing big data

This step should be reserved for the marketing department 🙂 , preferably after we have gained some business intelligence from our data.

2. Batch Processing

This is post-hoc processing that is usually done when we have time to do some robust, in-depth processing of our data. This type of processing may just take too long to do in real time or is more appropriate for when businesses want to look back on data sets and perform new queries that were not initially performed.

3. Real Time Stream Processing

More and more businesses require, and expect, real time information in order to make mission-critical decisions for systems and their clients.  Some of the stream processing may actually be adding structure to our data so we can immediately store our data in the databases mentioned above.

As stream processing technology has gotten better we have been able to move more processing of our data from batch to stream processing.

4. Parallel Processing?

Most modern big data processing is done in some sort of parallel way.  This is outside the scope of this blog, but we will be publishing a “Parallelization in a Nutshell” blog soon – so stay tuned!   In the meantime,  here is very simplified primer on parallel processing.  It is worth noting, parallel processing usually requires that processing task be written in a special way to run on parrallel processing systems.

traditional processing vs distributed processing

 

Stream and Batch Processing

Challenge 4: Visualize and Analyze

The job of a Data Scientist is to distill down all that data, sprinkle a magic potion made from statistics, machine learning, and industry knowledge and provide some useful business intelligence for the decision makers in your, or your client’s, organization.

Unless you are selling your data this is the most important step of the process.  Descriptive analytics is not enough.  The target outcome should be providing your decision makers with analysis that allows them to make data-driven business decisions that provide value.  This is called “Prescriptive Analytics”.

“The target outcome should always be providing your decision makers with analysis that allows them to make data driven business decision that provide value.  This is called Prescriptive Analytics.”

Analytics and Visualization Software

 

 

 

Tags : Big Data In a Nutshell
shaun

Big Data Enthusiast and lover of all things distributed and scalable.

Related Posts

Apache Kafka in a Nutshell

Apache Kafka in a Nutshell

April 18, 2019

  The Gentlest Introduction to Apache Kafka While “Chaos Theory” says the universe trends to complexity, a good engineer should strive to tame that complexity by providing the simplest solution possible. But “simple” solutions can still be too complex for individuals whose job it is to provide usable information to end users and stake holders. … Continue reading Apache Kafka in a Nutshell

Read More
Connecting to Nest Camera API and Getting Live URL

Connecting to Nest Camera API and Getting Live URL

March 8, 2019

I love our smart cameras but I am tired of waiting for rollouts of new features before I can use the camera. I would also like to store images and video automatically to my own cloud storage for later batch analysis with some home grown ML models. In this tutorial I am going to walk … Continue reading Connecting to Nest Camera API and Getting Live URL

Read More