How Dense is Your Information?

Big vs small data

Critical understanding to get the most out of Big Data

To appreciate what it takes to get the most out of Big Data let’s look at what Big Data is and “information density”. Information density is the amount of valuable information per byte of data.

What is Big Data?

Big data is typically data from sources that are collecting interactions of people or things.

  • Social data – This data can deliver business information such as: Reviews / Sentiment, Reputational Risks, or Personalization. Social data usually comes in high volume and has Informal structure that requires text analysis and/or natural language processing.
  • Sensor data (Internet of things) – This data can deliver business information such as: alerts for complex automated systems/networks, new services based on personal sensors, or controls for automated factories. Sensor data tends to be very high volume and highly structured. To obtain the business value large set or combinations of set need to be analyzed.

What is “small data” i.e. not Big Data
small data is the data used in typical business processes

  • Transactions – Purchases, orders, registrations, etc.. Transactions tend to be highly structured and of medium to high volume.
  • Master Data (Key entities) – Customers, Employees, Vendors, Products, Assets, Locations, etc.. Master data tends to be structured and of low to medium volume. This data also provides the connection between many data sets.
  • Relationships – The relationship between business entities for example the costumers of a given product or the subsidiary companies of a given business partner.

Relative Information density
Because small data is built for specific business processes Byte for Byte “small data” has more direct value to the business. This is why business applications have focused on this data. It may take less work but governing this data is still not done well in many businesses.

Big data is less dense so more work is needed to obtain value, e.g. processing text to derive business context. Because big data is not typically focused on business processes it also has a higher noise to information ratio and needs more analysis/filtering to obtain business information.

But … Pound for pound there is more big data available
So working on Big Data can add tremendous value, even if it is more work. This is why businesses are so interested in Big Data.

The other But … The value of Big Data is strongest when tied to small data
To really understand what profile of customers have what preferences requires tying together all the master data and transactions in about the customer. Knowing the sentiment around specific products / vendors requires knowing the relationship between your customers and products.

So to obtain the biggest gains form Big Data it is important to realize more work and filtering needs to be done on the Big Data. And that your Big Data needs to be integrated with well governed small data.

Zero Wait Information ≠ Real time

Zeor_Wait_Single_in_Crowd

How to Choose the Right Data Movement:  Real-time or Batch?

We all want a “zero waitinfrastructure.  This has spurred many organizations to push all data through a real-time infrastructure.  It’s important to recognize that “zero wait” means that the information is in ready form when a user needs it, so if the user needs information that includes averages, sums, and/or comparisons, there is a natural need to have a data set that has been fully processed (e.g., cleaned, combined, augmented, etc.).  Building the data infrastructure with this in mind is very important.

The popular point of view is that real-time processing is the “modern” solution and that batch processing is the “archaic” way.  However, real-time processing has also been around for a long time, and each mode of processing exist for different purposes.

One trade-off between real-time and batch processing is high throughput versus low latency.  Choosing one process over the other can be somewhat counterintuitive for the broader team, so it is important to determine what the throughput and latency requirements are, independently of each other.  A great example of throughput versus latency is the question, “What is the best way to get from Boston to San Francisco?”  You might answer, “By plane.”  That would be true for transporting a small group of people at a time as that would result in the lowest latency, but would by plane be the best way to move a million people at once?  How would you get the highest throughput?

Real-time processing is very good for collecting input continuously and responding immediately to a user, but it is not the solution for all data movement.  It’s not even necessarily the fastest mode of processing.  When deciding whether data should be moved in real time or in batch, it is important to define the nature of the business need and the method of data acquisition.