Truth from Data? : Would we choose if it we knew...

Avoiding Lies and Clicking on TruthHave you heard the news about Fake News?…  Of course you have.  And despite all the fervor around it not only continues to exists  but thrive.  Because of one simple fact, it quite literally pays.

You see those news items that get clicked on the most end up getting revenue.  And the more attention the more it drives those clicks.  Fake news is simply created to capture those clicks. Different stories crafted with words, memes, images and people that cry out to click me!!

Though it seems like an art form there is also a lot of science behind it.  For example marketing research companies study subject lines that will cause email to more likely be read.  Studying the click behavior of many different populations.  All aimed at finding the most effective communication to reach your audience.  Unfortunately Fake News uses this science making falsehoods payoff but not caring about truth.  Can data science make truth more valuable?

I’m not sure we can get there today but it is conceivable, and here’s how.

  1. Build a “Truth Grader” that reads ahead “news articles” that appear in your browser
  2. Then the grader breaks the article into opinion vs fact, capturing the fact / opinion ratio
  3. Next the grader tests the facts against trusted sources and determines an overall fact score
  4. Finally the Truth Grader puts a code based on the strength of facts with rill over details for the reader

Technically we’re actually not as far off as it may sound.  Step 1 and 4 are relatively straight forward, Step 3 maybe possible with a Watson like interface. Though determining “trusted sources” would likely be an area of much “discussion”.  It’s step 2 that would currently be the only real blocker.  And of course this gets harder with video and pictures

Of course the real question is would knowing that some juicy sounding story was probably fake stop us from clicking on it…  I am hopeful we’d teach the web we do value truth.  But at least we’d have the data to know.

How much does your data weigh?

WieghingNumbersBusiness Improvement via data metrics

Measurement can be key to improving data. But, there are too many potential measures when it comes to data. Every column, every row, every table, every relationship can be measured. And that does not even get into the possibilities of metadata or data quality. With all these possibilities coming up with a measurement scheme can seem too costly. And without proper focus it will be.

So what to focus on?

The four areas to really need the most focus:

  1. Check if objectives are being met
  2. See how the expected “control points” are changing
  3. Make sure the processes put in place work as intended
  4. Watch to see when sizing and other assumptions will be violated

1. Objective Metrics: Check if objectives are being met

As part of Data Governance it is important the business visit this topic on a regular basis. Here are some examples of objectives I have discussed with clients recently

  • Reducing the time it takes to onboard a new customer / product / location
  • Reduce bounced communications (e.g. mailings, emails, phone calls, …)
  • Improve Customer response (e.g. conversion rate, click throughs, ..)
  • Improve Compliance (e.g. Know The Customer, Physician Spend, Conflict Minerals, …)

There are many other examples I could give, but in all cases this is one of the key areas to measure. As much as possible these items should measured based on historic data so that a baseline can be created to obtain a before and after view.

Any new data governance initiatives (e.g. an MDM or data cleansing implementation) need to have identified requirements expected to be met. As these requirements are developed the corresponding metrics to measure success should also be created. Then the data governance team should review these metrics going forward compared to historic data.

2. “Control Point” Metrics: See how the expected “control points” are changing

“Control points” just refers to the data elements that are expected to actually effect the objectives. For example in the case on on boarding a customer, what are the data elements that would slow down the process. This could be an invalid addresses, duplicate entries in the SFA tool, missing phone number, etc. Each of the potential causes would be a “control point” and should be measured.

Each new data project should included a design that showing what data changes need to occur to meet the requirements / goals. As these designs are created metrics should be identified to measure. Note these may be direct data, i.e. counting the customer records with and without a home phone. Other may be metadata, i.e. counting missing field descriptions for customer data sources. The data governance team should review control point metrics along with the business objective measurements.

3. Process Metrics: Make sure the processes put in place are working as intended

As new processes and systems are put in place it important to measure the activity of these systems. Like the control point metrics the process metrics need to be based on the design work for data projects. These metrics will ensure the design is meeting functional and non-functional requirements. They are a key way on ensuring SLAs are met.

Process metrics are also likely to be specific to underlying technology choices. For example user of the Informatica MDM Hub can use a product like the Hub Analyzer by GlobalSoft ( ). Tools like this can be vital in tracking day to day operations and help in tuning system configuration. Process metrics should be collected and review as early in the development cycle as possible to create baselines. Process metrics should be reviewed by the operation team on a regular basis. The data governance team should track if process metrics are varying unexpectedly.

4. Assumption Metrics: Watch to see when sizing and other assumptions will be violated

As part of the design process key assumptions should be collected. These should also be turned into metrics to ensure that the assumption are met. Collecting and reviewing these metrics will allow more proactive planning if trends so they will be violated at some point. A common example of this is sign assumptions. These metrics should be reviewed by the operations team and the data governance team as any projections show limits begin exceeded.

By focusing on a few metrics in each of these four areas will allow a data governance team to make sure data initiatives are on track and to identify new opportunities.

I am not suggesting these are the only metrics. There should be someone always looking at new potential metrics that are not part of the initial design. For these it is key to take a good “data science” approach and understand what actions the potential metrics suggest. If an action can’t be determined more need to be done.

To help discover new metrics it is best that key data assets be organized in such a way that meta data, data changes and other operations can be measured at points of time in the future. In other words design data repositories, both “Big Data” and “small data”, to be measured as potential “control points” in the future.


How Dense is Your Information?

Big vs small data

Critical understanding to get the most out of Big Data

To appreciate what it takes to get the most out of Big Data let’s look at what Big Data is and “information density”. Information density is the amount of valuable information per byte of data.

What is Big Data?

Big data is typically data from sources that are collecting interactions of people or things.

  • Social data – This data can deliver business information such as: Reviews / Sentiment, Reputational Risks, or Personalization. Social data usually comes in high volume and has Informal structure that requires text analysis and/or natural language processing.
  • Sensor data (Internet of things) – This data can deliver business information such as: alerts for complex automated systems/networks, new services based on personal sensors, or controls for automated factories. Sensor data tends to be very high volume and highly structured. To obtain the business value large set or combinations of set need to be analyzed.

What is “small data” i.e. not Big Data
small data is the data used in typical business processes

  • Transactions – Purchases, orders, registrations, etc.. Transactions tend to be highly structured and of medium to high volume.
  • Master Data (Key entities) – Customers, Employees, Vendors, Products, Assets, Locations, etc.. Master data tends to be structured and of low to medium volume. This data also provides the connection between many data sets.
  • Relationships – The relationship between business entities for example the costumers of a given product or the subsidiary companies of a given business partner.

Relative Information density
Because small data is built for specific business processes Byte for Byte “small data” has more direct value to the business. This is why business applications have focused on this data. It may take less work but governing this data is still not done well in many businesses.

Big data is less dense so more work is needed to obtain value, e.g. processing text to derive business context. Because big data is not typically focused on business processes it also has a higher noise to information ratio and needs more analysis/filtering to obtain business information.

But … Pound for pound there is more big data available
So working on Big Data can add tremendous value, even if it is more work. This is why businesses are so interested in Big Data.

The other But … The value of Big Data is strongest when tied to small data
To really understand what profile of customers have what preferences requires tying together all the master data and transactions in about the customer. Knowing the sentiment around specific products / vendors requires knowing the relationship between your customers and products.

So to obtain the biggest gains form Big Data it is important to realize more work and filtering needs to be done on the Big Data. And that your Big Data needs to be integrated with well governed small data.