The world is one big data problem

“The world is one big data problem” Andrew McAfee. Co-director of the MIT initiative on the Digital Economy.

When we talk about data access and distribution, everyone is in the same boat, no one has all the data but it is clear there are large discrepancies in its’ access, companies who have designed their businesses around data are some of the largest in the world, however it is also proprietary for them, now we see an incredible demand for many more companies to be more data driven but this isn’t a simple transition.

It is clear we are currently in the midst of a huge digital evolution, which has accelerated since the start of the Covid pandemic back in 2020. Data is seen as a key facilitator of this new digitally enabled world and also critically linked to the sustainable growth agenda across industries. A key part of the digital change is not just the data in it’s raw mode, it is the path to enable AI and drive outcomes such as:

1. What better decisions it can bring
2. What efficiencies can it create
3. What processes can it automate
4. How can it drive better experiences

We now see data challenging the very fabric of business, disrupting traditional company operating models, the way business decisions are made. It isn’t just businesses where this demand is surging, we as individuals are also are seeing a consumption change as data becomes more and more important in our own day to day decision making. As with the expected explosion in IOT and 5G technologies taking place, the amount of “real data” we consume will continue to disrupt our day to day lives whether at home or work.

At the very core data could be seen in 2 forms, “real data” which we are very familiar with, this is typically captured through systems and processes and linked to a set of real life tracked activity. The second form, less well known is synthetic data which isn’t captured or tracked from something that has happened in real life but looks just like if it has and looks like real data but is actually programatically simulated to act like real data.

The expectation is real data will continue to grow with the huge focus being placed on it’s collection in the shape of new data companies, IOT exploding, businesses increasing data capture funnels, growing quality of existing data, to name just a few. However it’s growth still has potentially many constraints which may not be able to support some of the evolution to greater AI adoption.

Real data challenges:

1. Data in many cases isn’t like a commodity accessible to all.
2. Data privacy constraints set against an ever changing regulatory framework.
3. Data types could have a time sensitive shelf life.
4. Real data can have inherent biases which skew outcomes.
5. Businesses aren’t always designed like data companies, their systems may not be optomised for data capture and this is a longer term requirement across the data design of future systems.

Step up, synthetic data, the potential saviour of these real data challenges.

Credit: Gartner

Gartner cites: Synthetic data is generated by applying a sampling technique to real-world data or by creating simulation scenarios where models and processes interact to create completely new data not directly taken from the real world.

For many years synthetic data has usually taken a back seat as real data stacks have been exhausted. There has also been some lack of trust in synthetic data because it hasn’t been generated by real life situations/ systems. This set in a world where human decisions are predominately being made there are times a lot of bias and quality issues already in the real data results and we are also faced with so many scenarios that aren’t represented by these real data sets.

Thus the keys to moving AI forward through real data is therefore fraught with many challenges, some questions like:

1. Is there the size of data available for each individual scenario that could exist?
2. Have all the right subset of data points been designed and captured that are relevant for the data outcomes being modelled?
3. Is the data quality high, free of bias, are you hampered by data privacy regulation which could impact any learning models further down the line?
4. Are there data sets you can’t access which would have helped the models?

The access to all the fragments of data which lead you to an insight to outcome is such an important typical gap in real data, what happens when a dog crosses a motorway when it isn’t part of the training simulated data for autonomous vehicle testing? In our human circumstances our own brains are firing the right neurons in our decision layer to answer that. Creating a dataset modelling every circumstance and modelling the behaviour, you realise becomes commercially impossible and time prohibitive to create/ source a training dataset that diverse. In fact when you actual start modelling some basic scenarios in your data problems don’t be surprised when permutations end up in the quintillions and it is highly unlikely you are going to have that much real data.

So in reality we have only scratched the surface of the data design and collection, but as synthetic data models start taking shape we should get new opportunities with AI and it will be interesting to follow this evolution and its impact in use cases across businesses and industries.

Some things you should be thinking about to expand your real data but also growing your synthetic data capability

1. As part of your digital mandate and future tech changes, is there an evaluation of all your relevant systems to optimise the capture of data?
2. Within your internal data design plans are there clear definitions of types of data which need to be captured and linked?
3. Are you looking at your data management and architecture to ensure security, ease of access, flexibility, speed?
4. Do you have the right talent to support your overall data strategy, including business data experts, data modelers to data scientists?

In the future, taking the example of autonomous vehicles, the irony of synthetic data could be comparison of the safety decisions made through our own human led driving v a car’s synthetic data trained AI led computer, the question is who wins on safety? I think the phrase “it is inevitable” springs to mind.

So why is this such an important topic, to level the playing field a little across the world, it is important that the in-equal distribution of data doesn’t stop governments and companies from driving their digital strategies and synthetic data is a potential avenue to help close some of those gaps.