What does “life” mean for data, and why should anyone care?
I’ve been insistent that neither the technology nor the analytics function manages the entire lifecycle of data and how that creates a critical gap, like in this blog here. This poses the question of what we mean by a “data lifecycle” and subsequently, whether data has “life” at all.
The data lifecycle
Clearly, we have applied the idea of life to many things that are not alive in the biological sense. I think we know data is not biologically alive. But we do not think of it as something that is born or something that dies. Especially the latter—once generated, data can exist forever, right? Once I mentioned the idea of the data lifecycle to a client. She looked at me funny and said: “I don’t think of data as something that dies.”
Every piece of data has a beginning and an end, whether or not we explicitly acknowledge them. What do we mean by “the life of data”? Better yet, what is the meaning of life for data? For those of us old enough to be familiar with Monty Python’s The Meaning of Life: “makes you think, doesn’t it?”
The tendency today still is to let data accumulate and deal with them only when we need to. We zoom in on how to get and use information. As a result, data practices are still very consumption-centric. The message is still largely about how we need to make better use of all the data we have.
By now, we’ve all come across discussions centered around either the value or the risk (or both!) of data. A lot of those center around data acquisition and use: sourcing, access, insight generation, tools, transmission, and storage. But the values and the risks are present throughout the life of information. By not considering the entire lifecycle, we not only miss on the full value of data but also lead ourselves to issues that make it a loss or even put the entire organization at risk.
The birth, the death, and the afterlife
Then, what does a data lifecycle look like? There are four basic stages:
- Data acquisition. We get data somehow, be it by birth or by adoption. It may be that some technology application automatically generates data in real time. We may gain custody of data from a vendor or a public source. We may collect data intentionally, as we do when we conduct surveys or scientific experiments.
- Data processing. What we generate, collect, or acquire, needs caring, nurturing, and even disciplining (!). This includes the usual extract-transform-load (i.e., ETL) concepts, but it’s not just putting them and storing them somewhere. It needs cleaning, structuring, developing and applying business rules, and documenting, among others. Data needs to “mature” into adolescence and then adulthood so that users can use it correctly and efficiently.
- Data use. This is the reason for the data life. We also need to maintain data so that you don’t make decisions based on bad input on an ongoing basis.
- Data sunset, a.k.a. data death. At some point, pieces of data become ineffective or can no longer be used for one reason or another. So, we withdraw them from service or even eliminate them all total.
Generally, we do not actively think about the birth of data but rather only when we have to. In the back of our minds, we know it happens because the very existence of data tells us so. But we spend relatively little mindshare on how it happens and more importantly, how that impacts us. Aspects of the sources frame the validity of the conclusions drawn from the data, so this is not trivial. That is, you could inadvertently make a grossly wrong decision if you choose to ignore the details about the birth of the data. Even the most advanced analysis techniques and algorithms do not spare you of this.
What about death? Without mandates, rarely does anyone think of sunsetting data, much less how to best do so. Regulations often require us to do some of this, but only the bare minimum and because we have to.
Then, to quote The Meaning of Life again, “is death really the end? … Is there an afterlife?” You may find it absurd, but data does have an afterlife. There are risks associated with data that remain even after you take it out of service. They can come back and haunt us like zombies. Ponder that for a minute.
Analytics practitioners—statisticians, data scientists, machine learning engineers, etc.—take a look at data quality on a very limited basis. They do so just to be able to do their real job: to make something out of data. In the vast majority of the cases, the data used for any given analysis is only the tip of the data iceberg. The other 90% (or more!) remains idly, waiting to be cleansed until someone takes a look at it.
It’s not unlike weeds in a garden. You can clean out the weeds from what you see upfront, but you could have a jungle back there. Things could be festering out of sight and slowly killing your prized plants. You don’t see it happen, and when you do, it’s too late.
It also creates a vicious cycle of “get data, clean only the immediately necessary data, use only the immediately necessary data.” The garbage-in-garbage-out principle applies not only when we develop the analytic but also every time we use data for decision making. This is independent of the frequency of those decisions: once, occasionally, or on an ongoing basis. In the meanwhile, non-technical data users run into issues and lose trust in the data. Lost trust in data is very difficult to regain—this is not a data issue but rather a human issue.
The life experiences
It’s not just that the entirety of data is far bigger than a single or even multiple analysis efforts. Data changes over time. It accumulates life experiences in the form of data lineage. Life experiences! Who does what to it along the way? How? And what does it become at what point?
Too commonly, data gets cloned by various actors to suit their own needs. This is one of the tell-tale signs of inadequate data strategy and management. Then, each clone starts to take a life of its own, each with its own life experiences. They may separately generate some locally optimized values. But collectively, they invariably detract from the global (i.e., enterprise) value optima and generate troubling global risks.
What about third-party data? We expect most tactical issues to be resolved already when you acquire data from elsewhere, especially from reputable sources. But there is a whole set of responsibilities that come along with getting that piece of data. Placing it in a new environment does not take care of itself. Data needs to adapt to the new environment and vice versa. You need to care for it, whether you adopt it or raise it from its birth.
These are all not only valid but in fact critical.
Managing data from birth to death and beyond
The technology and the analytics functions are not properly empowered or focused on managing data throughout its lifecycle. There is a reason data management exists as an entire discipline! Managing the entire lifecycle of data means having intentional support for data all the time. This starts at the raw data stages at the latest, then through the use or the consumption of insights in decision making, and to sunsetting the information all together and beyond.
Even if the going is good now, at some point the cost and especially the risk eventually outweigh the benefits. Only those specifically focused on data management have a full picture of the value and the risks of the data. And in practice, data management professionals don’t get involved unless someone in technology, analytics, or business/research acknowledges they exist.
Regulations are far from comprehensive, and regulatory compliance is reactive by nature. Reactive data management is far more expensive than proactive and intentional data management in the long run. Ethics also dictate that we should be intentional about when and how to manage and sunset data regardless of domain.
As for the meaning of life itself? According to Monty Python: “try and be nice to people, avoid eating fat, read a good book every now and then, get some walking in, and try to live together in peace and harmony with people of all creeds and nations.” The underlying idea applies to data, too! The events of the last few years, and especially today, make this statement truer than ever. Data or otherwise.
P.S. Not everyone finds Monty Python and its sense of humor palatable. But if you work with Python the scripting language, or work with someone who does, you should at least be familiar with its namesake. That said, I’m more of “the Holy Grail” than “the Meaning of Life” person…