The world of data and analytics is full of misunderstanding and misinformation. Unfortunately, a lot of this is perpetuated not just by data laypersons but by professionals of all kinds—from data scientists to vendors, to executives and advisors. It is a little frightening when we think about the fact that they are supposed to be the experts guiding those who are not. Since I sit at both ends of this professional spectrum (I am both a data scientist and an advisor), I feel compelled to address perhaps one of the most persistent myths about data: “let the data speak.”
Data are objective, they say. Trust the data and not humans? Unfortunately, there are at least two fundamental issues with this idea. Here are a couple of truths about data, perhaps unexpected and somewhat painful.
Truth #1: Data are stupid and lazy.
Data are not intelligent. Even artificial intelligence must be taught before it learns to learn on its own (even that is debatable). Data have no ability on their own.
Data are inherently lazy. They do not say anything on their own. I had little problem with the expression “data-driven” when I was more naïve, because in all honesty, I didn’t think it mattered. After having explained so many times that data do not drive humans or anything for that matter, I moved on to “data-informed” (i.e., “informed by data”). However, data still do not do anything, much less inform anyone. “Data-oriented” or “informed with data” is probably closer but still seems inadequate. Data simply do not do anything. They just exist.
It is often said that insights must be teased out of data. You need to know how to engage in a meaningful conversation with data, which requires more than a simple pick-up line (“hey, do you want to tell me what your average is?”). It is the responsibility of analytics professionals to produce business value from the data. However, it is not because data are shy or uninterested in getting into a relationship; rather, it is because they are simply inanimate.
I strongly disagree with the notion that data scientists spend too much time wrangling with data and not enough time building the analytic—beyond the typical data management issues like lack of documentation and technical errors, of course. The story of how the data came to be is embedded in all data, and that is golden in understanding the whole of the problem to be solved. I learn more about the business problem by wrangling with things that are often perceived to be data issues but really are a product of the business environment (including processes, standards, policies, etc.).
The more you understand and make sense out of it, the better you understand the bigger picture, and the better the analytic. You can’t automate this process completely; it results from humans being directly involved.
Truth #2: Data are rarely an objective representation of reality (on their own).
I want to clarify this statement: it does not say that data is rarely accurate or error-free. Accuracy and correctness are dimensions of quality of what is in the data themselves. Rather, the issue is with the incorrect use or interpretation of data. This is an important distinction, even though people tend to throw all data problems into one giant bucket.
We have all been implored to trust data as “hard facts,” so the notion that there is any room for interpretation may seem odd. However, this is precisely the problem: what one believes to be fact is often not the objective truth but rather an interpretation of the objective truth. This is a major challenge that leads to misuse and underperformance of analytics.
Unless you are heavily involved in experimental design or primary research, the data we come across as ordinary people are rarely collected for the specific purpose at hand. Statisticians make a pretty strong distinction between data whose collection (not just sampling) is designed for the specific purpose at hand and all other data. The latter is often called “observational” or “secondary use” and can be severely biased with respect to the question of interest, or at best confusing and/or inconsistent.
Those with a good foundation in probability and statistics understand the criticality of this, and this is where the so-called “citizen data scientists” (or even trained data scientists) can get dangerous. That is, how you get the data from their very origin until they are in your hands matters in what conclusions you can draw. You need a solid foundation in probability (among other things) to know how to do so.
While it makes perfect sense to leverage the data we have anyway, this requires knowing how to put them into correct contexts that are different quite literally from one use to another. Since the vast majority of the data we see day to day are observational, your data will always embody some degree of inappropriateness for the problem at hand. Knowing what the consequences are, how to make them less inappropriate, and how exact the conclusion could be, becomes critical in making decisions with data. It does not mean you cannot use the data, but it does mean you could really screw yourself without knowing what you are doing.
In short, data do not know what you think they know. They reveal what they contain without any of the sensemaking humans could do about the discrepancy from the true reality; data do not do any sensemaking (again, or anything else). It would be unwise to discount how the provenance of the data fundamentally does or does not relate to the whole of your problem (provided you do understand the whole of the problem). This is not unlike drinking milk without checking it is still good (or checking the type of milk if you are allergic) just because it is in your refrigerator—the whole of the problem being not just to consume milk but to do so in a way that does not make you ill.
We are seeing this firsthand right now with the COVID-19 crisis. There is a reason that the data—again, the data we have all been led to believe are objective, hard facts—seem inconsistent to a rational person. The numbers alone are often out of context, measured differently, etc.—this is all expected because the COVID-19 crisis is not a designed experiment.
The idea of a “rational person” is a discussion for another day, but when we throw in confirmation bias (humans are more likely to agree with something that supports your view) and social media platforms, everyone is now an expert in the name of “hard data.” What many do not realize is that they argue with point armed with “facts” that do not represent the reality. For the record, I am not saying that social media is bad or that one should never use data, but one should use data in a way that is less likely to have a negative impact.
(PSA: Wash your hands and practice social distancing, which, in reality, is physical distancing!)
True inaccuracy and errors in data are at least relatively straightforward to address, because they are generally all logical in nature. Bias, on the other hand, involves changing how humans look at data, and we all know how hard it is to change human behavior.
Don’t (just) let the data speak
So, please do not just let the data speak, and please do not expect the data to just speak to you. If anyone, including and especially anyone of the data science persuasion, tells you to “just let the data speak,” it should raise a red flag. The discussion around how blind faith in data leads to a lot of bias in analytics is growing, but real solutions beyond the intangibles like diversifying the team are still hard to come by. Those who really understand how to solve problems with data know why letting the data just speak is a bad idea; it not only grossly simplifies the business problem but can create other “unintended consequences.” This is an entire discussion on its own, which we will save for another day….