Someone asked this very question recently on the Predictive Analytics Network (PAN) forum on LinkedIn. Below is my answer, somewhat expanded relative to what I posted in that venue.
In my opinion, caveats and potential downsides in dealing with predictive analytics and ‘big data’ include:
- Naively believing one can number crunch one’s way to spectacular results, not to mention a misplaced confidence in being able to interpret them appropriately, without a thorough understanding of a domain, simply because reams of data are available (ties to #3).
- Disregarding ‘fitness for purpose’ of any and all tools used.
- Not understanding the limitations and ranges of applicability of a tool or a ‘solution’ (ties to #2).
- Ignoring ongoing structural change, which means existing data may have low predictive value. This happens more often than one may suspect. Indeed, one of the three Vs of ‘big data’ (Volume) has a time component implicitly associated with it, and, as we know, over time things change. This does not seem to give people pause or stop them from trying to predict something the specifics of their context notwithstanding.
- Not understanding the present well. If one doesn’t, it is futile to try to predict the future. Good forecasters have a clear grasp of what is going on now. Not only that, past and present data are needed to be able to take an educated guess at what’s beyond the horizon. These data must be time-stamped and kept, and not simply overwritten as they change.
- Paying insufficient attention to data quality and processes associated with data stewardship from data entry onward. This should not be confused with the wholesale elimination of outliers, themselves quite useful in detecting deviations from common patterns.
- Erroneous or inadequate data conditioning, including weighing the use of derived vs. raw data. For example, sometimes a ratio (BMI) is a better indicator than individual variables (weight, height). Understanding feasible data ranges also helps to scope matters down.
- Not thinking hard enough about the right data, itself perhaps a subset of ‘big data.’ Specifically, not focusing on the data that help answer the right, correctly posed question. It all starts with not framing the problem properly.
- Succumbing to managerial pressure to deliver insights in the short term and at a certain rate, which tends to be directly proportional to their investment in platforms and tool-sets. This is one instance where productivity quotas are misplaced.
- Ego and lack of humility, always unhealthy in any research endeavor.
In the end, of course, these are in no way the ‘fault’ of either predictive analytics or ‘big data.’ Rather, they are indicative of all too human foibles which we come across in many domains. They tell us we need to guard against our innate tendency to skip the basics and let our unbridled enthusiasm get the better of us as we rush to some imagined and wished for result, generalizing when it is not justified, seeing trends that are not there, jumping to conclusions from unverified assumptions, and relying in excess on tools that are so complex as regards what is ‘under the hood’ that they may actually require that we accept their output almost on faith unless we helped develop them.
In conclusion, coaxing information and developing ground-breaking insights from data may require more finesse and patience than many appear to be willing to invest and less brute force than we can all too easily wield with the tools at our disposal today.