Probably the oldest refrain in computing in general and data management in particular is that garbage in equates to garbage out. But there's a corollary to good old GIGO and it comes in the form of a proverb. Horses for courses.
What you are about to read may shock you, but only slightly. Using reasonably good data to get a reasonably good outcome can be just fine. It can even be a better approach for some use cases than using perfectly refined, one hundred percent accurate data.
Let's take a quick step backwards. As data continues to expand at enormous rates, what is happening is that traditional methods for building ETL and data integration is just too slow. By the time the data is sorted out, it isn't even fit-for-purpose any longer - the question has passed, the business case has changed or a bit of guesswork has provided the insight sought (whether or not successfully is beyond the remit of this blog).
Use cases are key
So, what's the answer here? Quite simply, the use case should dictate the extent to which the data is refined. If it is to test a hypothesis for a marketing campaign, for example, and using the analogy that data is the new oil, refining it into diesel fuel is good enough. There's no need to turn it into jet fuel before you know if the plane can even fly.
The resulting 'crude but fast' analysis can provide indicative answers (and direction) on what to do next.
If it looks promising and worthwhile, the case could be made to continue down that path and further refine the results by getting the experts involved.
What's important in establishing a set of standards for any data analysis is to include caveats for it. Perhaps break down the stores into bronze (for the marketing example), silver and gold. When a 'bronze' analysis is done, and the SLAs and quality isn't as high but it is fit for purpose, just go ahead and say so. But don't build something on bronze and present it to the chief exec without letting him or her know it.
By pre-establishing the chances of success with a rough early run, it also becomes possible to give the business some sense of control. It breaks down barriers by equipping line-of-business people with the tools to experiment, without consuming the time and effort of dedicated specialists until accurate gold-standard, jet fuel is required.
And use the right tools
Using a data warehouse automation tool like WhereScape RED is a valuable tool in the process of collecting and refining data to the right level for the job at hand. If it is to be bronze, to explore initial queries, no problem. Because the data is in a central repository, if it is then necessary to move to silver, and then potentially on to gold - the jet fuel standard - it is all in that one place, rather than distributed in disparate systems. You can model it, apply rules or clean the data inside RED. The whole process for all levels is managed from one tool and it uses the power of automation, data lineage and auto-documentation to make things easier for you. What's more, you always know what you have got, and to which level it is refined.
A word of caution, however: there remains a necessity for governance across the process and clear communication. That's because you don't want data analysts becoming pseudo-developers who won't apply the same level of rigour when creating production systems. Once the use case is demonstrated and handed over to the specialists for refinement and automation, the analysts should return to their core duty of looking out for the next way in which data can deliver an advantage.