But this leads to our fifth problem, which is similar-yet-different data sets. Why are there multiples? Which one should I use? Is this data set still maintained, or is it a zombie data set that’s still regularly updated but without anyone overseeing it? The problem comes to a head when you have important computations that disagree with each other, due to relying on data sets that should be identical but are not. Providing conflicting reports, dashboards, or metrics to customers will result in a loss of trust, and in a worst-case scenario, loss of business and even legal action.
Even if you sort out all of these problems—reducing latency, reducing costs, removing duplicate pipelines and data sets, and eliminating break-fix work—you still haven’t provided anything that operations can use. They’re still on their own, upstream of your ETLs, because all of the cleaning, structuring, remodeling, and distribution work is only really useful for those in the data analytics space.
Shift left for a headless data architecture
Building a headless data architecture requires a rethink of how we circulate, share, and manage data in our organizations—a shift left. We extract the ETL->bronze->silver work from downstream and put it upstream inside our data products, much closer to the source.