A 2021 study (based on 300 “data leaders”) says the average data engineer spends 44% of their time maintaining data pipelines.
That’s anywhere from $75,000 to $110,000 in salary for one person on maintenance.
On top of this is cloud costs, which is many multiples of the cost of man hours at this point.
A 2023 G2 article says 80% of companies spend $1.2 million on cloud services, while over a third of enterprises spend over $12 million on cloud technology.
It could go up by as much as 20% annually on average.
Adding to this is the cost of siloed data products. Large companies have numerous databases or data warehouses that no one understands. If you can’t understand them, data teams continue maintaining them and are not entirely sure if they can be reused for new data projects.
In short, IT budgets are under pressure. And the perceived value of tech spending is always in relation to cost.
Why does this matter for data teams?
The answer has been the same for over a decade now — they’re too far away from the business use case.
Analytics products are nice to have, but are often not good enough to ignore a business expert’s gut instincts (and his Excel sheet).
Is the data valuable if the final products are seldom or rarely used?
The gap between business reality and data analytics is more evident with enterprises pushing for AI-enabled predictive data products.
However, most data warehouses are not designed to support the latest in AI. Predictive analytics is out of the question for all except the most digitally savvy companies. The FAANGs of the world, and even for them it’s not easy.
Right now, the average data leader at an enterprise has to:
A data leader has to bring their team closer to revenue right now and make them a valuable investment that results in better decision-making every day.
How can any of this be done if you’re stuck in a constant data engineering loop?
We’ve been engaging large enterprises over the years, and there’s always a “new” data project at hand because the current setup just isn’t bringing value.
The existing data warehouses are messy because no one knows exactly how the pipelines have been built.
This could be because source systems have updated their architecture, like API endpoints or availability of specific data points. Or because teams have changed and the people who built the data warehouses — often consultancies — aren’t available to troubleshoot things.
Documentation is almost never enough to decide if a data warehouse is still as useful as when it was built.
Data governance initiatives were supposed to help in such situations, but our experience shows that large enterprises have realized that existing governance initiatives haven’t been able to find scalable solutions.
The end result?
For instance, a data warehouse was supposed to be able to handle new business requests faster.
In reality, a new data warehouse project kicks off every time a significant request comes from the business. Why?
For most teams, it's faster to build rather than re-engineer what exists.
As mentioned before, existing data warehouses are a mess and there’s no way to decode what’s in there and how it can be applied to new business requirements.
So build is faster than reusing existing databases.
Often, because there’s an expectation that an ad-hoc request will be handled fast, the new data warehouse is also constructed such that it cannot be reused.
You now have hundreds of databases that need to be maintained, with pipelines being duplicated and siloed for data products that aren’t bringing real business value.
How can data teams deliver “value” if they're constantly (re)engineering data warehouses & data marts?
The solution is to model a “semantic layer” that ties together data warehouses. There’s a lot of ways to interpret a semantic layer. We’ll try to be as exact as possible.
A semantic layer — because it’s easily understood by both data and business teams — makes it possible to reuse what’s already been built.
A VP of Sales or CFO can very easily tell a data team what data they need to solve their problem and there would be a common reference in the model that truly represents real-world scenarios.
You’ll soon have a company culture where the default way to solve a problem isn’t to add technology on top of existing technology to bring value.
This brings down the budget considerably.
Plus, if you now have models that are no longer fit for purpose, you can also deprecate corresponding databases instead of continuing to maintain them because it’s a black box and switching it off might break something.
A semantic layer is easy to construct when you’re starting from scratch. But what do you do when you already have hundreds of databases?
You reverse engineer. How?
A data team that's existed for a certain amount of time working on existing business domains already has data warehouses that are good enough.
Reusing these to deliver new requirements is cheaper and faster. Plus, you get to avoid the cost of maintaining databases that are stale.
Why wouldn't enterprises choose to do this?