March 13, 2024

6 Minutes

Data Teams are Stuck in an Engineering Loop, Because They Don't Have Visibility of Data Assets

Data Industry

Sam Abraham

Marketing Manager

A 2021 study (based on 300 “data leaders”) says the average data engineer spends 44% of their time maintaining data pipelines.

That’s anywhere from $75,000 to $110,000 in salary for one person on maintenance.

On top of this is cloud costs, which is many multiples of the cost of man hours at this point.

A 2023 G2 article says 80% of companies spend $1.2 million on cloud services, while over a third of enterprises spend over $12 million on cloud technology.

It could go up by as much as 20% annually on average.

Adding to this is the cost of siloed data products. Large companies have numerous databases or data warehouses that no one understands. If you can’t understand them, data teams continue maintaining them and are not entirely sure if they can be reused for new data projects.

In short, IT budgets are under pressure. And the perceived value of tech spending is always in relation to cost.

Why does this matter for data teams?

Data teams face the same pressure but are often seen as the additional, disposable layer on top of operational IT infrastructure. More so in tough market conditions.
Ironically, data teams are also well placed to have an impact in addressing these issues. Which team is better placed to create a process to deprecate pipelines and manage cloud costs effectively?

Why are data teams under pressure?

The answer has been the same for over a decade now — they’re too far away from the business use case.

Analytics products are nice to have, but are often not good enough to ignore a business expert’s gut instincts (and his Excel sheet).

Is the data valuable if the final products are seldom or rarely used?

The gap between business reality and data analytics is more evident with enterprises pushing for AI-enabled predictive data products.

However, most data warehouses are not designed to support the latest in AI. Predictive analytics is out of the question for all except the most digitally savvy companies. The FAANGs of the world, and even for them it’s not easy.

Right now, the average data leader at an enterprise has to:

Manage expectations on pace & quality of data product delivery.
Deal with a constant backlog of requests, while accommodating ad-hoc requests.
Justify spending decisions on new data tools & cloud assets.
Be responsible for data governance while all of this adds to the complexity of governance.

A data leader has to bring their team closer to revenue right now and make them a valuable investment that results in better decision-making every day.

How can any of this be done if you’re stuck in a constant data engineering loop?

The Constant Data Engineering Loop

We’ve been engaging large enterprises over the years, and there’s always a “new” data project at hand because the current setup just isn’t bringing value.

The existing data warehouses are messy because no one knows exactly how the pipelines have been built.

This could be because source systems have updated their architecture, like API endpoints or availability of specific data points. Or because teams have changed and the people who built the data warehouses — often consultancies — aren’t available to troubleshoot things.

Documentation is almost never enough to decide if a data warehouse is still as useful as when it was built.

Data governance initiatives were supposed to help in such situations, but our experience shows that large enterprises have realized that existing governance initiatives haven’t been able to find scalable solutions.

The end result?

Devaluation of Data Products

For instance, a data warehouse was supposed to be able to handle new business requests faster.

In reality, a new data warehouse project kicks off every time a significant request comes from the business. Why?

For most teams, it's faster to build rather than re-engineer what exists.

As mentioned before, existing data warehouses are a mess and there’s no way to decode what’s in there and how it can be applied to new business requirements.

So build is faster than reusing existing databases.

Often, because there’s an expectation that an ad-hoc request will be handled fast, the new data warehouse is also constructed such that it cannot be reused.

You now have hundreds of databases that need to be maintained, with pipelines being duplicated and siloed for data products that aren’t bringing real business value.

How can data teams deliver “value” if they're constantly (re)engineering data warehouses & data marts?

How Data Teams can Avoid (Re)Engineering & Deliver Value Faster

The solution is to model a “semantic layer” that ties together data warehouses. There’s a lot of ways to interpret a semantic layer. We’ll try to be as exact as possible.

It’s a layer that is easily understood by both business and data teams.
It is represented using conceptual data models using business metadata.
It represents the reality of the business, rather than how data has been engineered.

The Budget Pool is Shallower Than Before

A semantic layer — because it’s easily understood by both data and business teams — makes it possible to reuse what’s already been built.

A VP of Sales or CFO can very easily tell a data team what data they need to solve their problem and there would be a common reference in the model that truly represents real-world scenarios.

You’ll soon have a company culture where the default way to solve a problem isn’t to add technology on top of existing technology to bring value.

This brings down the budget considerably.

Plus, if you now have models that are no longer fit for purpose, you can also deprecate corresponding databases instead of continuing to maintain them because it’s a black box and switching it off might break something.

Why Reverse Engineering is One Answer

A semantic layer is easy to construct when you’re starting from scratch. But what do you do when you already have hundreds of databases?

You reverse engineer. How?

With Ellie, you can read in from one or multiple databases or data warehouses.
It's already structured information (database schemas) and is almost universal, meaning it doesn’t matter if you’re using Snowflake, SQL or something else.
Ellie “autodraws” conceptual and physical models.
The entities, representing an important data point, are added to our universal glossary and the “reverse engineered” model can be the basis for a model that delivers something new.
You can create a number of these conceptual models, and manually link them through our glossary to build the big picture. An enterprise architecture view of sorts.
The engineering then becomes faster because you reuse what you have and you only need to engineer the parts that are missing.