Updated
December 24, 2024

Emerging trends in data engineering

Explore the top emerging data engineering trends shaping 2025, from DataOps and MLOps to LLM copilots and vector databases. Learn how innovations in data quality, governance, and observability are transforming the data landscape.

Etai Mizrahi
Co-founder
Explore the top emerging data engineering trends shaping 2025, from DataOps and MLOps to LLM copilots and vector databases. Learn how innovations in data quality, governance, and observability are transforming the data landscape.

Table of contents

  • Current State of Data Engineering
  • Top 8 Data Engineering Trends for 2024some text
    • DataOps and MLOps
    • Data Mesh and Data Fabric
    • More Power to Data Quality and Data Governance
    • Increased Adoption of Data Orchestration and Observability
    • Emergence of Data Vaults
    • LLM Copilotssome text
      • RAG to the Rescue
    • Vector Databases
    • The Rise of GitOps

Current state of data engineering

402.74 million terabytes.

That’s how much data is being created today.

With AI taking the wheel, that number is bound to go up considerably. This large data volume makes the data quality, security and governance conversation more pertinent because bad data and data breach costs are at an all-time high. 

$3.1 trillion.

According to IBM, that’s how much businesses lose every year due to poor data quality.

They also found that, on average, a data breach in 2024 costs an organization $4.88 million.

Over the last decade, tech-VC funding has ballooned due to an explosion of data tooling and belief in data-driven organizations. Although the explosion has been largely beneficial, it also created new challenges for data teams.

Challenges like growing dependencies and complexities, overlapping tool capabilities, and perhaps the most critical issue plaguing the modern data landscape: the rapidly growing cost of running infrastructure that often exceeds the value delivered.

The ‘growth-at-all-costs’ approach has come home to roost. Many companies have had to pivot their business models significantly to focus on profitability and sustainable growth. This shift has resulted in increasing pressure on data teams to demonstrate a return on investment.

That’s what the data engineering picture looks like at the moment. Let’s discuss the key data engineering trends shaping the market in 2024 and how this space might evolve.

Top 8 data engineering trends for 2024

According to Gartner, this is what the hype cycle for key data management trends looks like:

Let’s talk about the 8 trends that’ll influence the data engineering landscape:

DataOps and MLOps

Organizations are replacing old data management setups with new practices focused on automation for effective analytics. This approach uses a continuous feedback loop to ensure data quality. Continuously monitoring data health allows DataOps to resolve issues proactively, leading to faster deployment of data pipelines.

DataOps, along with data observability tools, provides comprehensive visibility into data access, flows, and usage. This transparency will allow organizations to tackle technical, ethical, and legal challenges as data becomes more accessible.

With increased adoption of monitoring tools, continuous integration, and automation, DataOps will become a key practice for companies looking to streamline their data management process.

MLOps is used to maintain machine learning models. This year, we have and will be seeing an increase in the democratization of MLOps. This would further lead to the rise of no-code/low-code MLOps and relevant platforms. 

This would make MLOps more accessible to a broader range of users, irrespective of their technical proficiency. MLOps CI/CD is a sub-trend on the rise. It integrates delivery principles and continuous integration into the machine learning lifecycle. By automating and standardizing workflows from model deployment to data ingestion, this approach will ensure reliability via testing, quality checks, and monitoring. Key players in the space, like Argo CD, Jenkins, and GitHub Actions, are spearheading this transformative trend.

The lack of proficient folks in the space seems to be one of the obstacles to the more widespread adoption of MLOps.

Data Mesh and Data Fabric

Zhamak Dehghani first introduced the concept of Data Mesh. It picked up steam in the community because it provided greater flexibility for data owners.

Lately, the promise of data mesh seems to be wavering. The Gartner Hype Cycle estimates that the Data Mesh has become obsolete before its plateau. Hannes Rollin listed a couple of reasons why that might be the case:

  • Most of the data is not useful: Most of the data collected isn’t top-notch and shouldn’t be made top-notch. Finding out which data is worth using is a rather painstaking process.
  • Lack of data competence: There are way too many developments and not enough competent folks in data teams to keep up with it.
  • Maintenance Burden: Maintenance burden is probably the biggest issue plaguing the data mesh users. The more different things you try, the more stress in the system, the more your maintenance cost mounts.

Another technology that has risen to tackle data silos is the data fabric technology. It provides an integrated architecture that connects data across various platforms, irrespective of whether the data is on-premise, in the cloud, or on the edge.

However, according to SeattleDataGuy’s State of Data Survey, not a lot of people had heard about Data Fabric:

According to Gartner’s Hype Cycle graph, data fabric still has 2-5 years to go before it hits the plateau, so it’s still early for the technology. Siloed data stacks and the lack of preparedness for genAI are driving organizations to search for new approaches, which will allow this technology to grab the spotlight in the coming days. You’ll start seeing more vendors creating unified data fabric platforms for automating tasks like schema alignment for profiling data and new data sources.

This prediction is further supported by the fact that G2 reported that it takes a mere 3 months on average for companies using data fabric software to see positive ROI:

More power to Data Quality and Data Governance

As discussed in the previous section, the vast majority of data tends to be useless to data teams. To extract meaningful insights from the remaining data, you need the data quality to be top-notch.

All the other trends mentioned here are directly or indirectly trying to improve the quality of data used for analysis. 

With the AI takeover entering hyperspeed, focusing on data quality will become even more pivotal because the cost of ingesting bad-quality data has always been pretty high. Case in point: In 2022, Unity Software lost $110 million by ingesting poor-quality data!

Salma Bakouk talked about the 1*10*100 rule that talks about the costs associated with bad data quality:

  • Addressing a data quality issue at the point of entry is approximately 1x the original cost.
  • If the issue spreads undetected through the system, the cost increases to about 10x, involving remediation and correction efforts.
  • If it reaches the end-user or decision-making stage, the cost can go up to 100x the initial expense due to consequences like lost opportunities, operational disruptions, and customer dissatisfaction. 

It’s only going to skyrocket from here.

This makes it all the more important for organizations to invest in data governance, data literacy, and investing in a data culture.

Kris Peeters, Founder and CEO at Dataminded, talked about the importance of data governance:

“The big risk today is that we embrace data product thinking without investing significant efforts in governance and automation. This will lead to several isolated data platforms in the organization, thus reinforcing silos instead of breaking them down.” 

Implementing a robust data governance setup would mean regular audits of data quality, deduplicating and cleaning datasets, and maintaining data currency — all of which are crucial for maintaining high-quality data. Organizations can use data quality validation and monitoring tools to enhance the processes by reducing manual work.

However, we are still a little ways away from more widespread adoption of said tools. According to the State of Data Quality Report, only 14% of surveyed participants had implemented tools to automate data quality management, while 13% had no plans for implementation.

In the coming days, we’ll see more organizations investing in training and culture to promote data literacy across organizations and focus on continuous improvement and process automation platforms to bridge the gap.

Increased adoption of Data Orchestration and Observability

Data orchestration allows companies to streamline their data operations, making it easier for employees to use data. With key data orchestration trends like data democratization, real-time data processing, and low-code data integration on the rise, we’ll witness more widespread adoption. With ELT getting more popular, the focus of data orchestration would shift from data integration to data wrangling while ensuring data usefulness and quality.

Even though the primary objective of data orchestration tools is to reduce complexity, introducing more tools to your data stack will inevitably make it more complex. Navigating this complexity while trying to find a way to integrate with all the data repositories within an organization would allow data orchestration as a technology to expand its reach even further.

As cloud-native environments become more mainstream, pinpointing the root cause of system failures has become more difficult. Reducing incidents of downtime is absolutely crucial at this point when every hour of downtime could cost organizations north of $150,000. Enter data observability.

Thanks to observability trends like observability pipelines, data-driven FinOps, and platform engineering, data observability is here to stay. 

Data orchestration doesn’t exist in a vacuum. Investing in data observability will allow organizations investing in data orchestration to improve the success of their data pipelines, which means tools offering end-to-end data observability will be in pretty high demand.

Emergence of Data Vaults

Data lakes have changed the data engineering landscape, but managing the vast amount of structured data can be a hassle. This is where data vaults can come in handy. Since data vaults are based on agile methodologies and techniques, they can quickly adapt to changing business conditions. A key advantage of data vaults for data engineers is that ETL jobs need less refactoring during model changes.

According to a survey by BARC in 2023, the primary technical reasons businesses cited for adopting a data vault are as follows:

  • Extensibility
  • Scalability
  • Flexible Architecture
  • Simpler Data Management
  • Unified Data Model
  • Superior Data Quality

The data vault modeling technique is gaining increased acceptance from the community due to its prioritization of data integrity. Its auditability features monitor every data entry modification, ensuring transparency and trust in the data ecosystem.

LLM copilots

AI is all the rage now. Every company wants to integrate AI into their workflow in some way or another.

LLMs will start acting as co-pilots for existing data engineers, analysts, and scientists, improving productivity. GitHub Copilot will be the ‘stepping stone’ for companies trying to get more value out of AI.

Beyond the productivity gains, LLMs are also opening the gate for new responsibilities. As more organizations try to do more with less, the demand for generalists proficient in AI, platform engineering, and data has gone up. 

LLMs are also creating new opportunities for data teams to shift from being ‘cost centers’ to becoming ‘profit centers’ for organizations. It’s not enough for data pipelines to simply move data around today. With LLMs, data teams can focus on building pipelines that people use to drive business value.

GenAI will likely move data engineers in two key directions:

  • GenAI and data mesh architecture will allow online system owners to also own the system data. The ability to own the system “end-to-end,” from online systems to logging to pipelines to datasets to metrics, will be huge in the coming days.
  • A subset of data engineers will start picking up more product manager-like behaviors. Pipeline work won’t consume as much of a data engineer’s time in the future, so they’ll be able to pick up work in visualization and predictive modeling.

RAG to the Rescue

For generative AI to succeed at an enterprise level, it needs to be:

  • Scalable
  • Private and secure
  • Trusted

A common drawback users saw in GPT-4 was its tendency to hallucinate responses, which means that in order to make responses useful for business purposes, organizations would have to start augmenting LLMs with their own proprietary data, which would include the necessary business context.

Retrieval-augmented generation (RAG) is a technique that can improve the accuracy and reliability of genAI models by incorporating proprietary data.

Since RAG applications are designed to pull information from sources before providing an output, they are well suited for querying data from structured and unstructured data sources like feature stores and vector databases. Also, given that these applications can be implemented with minimal code, they are more cost-effective and faster than retraining models.

An example of a handy tool in this space is Secoda AI. Secoda AI is a secure AI-powered assistant that can not only answer questions about your data but also write and execute complex queries without the need for a semantic layer. By understanding the context of your data, it provides instant responses to complex questions, allowing you to chat with Secoda AI as if you were talking to a data analyst who knows the ins and outs of your data.

Vector databases

Vector databases can store and manage vector embeddings for managing unstructured data like videos, images, and natural language.

Vector databases have primarily gained popularity as the go-to solution for building RAG systems.

Another reason for choosing vector databases is the inevitable increase in data volume over time. As the data keeps expanding, cost optimization becomes a major concern. These databases deliver superior performance while needing fewer resources, making them highly cost-effective and a prime choice at a time when data teams are looking for ways to do more with less.

In the future, we’ll see vector database capabilities expanding to include exact matching or search, according to Charles Xie. By combining similarity-based searching and exact matching, vector database users will be able to fine-tune the balance between obtaining a high-level overview and extracting specific details. 

According to the Vector Databases Landscape Report by Forrester, you’ll be witnessing market bleed-over in the future. You’ll see more cloud data platforms, including lakehouses, add vector capabilities to their offering.  

Key challenges that stand in the way of more widespread adoption of vector databases are as follows:

  • Massive Data Scale Issues: Clustering nearest vectors becomes more complex in high-dimensional vector spaces, making it hard to extract the most relevant results at scale. 
  • Computational Costs: Embedding billions of vectors can be very costly, needing high GPU use at scale. For organizations that would need constant embedding of streaming data, the ongoing cost and time required are considerably high.
  • Maintenance: Ensuring proper maintenance and reliability is necessary to accommodate the increasing data demands. However, indexing, removing, and updating operations in a billion-order vector database is very challenging since they haven’t matured to a point where users can efficiently execute these operations at that scale.

The Rise of GitOps

With the convergence of software, data, and platform engineering, along with the rise of multi-cloud and hybrid environments this year, GitOps has become popular in the data engineering community. It’s an operational framework that manages software deployment and automates infrastructure.

According to a Gartner report, 99% of cloud security failures in 2025 will be caused by the customer’s fault — either inadequate control or misconfigurations. By adopting GitOps, a company will be able to tackle the following challenges:

  • Failed deployments reliant on disaster recovery: Git allows developers to revert, fork, and rollback, providing a good contingency plan in case something goes wrong in a production environment. Since Git is a single source of truth, the downtime recovery time gets reduced from minutes to seconds. 
  • No idea where or how an application runs: GitOps provides a consistent model for application and infrastructure changes. Development processes can be reproduced through Git.
  • Missing documentation: If you use Git to manage Kubernetes clusters, you get a complete audit log of all changes to the cluster from outside of Kubernetes with a full audit trail of each change. 

While GitOps will allow data teams to ship faster with fewer resources more securely in the coming days, its widespread adoption would need to overcome the following challenges:

  • Limited Visibility: GitOps provides visibility to everything within an environment, with all state data stored as plain text. However, this only works for setups with fewer data stores. On the other hand, enterprise environments tend to have various config files and repositories, making it extremely cumbersome to sift through all the files. 
  • Unsuitable for Storing Secrets: A complex enterprise environment needs a solution to store secrets outside the standard CI/CD process. Auditing secrets like passwords and keys are crucial to the operation; therefore, they must be stored in a centralized data store. A Git repository isn’t fit for storing secrets that need to be encrypted and decrypted because Git history will always have a record of those secrets.

A common thread in all these trends is the focus on doing more with less and making sure that the data getting used is secure and private — more so with the AI train picking up steam.

Data observability will take center stage to make sure AI is fit for enterprise needs. 

With time, we’ll likely see more consolidated tools use more data and AI to create more impact while keeping data quality at the center and lowering cloud costs. 

Take Secoda, for instance. Secoda consolidates your data monitoring and observability, lineage, catalog, and governance in one central platform so you can reduce complexity and save budget. You also get visibility into the health of your entire stack and prevent data asset sprawl. Add to that the capabilities of Secoda AI to profile data, tag PII, analyze trends, generate documentation, etc, and you have a strong contender for disruption in this space.

If you want to implement a comprehensive data observability framework, book a demo with Secoda today and start extracting the full value of your data.

Heading 1

Heading 2

Header Header Header
Cell Cell Cell
Cell Cell Cell
Cell Cell Cell

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote lorem

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

  • Item A
  • Item B
  • Item C

Text link

Bold text

Emphasis

Superscript

Subscript

Keep reading

See all stories