A data pipeline consists of a series of processes that move data from one system to another while transforming and organizing it for different uses. A proper data pipeline ensures the full automation of the flow of data from a specific source (or sources) to a destination (like a data warehouse or SIEM while ensuring that the data is cleaned, formatted, and enriched correctly.
With the world's reliance on data and the need to make data-driven decisions, data pipelines have become a crucial part of the decision-making process. However, as pipelines handle increasingly large data sets, their complexity grows, leading to errors, data quality issues, and bottlenecks. This makes maintaining and debugging pipelines increasingly challenging. Establishing best practices for designing, building, and maintaining data pipelines is essential for managing complexity. It also helps ensure error handling, data validation, and monitoring are in place, ultimately delivering timely, accurate results.
This article explores the fundamental parts of a data pipeline as well as different data pipeline architectures and their use cases.
Understanding data pipelines
A data pipeline can consist of many different components, but generally speaking, the components can be sorted into one of five categories.
Data sources
As you build a pipeline to answer a specific business query, you have to start at the data source. In other words, which external API or internal database will contain the data that you need to transform or process?
Examples of common data sources include:
- Your company's customer relationship management (CRM) database, which contains customer information
- Web server access logs that help identify issues with your web server performance
- External financial data APIs that provide stock market prices
Data sources are the first components you need to identify when designing or building your data pipelines. But identifying a data source is only the first step. The attributes of the data source (such as whether it’s a push or pull source or whether it’s streaming or batch) can heavily influence the design of your data pipeline. For example, streaming sources often require real-time processing, which demands different tools and architectures compared to batch sources, which can be processed in intervals. These distinctions impact everything from performance optimization to resource allocation.
Data ingestion
Data ingestion is the process of collecting data from different sources. Some data sources only supply a single method for importing data from the source, while others might make multiple methods available. You need to evaluate which methods make the most sense for your data pipelines in terms of efficiency and urgency requirements.
Data sources generally fall into one of two categories: push sources and pull sources.
Push sources
Push sources actively send data to a pipeline as the data becomes available. This can occur in real time or in batches at certain intervals, depending on the data source's configuration and/or default settings. Push sources include, but aren't limited to:
- CRM systems. These can generate an end-of-day report that contains information on new customers and send it as a batched report.
Pull Sources
Pull sources require the data pipeline to actively query or retrieve data from the data source. This method can also make use of real-time streaming methods or periodic queries for batch processing. Pull sources include:
- Data pulled from a social media platform's API that can be used to analyze trending topics in real time.
- A threat intelligence feed that is periodically queried to enrich your security logs with up-to-date threat information.
Your data pipeline needs to be flexible enough to incorporate both push and pull sources or even combine multiple data sources to facilitate answering specific business questions. Achieving this flexibility requires a modular and scalable pipeline design. For example, you could use tools like Apache Kafka to handle streaming push data, while Apache Airflow or Apache NiF can schedule and manage data pulls from APIs or databases. By combining these approaches, your pipeline can seamlessly integrate multiple sources while maintaining performance and efficiency.
Data transformation
Data transformation is where most of the processing in your data pipeline occurs. It's a broad category of different types of processing that you apply to the source data and can involve many different processes, either individually applied or as a sequence of steps, to achieve the result you want.
Generally speaking, data transformation processes fall into one of three categories: data cleansing, data enrichment, or data conversion.
Data cleansing
Data cleansing is the process of identifying and correcting errors or inconsistencies in the source data. By eliminating these errors, you improve your data quality and end up with more accurate results.
Examples of data cleansing include:
- Removing duplicate entries from a customer database
- Standardizing date formats from different sources to ensure consistency across systems
Data enrichment
Data enrichment involves adding additional information to a data record or event that can provide more context. This step usually involves merging data from different sources to create a more complete data set. Data enrichment can include:
- Adding geographic information to IP addresses in security logs to facilitate analysis of attack patterns coming from specific countries
- Combining new customer sign-ups with marketing campaign data to measure the effectiveness of the marketing campaign
Data conversion
Data conversion is a broad category of processes that could be used to facilitate any of the following examples:
- Aggregating individual events into a summarized format, like total sales per day per product category
- Converting data from one format to another, especially if the data destination requires the data to be in a specific format, like JSON or Parquet
Generally speaking, data conversion prepares data for storage and consumption in a format that is either required by the storage layer or convenient for a user to consume.
Data storage
Data storage refers to the storage medium where your transformed data will live, ready for any combination of retrieval, querying, and/or analysis. Some examples of common data storage destinations might include, but are in no way limited to:
- A cloud-based data warehouse, like Amazon Redshift
- An on-premise data lake, like Delta Lake
- A relational database from which users and other systems can use standard SQL to query the data
When choosing a storage medium, several factors should be considered to ensure optimal performance and cost-efficiency. Key considerations include access latency, which affects how quickly data can be retrieved, and read/write speeds, which determine how efficiently data can be processed and stored. The scale of your data can also determine the cost of your data storage, as some storage mechanisms can be considerably more expensive per GB than others. All of these considerations need to be balanced to help you design a cost-effective and responsive data pipeline.
Data consumption
While data consumption might seem slightly disconnected from your data pipeline as a whole, your data pipeline enables the consumption of the data in the first place. Data consumption can also occur directly from the data pipeline, depending on your use case and the storage requirements of your data. The following are some examples of data consumption:
- Business intelligence dashboards that show sales data trends, stored in a transactional database
- A machine learning model that predicts customer movement using historical data stored in the data warehouse
In some use cases, like fraud detection, data can be analyzed as it flows through the pipeline, even before storing it.
Unless you're only storing data as a regulatory requirement, data consumption should be the ultimate goal of any data pipeline. Data that is consumed is also data being used to make data-driven decisions.
ETL vs. ELT
Extract, transform, load (ETL) and extract, load, transform (ELT) are two different data processing approaches for analyzing your data. While these approaches are similar in that they share the same components, the main difference lies in where data transformation takes place.
ETL
In a typical ETL approach, the data is:
1. Extracted from the source
2. Transformed into a desired format or structure
3. Loaded into a destination system, like a data warehouse
Typically, processing occurs on a single, powerful system or some sort of distributed processing system like Apache Hadoop. Once the data has been processed, you send it off to the appropriate destination.
ETL does have some drawbacks, so depending on your use case, you should consider the following:
- Because data has to be transformed first, you might have to wait for the transformation process to finish before you can use the finished results for decision-making.
- Since processing involves various data sources, managing the process can be complex as it needs to accommodate different data destinations and their particular requirements.
ETL is more commonly used when data consistency and data quality are prioritized over real-time access to the data. You can also dig deeper into the differences and trade-offs between ETL and ELT here.
ELT
ELT relies on the data destination to handle the processing of the data. Data is extracted from the source, loaded onto a destination system, and then transformed or processed on each individual data destination.
ELT is mostly used when you want quicker access to data or when the destination system does a very specific type of transform that you cannot necessarily do with a more generic ETL platform. However, this specificity can also be considered a drawback. In some cases, your destination system might not be able to handle more generic transforms for implementing complex business rules.
In a nutshell, ETL is transformation-heavy before sending the data to a destination. It's often used when data needs to be formatted, cleansed, or normalized before storing it. ELT delays the transformation step and relies on the transformational features of different systems to get the answers to your specific questions.
Different types of data pipeline architectures
When designing data pipelines, it's important to recognize that different types of data pipeline architectures are tailored to meet specific needs and use cases. The different pipelines can be categorized into batch and real-time processing pipelines, each with pros and cons. Understanding the differences between these types is crucial for selecting the right approach to manage and process your data efficiently.
Batch processing
A batch processing pipeline should be used when you have large volumes of data that do not require immediate processing.
One of the key benefits of a batch processing pipeline is that it's scalable. You can simply add more computational power in parallel to handle your larger data volumes as your organization grows. Another benefit is that batch processing is reasonably efficient. You can manage your processing schedules to run during off-peak periods, which can lighten the load on other resources in your organization, like your network infrastructure. Batch processing can also perform complex calculations (like aggregations and indexing) on large data sets that aren't always possible with a real-time data stream.
Some common use cases for batch processing pipelines include tasks like data warehousing. Specific events are sent to their respective destinations where they can be cleaned, aggregated, or stored for future analysis. You could also do trend analysis or forecasting using historical data by processing large amounts of data spanned over longer periods of time.
End-of-day sales is a common batch processing use case. If you have multiple regional stores that calculate their end-of-day sales figures, you could upload those figures to the head office for collating:
After the data is sent to the data warehouse platform, further analysis and reporting can be done, like pivoting to region-based sales over quarters or doing trend analysis for seasonal sales.
Real-time streaming
Real-time processing pipelines should be used in scenarios where low latency and time-sensitive answers are required. With a real-time pipeline, you can analyze the data as it arrives to obtain immediate results. These immediate results can lead to immediate action, which enables time-sensitive use cases like fraud detection.
In general, real-time pipelines are more resource-intensive than batch processing pipelines because they need to apply transformations while receiving a continuous stream of data. Depending on the size of the data stream, this could limit the types of transformations you can apply. Aggregating over a specific time window could also mean that you need large amounts of memory or temporary storage to store the aggregation calculation result until you hit the required time window.
As mentioned, fraud detection is a common use case enabled by real-time processing. For example, take a credit card transaction being processed in one country and then being processed in another country shortly after. This could indicate a cloned or compromised card. In this scenario, you could take steps like freezing the compromised user's account or blocking transactions from that card until you get more information about the situation.
Another use case is sensor monitoring. IoT sensors can generate a lot of data, and many times, they generate the type of data that needs to be acted upon in a time-sensitive manner. A company that specializes in cold storage can monitor the temperatures of its storage freezers in real time in order to predict the failure of a storage unit by measuring the delta between readings over a given time window. This would allow the company to take action to move goods out to another unit and minimize losses of goods that need to be properly stored in a cold environment:
Unless you want to do trending data analysis of your IoT readings, it's not necessary to store them for long periods, so it doesn't need to be stored in a data warehouse.
Hybrid pipelines
Hybrid pipelines combine both types of processing (batch and real-time) to meet diverse data needs. This type of pipeline is more common as an organization grows and different kinds of answers are required from its data sets. For instance, real-time streaming tools like Apache Kafka or AWS Kinesis are ideal for processing low-latency data, while batch processing frameworks like Apache Spark can be used to handle large volumes of historical data for end-of-day analysis As a company matures, there will be a good mix of pipelines that use real-time processing for low-latency answers and end-of-day batch processing for historical data.
The fraud detection example mentioned previously is a good use case for implementing a hybrid pipeline. You need to immediately act if you pick up a fraudulent transaction, but you also need to report on these incidents and have trending data for analysis to see if fraud incidents are increasing or decreasing:
This hybrid approach makes a lot of sense when you have low latency requirements and need to report on historical events.
Mastering data pipelines and driving results
Building effective data pipelines is an essential skill in today's data-driven world. In this article, you learned about the different components of a data pipeline as well as different processing approaches, like ETL and ELT. You also learned the difference between batch and real-time processing and took a look at different use cases for different types of data pipelines.
Companies like Secoda can help you search, monitor, and manage all of your organization's data in place. Book a demo and see why top data teams choose Secoda to get more done.