Data Engineering Lifecycle
Understanding the main stages in DE lifecycle and how they interact with each other
I recently started reading the “Fundamentals of Data Engineering” book and decided to share my learnings as I go along with the book. In this article, we will discuss the main stages in the Data Engineering lifecycle as mentioned in the book and some key concepts related to each one of them. Before diving into that, we can first see what Data Engineering actually is.
Simply put, the goal of Data Engineering is to take in raw data and converts it into information that can be used by down-stream use cases in a consistent manner. To achieve this goal, we need to develop, implement, and maintain a wide range of systems and processes, which is what Data Engineering is all about.
Data Engineering Lifecycle
The data engineering lifecycle consists of the following:
Generation
Storage
Ingestion
Transformation
Serving
There are other crucial elements of the lifecycle (Undercurrents in the above image) but we will discuss them in a later article. Let’s jump into the individual elements:
Generation
This part of the lifecycle basically refers to the origins of the data, called source systems. These could be transactional databases, IoT devices, message queues etc. The main things that a data engineer needs to know about these systems include a high-level understanding of how these source systems work, how they generate the data, the frequency, velocity, and variety of the data.
One of the most crucial and challenging parts related to the data generated by the systems is the schema, which defines the hierarchical organization of data. This could be schema-less or fixed-schema. Schema-less means that there is no fixed schema that the data is fed into, rather it is written with different schemas in flexible storage systems, such as message queues, blobs, or document databases. Fixed-schema refers to a schema forced in the database, which the writing of the data must conform to
Storage
Next comes storage, which can be defined as the one of the most crucial and complicated stage. This is because choosing a storage solution is complicated by a number of reasons such as the availability of multiple storage solutions on the cloud, the interaction of the storage systems with other stages in the data lifecycle, and the availability of the various storage systems to perform complex queries.
Some key considerations in this case could be the storage solution’s compatibility with the source systems, especially in terms of write and read speeds, or the storage system able to handle scalability, or the ability of the system to conform with service-level agreements (SLAs). Other considerations include the schema-less vs fixed schema, how the data quality is being tracked, how the system handles data compliance needs of the system, and so on.
Ingestion
Next stage requires ingesting data from the source systems into the storage solutions. This stage does provide a bottleneck to the data engineering lifecycle that any sudden data ingestion stoppages can result in data flow being stopped or insufficient data being delivered, causing a ripple affect across all the stages in the lifecycle.
There are some key considerations here which include is the data being ingested reliably and is it readily available after being ingested. Other concern the data itself i.e., frequency, volume, format of the data or the usability of the data which refers to the question if the data can be used directly for downstream tasks.
One important point of discussion here is batch vs streaming data ingestion. Normally data is always generated in streaming fashion at the source but it can be ingested in batches, convenient way of processing the data in large chunks or streaming ingestion meaning data being ingested in a continuous real-time fashion. Some key considerations here include that the downstream storage systems ability to handle data in real-time, millisecond real-time data ingestion vs micro-batch data ingestion, benefits and reliability of doing streaming ingestion, cost analysis, etc
Transformation
This means that data needs to change from its raw and original form into something useful for the relevant downstream use cases. This includes changing the data into the format required for generating reports, creating dashboards, or ML model training. The process can include mapping the fields into their correct data types, standardizing the formats, handling missing values, feature engineering for ML etc.
Some things to consider here would be the cost (both computational and time wise) and compare it to the business value generated by these transformations, how do these transformations support the underlying business rules, and how simple and isolated the transformations are.
Serving Data
Now comes the final stage of the data engineering lifecycle which is basically how we “get value” from the data that is now transformed into useful structures. This is where everything we have done before actually results into practical value. If the data just sits there, all the above work will go into vain. Data only has value when it is used for practical, downstream purposes, although this value can differ from user to user.
Some of the main use cases of data are:
Analytics: This is basically using the data to generate reports, visualize using dashboards, and doing ad hoc analysis
Machine Learning: This is where the data scientists or ML engineers can apply sophisticated feature engineering (which can then be automated by Data Engineers in the transformation stage) and ML algorithms to train models that provide value to the customers and the business.
Reverse ETL: This basically refers to taking the processed data and feeding it back into the source systems. This could be in the form of analyzing the scored models from ML, or pushing metrics to a customer data platform.
Conclusion
In this article, we looked at the major stages in a data engineering lifecycle, starting with data generation from source systems into them being ingested into the storage solutions, finally serving the data to the consumers after performing any required transformations. Hope this gives you an overview of the lifecycle and I would highly recommend reading the book for in-depth explanations. We will see how the undercurrents such as security, data management, orchestration, etc. play a role in the data engineering lifecycle in the next article. Feel free to ask any questions or provide any insights into this and follow me on LinkedIn for more.