Data Lineage

Data Lineage

Improve your Data Lineage to Unlock the Possibilities Hidden within your Data

Data Lineage

As data grows in size and complexity, so do data teams. New hires, employee turnover and new tools and processes (not to mention new data sources) mean it's incredibly difficult for data teams to have clarity and visibility into all the transformations and journeys of all their data sources.

As a result of this, data lineage is an increasingly important aspect of any data leader. Irina Steenbek from DMU highlights a growing interest in data lineage across all areas of the enterprise data management community, especially as business metadata becomes more necessary to non-IT professionals.

Without visibility into the life cycle or journey of your data, it is virtually impossible to guarantee your insights will be accurate. Besides not being able to fully trust your data, it can also escalate into issues of compliance that can severely damage a business' reputation.

This is before even mentioning the damage that an opaque disjointed data operation can do, wasting millions in inefficient processes that are often handled in silos. What is more, when an issue arises, the challenge of reverse engineering data processes to get to the source of a problem can become a headache of monumental size—and cost.

A Plethora Of Data Visualization Tools For Everyone

Companies like Tableau, Qlik or Looker revolutionized the way in which data is visualized, in a market that is still growing at a tremendous pace. In fact, the global data visualization market size is projected to more than double over the next six years, from $8.8 billion in 2019 to $19.20 billion by 2027.

An explosion in data visualization tools means real-time or near real-time business dashboards are easy to implement, access and explore by any business user across an organization. These tools have played a huge role in making organizations more data literate and helped democratize data management far beyond data teams.

But what happens with the data behind the dashboards? How can data teams visualize hundreds or even thousands of data pipelines, processes and models? While on the surface, everything has been simplified with visualization tools, there is often an increasingly complex and sophisticated data ecosystem beneath this. This is where data lineage comes into play.

Data Observability, Traceability And Compliance

Beyond visibility and transparency, data lineage is essential for traceability and compliance. Mapping the journey of your data, including its origin, each stop, and the modifications it undergoes along the way is the essence of data lineage. This traceability of all your data is essential to run a well-oiled data operation, and spotlight any issues or gaps in your current DataOps ecosystem.

In addition, with compliance and data governance taking center stage, data lineage is something that soon won't be seen as a luxury but a requirement for data teams. With businesses embracing the cloud to store and manage all their information, data lineage provides a clear window into an organization's past, present and future of its data.

While it may sound like an easy fix, achieving flawless data lineage also comes with its challenges, especially for enterprise businesses. In a recent CDOTrends virtual roundtable, participants highlighted that despite data lineage becoming a critical factor, it can be a difficult undertaking, particularly in large organizations. An extraordinary amount of data entering businesses each day, and this data is being stored, sliced and diced in different ways by different teams. As a result of this, data teams find themselves having the challenge to provide access to data tools and systems while remaining in control of an organizations' data processes, which must also adhere to a standard of quality, security and compliance.

Essential Questions To Ask The Business Before Getting Started

There are two things I recommend any data team do before creating or improving their data lineage systems. Firstly, make a clear distinction between what data lineage entails for technical versus business teams. Developers and data engineers will need technical-oriented data lineage, as they will want visibility into the inner workings of their code and how it influences data flows. Business-oriented data lineage is more about visibility that creates transparency and trust in optimal data processes, and business users will be often more concerned about data governance and business compliance as opposed to visibility into code or transformation syntax.

Secondly, I recommend teams set clear goals and expectations as to what data lineage will answer for their business. What data is being stored? How is this data being stored and used? Is there any crucial business data we're ignoring? Are there silos across the business causing unnecessary duplication or expenses? It is key to understand the value that data lineage will bring and the expected ROI in terms of the business answers and therefore the value it can bring to the table.

Agility And Collaboration For Modern Data Teams

Data lineage is what gives a team the ability to access data and leverage it to its full potential with the freedom to explore. Since data flows, changes and processes are carefully mapped out and easy to trace, fixing any issues or reversing after a wrong turn is a lot more straightforward.

As your data travels through different pipelines (i.e., ETL, databases, files) it interacts with other data and pieces of information or transforms. Getting the full story of all your data sources' journeys through data lineage will soon be the priority of teams, boards and data leaders that are under pressure to orchestrate data transparently and efficiently.

The Basics of Data Lineage

Data Lineage is an essential component in all business metadata management. Often overlooked, the value of data lineage can be seen in many areas.

There is a growing interest in data lineage for many reasons, across all areas of the enterprise data management community, especially as business metadata becomes more necessary to non-IT professionals.

There are several groups of stakeholders within any company that might be interested in data lineage. Formerly, only the Information Technology (IT) department understood the concept of data lineage and its value. As the explosion of data has affected every business area, business stakeholders have embraced the need for improved metadata management and a deeper approach for data lineage. Stakeholders in finance and risk have become the biggest data lineage enthusiasts.

This hidden interest in data lineage has several reasons:

  • appearance of new legislation requirements
  • business changes
  • an increase in data quality initiatives
  • supervisor and audit requirements.

Defining “Data Lineage”

Industry reference industry guides provide some definitions on data lineage.

However, the definition of data lineage is ambiguous and intercepts other terms, such as “data flow”, “integration architecture”, “data and information (value) chain.” Finally, the definition of data lineage has a lot in common with other data-related terms.

These definitions can serve as a basis for understanding:

Data flow is “the transfer of data between systems, applications, or data sets. Data lineage is a description of the pathway from the data source to their current location and the alterations made to the data along the pathway.”

“[…] data […] has lineage (i.e., a pathway along which it moves from its point of origin to its point of usage, sometimes called the data chain).”

“Data flows are a type of data lineage / metadata documentation that depicts how data moves through business processes and systems. End-to-end data flows illustrate where the data originated, where it is stored and used, and how it is transformed as it moves inside and between diverse processes and systems.”

None of these definitions is clear, and all intersect each other to some extent.

“Data flow”, “data lineage” and “data chain” are terms that describe similar concepts of data movement and transformation. Therefore, these terms often are used interchangeably. Data lineage is a description of the path along which data flows from the point of its origin to the point of its use.

Still, the definitions say nothing about documenting data lineage. To understand the way to document this movement, it is important to know the components that constitute data lineage.

Data Lineage components

The same guides give clarification on data lineage component.
“Data flows map and document relationships between data and:

  • Application within a business process
  • Data stores or databases in an environment
  • Network segments (useful for security mapping)
  • Business roles, depicting which roles have responsibility for creating, updating, using and deleting data
  • Location where local differences occur.”

Therefore, the key components of data flow / lineage are IT system components (applications, databases, network segments) and business processes.

TOGAF 9.1 by The Open Group, the leading guide in enterprise architecture stipulates, “The Data Flow view is concerned with storage, retrieval, processing, archiving, and security of data.”

The definition of TOGAF9.1 seems to have nothing in common with definitions from other reference guides. Rather, it refers to the concept of a data lifecycle.

Legislation requirements

There are several legislation requirements which requirements cause interest for data lineage: the Basel Committee on Banking Supervision‘s standard number 239: “Principles for effective risk data aggregation and risk reporting” (BCBS 239 or PERDARR), the EU General Data Protection Regulation (GDPR), IFRS9, TRIM (Targeted review of internal models) and others. Many specialists consider data lineage as the ultimate remedy to meet these requirements. At the same time, the term “data lineage” is never mentioned directly in these regulatory documents.

All conclusions about the necessity of data lineage are based on careful investigation of legislation requirements and consequent matching of these requirements to the data management methods and techniques, with data lineage forming part of it.

Business changes

Often, a company deals with different types of business changes, such as changes in information needs and requirements, changes in application landscape, organizational changes etc. As an example, consider a change in a database of a business application. Usually, data is transformed and processed through the chain of applications, as noted in Figure 1:

Figure 1: A chain of applications

Figure 1: A chain of applications

 

For convenience, the chain consists of just a few applications, but in reality, especially in large companies, such chains consist of dozens of applications.

If, for example in “Company web-page” (the starting point of the chain on the left side of the Figure 1) the database of one of the applications is changed, it means that professionals will need to estimate all required changes in the consequent applications, including the impact on the end reports and/or dashboards. In this case, data lineage will be able to ease the impact analysis of the change.

For example, if changes touch information and reporting requirements (the end point of the chain in Figure 1), professionals will need to use root-cause analysis that will allow them to assess which data is required to produce this new information, where data should come from and how it should be transformed. In such a case, a root-cause analysis will be much easier to do if the data lineage is already recorded.

Usually, knowledge about data processing is kept in the minds of professionals or in the best-case scenario, on local computers in the form of Word or Excel documents.

Data Quality

In many organizations, there are there are a variety of initiatives around the quality of data. In large international companies, a major data quality program may require several years for development and implementation, and longer for the user community to judge it successful. Unfortunately, many business stakeholders and IT staff do not understand the essential part that accurate data lineage plays in resolution of data quality issues. For example, data lineage plays the key role in performing root-cause analysis while investigating data quality issues.

Supervisor and audit requirements

Supervisors and audit requirements affect the use of data lineage in every organization. There is a growing tendency that in addition to aggregated reports, supervisors require companies to provide granular reporting data for support of the reported results. Also, especially finance and risk functions often develop requirements that explain how critical metrics and figures in their reports have been derived. For that, professionals must be able to trace back the full chain of data transformations and explain each transformation’s path. This need requires knowledge of end-to-end data linage.

Conclusion

  1. There is no agreed list of components that constitute data lineage.
  2. These are the essential components of data lineage:
  • IT systems (applications, databases, network segments)
  • Data elements
  • Business processes, including different functional roles (data- and non-data related)
  • Data controls.
Figure 2: Key components of Data Lineage

Figure 2: Key components of Data Lineage

 

There are several key points concerning data lineage:

  • Data lineage is a representation of the path along which data flows from the point of its origin to the point of its usage.
  • Data lineage is used to design and describe processes of data transformation and processing.
  • Data lineage is recorded by representing a set of linked components such as data (elements), business processes, IT systems and applications, data controls. These components could be presented on different level of abstraction and detail. Usually, such a lineage is called a ‘horizontal’ data lineage.