Data Lineage: Make room to grow

 

Whenever you are triaging a problem, you have to focus on what is in front of you, but you cannot forget about how your actions will impact your overall desired outcome.  For example, let’s imagine you are managing a thriving new software company.  Your software has gone viral, and you have to grow ASAP to handle support and new development.  What is your knee-jerk reaction?  Get as many bodies as you can.  You bring in a wave of contractors, from project managers to developers.  Your short-term needs are met, but in 3 months, what happens?  You grew too fast, and customer complaints are starting to expose the weakness of not having a fully functional organization.  

A data governance practice usually has the same problem.  There is always one pillar that needs more immediate attention at the beginning: data quality.  Even if that is the case, you cannot pour all of your gas on that fire; you still have to plan for the rest of your data governance framework as you build the data strategy.  When you make decisions about the data platform, how you address data quality, or where business logic is implemented, you have to make room for other data governance practices like data lineage.

Data Lineage: Documenting the Data's Journey

So, what is data lineage?  Simply put, data lineage tracks the complete data lifecycle, from origin through transformations to its final destination or consumption point. It maps the flow, documenting data movement, transformations, and end points.

The Long Game: Laying the Groundwork

Even if an organization is not actively using data lineage tools, the decisions made in data architecture are critical for future data lineage initiatives. You need to "play the long game" with this component.

  • Data movement: Consider the ETL/ELT tools and methods. Are processes robust and well-documented? Did you choose technologies that automatically capture metadata or easily integrate with metadata management platforms? 
  • Business logic location: Is transformation logic embedded within scripts or stored in a way that is easily understood? Centralizing and documenting business rules within data pipelines simplifies tracing the impact of the rules on the data.  If you decide to lock that business logic way into APIs that cannot be easily accessed because it is easier to build APIs in your organization, you will limit how impactful your data lineage will be.
  • Tool selection: Even if a full lineage tool isn't on the near-term road map, did you select "lineage-friendly" technologies? When you choose tools that readily share metadata, have open APIs, or integrate well with existing or future data governance solutions like data catalogs and lineage platforms, you are making a space for data lineage even if you cannot work on it right away.

Your choices in data architecture and processes will help build a foundation that is easier to trace and understand when comprehensive data lineage is needed. When you take time to stub in access points for data lineage now, it will be faster to implement down the road.

The Inevitable Need for Lineage

The fact is, you may be able to push data lineage off a bit while you triage other parts of your data platform and build out more urgent areas of your data strategy, but it will become a priority.  What are some reasons why its importance could increase quickly?

  • Troubleshooting and Impact Analysis: Data lineage becomes invaluable when data quality issues arise or when understanding the impact of change is needed. It allows tracing data back to its source, identifying errors, and understanding the effects of changes. This is critical for maintaining data quality and preventing widespread issues.
  • Compliance and Auditing: Many industries require a clear audit trail of data processing. Data lineage provides that documented proof, showing how data was sourced, transformed, and used. This reduces compliance risks and simplifies audits.
  • Trust and Confidence: Business stakeholders need to trust the data. Data lineage fosters trust by providing transparency into the data's journey, transformations, and dependencies, leading to more informed decision-making.
  • Collaboration and Efficiency: Lineage enhances collaboration between data engineers, analysts, and business users by providing a common understanding of data flows. This can prevent misunderstandings and streamline data-related projects.

It is just not possible to address every component of data governance at the same time.  Priorities have to be set based on the resources you have and the business demand that you are facing.  That is not an excuse not to plan.  We know that data lineage needs as much metadata as it can get, so resist the impulse to “just get it done” and put in processes and expectations that foster the creation and easy access of metadata.  When you do that, data lineage will be much easier to accomplish.


Comments

Popular posts from this blog

Data Strategy: Guiding Principles

Data Principles: The Power of Naming Standards

Data Governance: Check and Check Again