Data Lineage: Make room to grow
Whenever you are triaging a problem, you have to focus on what
is in front of you, but you cannot forget about how your actions will impact
your overall desired outcome. For
example, let’s imagine you are managing a thriving new software company. Your software has gone viral, and you have to
grow ASAP to handle support and new development. What is your knee-jerk reaction? Get as many bodies as you can. You bring in a wave of contractors, from
project managers to developers. Your short-term
needs are met, but in 3 months, what happens?
You grew too fast, and customer complaints are starting to expose the
weakness of not having a fully functional organization.
A data governance practice usually has the same
problem. There is always one pillar that
needs more immediate attention at the beginning: data quality. Even if that is the case, you cannot pour all
of your gas on that fire; you still have to plan for the rest of your data
governance framework as you build the data strategy. When you make decisions about the data platform,
how you address data quality, or where business logic is implemented, you have
to make room for other data governance practices like data lineage.
Data Lineage: Documenting the Data's Journey
So, what is data lineage?
Simply put, data lineage tracks the complete data lifecycle, from origin
through transformations to its final destination or consumption point. It maps
the flow, documenting data movement, transformations, and end points.
The Long Game: Laying the Groundwork
Even if an organization is not actively using data lineage
tools, the decisions made in data architecture are critical for future data
lineage initiatives. You need to "play the long game" with this
component.
- Data
movement: Consider the ETL/ELT tools and methods. Are processes
robust and well-documented? Did you choose technologies that automatically
capture metadata or easily integrate with metadata management
platforms?
- Business
logic location: Is transformation logic embedded within scripts
or stored in a way that is easily understood? Centralizing and documenting
business rules within data pipelines simplifies tracing the impact of the
rules on the data. If you decide to
lock that business logic way into APIs that cannot be easily accessed because
it is easier to build APIs in your organization, you will limit how impactful
your data lineage will be.
- Tool
selection: Even if a full lineage tool isn't on the near-term
road map, did you select "lineage-friendly" technologies? When
you choose tools that readily share metadata, have open APIs, or integrate
well with existing or future data governance solutions like data catalogs
and lineage platforms, you are making a space for data lineage even if you
cannot work on it right away.
Your choices in data architecture and processes will help build
a foundation that is easier to trace and understand when comprehensive data
lineage is needed. When you take time to stub in access points for data lineage
now, it will be faster to implement down the road.
The Inevitable Need for Lineage
The fact is, you may be able to push data lineage off a bit
while you triage other parts of your data platform and build out more urgent
areas of your data strategy, but it will become a priority. What are some reasons why its importance
could increase quickly?
- Troubleshooting
and Impact Analysis: Data lineage becomes invaluable when data
quality issues arise or when understanding the impact of change is needed.
It allows tracing data back to its source, identifying errors, and
understanding the effects of changes. This is critical for maintaining
data quality and preventing widespread issues.
- Compliance
and Auditing: Many industries require a clear audit trail of data
processing. Data lineage provides that documented proof, showing how data
was sourced, transformed, and used. This reduces compliance risks and
simplifies audits.
- Trust
and Confidence: Business stakeholders need to trust the data.
Data lineage fosters trust by providing transparency into the data's
journey, transformations, and dependencies, leading to more informed
decision-making.
- Collaboration
and Efficiency: Lineage enhances collaboration between data
engineers, analysts, and business users by providing a common
understanding of data flows. This can prevent misunderstandings and
streamline data-related projects.
It is just not possible to address every component of data
governance at the same time. Priorities
have to be set based on the resources you have and the business demand that you
are facing. That is not an excuse not to
plan. We know that data lineage needs as
much metadata as it can get, so resist the impulse to “just get it done” and
put in processes and expectations that foster the creation and easy access of
metadata. When you do that, data lineage
will be much easier to accomplish.
Comments
Post a Comment