Data Catalog and Lineage
data engineering
query
Data Catalog
Definition
A data catalog is an organized inventory of data assets across your organization.
It provides context, meaning, and trust so people can easily find the right data and use it with confidence.
Single source of truth about the data
Advantage
- Sentralize
- Increase trust
- Save time
- Improve collaboration
- Drive value
Components
- Business context
- Descriptions
- terms
- classifications
- Ownership and stewardship
- ensure accountability and quality over time
- Trust and Quality
- Quality scores
- certification
- policies
- Lineage
- where data comes from
- how it changes
- where it used
- Usage and Popularity
- how data is used
- by whom to make better decisions
- Related Assets
- Link to dashboards, reports, notebooks, APIs, and documents
workflow
- Connection to the data
- Discovery and collect metadata and lineage
- Enrich with business context, classifications and quality rules
- Govern with define ouwnership, policies, and trust levels
- Share and use to make data easy to find, and understand.
Tips
- Start Small, Think Big: begin with high-value domains and expand gradually.
- Define ownership early
- Standardize business terms
- Automate whenever possible
- measure and improve: track usage, quality and bussiness impact continuously.
Data Lineage
- It shows where a number started, what happend to it, and where it ended up.
- It healps to answer without guessing.
- debug issues faster
- change things more safely
- build trust in the numbers
- define zoom level: from what people see, where it’s stored, sources data.