DataLineage — Column-Level Data Lineage
28 columns mapped source → transformations → destinations. Each column carries data classification (direct PII / sensitive / cardholder / aggregate / indirect), every transformation that touches it (PII-scrub, hash, anonymize, redact, generalize), every downstream consumer (warehouse, ML training, partner shares). Surfaces 4 columns flowing raw to ML training without redaction + 2 lineage gaps where the source was lost during platform migration.
What it is
The shape behind every modern data-lineage platform — Atlan, Collibra, OpenLineage, dbt-style lineage. PIIScout (batch 9) maps each column at the schema level. DataLineage maps where each column FLOWS — from source through every transformation to every destination.
What’s in it
- 28 columns across the realistic SaaS surface — customers (email, phone, DOB, SSN), payments (card_number dropped, last4 retained), addresses (street, geo_lat), orders (IP), support_tickets, employees (payroll, performance), analytics_events (user_id_hash, session_id, fingerprint), search_history, and partner data shares.
- Per-column lineage flow rendered as source → [transformations] → destinations:
- Source identifies origin table + column
- Transformations include the operation (sha256, generalize, truncate, tokenize, drop) and whether it produced a redacted output
- Destinations are downstream services (analytics warehouse, ML training, partner share, audit log) with per-destination ok/not-ok flag
- Worst-offender findings:
- DL-002 customers.phone — flows raw to events_raw + ML training without redaction (PIIScout C002)
- DL-008 addresses.geo_lat — 7-decimal precision flows raw (identifies a single building per Sweeney 2000)
- DL-024 search_history.query — raw queries flow to ML training (users typing emails into search)
- DL-026 customers.notes — LINEAGE GAP: pre-2022 column with unknown downstream paths after platform migration
- Cross-tool callbacks — every column references its PIIScout entry, RtbfFlow path, and IncidentLog incidents.
Why this shape
GDPR Art 25 (privacy by design) + Art 30 (RoPA) + CCPA §1798.140 demand the visibility DataLineage prototypes — not just “does this column exist” but “where does this column travel and is it transformed appropriately at each hop.” The killer audit finding: a sensitive column flowing untransformed into ML training data.
How it ships
Single HTML file, ~22KB. Zero dependencies. 28 columns × per-column flow renderer + status classifier in 240 lines of vanilla JavaScript.