Big Data Quality, Data Management & Governance – Comprehensive Study Notes
Big Data Quality & Data Management Overview
- Big data presents challenges related to its intrinsic characteristics: volume, variety, velocity and quality.
- Data‐quality management and data governance are foundational to turning data into a strategic asset, preventing breaches and ensuring ROI.
- Key organisational structures: Data Governance Council (DGC), data owners, data stewards, data custodians, technology stewards.
Data Quality Assessment – Core Ideas
- A systematic activity spanning data understanding ➔ measurement ➔ assessment ➔ cleansing / improvement ➔ monitoring.
- Relies on a Data Quality Assessment Plan, Data Maturity Model and supporting processes (business case, funding, comms, risk & configuration mgmt.).
- DAMA-DMBOK functions mapped in the plan: data profiling, metadata mgmt., architecture, platform, operations, quality rules & criteria.
Six Core Data Quality Dimensions (DAMA)
- Completeness – proportion of stored data vs. the potential of “100 %” complete.
- Uniqueness – no entity recorded more than once.
- Timeliness – data represent reality at the required point in time.
- Validity – data conform to syntax (format, type, range) of its definition.
- Accuracy – degree to which data mirror real-world objects/events.
- Consistency – absence of difference between two or more representations of the same thing.
How to Use the Dimensions
- Practitioners should adopt these as a standard vocabulary; omit a dimension only if truly irrelevant.
- Typical assessment questions:
- Completeness: “Are all data items recorded?”
- Consistency: “Can we match the data set across stores?”
- Uniqueness: “Is there a single, authoritative view?”
- Validity: “Do the values obey rules?”
- Accuracy: “Do they reflect reality?”
- Timeliness: “Are they captured within allowed latency?”
Dimension-Specific Specifications (PETRONAS Examples)
Completeness
- Unit: % of non-blank values.
- Scope: 0–100% of critical data per item/record/set.
- Example: 300294×100=98% completeness for “First Emergency Contact Phone No.”
- Related: Validity, Accuracy.
Consistency
- Measure: pattern / value frequency across data stores.
- Possible to be consistent yet invalid or inaccurate.
- Example pseudo-code:
SELECT COUNT(DISTINCT "DateOfBirth") …. - Unit: %.
Uniqueness
- Measure: Uniqueness%=Recorded countReal-world count×100
Example: 520500×100=96.2%. - Footnote: inverse of duplication level.
Validity
- Check against metadata rules for format, type, range.
- Example rule: class identifier $\text{AAA99}$ pattern.
- Pseudo-code sample: “age ≥ 4 ∧ ≤ 11”.
Accuracy
- Requires validity first.
- Formula: Accuracy%=Count accurate+Count inaccurateCount accurate×100.
- Example involves US vs EU date format causing mis-age calculation.
Timeliness
- Measure: raw time difference; breaches occur beyond SLA.
- Example: Delay=4 Jun 2013−1 Jun 2013=3 days (> 2-day SLA).
Glossary Highlights
- Dimension, Measure, Scope, Unit of Measure, Assessment, Continuous measurement, Discrete measurement, Record, Dataset.
Case Study – XYZ Project (IMGESA / xyz.csv)
- File merged from monthly deliveries.
- Stats: 29 497 rows × 43 cols = 1 268 371 cells.
- Cells with data: 1 235 248; missing: 33 123 ➔ overall completeness 97.39%.
Completeness Findings
- Selected variables with low completeness:
- SLOPEGEOLEN 88.48 %
- LIDAR_PR 66.24 %
- LIDARRISKRANK 66.24 %
- Geopig_Strain 66.76 %
Uniqueness Findings
- Duplicated coordinate rows detected (≈ 50 % duplication; uniqueness ≈ 49.92 %).
- Example duplicate tuple: STARTLAT 363235, STARTLONG 379985, ENDLAT 363257, ENDLONG 379929.
Timeliness, Accuracy, Consistency Findings
- Timeliness not applicable (no sampling dates).
- No accuracy or sampling-consistency issues detected within value thresholds.
- Numeric/band rules e.g.
DiscontinuitiesScore ∈ {0,2,4,8}, TypeMaterialScore 0–11, Rain_Jan 320–637 mm. - Categorical rules:
LIDAR_RISK_RANK ∈ {V.LOW, LOW, MED, HIGH, V.HIGH, NULL}.
Data Governance Fundamentals
- Definition: authority & control (planning, monitoring, enforcement) over data assets.
- Goals
- Define & communicate strategies, policies, standards, procedures.
- Enforce compliance & resolve data issues.
- Sponsor & track DM projects.
- Promote data asset value.
- Data is strategic; must have stewards, security, privacy compliance.
Data Management vs Data Governance
- Data Management: umbrella capability safeguarding data & optimising value chains.
- Data Governance: sub-capability establishing framework & coordinating other DM sub-capabilities.
Business Benefits of Governance
- Quality decision-making, insight-to-action, security, ROI, reduced IT burden, clear ownership rights.
- Demonstrated by high‐profile breaches (Malaysia Airlines 2021; Microsoft 2020 – 250 M records; IBM 2019 breach cost study).
Challenges in Governing Big Data
- 4 Vs stress governance:
Volume – exploding size needs programmatic processes.
Variety – mixed raw/structured sources, sentiment data.
Velocity – real-time ingestion vs monitoring lag.
Veracity – sparse, volatile, manufactured data; defining “complete”. - Organisational hurdles:
- No enterprise DM strategy.
- Disparate sources.
- Legacy storage lacking central repository.
- Data formats proliferation.
Governing Data – Ownership & Roles
- Clarify: who owns, who can access, change of ownership during flows, security measures, regulatory compliance.
- Roles
- Data Owner (executive accountability).
- Data Steward (content, context, rules).
- Data Custodian (transport, storage, rule enforcement).
- Technology Steward.
- Governance Council (exec steering; DM leader facilitator).
Data Maturity Model & Gap Analysis Approach
- Assess current maturity (Level 1 ad-hoc ➔ Level 5 optimised).
- Gap analysis worksheet (existing vs needed capabilities).
- Implementation plan (roadmap aligned to DGC).
- Example gaps: missing Technical Data Management Framework docs; need standalone Data Quality guideline.
Implementation Plan (Illustrative Timeline)
- Stage 1: Preliminary study (collect existing policies, SLA, SOP per BU).
- Stage 2: Establish governance framework (Technical Data Framework, Standard, Guideline).
- Milestones sample: start 1 Sep 2017 ➔ week 7, 10, 16 checkpoints; deliver full findings & bi-weekly status.
Big Data Governance Framework Components
- Architecture domains: Business, Data, Application, Technology.
- Core functions mapped to DAMA: Data Security, DW/BI, Master Data, Metadata, Quality, Integration, etc.
- Four pillars of trustworthy AI analytics: governance, quality, security, integration.
Cloud Data Governance Framework (high-level)
- Extends on-prem model to cover cloud entitlements, shared-responsibility, multi-tenant security controls.
Conclusion & Key Takeaways
- High data quality requires explicit measurement across six DAMA dimensions, each with clear formulas and thresholds.
- Governance provides policies, owners, roles, frameworks and is vital for security, compliance and analytics value.
- Big-data contexts magnify traditional challenges; start small (ownership policy), iterate via maturity assessments, and institutionalise frameworks.
- Completeness %: Total expected valuesNon-null values×100
- Uniqueness %: Recorded entity countReal-world entity count×100
- Accuracy %: Accurate+Inaccurate entriesAccurate entries×100
- Timeliness (lag): t<em>recorded−t</em>event
References & Further Reading
- DAMA‐DMBoK 1st & 2nd Edition; DAMA Dictionary.
- TIQM, Loshin (Practitioner’s Guide), Olson (Accuracy Dimension), English (Information Quality), IAIDQ Glossary.
- Academic papers: Cheong & Chang 2007; Saed et al. 2018; Mukhrizal et al. 2019.
- IBM Security “Cost of Data Breach 2019”; NST 2021 Malaysia Airlines breach; CNBC 2021 breach costs.