Big Data Quality, Data Management & Governance – Comprehensive Study Notes

Big Data Quality & Data Management Overview

  • Big data presents challenges related to its intrinsic characteristics: volume, variety, velocity and quality.
  • Data‐quality management and data governance are foundational to turning data into a strategic asset, preventing breaches and ensuring ROI.
  • Key organisational structures: Data Governance Council (DGC), data owners, data stewards, data custodians, technology stewards.

Data Quality Assessment – Core Ideas

  • A systematic activity spanning data understanding ➔ measurement ➔ assessment ➔ cleansing / improvement ➔ monitoring.
  • Relies on a Data Quality Assessment Plan, Data Maturity Model and supporting processes (business case, funding, comms, risk & configuration mgmt.).
  • DAMA-DMBOK functions mapped in the plan: data profiling, metadata mgmt., architecture, platform, operations, quality rules & criteria.

Six Core Data Quality Dimensions (DAMA)

  • Completeness – proportion of stored data vs. the potential of “100 %” complete.
  • Uniqueness – no entity recorded more than once.
  • Timeliness – data represent reality at the required point in time.
  • Validity – data conform to syntax (format, type, range) of its definition.
  • Accuracy – degree to which data mirror real-world objects/events.
  • Consistency – absence of difference between two or more representations of the same thing.
How to Use the Dimensions
  • Practitioners should adopt these as a standard vocabulary; omit a dimension only if truly irrelevant.
  • Typical assessment questions:
    • Completeness: “Are all data items recorded?”
    • Consistency: “Can we match the data set across stores?”
    • Uniqueness: “Is there a single, authoritative view?”
    • Validity: “Do the values obey rules?”
    • Accuracy: “Do they reflect reality?”
    • Timeliness: “Are they captured within allowed latency?”

Dimension-Specific Specifications (PETRONAS Examples)

Completeness
  • Unit: % of non-blank values.
  • Scope: 0100%0\text{–}100\% of critical data per item/record/set.
  • Example: 294300×100=98%\frac{294}{300} \times 100 = 98\% completeness for “First Emergency Contact Phone No.”
  • Related: Validity, Accuracy.
Consistency
  • Measure: pattern / value frequency across data stores.
  • Possible to be consistent yet invalid or inaccurate.
  • Example pseudo-code: SELECT COUNT(DISTINCT "DateOfBirth") ….
  • Unit: %.
Uniqueness
  • Measure: Uniqueness%=Real-world countRecorded count×100\text{Uniqueness\%} = \frac{\text{Real-world count}}{\text{Recorded count}} \times 100
    Example: 500520×100=96.2%\frac{500}{520} \times 100 = 96.2\%.
  • Footnote: inverse of duplication level.
Validity
  • Check against metadata rules for format, type, range.
  • Example rule: class identifier $\text{AAA99}$ pattern.
  • Pseudo-code sample: “age ≥ 4 ∧ ≤ 11”.
Accuracy
  • Requires validity first.
  • Formula: Accuracy%=Count accurateCount accurate+Count inaccurate×100\text{Accuracy\%}=\frac{\text{Count accurate}}{\text{Count accurate} + \text{Count inaccurate}} \times 100.
  • Example involves US vs EU date format causing mis-age calculation.
Timeliness
  • Measure: raw time difference; breaches occur beyond SLA.
  • Example: Delay=4 Jun 20131 Jun 2013=3 days\text{Delay} = 4\text{ Jun 2013} - 1\text{ Jun 2013} = 3\text{ days} (> 2-day SLA).

Glossary Highlights

  • Dimension, Measure, Scope, Unit of Measure, Assessment, Continuous measurement, Discrete measurement, Record, Dataset.

Case Study – XYZ Project (IMGESA / xyz.csv)

  • File merged from monthly deliveries.
  • Stats: 29 497 rows × 43 cols = 1 268 371 cells.
  • Cells with data: 1 235 248; missing: 33 123 ➔ overall completeness 97.39%97.39\%.
Completeness Findings
  • Selected variables with low completeness:
    • SLOPEGEOLEN 88.48 %
    • LIDAR_PR 66.24 %
    • LIDARRISKRANK 66.24 %
    • Geopig_Strain 66.76 %
Uniqueness Findings
  • Duplicated coordinate rows detected (≈ 50 % duplication; uniqueness ≈ 49.92 %).
  • Example duplicate tuple: STARTLAT 363235, STARTLONG 379985, ENDLAT 363257, ENDLONG 379929.
Timeliness, Accuracy, Consistency Findings
  • Timeliness not applicable (no sampling dates).
  • No accuracy or sampling-consistency issues detected within value thresholds.
Validity Checks (Range Rules Extract)
  • Numeric/band rules e.g. DiscontinuitiesScore ∈ {0,2,4,8}, TypeMaterialScore 0–11, Rain_Jan 320–637 mm.
  • Categorical rules: LIDAR_RISK_RANK ∈ {V.LOW, LOW, MED, HIGH, V.HIGH, NULL}.

Data Governance Fundamentals

  • Definition: authority & control (planning, monitoring, enforcement) over data assets.
  • Goals
    • Define & communicate strategies, policies, standards, procedures.
    • Enforce compliance & resolve data issues.
    • Sponsor & track DM projects.
    • Promote data asset value.
  • Data is strategic; must have stewards, security, privacy compliance.
Data Management vs Data Governance
  • Data Management: umbrella capability safeguarding data & optimising value chains.
  • Data Governance: sub-capability establishing framework & coordinating other DM sub-capabilities.
Business Benefits of Governance
  • Quality decision-making, insight-to-action, security, ROI, reduced IT burden, clear ownership rights.
  • Demonstrated by high‐profile breaches (Malaysia Airlines 2021; Microsoft 2020 – 250 M records; IBM 2019 breach cost study).

Challenges in Governing Big Data

  • 4 Vs stress governance:
    Volume – exploding size needs programmatic processes.
    Variety – mixed raw/structured sources, sentiment data.
    Velocity – real-time ingestion vs monitoring lag.
    Veracity – sparse, volatile, manufactured data; defining “complete”.
  • Organisational hurdles:
    1. No enterprise DM strategy.
    2. Disparate sources.
    3. Legacy storage lacking central repository.
    4. Data formats proliferation.

Governing Data – Ownership & Roles

  • Clarify: who owns, who can access, change of ownership during flows, security measures, regulatory compliance.
  • Roles
    • Data Owner (executive accountability).
    • Data Steward (content, context, rules).
    • Data Custodian (transport, storage, rule enforcement).
    • Technology Steward.
    • Governance Council (exec steering; DM leader facilitator).

Data Maturity Model & Gap Analysis Approach

  1. Assess current maturity (Level 1 ad-hoc ➔ Level 5 optimised).
  2. Gap analysis worksheet (existing vs needed capabilities).
  3. Implementation plan (roadmap aligned to DGC).
  • Example gaps: missing Technical Data Management Framework docs; need standalone Data Quality guideline.

Implementation Plan (Illustrative Timeline)

  • Stage 1: Preliminary study (collect existing policies, SLA, SOP per BU).
  • Stage 2: Establish governance framework (Technical Data Framework, Standard, Guideline).
  • Milestones sample: start 1 Sep 2017 ➔ week 7, 10, 16 checkpoints; deliver full findings & bi-weekly status.

Big Data Governance Framework Components

  • Architecture domains: Business, Data, Application, Technology.
  • Core functions mapped to DAMA: Data Security, DW/BI, Master Data, Metadata, Quality, Integration, etc.
  • Four pillars of trustworthy AI analytics: governance, quality, security, integration.

Cloud Data Governance Framework (high-level)

  • Extends on-prem model to cover cloud entitlements, shared-responsibility, multi-tenant security controls.

Conclusion & Key Takeaways

  • High data quality requires explicit measurement across six DAMA dimensions, each with clear formulas and thresholds.
  • Governance provides policies, owners, roles, frameworks and is vital for security, compliance and analytics value.
  • Big-data contexts magnify traditional challenges; start small (ownership policy), iterate via maturity assessments, and institutionalise frameworks.

Selected Equations / Formulae (Quick Reference)

  • Completeness %: Non-null valuesTotal expected values×100\frac{\text{Non-null values}}{\text{Total expected values}}\times100
  • Uniqueness %: Real-world entity countRecorded entity count×100\frac{\text{Real-world entity count}}{\text{Recorded entity count}}\times100
  • Accuracy %: Accurate entriesAccurate+Inaccurate entries×100\frac{\text{Accurate entries}}{\text{Accurate} + \text{Inaccurate entries}}\times100
  • Timeliness (lag): t<em>recordedt</em>eventt<em>{\text{recorded}} - t</em>{\text{event}}

References & Further Reading

  • DAMA‐DMBoK 1st & 2nd Edition; DAMA Dictionary.
  • TIQM, Loshin (Practitioner’s Guide), Olson (Accuracy Dimension), English (Information Quality), IAIDQ Glossary.
  • Academic papers: Cheong & Chang 2007; Saed et al. 2018; Mukhrizal et al. 2019.
  • IBM Security “Cost of Data Breach 2019”; NST 2021 Malaysia Airlines breach; CNBC 2021 breach costs.