Chapter 3 – Data and Knowledge Management

Learning Objectives

  • Address common data‐management challenges through strong Data Governance practices.
  • Evaluate Relational Databases (RDBMS) – advantages, disadvantages, querying, modeling, normalization, joins.
  • Define Big Data; recognize its core characteristics and managerial implications.
  • Detail the components and life-cycle of Data Warehouses / Data Marts; explain Extract-Transform-Load (ETL) and governance requirements.
  • Explain Knowledge Management Systems (KMS), the KM Cycle, and organizational benefits / barriers.

Chapter Outline

  • Managing Data
  • The Database Approach
  • Big Data
  • Data Warehouses & Data Marts
  • Knowledge Management
  • Appendix: Fundamentals of Relational Database Operations (SQL, QBE, ER Modeling, Normalization, Joins)

3.1 Managing Data

Difficulties in Managing Data
  • Data volumes grow exponentially over time.
  • Numerous, heterogeneous sources: internal operational systems, documents; personal files; external commercial databases, websites, government reports; click-stream logs.
  • Emergence of new data types (sensor, social, multimedia, etc.).
  • Data rot/degradation (out-of-date, corrupted, inaccessible storage media).
  • Maintaining security, quality, integrity while satisfying legal & regulatory constraints (e.g., GDPR, HIPAA, SOX).
Data Governance (DG)
  • Enterprise-wide framework of formal, unambiguous rules governing creation, collection, handling, protection, and usage of information.
  • Aligns data policies with corporate strategy, risk tolerance, compliance.
  • Core tactic: Master Data Management (MDM)
    • Cross-functional process spanning all applications/business units.
    • Ensures a single, consistent, accurate, timely view of key entities.
  • Master Data examples: customer, product, employee, vendor, geographic location, chart of accounts.

3.2 The Database Approach

Historical Context
  • 1950s–1970s: standalone file-processing systems → data silos, redundancy, inconsistency.
  • Modern solution: Database Management System (DBMS).
Data Hierarchy
  • BitByte (8 bits)FieldRecordFile/TableDatabase\text{Bit} \rightarrow \text{Byte (8 bits)} \rightarrow \text{Field} \rightarrow \text{Record} \rightarrow \text{File/Table} \rightarrow \text{Database}
  • Example (University DB): Student table contains fields such as Name, Major, Gender; Faculty table contains Name, Position, etc.
DBMS Functions & Architectures
  • Creates, stores, accesses, and manages databases.
  • Supplies query/reporting tools, enforces integrity & security, enables backup/recovery.
  • Primary logical models:
    • Relational (dominant; e.g., Oracle, MySQL, SQL Server, MS Access)
    • Network
    • Hierarchical
Relational Model Essentials
  • Organizes data in two-dimensional tables (relations).
    Rows = records/tuples/instances; Columns = attributes/fields.
  • Design begins with a Data Model (ER diagram).
  • Problems minimized:
    • Redundancy (duplicate data)
    • Isolation (inaccessible across apps)
    • Inconsistency (copies disagree)
  • Objectives maximized:
    • Security (centralized protection)
    • Integrity (enforced constraints)
    • Independence (loose coupling between programs & data)

3.3 Big Data

Definition
  • Gartner: “Diverse, high-volume, high-velocity information assets requiring new processing to enable enhanced decision making, insight discovery, and process optimization.”
  • Institute perspective: data exhibiting variety, velocity, volume, often non-relational and semi/un-structured.
  • Typical sources:
    • Traditional enterprise transactions
    • Machine/sensor/IoT streams
    • Social media interactions
    • Images/video/audio from cameras, scanners, satellites.
3 V’s Characteristics
  • Volume – petabytes, exabytes.
  • Velocity – real-time or near-real-time arrival; shrinking feedback loops.
  • Variety – text, click-streams, log files, audio, video, XML/JSON, etc.
Additional Issues
  • Trustworthiness of sources; dirty data (inaccurate, duplicate, incomplete); instability (stream conditions change).
Managing Big Data
  • Combine traditional RDBMS with NoSQL systems (key-value, document, column-family, graph).
  • Popular open-source stack: Hadoop, Spark, MapReduce, HDFS.
Leveraging Big Data
  • Availability: democratize access for all stakeholders.
  • Experimentation: A/B testing (Amazon, Google, LinkedIn).
  • Micro-segmentation: fine-grained customer clusters.
  • New business models: e.g., truck-fleet telematics for preventive maintenance.
  • Comprehensive analytics: full-population processing vs. sampling.
Functional-Area Applications
  • HR (talent analytics), Product Development (feature usage), Operations (predictive maintenance), Marketing (real-time offers), Government (fraud detection, smart cities).

3.4 Data Warehouses & Data Marts

Definitions
  • Data Warehouse (DW) – centralized repository of historical, subject-oriented, integrated, time-variant, non-volatile, multidimensional data to support decision making.
  • Data Mart (DM) – low-cost, department-level subset of DW tailored to a Strategic Business Unit (SBU).
Core Characteristics
  1. Subject-oriented: organized by business dimension (customer, vendor, product).
  2. Integrated: data cleansed, standardized, de-duplicated across systems.
  3. Time-variant: multiple snapshots; supports trend analysis.
  4. Non-volatile: read-only to end users; updates through controlled ETL.
  5. Multidimensional: supports Online Analytical Processing (OLAP) cubes.
Generic DW Environment
  • Source Systems: operational databases, ERP, Web logs, 3rd-party feeds.
  • Data Integration (ETL): Extract → Transform → Load; may include real-time streaming.
  • Storage Architectures: enterprise DW with/without dependent data marts, metadata repositories.
  • Data Quality & Governance: cleansing tools, stewardship committees, refresh schedules.
  • Users: IT developers, analysts, executives, suppliers, customers, regulators.
Example Architecture Components (Slide 35)
  • POS, ERP → ETL → Enterprise DW → Data Marts (Finance, Marketing) → OLAP/ROLAP, Data Mining, DSS, Web browser access.

3.5 Knowledge Management

Core Concepts
  • Knowledge (K): contextual, relevant, actionable information – aka Intellectual Capital.
  • KM Process: systematic manipulation of organizational memory (often unstructured).
  • Explicit K: documented (policies, manuals, reports, databases).
  • Tacit K: experiential (insights, know-how, skills); harder to codify/transfer.
Knowledge Management Systems (KMS)
  • Leverage intranets, extranets, groupware, database, AI/ML to capture, store, share expertise.
  • Mitigate turnover, support innovation, accelerate problem solving via access to best practices.
KMS Cycle (6 Steps)
  1. Create – innovate, import external ideas.
  2. Capture – recognize value, represent suitably.
  3. Refine – contextualize; merge tacit insights w/ explicit facts.
  4. Store – repository w/ metadata & retrieval mechanisms.
  5. Manage – curate, update, retire obsolete content.
  6. Disseminate – deliver right knowledge to right person at right time (anywhere/anytime).

Appendix – Fundamentals of Relational Database Operations

Key Terminology
  • Relational Database: collection of related 2-D2\text{-D} tables.
  • Record (Row/Tuple) – single instance.
  • Field (Column/Attribute) – characteristic of record.
Keys
  • Primary Key (PK) – uniquely identifies a record (e.g., StudentID).
  • Secondary Key – aids retrieval but not unique (e.g., LastName).
  • Foreign Key (FK) – PK from another table used to maintain referential integrity.
Querying
  • SQL – declarative language (SELECT, FROM, WHERE…).
    Example:
    \text{SELECT Name, Price FROM Pizza WHERE Price} > 10;
  • QBE – drag-and-drop grid interface building SQL under the hood.
Entity-Relationship (ER) Modeling
  • Entities: objects/events (STUDENT, CLASS).
  • Attributes: properties (StudentName, ClassTime).
  • Relationships (deg. & cardinality):
    • Unary (1-entity) – recursive (EMPLOYEE supervises EMPLOYEE).
    • Binary (2-entity) – most common (PROFESSOR teaches CLASS).
    • Ternary (3-entity) – e.g., SUPPLIER ships PRODUCT to WAREHOUSE.
  • Cardinality/Connectivity: mandatory vs. optional; single vs. many (see slide symbols).
    • One-to-One, One-to-Many, Many-to-Many (resolved by associative table).
  • Business Rules & Data Dictionary document semantics, constraints, update frequency.
Normalization
  • Goal: minimal redundancy, maximal integrity, optimal performance.
  • Functional Dependency: attribute AA determines attribute BB (ABA \rightarrow B).
  • 1st NF – eliminate repeating groups; atomic attributes.
  • 2nd NF – eliminate partial dependencies on composite PK.
  • 3rd NF – eliminate transitive dependencies (non-key → non-key).
Pizza Shop Example
  1. Raw Data (slide 71) → redundancy & anomalies.
  2. 1NF (slide 73) – single table; still duplicative.
  3. 2NF (slide 77) – split into Orders, Order-Pizza, Pizza tables.
  4. 3NF (slide 79) – further split Customer; all redundancy removed; only FKs link tables.
Join Operation
  • Combines rows from related tables using PK–FK relations.
  • In example, ORDER + ORDER_PIZZA + PIZZA + CUSTOMER joined to produce complete invoice.

Ethical, Philosophical & Practical Considerations

  • Centralizing data amplifies both value and risk (privacy breaches, single-point failure).
  • Data Quality and Governance are continuous—not one-time—activities; require culture of stewardship.
  • Big Data ethics: bias, surveillance, informed consent.
  • Knowledge-sharing culture vs. knowledge hoarding; need incentives & trust.

Quick Reference Equations & Figures

  • 8 bits=1 byte8\ \text{bits} = 1\ \text{byte}
  • Data Quality Score=Valid RecordsTotal Records\text{Data Quality Score} = \frac{\text{Valid Records}}{\text{Total Records}} (illustrative).
  • Figure cross-walks (refer to slides):
    • 3.1 (Data Hierarchy), 3.5 (Relational vs. Multidimensional), 3.10–3.12 (Relationship types), 5.13–5.18 (Normalization & Join).