Chapter 3 – Data and Knowledge Management
Learning Objectives
- Address common data‐management challenges through strong Data Governance practices.
- Evaluate Relational Databases (RDBMS) – advantages, disadvantages, querying, modeling, normalization, joins.
- Define Big Data; recognize its core characteristics and managerial implications.
- Detail the components and life-cycle of Data Warehouses / Data Marts; explain Extract-Transform-Load (ETL) and governance requirements.
- Explain Knowledge Management Systems (KMS), the KM Cycle, and organizational benefits / barriers.
Chapter Outline
- Managing Data
- The Database Approach
- Big Data
- Data Warehouses & Data Marts
- Knowledge Management
- Appendix: Fundamentals of Relational Database Operations (SQL, QBE, ER Modeling, Normalization, Joins)
3.1 Managing Data
Difficulties in Managing Data
- Data volumes grow exponentially over time.
- Numerous, heterogeneous sources: internal operational systems, documents; personal files; external commercial databases, websites, government reports; click-stream logs.
- Emergence of new data types (sensor, social, multimedia, etc.).
- Data rot/degradation (out-of-date, corrupted, inaccessible storage media).
- Maintaining security, quality, integrity while satisfying legal & regulatory constraints (e.g., GDPR, HIPAA, SOX).
Data Governance (DG)
- Enterprise-wide framework of formal, unambiguous rules governing creation, collection, handling, protection, and usage of information.
- Aligns data policies with corporate strategy, risk tolerance, compliance.
- Core tactic: Master Data Management (MDM)
- Cross-functional process spanning all applications/business units.
- Ensures a single, consistent, accurate, timely view of key entities.
- Master Data examples: customer, product, employee, vendor, geographic location, chart of accounts.
3.2 The Database Approach
Historical Context
- 1950s–1970s: standalone file-processing systems → data silos, redundancy, inconsistency.
- Modern solution: Database Management System (DBMS).
Data Hierarchy
- Bit→Byte (8 bits)→Field→Record→File/Table→Database
- Example (University DB): Student table contains fields such as Name, Major, Gender; Faculty table contains Name, Position, etc.
DBMS Functions & Architectures
- Creates, stores, accesses, and manages databases.
- Supplies query/reporting tools, enforces integrity & security, enables backup/recovery.
- Primary logical models:
- Relational (dominant; e.g., Oracle, MySQL, SQL Server, MS Access)
- Network
- Hierarchical
Relational Model Essentials
- Organizes data in two-dimensional tables (relations).
Rows = records/tuples/instances; Columns = attributes/fields. - Design begins with a Data Model (ER diagram).
- Problems minimized:
- Redundancy (duplicate data)
- Isolation (inaccessible across apps)
- Inconsistency (copies disagree)
- Objectives maximized:
- Security (centralized protection)
- Integrity (enforced constraints)
- Independence (loose coupling between programs & data)
3.3 Big Data
Definition
- Gartner: “Diverse, high-volume, high-velocity information assets requiring new processing to enable enhanced decision making, insight discovery, and process optimization.”
- Institute perspective: data exhibiting variety, velocity, volume, often non-relational and semi/un-structured.
- Typical sources:
- Traditional enterprise transactions
- Machine/sensor/IoT streams
- Social media interactions
- Images/video/audio from cameras, scanners, satellites.
3 V’s Characteristics
- Volume – petabytes, exabytes.
- Velocity – real-time or near-real-time arrival; shrinking feedback loops.
- Variety – text, click-streams, log files, audio, video, XML/JSON, etc.
Additional Issues
- Trustworthiness of sources; dirty data (inaccurate, duplicate, incomplete); instability (stream conditions change).
Managing Big Data
- Combine traditional RDBMS with NoSQL systems (key-value, document, column-family, graph).
- Popular open-source stack: Hadoop, Spark, MapReduce, HDFS.
Leveraging Big Data
- Availability: democratize access for all stakeholders.
- Experimentation: A/B testing (Amazon, Google, LinkedIn).
- Micro-segmentation: fine-grained customer clusters.
- New business models: e.g., truck-fleet telematics for preventive maintenance.
- Comprehensive analytics: full-population processing vs. sampling.
Functional-Area Applications
- HR (talent analytics), Product Development (feature usage), Operations (predictive maintenance), Marketing (real-time offers), Government (fraud detection, smart cities).
3.4 Data Warehouses & Data Marts
Definitions
- Data Warehouse (DW) – centralized repository of historical, subject-oriented, integrated, time-variant, non-volatile, multidimensional data to support decision making.
- Data Mart (DM) – low-cost, department-level subset of DW tailored to a Strategic Business Unit (SBU).
Core Characteristics
- Subject-oriented: organized by business dimension (customer, vendor, product).
- Integrated: data cleansed, standardized, de-duplicated across systems.
- Time-variant: multiple snapshots; supports trend analysis.
- Non-volatile: read-only to end users; updates through controlled ETL.
- Multidimensional: supports Online Analytical Processing (OLAP) cubes.
Generic DW Environment
- Source Systems: operational databases, ERP, Web logs, 3rd-party feeds.
- Data Integration (ETL): Extract → Transform → Load; may include real-time streaming.
- Storage Architectures: enterprise DW with/without dependent data marts, metadata repositories.
- Data Quality & Governance: cleansing tools, stewardship committees, refresh schedules.
- Users: IT developers, analysts, executives, suppliers, customers, regulators.
Example Architecture Components (Slide 35)
- POS, ERP → ETL → Enterprise DW → Data Marts (Finance, Marketing) → OLAP/ROLAP, Data Mining, DSS, Web browser access.
3.5 Knowledge Management
Core Concepts
- Knowledge (K): contextual, relevant, actionable information – aka Intellectual Capital.
- KM Process: systematic manipulation of organizational memory (often unstructured).
- Explicit K: documented (policies, manuals, reports, databases).
- Tacit K: experiential (insights, know-how, skills); harder to codify/transfer.
Knowledge Management Systems (KMS)
- Leverage intranets, extranets, groupware, database, AI/ML to capture, store, share expertise.
- Mitigate turnover, support innovation, accelerate problem solving via access to best practices.
KMS Cycle (6 Steps)
- Create – innovate, import external ideas.
- Capture – recognize value, represent suitably.
- Refine – contextualize; merge tacit insights w/ explicit facts.
- Store – repository w/ metadata & retrieval mechanisms.
- Manage – curate, update, retire obsolete content.
- Disseminate – deliver right knowledge to right person at right time (anywhere/anytime).
Appendix – Fundamentals of Relational Database Operations
Key Terminology
- Relational Database: collection of related 2-D tables.
- Record (Row/Tuple) – single instance.
- Field (Column/Attribute) – characteristic of record.
Keys
- Primary Key (PK) – uniquely identifies a record (e.g., StudentID).
- Secondary Key – aids retrieval but not unique (e.g., LastName).
- Foreign Key (FK) – PK from another table used to maintain referential integrity.
Querying
- SQL – declarative language (SELECT, FROM, WHERE…).
Example:
\text{SELECT Name, Price FROM Pizza WHERE Price} > 10; - QBE – drag-and-drop grid interface building SQL under the hood.
Entity-Relationship (ER) Modeling
- Entities: objects/events (STUDENT, CLASS).
- Attributes: properties (StudentName, ClassTime).
- Relationships (deg. & cardinality):
- Unary (1-entity) – recursive (EMPLOYEE supervises EMPLOYEE).
- Binary (2-entity) – most common (PROFESSOR teaches CLASS).
- Ternary (3-entity) – e.g., SUPPLIER ships PRODUCT to WAREHOUSE.
- Cardinality/Connectivity: mandatory vs. optional; single vs. many (see slide symbols).
- One-to-One, One-to-Many, Many-to-Many (resolved by associative table).
- Business Rules & Data Dictionary document semantics, constraints, update frequency.
Normalization
- Goal: minimal redundancy, maximal integrity, optimal performance.
- Functional Dependency: attribute A determines attribute B (A→B).
- 1st NF – eliminate repeating groups; atomic attributes.
- 2nd NF – eliminate partial dependencies on composite PK.
- 3rd NF – eliminate transitive dependencies (non-key → non-key).
Pizza Shop Example
- Raw Data (slide 71) → redundancy & anomalies.
- 1NF (slide 73) – single table; still duplicative.
- 2NF (slide 77) – split into Orders, Order-Pizza, Pizza tables.
- 3NF (slide 79) – further split Customer; all redundancy removed; only FKs link tables.
Join Operation
- Combines rows from related tables using PK–FK relations.
- In example, ORDER + ORDER_PIZZA + PIZZA + CUSTOMER joined to produce complete invoice.
Ethical, Philosophical & Practical Considerations
- Centralizing data amplifies both value and risk (privacy breaches, single-point failure).
- Data Quality and Governance are continuous—not one-time—activities; require culture of stewardship.
- Big Data ethics: bias, surveillance, informed consent.
- Knowledge-sharing culture vs. knowledge hoarding; need incentives & trust.
- 8 bits=1 byte
- Data Quality Score=Total RecordsValid Records (illustrative).
- Figure cross-walks (refer to slides):
- 3.1 (Data Hierarchy), 3.5 (Relational vs. Multidimensional), 3.10–3.12 (Relationship types), 5.13–5.18 (Normalization & Join).