1/15
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No study sessions yet.
Business Intelligence (BI)
BI = integrated tools and processes to capture, integrate, store and analyze data so managers can makebetterdecisions.Itturnsrawdata→information→knowledge→(ideally)wiser decisions.
BI benefits & framework
BI gives better decisions, a shared architecture, one common data repository (“single version of truth”), and common reporting/analysis tools. Typical flow: operational + external data → ETL → data warehouse + data marts → reporting, analytics, monitoring, dashboards; all used by people within defined processes and governance.
Operational vs decision-support data
Operational data lives in highly normalized OLTP DBs, focused on fast, accurate transaction processing (current, detailed, many updates, simple/narrow queries).
Decision-support data is historical, summarized, subject-oriented and multi-dimensional, focused on fast retrieval and complex, broad queries with drill-down (more detail) and roll-up (more aggregation).
Data warehouse vs operational DB
Operational DBs are siloed by function (orders, payments, inventory), use local formats and semantics, and constantly change.
A data warehouse is integrated (common codes/definitions), subject-oriented (product, region, customer, etc.), time-variant (keeps history with explicit time dimension), and non-volatile (loaded periodically, rarely updated in place).
ETL & schema-on-write
ETL extracts data from operational sources, transforms it (clean, standardize, integrate,
aggregate), then loads it into the warehouse. Because schema is defined before loading, this is
“schema-on-write”.
Star schema – core warehouse design
Star schema maps multi-dimensional data into a relational model.
Fact table: numeric business measures (e.g., sales_amount), with foreign keys to dimensions.
Dimension tables: descriptive attributes (product, customer, time, location, etc.) used to filter,
group, and “slice and dice”.
Attribute hierarchies: ordered levels (e.g., Year → Quarter → Month → Day) used for drill-
down/roll-up.
Cube view: each fact row represents one point in the multi-dimensional cube (e.g., product ×
time × location).
Star schema representation
Fact table in the center, one row per event (sale, shipment, etc.), with a composite primary key ofsignatures: equals(geom), disjoint(geom), intersects(geom), touches(geom), crosses(geom),
all dimension FKs plus numeric measures.
Dimension tables surround it, each with a PK referenced by the fact table (1:M from dimension
to fact).
Snowflake schema – super important
In a snowflake schema, dimension tables are normalized into multiple related tables along their
hierarchies (e.g., Store → City → State → Region instead of one wide Location table).
Pros: cleaner dimension design, avoids repeated region/state values.
Cons: more joins to answer queries (to roll up from store to state or region).
Snowflake is really about how we implement attribute hierarchies: each level gets its own table, so drill-down/roll-up is following these 1:M chains.
Ways to structure a warehouse & multiple facts
Star: fact table → single-level (denormalized) dimension tables.
Snowflake: fact table → lowest-level dimension; that dimension links up a normalized chain. All-in-one variants: put higher-level attributes directly into fact, or have separate fact tables at region/state/city levels to avoid long joins but use more space.
Multiple fact tables (e.g., SALES_REGION, SALES_STATE, SALES_CITY, SALES_LOCATION) can speed queries at different aggregation levels.
Star vs snowflake – trade-offs
Star: fewer joins, simpler queries, more redundancy in dimensions; good for performance and user-friendliness.
Snowflake: more normalized dimensions, less redundancy, but more joins and sometimes more complex queries.
BI design often mixes them and may pre-aggregate into extra fact tables to avoid expensive GROUP BY operations on huge base facts.
Data analytics (inside BI)
Data analytics = mathematical/statistical/modeling methods used on warehouse data to understand what happened (explanatory analytics) and predict what will happen (predictive analytics). This overlaps heavily with “data mining”.
OLAP (online analytical processing)
OLAP = multi-dimensional analysis over a data warehouse to support decision-making and
business modeling.
Architecture: sources → ETL → DW → OLAP engine/server → client tools (Excel, dashboards, GUIs) → reports, cubes, visualizations.
Departments may have local data marts (small, subject-specific subsets of the DW) for fast, focused analysis.
Data marts
Data mart = smaller, single-subject slice of the warehouse (e.g., Sales, Marketing). Cheaper and faster to build than a full DW; tailored to one group’s decision-support needs.
ROLAP vs MOLAP
ROLAP (Relational OLAP): stores star/snowflake schemas in a relational DBMS; uses SQL +
extensions; easy to add dimensions; good for ad hoc queries and very large relational datasets.
MOLAP (Multidimensional OLAP): stores data in multidimensional “cubes” (n-D arrays); often cached in memory; very fast for predefined dimensions and common cube operations; more rigid (dimensions fixed when cube built).
Roughly: ROLAP = flexible schemas on relational DB; MOLAP = pre-built cubes with high performance.
SQL OLAP extensions – ROLLUP & CUBE
ROLLUP: GROUP BY with automatic subtotals along the listed dimensions in order, plus a grand total.
CUBE: GROUP BY with all combinations of the listed dimensions, generating subtotals for
every subset plus grand total. These are used to precompute aggregates for OLAP-style analysis.
Modern architectures: warehouse, lake, lakehouse
Traditional warehouse: ETL, historical data, schema-on-write, mostly structured tables.
Data lake: raw, semi-structured, and streaming data stored first (often in cheap cloud storage);
schema applied later at read time (“schema-on-read”, ELT).
Lakehouse: merges lake + warehouse ideas; uses the lake as storage but layers
warehouse/BI/ML capabilities on top. Concepts like data mesh (domain-owned, federated data)
extend how organizations own and publish data products.