Big Data Overview, Data Structures & Modern Analytics

Module Objectives

By the end of this lesson a learner should be able to:
• Discuss what Big Data is and enumerate its defining characteristics.
• Differentiate the spectrum of data structures (structured, semi-structured, quasi-structured, unstructured).
• Identify and compare the principal data-storage repositories used by analysts (spreadmarts, data warehouses, analytic sandboxes).
• Examine the current state-of-practice in analytics and where opportunities for improvement lie.
• Distinguish Business Intelligence (BI) from Data Science along several analytical dimensions.
• Diagnose the weaknesses of today’s analytical architectures and explain why new platforms are demanded by modern data-science work.

Big Data: Core Ideas & Significance

Continuous, exponential data creation
• Mobile phones, social media, medical imaging, and countless IoT devices generate data around the clock.
• Merely storing the influx is hard; extracting signal from it is harder—but doing so transforms business, government, science, and day-to-day life.
Industry trailblazers
• Credit-card networks process $\text{billions}$ of transactions to detect fraud via rule-based & machine-learning models.
• Telcos mine call-detail records to spot churn risk (e.g., customer’s friends switch carriers ⇒ offer a retention promo proactively).
• Data-native firms (LinkedIn, Facebook) treat their data as the product—company valuation grows with dataset size & richness.
The three canonical “V’s”
• Volume – scale moves from $\text{10}^3$ – $\text{10}^6$ rows to $\text{10}^9$ rows × $\text{10}^6$ columns.
• Variety – heterogeneous formats: text, images, logs, genomic sequences, sensor streams, etc.
• Velocity – rapid ingestion + near real-time analytics; batch windows shrink from hours to seconds.
Which “V” matters most?
• Although media fixates on size, variety and velocity usually drive the need for novel tools.
• Traditional RDBMSs choke either on schema-less data or on sub-second latency demands.
Formal definition (McKinsey 2011)
• “Big Data is data whose scale, distribution, diversity, and/or timeliness require new technical architectures and analytics to enable insights that unlock new sources of business value.”
• Implication: success blends architecture + tooling + cross-disciplinary skill → the modern Data Scientist.
Drivers of the data deluge: mobile, sensors, social media, video surveillance & rendering, smart grids, seismic exploration, medical imaging, gene sequencing.
• Facebook 2012 statistic: $700$ status updates/second—each is fodder for ad targeting.
• Genomics: plummeting sequencing costs yield petabytes of nucleotide data; personalized medicine emerges (predictive prescriptions & early risk mitigation).

Data Structures: Taxonomy & Processing Needs

80–90 % of future growth is not in classic rows × columns → novel processing paradigms required.
Distributed compute & MPP architectures dominate because they
• Parallelize ingest/ETL of messy inputs.
• Enable scale-out analytics across many nodes.
Four structural classes
1. Structured Data
  • Rigid schema: relational DB tables, OLAP cubes, CSV, spreadsheets.
  • Optimized for SQL, indexing, ACID transactions.
2. Semi-structured Data
  • Self-describing markup (XML, JSON) – recognizable tags facilitate parsing, yet may nest arbitrarily.
3. Quasi-structured Data
  • Erratic but parse-able with effort (web clickstreams). Example: three consecutive URLs tied to one keyword reconstruct a user interest trail.
4. Unstructured Data
  • No inherent schema: free-text docs, PDFs, images, audio, video.
  • NLP, computer vision, or deep learning required to extract features.

Data Repositories: Evolution & Characteristics

Spreadmarts (spreadsheets + small data marts)
• Ad hoc, desktop-centric, low-volume record keeping; analysts pull manual extracts → version-control chaos & poor governance.
Data Warehouses (DW/EDW)
• Central, purpose-built, normalized storage; supports BI & standardized reporting.
• Strengths: security, backup, fail-over, “single source of truth.”
• Limitation: rigid schema restrains exploratory or compute-heavy analytics.
Analytic Sandboxes / Workspaces
• Collate assets from many sources; live outside production constraints.
• Enable flexible, high-performance, in-database or cluster-based experimentation.
• Blueprint for modern data-science notebooks & feature-engineering pipelines.

State of Practice in Analytics: Opportunities & Challenges

Traditional business problems—customer churn, up-selling, cross-selling—are now attacked with advanced analytics + Big Data, yielding bigger lifts than legacy methods.
Four broad problem categories where analytics confers competitive edge:
1. Revenue growth & sales optimization.
2. Operational efficiency & process optimization.
3. Risk mitigation, fraud detection, AML compliance.
4. Regulatory & audit adherence (ever-expanding data & complexity).
Compliance insight: Laws (e.g., AML) introduce high-dimensional patterns that only advanced statistical or machine-learning models can unravel.

Business Intelligence (BI) vs Data Science

Time horizon & Question framing
• BI = hindsight & mild insight → “What happened? When? Where?”
• Data Science = insight & foresight → “How/Why did it happen? What will happen next?”
Data requirements
• BI demands highly structured, aggregated tables.
• Data Science absorbs disaggregated, multi-modal, large/odd datasets.
Analytical tooling
• BI delivers static or interactive dashboards, KPIs, OLAP cubes.
• Data Science employs statistics, machine learning, optimization, simulation.
Example contrast
• BI: $\text{QTD Revenue} = \sum{i=1}^{n{sales}} \text{Amount}i$ —report vs target. • Data Science: Time-series model $yt = \alpha + \beta1 y{t-1} + \beta2 xt + \varepsilon_t$ to forecast next-quarter sales, incorporate exogenous variables, & quantify confidence intervals.
Workspace implication
• Data-science projects need agile, experiment-friendly platforms (not monolithic EDWs) to iterate quickly.

Current Analytical Architecture & Its Pain Points

Canonical workflow
1. Ingest: Raw sources → EDW; data must be normalized, typed, governed.
2. Departmental Marts: Business units erect mini-warehouses for flexible queries.
3. Operational BI: Enterprise apps & dashboards consume warehouse feeds.
4. Downstream Analytics: Only after the above do analysts receive extracts.
Resulting constraints
• High-value data becomes last in line for predictive modeling.
• Batch ETL ⇒ latency; data scientists work from stale snapshots.
• In-memory analysis limits sample size → may induce sampling bias, hurting model accuracy.
• Projects remain siloed, ad hoc, misaligned with enterprise strategy.
Strategic implication
• To scale advanced analytics, firms must modernize architecture (e.g., data lakes, stream processing, cloud object stores + distributed compute) and integrate Data Science outputs back into operational loops.

Ethical, Philosophical & Practical Considerations

Privacy vs Personalization
• Leveraging social posts or genomic data for targeting/prediction raises consent & confidentiality issues.
Fairness & Bias
• Sampling limits in legacy architectures can skew models → unfair or erroneous decisions.
Governance
• Spreadmarts undermine data lineage; centralized yet rigid EDWs protect data integrity but stifle innovation—need balanced policies.
Skills convergence
• The “Data Scientist” role merges statistics, computer science, domain knowledge → cross-functional teams become mandatory.

Numeric & Statistical References

Facebook 2012: $700\;\text{posts s}^{-1}$ .
Big-Data row/column magnitudes: >10^9 rows, >10^6 columns.
Future data growth: 80$–$90\% from semi-/quasi-/unstructured sources.

Key Take-away Formulas (for recall)

Simple churn rate:
$\text{Churn\ Rate} = \frac{\text{Customers\ lost\ in\ period}}{\text{Customers\ at\ start\ of\ period}}$
Lift in fraud detection (conceptual):
$\text{Lift} = \frac{\text{True Positives per \$} }{\text{Expected True Positives per \$ under random}}$
Velocity metric example for streaming:
$\text{Events\ per\ second} = \frac{\text{Total events}}{\text{Time\ window (s)}}$

Connections to Prior Knowledge

Relational algebra (SQL) handles structured data; Data Science introduces linear algebra, calculus & optimization to model complex relationships.
Traditional OLAP cubes apply aggregation functions (SUM, COUNT). Machine-learning uses gradient-based methods to minimize loss $L(\theta)$ and generalize beyond historical aggregates.

Practical Recommendations

Invest early in scalable, schema-flexible storage (Hadoop/HDFS, cloud object storage) + distributed engines (Spark, Dask) to sidestep EDW bottlenecks.
Build analytic sandboxes colocated with raw & curated data to accelerate experimentation.
Foster data-governance frameworks that preserve lineage while permitting iterative science.
Combine BI & Data Science: use dashboards for monitoring + ML models for decision automation (e.g., real-time churn predictions feeding CRM actions).