Big Data Overview, Data Structures & Modern Analytics

Module Objectives

  • By the end of this lesson a learner should be able to:
    • Discuss what Big Data is and enumerate its defining characteristics.
    • Differentiate the spectrum of data structures (structured, semi-structured, quasi-structured, unstructured).
    • Identify and compare the principal data-storage repositories used by analysts (spreadmarts, data warehouses, analytic sandboxes).
    • Examine the current state-of-practice in analytics and where opportunities for improvement lie.
    • Distinguish Business Intelligence (BI) from Data Science along several analytical dimensions.
    • Diagnose the weaknesses of today’s analytical architectures and explain why new platforms are demanded by modern data-science work.

Big Data: Core Ideas & Significance

  • Continuous, exponential data creation
    • Mobile phones, social media, medical imaging, and countless IoT devices generate data around the clock.
    • Merely storing the influx is hard; extracting signal from it is harder—but doing so transforms business, government, science, and day-to-day life.
  • Industry trailblazers
    • Credit-card networks process billions\text{billions} of transactions to detect fraud via rule-based & machine-learning models.
    • Telcos mine call-detail records to spot churn risk (e.g., customer’s friends switch carriers ⇒ offer a retention promo proactively).
    • Data-native firms (LinkedIn, Facebook) treat their data as the product—company valuation grows with dataset size & richness.
  • The three canonical “V’s”
    • Volume – scale moves from 103\text{10}^3106\text{10}^6 rows to 109\text{10}^9 rows × 106\text{10}^6 columns.
    • Variety – heterogeneous formats: text, images, logs, genomic sequences, sensor streams, etc.
    • Velocity – rapid ingestion + near real-time analytics; batch windows shrink from hours to seconds.
  • Which “V” matters most?
    • Although media fixates on size, variety and velocity usually drive the need for novel tools.
    • Traditional RDBMSs choke either on schema-less data or on sub-second latency demands.
  • Formal definition (McKinsey 2011)
    • “Big Data is data whose scale, distribution, diversity, and/or timeliness require new technical architectures and analytics to enable insights that unlock new sources of business value.”
    • Implication: success blends architecture + tooling + cross-disciplinary skill → the modern Data Scientist.
  • Drivers of the data deluge: mobile, sensors, social media, video surveillance & rendering, smart grids, seismic exploration, medical imaging, gene sequencing.
    • Facebook 2012 statistic: 700700 status updates/second—each is fodder for ad targeting.
    • Genomics: plummeting sequencing costs yield petabytes of nucleotide data; personalized medicine emerges (predictive prescriptions & early risk mitigation).

Data Structures: Taxonomy & Processing Needs

  • 80–90 % of future growth is not in classic rows × columns → novel processing paradigms required.
  • Distributed compute & MPP architectures dominate because they
    • Parallelize ingest/ETL of messy inputs.
    • Enable scale-out analytics across many nodes.
  • Four structural classes
    1. Structured Data
      • Rigid schema: relational DB tables, OLAP cubes, CSV, spreadsheets.
      • Optimized for SQL, indexing, ACID transactions.
    2. Semi-structured Data
      • Self-describing markup (XML, JSON) – recognizable tags facilitate parsing, yet may nest arbitrarily.
    3. Quasi-structured Data
      • Erratic but parse-able with effort (web clickstreams). Example: three consecutive URLs tied to one keyword reconstruct a user interest trail.
    4. Unstructured Data
      • No inherent schema: free-text docs, PDFs, images, audio, video.
      • NLP, computer vision, or deep learning required to extract features.

Data Repositories: Evolution & Characteristics

  • Spreadmarts (spreadsheets + small data marts)
    • Ad hoc, desktop-centric, low-volume record keeping; analysts pull manual extracts → version-control chaos & poor governance.
  • Data Warehouses (DW/EDW)
    • Central, purpose-built, normalized storage; supports BI & standardized reporting.
    • Strengths: security, backup, fail-over, “single source of truth.”
    • Limitation: rigid schema restrains exploratory or compute-heavy analytics.
  • Analytic Sandboxes / Workspaces
    • Collate assets from many sources; live outside production constraints.
    • Enable flexible, high-performance, in-database or cluster-based experimentation.
    • Blueprint for modern data-science notebooks & feature-engineering pipelines.

State of Practice in Analytics: Opportunities & Challenges

  • Traditional business problems—customer churn, up-selling, cross-selling—are now attacked with advanced analytics + Big Data, yielding bigger lifts than legacy methods.
  • Four broad problem categories where analytics confers competitive edge:
    1. Revenue growth & sales optimization.
    2. Operational efficiency & process optimization.
    3. Risk mitigation, fraud detection, AML compliance.
    4. Regulatory & audit adherence (ever-expanding data & complexity).
  • Compliance insight: Laws (e.g., AML) introduce high-dimensional patterns that only advanced statistical or machine-learning models can unravel.

Business Intelligence (BI) vs Data Science

  • Time horizon & Question framing
    • BI = hindsight & mild insight → “What happened? When? Where?”
    • Data Science = insight & foresight → “How/Why did it happen? What will happen next?”
  • Data requirements
    • BI demands highly structured, aggregated tables.
    • Data Science absorbs disaggregated, multi-modal, large/odd datasets.
  • Analytical tooling
    • BI delivers static or interactive dashboards, KPIs, OLAP cubes.
    • Data Science employs statistics, machine learning, optimization, simulation.
  • Example contrast
    • BI: QTD Revenue=<em>i=1n</em>salesAmount<em>i\text{QTD Revenue} = \sum<em>{i=1}^{n</em>{sales}} \text{Amount}<em>i—report vs target. • Data Science: Time-series model y</em>t=α+β<em>1y</em>t1+β<em>2x</em>t+εty</em>t = \alpha + \beta<em>1 y</em>{t-1} + \beta<em>2 x</em>t + \varepsilon_t to forecast next-quarter sales, incorporate exogenous variables, & quantify confidence intervals.
  • Workspace implication
    • Data-science projects need agile, experiment-friendly platforms (not monolithic EDWs) to iterate quickly.

Current Analytical Architecture & Its Pain Points

  • Canonical workflow
    1. Ingest: Raw sources → EDW; data must be normalized, typed, governed.
    2. Departmental Marts: Business units erect mini-warehouses for flexible queries.
    3. Operational BI: Enterprise apps & dashboards consume warehouse feeds.
    4. Downstream Analytics: Only after the above do analysts receive extracts.
  • Resulting constraints
    • High-value data becomes last in line for predictive modeling.
    • Batch ETL ⇒ latency; data scientists work from stale snapshots.
    • In-memory analysis limits sample size → may induce sampling bias, hurting model accuracy.
    • Projects remain siloed, ad hoc, misaligned with enterprise strategy.
  • Strategic implication
    • To scale advanced analytics, firms must modernize architecture (e.g., data lakes, stream processing, cloud object stores + distributed compute) and integrate Data Science outputs back into operational loops.

Ethical, Philosophical & Practical Considerations

  • Privacy vs Personalization
    • Leveraging social posts or genomic data for targeting/prediction raises consent & confidentiality issues.
  • Fairness & Bias
    • Sampling limits in legacy architectures can skew models → unfair or erroneous decisions.
  • Governance
    • Spreadmarts undermine data lineage; centralized yet rigid EDWs protect data integrity but stifle innovation—need balanced policies.
  • Skills convergence
    • The “Data Scientist” role merges statistics, computer science, domain knowledge → cross-functional teams become mandatory.

Numeric & Statistical References

  • Facebook 2012: 700  posts s1700\;\text{posts s}^{-1}.
  • Big-Data row/column magnitudes: >10^9 rows, >10^6 columns.
  • Future data growth: 80$–$90\% from semi-/quasi-/unstructured sources.

Key Take-away Formulas (for recall)

  • Simple churn rate:
    Churn Rate=Customers lost in periodCustomers at start of period\text{Churn\ Rate} = \frac{\text{Customers\ lost\ in\ period}}{\text{Customers\ at\ start\ of\ period}}
  • Lift in fraud detection (conceptual):
    Lift=True Positives per $Expected True Positives per $ under random\text{Lift} = \frac{\text{True Positives per \$} }{\text{Expected True Positives per \$ under random}}
  • Velocity metric example for streaming:
    Events per second=Total eventsTime window (s)\text{Events\ per\ second} = \frac{\text{Total events}}{\text{Time\ window (s)}}

Connections to Prior Knowledge

  • Relational algebra (SQL) handles structured data; Data Science introduces linear algebra, calculus & optimization to model complex relationships.
  • Traditional OLAP cubes apply aggregation functions (SUM, COUNT). Machine-learning uses gradient-based methods to minimize loss L(θ)L(\theta) and generalize beyond historical aggregates.

Practical Recommendations

  • Invest early in scalable, schema-flexible storage (Hadoop/HDFS, cloud object storage) + distributed engines (Spark, Dask) to sidestep EDW bottlenecks.
  • Build analytic sandboxes colocated with raw & curated data to accelerate experimentation.
  • Foster data-governance frameworks that preserve lineage while permitting iterative science.
  • Combine BI & Data Science: use dashboards for monitoring + ML models for decision automation (e.g., real-time churn predictions feeding CRM actions).