Big Data Overview, Data Structures & Modern Analytics
Module Objectives
- By the end of this lesson a learner should be able to:
• Discuss what Big Data is and enumerate its defining characteristics.
• Differentiate the spectrum of data structures (structured, semi-structured, quasi-structured, unstructured).
• Identify and compare the principal data-storage repositories used by analysts (spreadmarts, data warehouses, analytic sandboxes).
• Examine the current state-of-practice in analytics and where opportunities for improvement lie.
• Distinguish Business Intelligence (BI) from Data Science along several analytical dimensions.
• Diagnose the weaknesses of today’s analytical architectures and explain why new platforms are demanded by modern data-science work.
Big Data: Core Ideas & Significance
- Continuous, exponential data creation
• Mobile phones, social media, medical imaging, and countless IoT devices generate data around the clock.
• Merely storing the influx is hard; extracting signal from it is harder—but doing so transforms business, government, science, and day-to-day life. - Industry trailblazers
• Credit-card networks process billions of transactions to detect fraud via rule-based & machine-learning models.
• Telcos mine call-detail records to spot churn risk (e.g., customer’s friends switch carriers ⇒ offer a retention promo proactively).
• Data-native firms (LinkedIn, Facebook) treat their data as the product—company valuation grows with dataset size & richness. - The three canonical “V’s”
• Volume – scale moves from 103–106 rows to 109 rows × 106 columns.
• Variety – heterogeneous formats: text, images, logs, genomic sequences, sensor streams, etc.
• Velocity – rapid ingestion + near real-time analytics; batch windows shrink from hours to seconds. - Which “V” matters most?
• Although media fixates on size, variety and velocity usually drive the need for novel tools.
• Traditional RDBMSs choke either on schema-less data or on sub-second latency demands. - Formal definition (McKinsey 2011)
• “Big Data is data whose scale, distribution, diversity, and/or timeliness require new technical architectures and analytics to enable insights that unlock new sources of business value.”
• Implication: success blends architecture + tooling + cross-disciplinary skill → the modern Data Scientist. - Drivers of the data deluge: mobile, sensors, social media, video surveillance & rendering, smart grids, seismic exploration, medical imaging, gene sequencing.
• Facebook 2012 statistic: 700 status updates/second—each is fodder for ad targeting.
• Genomics: plummeting sequencing costs yield petabytes of nucleotide data; personalized medicine emerges (predictive prescriptions & early risk mitigation).
Data Structures: Taxonomy & Processing Needs
- 80–90 % of future growth is not in classic rows × columns → novel processing paradigms required.
- Distributed compute & MPP architectures dominate because they
• Parallelize ingest/ETL of messy inputs.
• Enable scale-out analytics across many nodes. - Four structural classes
- Structured Data
• Rigid schema: relational DB tables, OLAP cubes, CSV, spreadsheets.
• Optimized for SQL, indexing, ACID transactions. - Semi-structured Data
• Self-describing markup (XML, JSON) – recognizable tags facilitate parsing, yet may nest arbitrarily. - Quasi-structured Data
• Erratic but parse-able with effort (web clickstreams). Example: three consecutive URLs tied to one keyword reconstruct a user interest trail. - Unstructured Data
• No inherent schema: free-text docs, PDFs, images, audio, video.
• NLP, computer vision, or deep learning required to extract features.
Data Repositories: Evolution & Characteristics
- Spreadmarts (spreadsheets + small data marts)
• Ad hoc, desktop-centric, low-volume record keeping; analysts pull manual extracts → version-control chaos & poor governance. - Data Warehouses (DW/EDW)
• Central, purpose-built, normalized storage; supports BI & standardized reporting.
• Strengths: security, backup, fail-over, “single source of truth.”
• Limitation: rigid schema restrains exploratory or compute-heavy analytics. - Analytic Sandboxes / Workspaces
• Collate assets from many sources; live outside production constraints.
• Enable flexible, high-performance, in-database or cluster-based experimentation.
• Blueprint for modern data-science notebooks & feature-engineering pipelines.
State of Practice in Analytics: Opportunities & Challenges
- Traditional business problems—customer churn, up-selling, cross-selling—are now attacked with advanced analytics + Big Data, yielding bigger lifts than legacy methods.
- Four broad problem categories where analytics confers competitive edge:
- Revenue growth & sales optimization.
- Operational efficiency & process optimization.
- Risk mitigation, fraud detection, AML compliance.
- Regulatory & audit adherence (ever-expanding data & complexity).
- Compliance insight: Laws (e.g., AML) introduce high-dimensional patterns that only advanced statistical or machine-learning models can unravel.
Business Intelligence (BI) vs Data Science
- Time horizon & Question framing
• BI = hindsight & mild insight → “What happened? When? Where?”
• Data Science = insight & foresight → “How/Why did it happen? What will happen next?” - Data requirements
• BI demands highly structured, aggregated tables.
• Data Science absorbs disaggregated, multi-modal, large/odd datasets. - Analytical tooling
• BI delivers static or interactive dashboards, KPIs, OLAP cubes.
• Data Science employs statistics, machine learning, optimization, simulation. - Example contrast
• BI: QTD Revenue=∑<em>i=1n</em>salesAmount<em>i—report vs target.
• Data Science: Time-series model y</em>t=α+β<em>1y</em>t−1+β<em>2x</em>t+εt to forecast next-quarter sales, incorporate exogenous variables, & quantify confidence intervals. - Workspace implication
• Data-science projects need agile, experiment-friendly platforms (not monolithic EDWs) to iterate quickly.
Current Analytical Architecture & Its Pain Points
- Canonical workflow
- Ingest: Raw sources → EDW; data must be normalized, typed, governed.
- Departmental Marts: Business units erect mini-warehouses for flexible queries.
- Operational BI: Enterprise apps & dashboards consume warehouse feeds.
- Downstream Analytics: Only after the above do analysts receive extracts.
- Resulting constraints
• High-value data becomes last in line for predictive modeling.
• Batch ETL ⇒ latency; data scientists work from stale snapshots.
• In-memory analysis limits sample size → may induce sampling bias, hurting model accuracy.
• Projects remain siloed, ad hoc, misaligned with enterprise strategy. - Strategic implication
• To scale advanced analytics, firms must modernize architecture (e.g., data lakes, stream processing, cloud object stores + distributed compute) and integrate Data Science outputs back into operational loops.
Ethical, Philosophical & Practical Considerations
- Privacy vs Personalization
• Leveraging social posts or genomic data for targeting/prediction raises consent & confidentiality issues. - Fairness & Bias
• Sampling limits in legacy architectures can skew models → unfair or erroneous decisions. - Governance
• Spreadmarts undermine data lineage; centralized yet rigid EDWs protect data integrity but stifle innovation—need balanced policies. - Skills convergence
• The “Data Scientist” role merges statistics, computer science, domain knowledge → cross-functional teams become mandatory.
Numeric & Statistical References
- Facebook 2012: 700posts s−1.
- Big-Data row/column magnitudes: >10^9 rows, >10^6 columns.
- Future data growth: 80$–$90\% from semi-/quasi-/unstructured sources.
- Simple churn rate:
Churn Rate=Customers at start of periodCustomers lost in period - Lift in fraud detection (conceptual):
Lift=Expected True Positives per $ under randomTrue Positives per $ - Velocity metric example for streaming:
Events per second=Time window (s)Total events
Connections to Prior Knowledge
- Relational algebra (SQL) handles structured data; Data Science introduces linear algebra, calculus & optimization to model complex relationships.
- Traditional OLAP cubes apply aggregation functions (SUM, COUNT). Machine-learning uses gradient-based methods to minimize loss L(θ) and generalize beyond historical aggregates.
Practical Recommendations
- Invest early in scalable, schema-flexible storage (Hadoop/HDFS, cloud object storage) + distributed engines (Spark, Dask) to sidestep EDW bottlenecks.
- Build analytic sandboxes colocated with raw & curated data to accelerate experimentation.
- Foster data-governance frameworks that preserve lineage while permitting iterative science.
- Combine BI & Data Science: use dashboards for monitoring + ML models for decision automation (e.g., real-time churn predictions feeding CRM actions).