Foundations of Data Science – Unit I Comprehensive Notes

Data Science: Definition, Benefits, Uses

  • Data
    • Information translated into a form efficient for movement or processing.
  • Data Science (DS)
    • Evolutionary extension of statistics.
    • Integrates computer-science methods to handle today’s massive, heterogeneous data.
  • Benefits & Typical Use-Cases
    • Commercial firms: customer insight, process optimisation, product personalisation, cross-/up-sell.
    • Government: internal analytics & open-data initiatives for public value.
    • NGOs: fundraising optimisation, advocacy impact measurement.
    • Universities & MOOCs: learning-analytics to complement traditional classes.
    • Ethical/Practical implication: responsible use of personal data, need for privacy policies (“Chinese walls”).

Facets (Types) of Data

  • Structured Data
    • Fixed fields; fits predefined data model; easily stored in relational tables.
    • Managed/query via SQL.
  • Unstructured Data
    • Context-specific, irregular; e.g. free-form e-mails, text documents.
  • Natural-Language (NL) Data
    • Sub-class of unstructured data.
    • Challenges: domain dependence; partial success in entity/topic recognition, summarisation, sentiment.
  • Machine-Generated Data
    • Auto-created logs, telemetry, call-detail records.
    • High volume & velocity → needs scalable tools.
  • Graph-Based / Network Data
    • Focus on relationships (nodes, edges, properties).
    • Social-network analysis: influence metrics, shortest paths, community detection.
  • Audio, Image, Video
    • Perceptual data; hard for machines (object recognition, speech).
    • Example: MLBAM captures ≈7 TB video per baseball game; DeepMind RL agent learns from game frames via deep learning.
  • Streaming Data
    • Continuous real-time flow (Twitter trends, live sports, stock ticks).
    • Processed event-by-event instead of batch loads.

The Data Science Process (6 Iterative Steps)

1. Defining Research Goals

  • Craft a Project Charter containing:
    • Clear research goal; mission & context.
    • Planned analysis steps, resources, proofs of concept.
    • Deliverables, success metrics, timeline.
  • Success hinges on deep understanding of WHY, WHAT, HOW.

2. Retrieving Data

  • Internal Sources: databases, data marts, data warehouses (pre-processed), data lakes (raw).
  • External Sources: APIs (Twitter, LinkedIn), government open data, third-party vendors.
  • Barriers: data discovery, access control ("Chinese walls").

3. Data Preparation ("ETL+T")

3.1 Cleansing
  • Goal: true & consistent representation.
  • Common error categories & fixes:
    • Interpretation errors (age > 300 years).
    • Inconsistencies ("Female" vs "F").
    • Data-entry typos detected via frequency tables ("Godo" → "Good").
    • Redundant whitespace: "FR " ≠ "FR" → use strip().
    • Capital-letter mismatch: "Brazil" vs "brazil" → .lower().
    • Impossible values & sanity rule: 0age1200 \le \text{age} \le 120.
    • Outliers identified by plots/table of min-max.
    • Missing-value strategies: deletion, imputation, model-based filling.
3.2 Integrating Data
  • Joining (horizontal merge) on keys / primary keys.
  • Appending/Stacking (vertical union) requires identical column structure.
3.3 Transforming Data
  • Non-linear relationships: e.g. y=aebxy = ae^{bx} → log-transform.
  • Feature engineering: combine variables, create interaction terms.
  • Dimensionality reduction (PCA): component1 (27.8 %) + component2 (22.8 %) → 50.6 % variance retained.
  • Dummy/indicator variables: Weekdays → Monday…Sunday (0/1).

4. Exploratory Data Analysis (EDA)

  • Visual & descriptive deep dive; still iterative with Step 3.
  • Graph types: bar, line, histogram, KDE, Sankey, network diagrams, interactive/animated composites.
  • Complementary non-visual techniques: clustering, summary stats, simple prototype models.

5. Model Building (Data Modeling)

  • Purpose-driven, iterative.
5.1 Model & Variable Selection
  • Criteria: performance, ease of production deployment, maintainability, interpretability.
5.2 Execution
  • Leverage libraries (Python StatsModels, Scikit-learn, etc.).
  • Example: linear regression code snippet (not reproduced here).
5.3 Diagnostics & Comparison
  • Train–test (hold-out) split (e.g. 80 % vs 20 %).
  • Evaluate via error metrics, e.g. mean square error: MSE=1n<em>i=1n(y</em>iy^i)2\text{MSE}=\frac{1}{n}\sum<em>{i=1}^{n}(y</em>i-\hat{y}_i)^2.
  • Choose lowest-error model; verify assumptions (independence, homoscedasticity, etc.).
  • Insight: ensembles of simple models often beat one complex model.

6. Presenting Findings & Building Applications

  • Translate insights into business impact; convince stakeholders.
  • Automate where repetitive: scheduled scoring, auto-updated dashboards, report generation.
  • Critical soft skills: storytelling, visualisation, change-management.

Data Mining

  • Definition: mathematical discovery of actionable patterns/trends in large data that normal exploration can’t reveal.
  • Typical Scenarios
    • Forecasting (sales, server load).
    • Risk & probability (targeted marketing, break-even analysis).
    • Recommendations (product bundling).
    • Sequencing (shopping cart analysis, next-event prediction).
    • Grouping (clustering customers/events).
  • 6-Step Mining Process
    1. Problem definition (business requirements, scope, evaluation metrics).
    2. Data preparation (consolidate, clean).
    3. Data exploration (descriptives, distribution checks).
    4. Model building (create mining structures, aggregates).
    5. Exploration & validation (compare multiple configs, test accuracy).
    6. Deployment & updating (prediction services, content queries, integration into apps, continuous improvement).
  • Microsoft SQL Server Ecosystem
    • Analysis Services for mining structures/models.
    • AMO for programmatic management.
    • Integration Services for intelligent ETL routing.

Data Warehousing

  • Definition: integrating heterogeneous data into a central repository for analytical reporting & decision support.
  • Key Characteristics
    • Subject-oriented (sales, inventory).
    • Integrated (consistent naming/format/coding).
    • Non-volatile (read-only history retained).
    • Time-variant (records tagged with date/period).
  • Database vs Data Warehouse
    • DB: supports transactional, real-time CRUD.
    • DW: supports large-scale analytical queries; historical snapshots.
  • Three-Tier Architecture
    • Bottom Tier: DW server (relational DB); backend ETL tools cleanse/transform/load.
    • Middle Tier: OLAP server
    • ROLAP (relational extended) or
    • MOLAP (native multidimensional).
    • Top Tier: front-end client (query, analysis, reporting, mining tools).
  • How It Works
    • Integrates multi-source data (POS, mailing list, web, HR, etc.).
    • Enables downstream mining & advanced analytics.
  • Types of DW
    • Enterprise Data Warehouse (EDW): organisation-wide, cross-functional view.
    • Operational Data Store (ODS): near-real-time; routine operational queries.
    • Data Mart: department/region specific subset; feeds into EDW via ODS.

Basic Statistical Descriptions (Quick Recap)

  • Central tendency: mean, median, mode.
  • Dispersion: variance σ2\sigma^2, standard deviation σ\sigma, range, IQR.
  • Shape: skewness, kurtosis.
  • Association: covariance, correlation ρxy\rho_{xy}.
  • Practical note: these summaries underpin EDA graphs & sanity checks.

Summary of Unit I Workflow

  • Step 1 – Research Goal: articulate value & scope.
  • Step 2 – Retrieve Data: internal + external; surmount access barriers.
  • Step 3 – Prepare Data: cleanse, integrate, transform.
  • Step 4 – Explore: visual & statistical deep dive.
  • Step 5 – Model: machine-learning/statistical methods to meet goal.
  • Step 6 – Present & Automate: communicate insight, embed into operations.
  • Foundational infrastructures: Data Warehouses for storage; Data Mining for pattern extraction.