Foundations of Data Science – Unit I Comprehensive Notes
Data Science: Definition, Benefits, Uses
- Data
- Information translated into a form efficient for movement or processing.
- Data Science (DS)
- Evolutionary extension of statistics.
- Integrates computer-science methods to handle today’s massive, heterogeneous data.
- Benefits & Typical Use-Cases
- Commercial firms: customer insight, process optimisation, product personalisation, cross-/up-sell.
- Government: internal analytics & open-data initiatives for public value.
- NGOs: fundraising optimisation, advocacy impact measurement.
- Universities & MOOCs: learning-analytics to complement traditional classes.
- Ethical/Practical implication: responsible use of personal data, need for privacy policies (“Chinese walls”).
Facets (Types) of Data
- Structured Data
- Fixed fields; fits predefined data model; easily stored in relational tables.
- Managed/query via SQL.
- Unstructured Data
- Context-specific, irregular; e.g. free-form e-mails, text documents.
- Natural-Language (NL) Data
- Sub-class of unstructured data.
- Challenges: domain dependence; partial success in entity/topic recognition, summarisation, sentiment.
- Machine-Generated Data
- Auto-created logs, telemetry, call-detail records.
- High volume & velocity → needs scalable tools.
- Graph-Based / Network Data
- Focus on relationships (nodes, edges, properties).
- Social-network analysis: influence metrics, shortest paths, community detection.
- Audio, Image, Video
- Perceptual data; hard for machines (object recognition, speech).
- Example: MLBAM captures ≈7 TB video per baseball game; DeepMind RL agent learns from game frames via deep learning.
- Streaming Data
- Continuous real-time flow (Twitter trends, live sports, stock ticks).
- Processed event-by-event instead of batch loads.
The Data Science Process (6 Iterative Steps)
1. Defining Research Goals
- Craft a Project Charter containing:
- Clear research goal; mission & context.
- Planned analysis steps, resources, proofs of concept.
- Deliverables, success metrics, timeline.
- Success hinges on deep understanding of WHY, WHAT, HOW.
2. Retrieving Data
- Internal Sources: databases, data marts, data warehouses (pre-processed), data lakes (raw).
- External Sources: APIs (Twitter, LinkedIn), government open data, third-party vendors.
- Barriers: data discovery, access control ("Chinese walls").
3. Data Preparation ("ETL+T")
3.1 Cleansing
- Goal: true & consistent representation.
- Common error categories & fixes:
- Interpretation errors (age > 300 years).
- Inconsistencies ("Female" vs "F").
- Data-entry typos detected via frequency tables ("Godo" → "Good").
- Redundant whitespace: "FR " ≠ "FR" → use
strip(). - Capital-letter mismatch: "Brazil" vs "brazil" →
.lower(). - Impossible values & sanity rule: 0≤age≤120.
- Outliers identified by plots/table of min-max.
- Missing-value strategies: deletion, imputation, model-based filling.
3.2 Integrating Data
- Joining (horizontal merge) on keys / primary keys.
- Appending/Stacking (vertical union) requires identical column structure.
- Non-linear relationships: e.g. y=aebx → log-transform.
- Feature engineering: combine variables, create interaction terms.
- Dimensionality reduction (PCA): component1 (27.8 %) + component2 (22.8 %) → 50.6 % variance retained.
- Dummy/indicator variables: Weekdays → Monday…Sunday (0/1).
4. Exploratory Data Analysis (EDA)
- Visual & descriptive deep dive; still iterative with Step 3.
- Graph types: bar, line, histogram, KDE, Sankey, network diagrams, interactive/animated composites.
- Complementary non-visual techniques: clustering, summary stats, simple prototype models.
5. Model Building (Data Modeling)
- Purpose-driven, iterative.
5.1 Model & Variable Selection
- Criteria: performance, ease of production deployment, maintainability, interpretability.
5.2 Execution
- Leverage libraries (Python StatsModels, Scikit-learn, etc.).
- Example: linear regression code snippet (not reproduced here).
5.3 Diagnostics & Comparison
- Train–test (hold-out) split (e.g. 80 % vs 20 %).
- Evaluate via error metrics, e.g. mean square error: MSE=n1∑<em>i=1n(y</em>i−y^i)2.
- Choose lowest-error model; verify assumptions (independence, homoscedasticity, etc.).
- Insight: ensembles of simple models often beat one complex model.
6. Presenting Findings & Building Applications
- Translate insights into business impact; convince stakeholders.
- Automate where repetitive: scheduled scoring, auto-updated dashboards, report generation.
- Critical soft skills: storytelling, visualisation, change-management.
Data Mining
- Definition: mathematical discovery of actionable patterns/trends in large data that normal exploration can’t reveal.
- Typical Scenarios
- Forecasting (sales, server load).
- Risk & probability (targeted marketing, break-even analysis).
- Recommendations (product bundling).
- Sequencing (shopping cart analysis, next-event prediction).
- Grouping (clustering customers/events).
- 6-Step Mining Process
- Problem definition (business requirements, scope, evaluation metrics).
- Data preparation (consolidate, clean).
- Data exploration (descriptives, distribution checks).
- Model building (create mining structures, aggregates).
- Exploration & validation (compare multiple configs, test accuracy).
- Deployment & updating (prediction services, content queries, integration into apps, continuous improvement).
- Microsoft SQL Server Ecosystem
- Analysis Services for mining structures/models.
- AMO for programmatic management.
- Integration Services for intelligent ETL routing.
Data Warehousing
- Definition: integrating heterogeneous data into a central repository for analytical reporting & decision support.
- Key Characteristics
- Subject-oriented (sales, inventory).
- Integrated (consistent naming/format/coding).
- Non-volatile (read-only history retained).
- Time-variant (records tagged with date/period).
- Database vs Data Warehouse
- DB: supports transactional, real-time CRUD.
- DW: supports large-scale analytical queries; historical snapshots.
- Three-Tier Architecture
- Bottom Tier: DW server (relational DB); backend ETL tools cleanse/transform/load.
- Middle Tier: OLAP server
- ROLAP (relational extended) or
- MOLAP (native multidimensional).
- Top Tier: front-end client (query, analysis, reporting, mining tools).
- How It Works
- Integrates multi-source data (POS, mailing list, web, HR, etc.).
- Enables downstream mining & advanced analytics.
- Types of DW
- Enterprise Data Warehouse (EDW): organisation-wide, cross-functional view.
- Operational Data Store (ODS): near-real-time; routine operational queries.
- Data Mart: department/region specific subset; feeds into EDW via ODS.
Basic Statistical Descriptions (Quick Recap)
- Central tendency: mean, median, mode.
- Dispersion: variance σ2, standard deviation σ, range, IQR.
- Shape: skewness, kurtosis.
- Association: covariance, correlation ρxy.
- Practical note: these summaries underpin EDA graphs & sanity checks.
Summary of Unit I Workflow
- Step 1 – Research Goal: articulate value & scope.
- Step 2 – Retrieve Data: internal + external; surmount access barriers.
- Step 3 – Prepare Data: cleanse, integrate, transform.
- Step 4 – Explore: visual & statistical deep dive.
- Step 5 – Model: machine-learning/statistical methods to meet goal.
- Step 6 – Present & Automate: communicate insight, embed into operations.
- Foundational infrastructures: Data Warehouses for storage; Data Mining for pattern extraction.