Introduction to Data Science

Definition & Scope

  • Multidisciplinary field using scientific methods, processes, algorithms & systems to extract knowledge from data
  • Handles both structured & unstructured data; tasks: collect, clean, analyze, interpret

Big Data Characteristics

  • 33 Vs: Volume (terabytes, records, transactions), Variety (structured, semi-structured, unstructured), Velocity (batch, near-time, real-time, streams)
  • Traditional RDBMS ≠ adequate; requires ML & algorithmic approaches

Historical Milestones

  • 19601960 – term used as auxiliary for computer science (Peter Naur)
  • 19741974 – Naur’s review includes “Data Science”
  • 19941994 – IFCS conference title features Data Science
  • 19971997 – C.F. Jeff Wu lecture: statistics as data science

Motivations & Benefits

  • Deeper client understanding, storytelling, cross-industry applicability
  • Supports decision-making in travel, healthcare, education, retail, etc.
  • Drives product success/failure via big-data insights
  • Enables cross-sell, up-sell, personalization, e-governance
  • Benefits: discover patterns, innovate products, real-time optimization

Analytical Categories

  • Descriptive: what happened; visuals (pie, bar, line)
  • Diagnostic: why it happened; drill-down, correlations
  • Predictive: forecast future; ML, modeling
  • Prescriptive: recommend best action; simulations, recommendation engines

Common Applications

  • NLP: spam filters, algorithmic trading, Q&A systems, summarization
  • IoT data streams, personalized ads, quantitative trading, people analytics

Key Technologies

  • Artificial Intelligence / ML
  • Cloud Computing for scalable processing
  • Internet of Things for data generation
  • Quantum Computing for complex algorithms

Professional Roles

  • Data Scientist: pattern discovery, modeling, ML, stakeholder communication, tool deployment (Python, R, SAS, SQL)
  • Data Analyst: work with structured data, acquire/clean, statistical analysis, visualization, business reporting

Required Skill Set

  • Python, SQL/NoSQL, Excel
  • Advanced statistics & high-level math
  • Data visualization
  • NLP / ML
  • Business acumen, communication, teamwork, social-media mining

Data Types & Facets

  • Structured vs Unstructured
  • Quantitative vs Qualitative
  • Four measurement levels: Nominal, Ordinal, Interval, Ratio
  • Other facets: natural language, machine-generated, graph-based, audio/video/images, streaming

Structured vs Unstructured

  • Structured: easy storage & retrieval, SQL querying; hierarchical data = challenging
  • Unstructured: context-specific content, complex processing, often natural language