Introduction to Data Science
Definition & Scope
- Multidisciplinary field using scientific methods, processes, algorithms & systems to extract knowledge from data
- Handles both structured & unstructured data; tasks: collect, clean, analyze, interpret
Big Data Characteristics
- 3 Vs: Volume (terabytes, records, transactions), Variety (structured, semi-structured, unstructured), Velocity (batch, near-time, real-time, streams)
- Traditional RDBMS ≠ adequate; requires ML & algorithmic approaches
Historical Milestones
- 1960 – term used as auxiliary for computer science (Peter Naur)
- 1974 – Naur’s review includes “Data Science”
- 1994 – IFCS conference title features Data Science
- 1997 – C.F. Jeff Wu lecture: statistics as data science
Motivations & Benefits
- Deeper client understanding, storytelling, cross-industry applicability
- Supports decision-making in travel, healthcare, education, retail, etc.
- Drives product success/failure via big-data insights
- Enables cross-sell, up-sell, personalization, e-governance
- Benefits: discover patterns, innovate products, real-time optimization
Analytical Categories
- Descriptive: what happened; visuals (pie, bar, line)
- Diagnostic: why it happened; drill-down, correlations
- Predictive: forecast future; ML, modeling
- Prescriptive: recommend best action; simulations, recommendation engines
Common Applications
- NLP: spam filters, algorithmic trading, Q&A systems, summarization
- IoT data streams, personalized ads, quantitative trading, people analytics
Key Technologies
- Artificial Intelligence / ML
- Cloud Computing for scalable processing
- Internet of Things for data generation
- Quantum Computing for complex algorithms
Professional Roles
- Data Scientist: pattern discovery, modeling, ML, stakeholder communication, tool deployment (Python, R, SAS, SQL)
- Data Analyst: work with structured data, acquire/clean, statistical analysis, visualization, business reporting
Required Skill Set
- Python, SQL/NoSQL, Excel
- Advanced statistics & high-level math
- Data visualization
- NLP / ML
- Business acumen, communication, teamwork, social-media mining
Data Types & Facets
- Structured vs Unstructured
- Quantitative vs Qualitative
- Four measurement levels: Nominal, Ordinal, Interval, Ratio
- Other facets: natural language, machine-generated, graph-based, audio/video/images, streaming
Structured vs Unstructured
- Structured: easy storage & retrieval, SQL querying; hierarchical data = challenging
- Unstructured: context-specific content, complex processing, often natural language