INTRO TO DS

INTRODUCTION TO DATA SCIENCE LECTURE NOTES

UNIT - 1 Introduction to Data Science

Definition of Data Science
  • Data Science: A domain of study dealing with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions.
        - Uses complex machine learning algorithms to build predictive models.
        - Analyzes data from various sources, in different formats.
        - Involves extraction, preparation, analysis, visualization, and maintenance of information.
        - A cross-disciplinary field using scientific methods and processes to draw insights from data.
Data Science Lifecycle

The Data Science Lifecycle consists of five distinct stages:

  1. Capture:
       - Data Acquisition, Data Entry, Signal Reception, Data Extraction.
       - Gathering raw structured and unstructured data.

  2. Maintain:
       - Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data Architecture.
       - Transforming raw data into usable formats.

  3. Process:
       - Data Mining, Clustering/Classification, Data Modeling, Data Summarization.
       - Examining patterns, ranges, and biases in prepared data to assess its value for predictive analysis.

  4. Analyze:
       - Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining, Qualitative Analysis.
       - Involves performing various analyses on the data.

  5. Communicate:
       - Data Reporting, Data Visualization, Business Intelligence, Decision Making.
       - Presenting analyses in readable formats such as charts and reports.

Evolution of Data Science: Growth & Innovation

  • Emerged from the merging of applied statistics with computer science to leverage modern computing.
  • 1962: John W. Tukey articulates the "data science" vision in "The Future of Data Analysis."
  • 1977: Establishment of the International Association for Statistical Computing (IASC) to link statistical methods, computer technology, and domain expertise.
  • 1980s and 1990s: Significant strides with the first Knowledge Discovery in Databases (KDD) workshop and the International Federation of Classification Societies (IFCS) founded.
  • 1994: Business Week publishes about “Database Marketing.”
  • 1990s and early 2000s: Growth of data science as a recognized field with the emergence of academic journals.
  • 2000s: Increased internet connectivity enables massive data collection capabilities.
  • 2005: Introduction of Big Data driven by companies like Google and Facebook, requiring technologies like Hadoop, Spark, and Cassandra.
  • 2014: Demand for data scientists surges as organizations seek data-driven insights.
  • 2015: Enters machine learning, deep learning, and Artificial Intelligence (AI) into data science.
  • 2018: New regulations impacting data science practices emerge.
  • 2020s: Breakthroughs in AI, machine learning, and increased demand for big data professionals continue.

Roles in Data Science

  1. Data Analyst:
       - Responsibilities: Visualization, munging, processing data, performing database queries.
       - Key skills: SQL, R, SAS, Python.
       - Important responsibilities include:
           - Extracting data from sources.
           - Maintaining databases.
           - Performing data analysis and report generation with recommendations.

  2. Data Engineer:
       - Responsibilities: Building and testing scalable Big Data ecosystems, updating systems for efficiency.
       - Key skills: Hive, NoSQL, R, Ruby, Java, C++, Matlab.
       - Important responsibilities include:
           - Design and maintain data management systems.
           - Data collection and management.
           - Conducting research.

  3. Database Administrator:
       - Responsibilities: Ensuring proper database functioning, managing data access services.
       - Important responsibilities include:
           - Database software and management.
           - Designing and developing databases.
           - Implementing security measures.

  4. Machine Learning Engineer:
       - Responsibilities: Designing ML systems, testing systems, implementing algorithms.
       - Key skills: SQL, REST APIs.
       - Important responsibilities include:
           - Developing ML systems.
           - Researching ML algorithms.

  5. Data Scientist:
       - Responsibilities: Understanding business challenges, performing predictive analysis.
       - Key skills: R, Matlab, SQL, Python.