Mahindra ÉCOLE CENTRALE - Data Engineering Course Notes

Mahindra ÉCOLE CENTRALE - University

Introduction to Data Engineering

Course Material
  • Textbooks:
      - Probability & Statistics for Engineers & Scientists (9th Edn.) by Ronald E. Walpole, Raymond H. Myers, Sharon L. Myers, and Keying Ye, Prentice Hall Inc.
      - The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd Edn.) by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, Springer, 2014
      - An Introduction to Statistical Learning: with Applications in R by G. James, D. Witten, T. Hastie, and R. Tibshirani, Springer, 2013

  • Reference Book:
      - Data Analytics with R by Motwani, Wiley.

Course Outline

  • Introduction to Data Science

  • Descriptive Statistics

What is Data Science?

  • Definition: Data Science is the science of collecting, storing, processing, describing, and modeling data.

  • Tasks Involved:
      - Collect
      - Store
      - Process
      - Describe
      - Model

  • Focus on Tasks: Attention on specific tasks depends on the application.

What is Data Analytics?

  • Definition: It refers to the process of examining datasets to draw conclusions about the information they contain.

  • Techniques: Data analytic techniques enable the uncovering of patterns to extract valuable insights from raw data (e.g., using tools like Google Analytics).

  • Benefits of Data Analytics:
      1. Improved Decision Making
      2. More Effective Marketing
      3. Better Customer Service
      4. More Efficient Operations

Data Analytics vs. Data Science

  • Data Science:
      - An umbrella that encompasses data analytics, data mining, etc.
      - Data scientists forecast the future based on past patterns, generate questions.

  • Data Analytics:
      - Data analysts extract meaningful insights from various data sources and find answers to existing questions.

Data Analytics vs. Others

  • Data Science: Deals with structured and unstructured data.

  • Data Analysis: Any human activities aimed at gaining insight into a dataset.

  • Big Data: Large volumes of data that cannot be effectively processed using traditional applications.

  • Data Mining: Uses machine learning algorithms to automate insights into a dataset.

  • Machine Learning: AI technique that builds models from training datasets to predict target variable values.

Collecting Data

  • Depends on:
      - The specific question a data scientist is trying to answer.
      - The operational environment.

  • Examples:
      1. E-commerce data on customer purchases retrieved via SQL.
      2. Political sentiments requiring web crawling and scraping.
      3. Experimental design needed for agricultural data on yields based on input types.

Storing Data

  • Types of Data:
      1. Transactional and Operational Data: e.g., patient records, insurance claims, inventory, customer records.
      2. Structured Data: Stored in relational databases (CRUD operations: Select, Insert, Update, Delete).
      3. Unstructured Data: Text, images, video, speech, representing the Big Data era.

  • Statistics: As of 2003, approximately 5 Exabytes of data collected, with similar amounts generated daily since 2013.

What is Big Data?

  • Definition: Big Data refers to a collection of datasets that are too large and complex for traditional database tools to process effectively.

  • 5 V's of Big Data:
      1. Volume: Large amounts of data being generated.
      2. Variety: Different types of data from various sources.
      3. Velocity: Rapid generation of data.
      4. Veracity: Dealing with uncertainties and inconsistencies.
      5. Value: Understanding the correct meaning from data.

The 5 V's Explained

  • Volume:
      - Projected data growth from 4.4 Zettabytes to 44 Zettabytes by 2020; conversion examples: 1 Exabyte = 1024 Petabytes = 1,048,576 Terabytes.

  • Variety & Examples:
      - Structured (e.g., databases), unstructured (e.g., documents), semi-structured (e.g., XML, JSON).
      - Different sources include emails, images, audio, and video data.

  • Velocity: Data generated every minute across platforms:
      - 98,000 tweets, 695,000 updates, 11 million instant messages, etc.

  • Veracity: Managing data quality by addressing uncertainty and inconsistencies in data sets, exemplified through statistical measures.

  • Value: Extracting meaningful insights from data through mechanisms that provide concrete meanings.

Storing Data Strategies

  • Types of Storage:
      1. Relational Databases: For structured data.
      2. Data Warehouses: Optimized for analytics; curated data sets.
      3. Data Lakes: For big data; uncurated and can include structured or unstructured data.

Processing Data

  • Phases in Processing:
      1. Data Wrangling or Munging: Extract, transform, and load (ETL) processes.
      2. Data Cleaning: Handling missing values, standardizing information, correcting errors, and removing outliers.
      3. Data Scaling, Normalizing, Standardizing: Normalization (zero mean, unit variance), standardization (values range between 0 and 1), scaling conversions (e.g., kilometers to miles).

  • Performance Considerations: Efficient processing is crucial when handling large datasets; often requires distributed processing techniques.

Describing Data

  • Techniques for Describing Data:
      1. Visualization: Charts, graphs, and plots.
      2. Summarization: Mean, median, mode, standard deviation, and variance to summarize monthly sales data.

Statistical Modeling Data

  • Modeling Concepts: Assessing data distributions, conducting hypothesis tests, and establishing robustness of hypotheses (e.g., effectiveness of a drug).

  • Hypothesis Testing: Focus on relationships in the data while estimating key parameters and providing statistical guarantees.

Recap of Statistical Concepts

  • Key terms: Population, sample, parameter, statistic, sampling strategies, hypotheses testing, etc.

Measures of Centrality

  • Definition and calculation of mean, median, and mode in data analysis. Also emphasizes samples vs populations and estimatic computations effective for determining averages and typical data points.

Measures of Spread

  • Discussion on range, interquartile range (IQR), variance, and standard deviation as critical metrics to quantify dispersion in datasets.

Utilization of Data Visualization

  • The importance of visualization techniques such as box plots, histograms, frequency plots, and scatter plots to interpret data and identify valuable trends and insights.

Use of Histograms and Frequency Polygons

  • Techniques of plotting to show distribution across datasets directly connect with statistical calculations.

Measures of Spread and Their Significance

  • Importance of understanding and employing various means of summarizing spread influenced by the nature of the data and rich in contextual meaning.

Standardizing Data

  • The significance of transforming raw data into standardized formats through scaling, shifting, and their impacts on different statistical measures.

Summary and Conclusion

  • Comprehensive overview of essential statistical concepts to support effective data engineering in datasets extensors. Understanding effective measures can greatly improve decision-making across various contexts.

The provided context does not include any specific formulas or graphs that were mentioned in a PowerPoint presentation, nor are they outlined in the existing notes. If there are specific formulas or graphical representations to be included, please provide them for reference, as the notes here focus on text and concepts related to Data Engineering, not visual elements. Linking relevant visuals or equations to their textual counterparts may enhance understanding if they become accessible.