Mahindra ÉCOLE CENTRALE - Data Engineering Course Notes
Mahindra ÉCOLE CENTRALE - University
Introduction to Data Engineering
Course Material
Textbooks:
- Probability & Statistics for Engineers & Scientists (9th Edn.) by Ronald E. Walpole, Raymond H. Myers, Sharon L. Myers, and Keying Ye, Prentice Hall Inc.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd Edn.) by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, Springer, 2014
- An Introduction to Statistical Learning: with Applications in R by G. James, D. Witten, T. Hastie, and R. Tibshirani, Springer, 2013Reference Book:
- Data Analytics with R by Motwani, Wiley.
Course Outline
Introduction to Data Science
Descriptive Statistics
What is Data Science?
Definition: Data Science is the science of collecting, storing, processing, describing, and modeling data.
Tasks Involved:
- Collect
- Store
- Process
- Describe
- ModelFocus on Tasks: Attention on specific tasks depends on the application.
What is Data Analytics?
Definition: It refers to the process of examining datasets to draw conclusions about the information they contain.
Techniques: Data analytic techniques enable the uncovering of patterns to extract valuable insights from raw data (e.g., using tools like Google Analytics).
Benefits of Data Analytics:
1. Improved Decision Making
2. More Effective Marketing
3. Better Customer Service
4. More Efficient Operations
Data Analytics vs. Data Science
Data Science:
- An umbrella that encompasses data analytics, data mining, etc.
- Data scientists forecast the future based on past patterns, generate questions.Data Analytics:
- Data analysts extract meaningful insights from various data sources and find answers to existing questions.
Data Analytics vs. Others
Data Science: Deals with structured and unstructured data.
Data Analysis: Any human activities aimed at gaining insight into a dataset.
Big Data: Large volumes of data that cannot be effectively processed using traditional applications.
Data Mining: Uses machine learning algorithms to automate insights into a dataset.
Machine Learning: AI technique that builds models from training datasets to predict target variable values.
Collecting Data
Depends on:
- The specific question a data scientist is trying to answer.
- The operational environment.Examples:
1. E-commerce data on customer purchases retrieved via SQL.
2. Political sentiments requiring web crawling and scraping.
3. Experimental design needed for agricultural data on yields based on input types.
Storing Data
Types of Data:
1. Transactional and Operational Data: e.g., patient records, insurance claims, inventory, customer records.
2. Structured Data: Stored in relational databases (CRUD operations: Select, Insert, Update, Delete).
3. Unstructured Data: Text, images, video, speech, representing the Big Data era.Statistics: As of 2003, approximately 5 Exabytes of data collected, with similar amounts generated daily since 2013.
What is Big Data?
Definition: Big Data refers to a collection of datasets that are too large and complex for traditional database tools to process effectively.
5 V's of Big Data:
1. Volume: Large amounts of data being generated.
2. Variety: Different types of data from various sources.
3. Velocity: Rapid generation of data.
4. Veracity: Dealing with uncertainties and inconsistencies.
5. Value: Understanding the correct meaning from data.
The 5 V's Explained
Volume:
- Projected data growth from 4.4 Zettabytes to 44 Zettabytes by 2020; conversion examples: 1 Exabyte = 1024 Petabytes = 1,048,576 Terabytes.Variety & Examples:
- Structured (e.g., databases), unstructured (e.g., documents), semi-structured (e.g., XML, JSON).
- Different sources include emails, images, audio, and video data.Velocity: Data generated every minute across platforms:
- 98,000 tweets, 695,000 updates, 11 million instant messages, etc.Veracity: Managing data quality by addressing uncertainty and inconsistencies in data sets, exemplified through statistical measures.
Value: Extracting meaningful insights from data through mechanisms that provide concrete meanings.
Storing Data Strategies
Types of Storage:
1. Relational Databases: For structured data.
2. Data Warehouses: Optimized for analytics; curated data sets.
3. Data Lakes: For big data; uncurated and can include structured or unstructured data.
Processing Data
Phases in Processing:
1. Data Wrangling or Munging: Extract, transform, and load (ETL) processes.
2. Data Cleaning: Handling missing values, standardizing information, correcting errors, and removing outliers.
3. Data Scaling, Normalizing, Standardizing: Normalization (zero mean, unit variance), standardization (values range between 0 and 1), scaling conversions (e.g., kilometers to miles).Performance Considerations: Efficient processing is crucial when handling large datasets; often requires distributed processing techniques.
Describing Data
Techniques for Describing Data:
1. Visualization: Charts, graphs, and plots.
2. Summarization: Mean, median, mode, standard deviation, and variance to summarize monthly sales data.
Statistical Modeling Data
Modeling Concepts: Assessing data distributions, conducting hypothesis tests, and establishing robustness of hypotheses (e.g., effectiveness of a drug).
Hypothesis Testing: Focus on relationships in the data while estimating key parameters and providing statistical guarantees.
Recap of Statistical Concepts
Key terms: Population, sample, parameter, statistic, sampling strategies, hypotheses testing, etc.
Measures of Centrality
Definition and calculation of mean, median, and mode in data analysis. Also emphasizes samples vs populations and estimatic computations effective for determining averages and typical data points.
Measures of Spread
Discussion on range, interquartile range (IQR), variance, and standard deviation as critical metrics to quantify dispersion in datasets.
Utilization of Data Visualization
The importance of visualization techniques such as box plots, histograms, frequency plots, and scatter plots to interpret data and identify valuable trends and insights.
Use of Histograms and Frequency Polygons
Techniques of plotting to show distribution across datasets directly connect with statistical calculations.
Measures of Spread and Their Significance
Importance of understanding and employing various means of summarizing spread influenced by the nature of the data and rich in contextual meaning.
Standardizing Data
The significance of transforming raw data into standardized formats through scaling, shifting, and their impacts on different statistical measures.
Summary and Conclusion
Comprehensive overview of essential statistical concepts to support effective data engineering in datasets extensors. Understanding effective measures can greatly improve decision-making across various contexts.
The provided context does not include any specific formulas or graphs that were mentioned in a PowerPoint presentation, nor are they outlined in the existing notes. If there are specific formulas or graphical representations to be included, please provide them for reference, as the notes here focus on text and concepts related to Data Engineering, not visual elements. Linking relevant visuals or equations to their textual counterparts may enhance understanding if they become accessible.