Introduction to Data Processing
Introduction to Data Processing
Author: Vardan Hovsepyan
Learning Objectives
Key Topics:
Data Science
Data Processing
Data Source
Data Types
Data Storing and Retrieving
Data Preparation
Data Cleaning
Data Formatting
Data Visualization
Data Science Overview
Definition
Interdisciplinary field that utilizes statistical and computational methods to extract insights from data.
Core Components
Involves:
Mathematics
Machine Learning
Programming
Visualization
Data Processing
Statistics
Purpose and Process
Data science aims to solve problems through an iterative process that involves:
Refinement and improvement
Four general steps in the process
Data Processing
Definition
A subset of data science focusing on specific early steps in the data science process.
Initial Steps
Includes:
Data Collection
Data Storage
Data Sources
Categories
Company Data:
Helps with data-driven decision making
Open Data:
Free to use, share, and build upon
Examples of Company Data
Web events
Survey data
Customer data
Logistics data
Financial transactions
Examples of Open Data
a. API Sources:
X (Twitter)
Instagram
Wikipedia
Yahoo! Finance
Google Public record b. International Organizations:
Federal Reserve Economic Data (FRED)
OECD
World Bank
IMF
WTO
UN c. National Statistical Offices:
SCB
PRV
Swedish Environmental Protection Agency
KTH Library
Data Types
Categories
Numerical (Quantitative Data)
Can be counted/measured (e.g., temperature)
Discrete: integer values
Continuous: integer/non-integer values
Categorical (Qualitative Data)
Descriptive and conceptual
Nominal: classified only
Ordinal: classified and ordered
Other Data Types
Image & Video Data
Audio Data
Text Data
Binary Data
Geospatial Data
Network Data
Metrics Data
Data Storing and Retrieving
Storage Locations
Single computer/server (small/medium data)
Cluster or servers (large data)
Cloud storage:
Microsoft Azure
Amazon Web Services
Google Cloud
Types of Data
Structured Data: (e.g., Excel spreadsheets, CSV files)
Unstructured Data: (e.g., Emails, text, video, web pages)
Data Storing Solutions
Structured Data: Relational Database (SQL)
Unstructured Data: Non-Relational Database (NoSQL)
Relational vs. Non-Relational Databases
Relational Database
Organized in tables
Linked using operations like joins and queries
Non-Relational Database
Flexible storage (documents, key-value pairs, graphs)
Efficient for complex structures or large datasets
Data Preparation
Importance
Data scientists spend about 45% of their time on data preparation tasks.
Benefits
Improves result quality, increases efficiency, enables effective analysis, reduces risks.
Key Steps
Data Cleaning: Identify and correct inaccuracies, inconsistencies.
Data Formatting: Convert data to suitable formats for analysis.
Data Cleaning Techniques
Main Issues
Missing data
Duplicate data
Inconsistent data
Noise
Outliers
Handling Techniques
Missing Values: Dropping, keeping, or imputing a. Mean, Median, Most frequent
Duplicate Data: Delete older records, merge duplicates
Inconsistent Data: Use external sources for validation, replace with reasonable values
Noise Reduction: Filtering, removing erroneous components
Outliers: Analyze closely or remove if not relevant
Data Formatting
Feature Selection
Involves selecting appropriate features for analysis and decisions on:
Adding, removing, combining, recoding, or breaking up features.
Feature Transformation
Adjusting values for better analysis (e.g., scaling, aggregation).
Scaling Techniques
Normalization: Rescales values between 0-1.
Standardization: Centers values around the mean with unit standard deviation.
Choosing between normalization and standardization based on data distribution.
Data Exploration & Visualization
Visualization Techniques
Summary statistics: Mean, median, standard deviation
Statistical models: linear regression
Graphs: Scatter, bar, histogram
Visualization Tools
Microsoft Excel, Google Sheets
Tableau, Power BI
Python Libraries: Matplotlib, Seaborn, Plotly
R and ggplot2, R Shiny
D3.js
Online Education Platforms
edX, DataCamp, Coursera, MIT OpenCourseWare, LinkedIn Learning
Conclusion
Thank you for attending the lecture!