Introduction to Data Processing

Introduction to Data Processing

  • Author: Vardan Hovsepyan

Learning Objectives

  • Key Topics:

    • Data Science

    • Data Processing

    • Data Source

    • Data Types

    • Data Storing and Retrieving

    • Data Preparation

    • Data Cleaning

    • Data Formatting

    • Data Visualization

Data Science Overview

Definition

  • Interdisciplinary field that utilizes statistical and computational methods to extract insights from data.

Core Components

  • Involves:

    • Mathematics

    • Machine Learning

    • Programming

    • Visualization

    • Data Processing

    • Statistics

Purpose and Process

  • Data science aims to solve problems through an iterative process that involves:

    • Refinement and improvement

    • Four general steps in the process

Data Processing

Definition

  • A subset of data science focusing on specific early steps in the data science process.

Initial Steps

  • Includes:

    • Data Collection

    • Data Storage

Data Sources

Categories

  • Company Data:

    • Helps with data-driven decision making

  • Open Data:

    • Free to use, share, and build upon

Examples of Company Data

  • Web events

  • Survey data

  • Customer data

  • Logistics data

  • Financial transactions

Examples of Open Data

a. API Sources:

  • X (Twitter)

  • Instagram

  • Wikipedia

  • Yahoo! Finance

  • Google Public record b. International Organizations:

  • Federal Reserve Economic Data (FRED)

  • OECD

  • World Bank

  • IMF

  • WTO

  • UN c. National Statistical Offices:

  • SCB

  • PRV

  • Swedish Environmental Protection Agency

  • KTH Library

Data Types

Categories

  • Numerical (Quantitative Data)

    • Can be counted/measured (e.g., temperature)

    • Discrete: integer values

    • Continuous: integer/non-integer values

  • Categorical (Qualitative Data)

    • Descriptive and conceptual

    • Nominal: classified only

    • Ordinal: classified and ordered

Other Data Types

  • Image & Video Data

  • Audio Data

  • Text Data

  • Binary Data

  • Geospatial Data

  • Network Data

  • Metrics Data

Data Storing and Retrieving

Storage Locations

  • Single computer/server (small/medium data)

  • Cluster or servers (large data)

  • Cloud storage:

    • Microsoft Azure

    • Amazon Web Services

    • Google Cloud

Types of Data

  • Structured Data: (e.g., Excel spreadsheets, CSV files)

  • Unstructured Data: (e.g., Emails, text, video, web pages)

Data Storing Solutions

  • Structured Data: Relational Database (SQL)

  • Unstructured Data: Non-Relational Database (NoSQL)

Relational vs. Non-Relational Databases

Relational Database

  • Organized in tables

  • Linked using operations like joins and queries

Non-Relational Database

  • Flexible storage (documents, key-value pairs, graphs)

  • Efficient for complex structures or large datasets

Data Preparation

Importance

  • Data scientists spend about 45% of their time on data preparation tasks.

Benefits

  • Improves result quality, increases efficiency, enables effective analysis, reduces risks.

Key Steps

  1. Data Cleaning: Identify and correct inaccuracies, inconsistencies.

  2. Data Formatting: Convert data to suitable formats for analysis.

Data Cleaning Techniques

Main Issues

  • Missing data

  • Duplicate data

  • Inconsistent data

  • Noise

  • Outliers

Handling Techniques

  1. Missing Values: Dropping, keeping, or imputing a. Mean, Median, Most frequent

  2. Duplicate Data: Delete older records, merge duplicates

  3. Inconsistent Data: Use external sources for validation, replace with reasonable values

  4. Noise Reduction: Filtering, removing erroneous components

  5. Outliers: Analyze closely or remove if not relevant

Data Formatting

Feature Selection

  • Involves selecting appropriate features for analysis and decisions on:

    • Adding, removing, combining, recoding, or breaking up features.

Feature Transformation

  • Adjusting values for better analysis (e.g., scaling, aggregation).

Scaling Techniques

  • Normalization: Rescales values between 0-1.

  • Standardization: Centers values around the mean with unit standard deviation.

  • Choosing between normalization and standardization based on data distribution.

Data Exploration & Visualization

Visualization Techniques

  • Summary statistics: Mean, median, standard deviation

  • Statistical models: linear regression

  • Graphs: Scatter, bar, histogram

Visualization Tools

  • Microsoft Excel, Google Sheets

  • Tableau, Power BI

  • Python Libraries: Matplotlib, Seaborn, Plotly

  • R and ggplot2, R Shiny

  • D3.js

Online Education Platforms

  • edX, DataCamp, Coursera, MIT OpenCourseWare, LinkedIn Learning

Conclusion

  • Thank you for attending the lecture!