Intro to Data Concepts

Data Value and Importance
  • Data is a vital asset for businesses and organizations across multiple industries, often referred to as the "new oil" of the digital economy.

  • It provides insights to improve operations, understand customer behavior through predictive analytics, and foster innovation via machine learning models.

  • Non-profits and governments leverage data to address global challenges, such as tracking disease outbreaks or optimizing resource distribution during crises.

  • Data-Driven Decision Making (DDDM): The practice of basing decisions on the analysis of data rather than purely on intuition.

Types of Data
Structured Data
  • Organized in rows and columns (e.g., spreadsheets or relational tables).

  • Highly organized and easily searched using Structured Query Language (SQL).

  • Follows a rigid schema where data must fit into predefined fields.

  • Examples: Names, dates, addresses, credit card numbers, and stock trade information.

Unstructured Data
  • Lacks a predefined format or organization, making it more complex to collect and process.

  • Examples: Images, audio files, videos, social media posts, and PDF documents.

  • Often stored in Data Lakes rather than traditional databases.

  • Approximately 80%80\% of data generated globally is unstructured, and companies use Natural Language Processing (NLP) and AI to extract meaning from it.

Databases
  • Databases are organized collections of structured data stored electronically.

  • Relational Databases (RDBMS): Use tables to store data, linking them through unique identifiers called Primary Keys and Foreign Keys.

  • Non-Relational Databases (NoSQL): Better suited for unstructured or semi-structured data (e.g., MongoDB).

  • Scalability: Databases allow for horizontal and vertical scaling to handle growing amounts of information.

Data Types
Quantitative Data
  • Numerical data that can be measured and quantified.

  • Discrete Data: Counted values that cannot be divided (e.g., Number of employees = 5050).

  • Continuous Data: Values that can be measured on a scale and broken down into smaller parts (e.g., Temperature = 98.698.6 or Height = 1.751.75 meters).

Qualitative Data
  • Non-numerical data describing characteristics, qualities, or attributes.

  • Nominal Data: Categories without a natural order (e.g., Eye color, hair color, or nationality).

  • Ordinal Data: Categories with a specific, meaningful order but undefined intervals (e.g., Education level: High School, Bachelor's, Master's; or customer satisfaction ratings: Poor, Fair, Good).

Data Collection and Analysis
  • The Analysis Process:

    1. Collection: Gathering raw data from various sources.

    2. Cleaning (Wrangling): Removing errors, duplicates, and inconsistencies to ensure data integrity.

    3. Analysis: Exploring patterns, correlations, and trends.

    4. Visualization: Presenting findings through charts and dashboards for stakeholders.

  • Understanding data types and structures is a foundational requirement for selecting the correct statistical tests and analytical tools.

Data Value and Importance
  • Data is a vital asset for businesses and organizations across multiple industries, often referred to as the "new oil" of the digital economy.

  • It provides insights to improve operations, understand customer behavior through predictive analytics, and foster innovation via machine learning models.

  • Non-profits and governments leverage data to address global challenges, such as tracking disease outbreaks or optimizing resource distribution during crises.

  • Data-Driven Decision Making (DDDM): The practice of basing decisions on the analysis of data rather than purely on intuition.

Types of Data
Structured Data
  • Organized in rows and columns (e.g., spreadsheets or relational tables).

  • Highly organized and easily searched using Structured Query Language (SQL).

  • Follows a rigid schema where data must fit into predefined fields.

  • Examples: Names, dates, addresses, credit card numbers, and stock trade information.

Unstructured Data
  • Lacks a predefined format or organization, making it more complex to collect and process.

  • Examples: Images, audio files, videos, social media posts, and PDF documents.

  • Often stored in Data Lakes rather than traditional databases.

  • Approximately 80%80\% of data generated globally is unstructured, and companies use Natural Language Processing (NLP) and AI to extract meaning from it.

Databases
  • Databases are organized collections of structured data stored electronically.

  • Relational Databases (RDBMS): Use tables to store data, linking them through unique identifiers called Primary Keys and Foreign Keys.

  • Non-Relational Databases (NoSQL): Better suited for unstructured or semi-structured data (e.g., MongoDB).

  • Scalability: Databases allow for horizontal and vertical scaling to handle growing amounts of information.

Data Types
Quantitative Data
  • Numerical data that can be measured and quantified.

  • Discrete Data: Represents whole numbers or values that can be counted and cannot be divided.

    • Examples: 33 children, 7676 DVDs, or 1212 chickens.

  • Continuous Data: Represents values that can be measured on a scale (finer levels) and broken down into smaller parts.

    • Examples: 13.1513.15 minutes, 56.32756.327 kilometers per hour, or 1.67641.6764 meters.

Qualitative Data
  • Non-numerical, descriptive data describing characteristics, qualities, or attributes.

  • Nominal Data: Categories or labels without a natural order or rank.

    • Examples: Nationality (e.g., Greek), marital status (e.g., Married), or hair color (e.g., Blonde).

  • Ordinal Data: Categories with a specific, meaningful order (ordered) but undefined or unequal intervals.

    • Examples: Survey responses (e.g., Very likely, Likely, Neutral, Unlikely, Very unlikely) or education levels (e.g., High School, Bachelor's, Master's).

Data Collection and Analysis
  • The Analysis Process:

    1. Collection: Gathering raw data from various sources.

    2. Cleaning (Wrangling): Removing errors, duplicates, and inconsistencies to ensure data integrity.

    3. Analysis: Exploring patterns, correlations, and trends.

    4. Visualization: Presenting findings through charts and dashboards for stakeholders.

  • Understanding data types and structures is a foundational requirement for selecting the correct statistical tests and analytical tools.