03_Types_of_Data_and_DataTypes_annotated

INTRODUCTION TO DATA SCIENCE

  • Overview of data science

  • Importance of data types in data analysis

Types of Data, Data Types and Data Categories

Recap of Last Week

  • Code snippet for loading a dataset:

    • df = pd.read_csv('/kaggle/input/california-housing-prices/housing.csv')

    • Sample data includes features like longitude, latitude, and housing prices:

      • Example Rows:

        • Row 0: -122.23, 37.88, 41.0, ...

        • Row 1: -122.22, 37.86, 21.0, ...

What is Data?

  • Definition: Data refers to raw information, facts, or statistics in various forms (numbers, text, images, etc.).

Broad Categories of Data

  • Quantitative

    • Discrete: Countable set of values (e.g., number of cars)

    • Continuous: Infinite values within a range (e.g., height, weight)

  • Qualitative: Descriptive data which can be further classified into structured and unstructured data.

Types of Data

  • Tabular Data: Structured data formatted in rows and columns, resembling spreadsheets.

  • Text Data: Human-readable text (e.g., reviews, articles).

  • Graph Data: Represents relationships and connections among entities (e.g., social networks).

  • Unstructured Data: Lacks a predefined structure, includes video, audio, images, etc.

Examples of Data Types

  • Tabular Data: Demographic info, grades.

  • Text Data: Emails, social media posts.

  • Graph Data: Roads, social connections.

  • Unstructured Data:

    • Videos: TikTok, face recognition

    • Images: Photos, road signs

    • Audio: Music, real-time translation

    • Biometrics: Fingerprints, facial recognition

Data Formats

  • Common Formats:

    • CSV, TSV: Type of structured data formats used for storage.

    • Image formats: .jpg, .png

    • Audio formats: .wav, .mp3

    • SQL databases: mySQL, PostgreSQL

    • NoSQL databases: Bigtable, MongoDB

  • CSV vs. TSV: Both are used for tabular data but differ by how they separate values (comma vs. tab).

Image Data

  • Structure: Organized as grids of pixels; color information in RGB channels (0-255).

  • Compression Types:

    • Lossy Compression: Reduces file size with some quality loss (e.g., JPEG).

    • Lossless Compression: Maintains original quality (e.g., PNG).

Databases

  • Databases: Organized collections of structured data, enabling efficient data management and retrieval.

JSON Format

  • JSON (JavaScript Object Notation): A lightweight format for data interchange, easily readable and writable for both humans and machines.

  • Structure: Uses key-value pairs and can represent various data types including strings, arrays, and objects.

XML/HTML

  • HTML: Used for creating web pages with a fixed set of tags.

  • XML: Designed for transporting and storing data, allowing for custom tags.

Data Acquisition

  • Sources for data include company data, databases, the internet, and RESTful APIs.

Beautiful Soup for Parsing HTML

  • A Python library used for web scraping to extract data from web pages.

RESTful APIs

  • Tools providing a structured way to access data from services, allowing for communication between applications.

Data Types

  • Overview to categorize data efficiently to facilitate analysis.

Summary

  • Types of data (tabular, text, graph, unstructured) are crucial in the data science lifecycle.

  • Understanding file formats (e.g., CSV, JSON) aids in data preparation.

  • Databases play a central role in data management.

  • Data acquisition methods like RESTful APIs and web scraping are vital for initial data gathering.

robot