03_Types_of_Data_and_DataTypes_annotated
INTRODUCTION TO DATA SCIENCE
Overview of data science
Importance of data types in data analysis
Types of Data, Data Types and Data Categories
Recap of Last Week
Code snippet for loading a dataset:
df = pd.read_csv('/kaggle/input/california-housing-prices/housing.csv')Sample data includes features like longitude, latitude, and housing prices:
Example Rows:
Row 0: -122.23, 37.88, 41.0, ...
Row 1: -122.22, 37.86, 21.0, ...
What is Data?
Definition: Data refers to raw information, facts, or statistics in various forms (numbers, text, images, etc.).
Broad Categories of Data
Quantitative
Discrete: Countable set of values (e.g., number of cars)
Continuous: Infinite values within a range (e.g., height, weight)
Qualitative: Descriptive data which can be further classified into structured and unstructured data.
Types of Data
Tabular Data: Structured data formatted in rows and columns, resembling spreadsheets.
Text Data: Human-readable text (e.g., reviews, articles).
Graph Data: Represents relationships and connections among entities (e.g., social networks).
Unstructured Data: Lacks a predefined structure, includes video, audio, images, etc.
Examples of Data Types
Tabular Data: Demographic info, grades.
Text Data: Emails, social media posts.
Graph Data: Roads, social connections.
Unstructured Data:
Videos: TikTok, face recognition
Images: Photos, road signs
Audio: Music, real-time translation
Biometrics: Fingerprints, facial recognition
Data Formats
Common Formats:
CSV, TSV: Type of structured data formats used for storage.
Image formats: .jpg, .png
Audio formats: .wav, .mp3
SQL databases: mySQL, PostgreSQL
NoSQL databases: Bigtable, MongoDB
CSV vs. TSV: Both are used for tabular data but differ by how they separate values (comma vs. tab).
Image Data
Structure: Organized as grids of pixels; color information in RGB channels (0-255).
Compression Types:
Lossy Compression: Reduces file size with some quality loss (e.g., JPEG).
Lossless Compression: Maintains original quality (e.g., PNG).
Databases
Databases: Organized collections of structured data, enabling efficient data management and retrieval.
JSON Format
JSON (JavaScript Object Notation): A lightweight format for data interchange, easily readable and writable for both humans and machines.
Structure: Uses key-value pairs and can represent various data types including strings, arrays, and objects.
XML/HTML
HTML: Used for creating web pages with a fixed set of tags.
XML: Designed for transporting and storing data, allowing for custom tags.
Data Acquisition
Sources for data include company data, databases, the internet, and RESTful APIs.
Beautiful Soup for Parsing HTML
A Python library used for web scraping to extract data from web pages.
RESTful APIs
Tools providing a structured way to access data from services, allowing for communication between applications.
Data Types
Overview to categorize data efficiently to facilitate analysis.
Summary
Types of data (tabular, text, graph, unstructured) are crucial in the data science lifecycle.
Understanding file formats (e.g., CSV, JSON) aids in data preparation.
Databases play a central role in data management.
Data acquisition methods like RESTful APIs and web scraping are vital for initial data gathering.