Overview of data science
Importance of data types in data analysis
Code snippet for loading a dataset:
df = pd.read_csv('/kaggle/input/california-housing-prices/housing.csv')
Sample data includes features like longitude, latitude, and housing prices:
Example Rows:
Row 0: -122.23, 37.88, 41.0, ...
Row 1: -122.22, 37.86, 21.0, ...
Definition: Data refers to raw information, facts, or statistics in various forms (numbers, text, images, etc.).
Quantitative
Discrete: Countable set of values (e.g., number of cars)
Continuous: Infinite values within a range (e.g., height, weight)
Qualitative: Descriptive data which can be further classified into structured and unstructured data.
Tabular Data: Structured data formatted in rows and columns, resembling spreadsheets.
Text Data: Human-readable text (e.g., reviews, articles).
Graph Data: Represents relationships and connections among entities (e.g., social networks).
Unstructured Data: Lacks a predefined structure, includes video, audio, images, etc.
Tabular Data: Demographic info, grades.
Text Data: Emails, social media posts.
Graph Data: Roads, social connections.
Unstructured Data:
Videos: TikTok, face recognition
Images: Photos, road signs
Audio: Music, real-time translation
Biometrics: Fingerprints, facial recognition
Common Formats:
CSV, TSV: Type of structured data formats used for storage.
Image formats: .jpg, .png
Audio formats: .wav, .mp3
SQL databases: mySQL, PostgreSQL
NoSQL databases: Bigtable, MongoDB
CSV vs. TSV: Both are used for tabular data but differ by how they separate values (comma vs. tab).
Structure: Organized as grids of pixels; color information in RGB channels (0-255).
Compression Types:
Lossy Compression: Reduces file size with some quality loss (e.g., JPEG).
Lossless Compression: Maintains original quality (e.g., PNG).
Databases: Organized collections of structured data, enabling efficient data management and retrieval.
JSON (JavaScript Object Notation): A lightweight format for data interchange, easily readable and writable for both humans and machines.
Structure: Uses key-value pairs and can represent various data types including strings, arrays, and objects.
HTML: Used for creating web pages with a fixed set of tags.
XML: Designed for transporting and storing data, allowing for custom tags.
Sources for data include company data, databases, the internet, and RESTful APIs.
A Python library used for web scraping to extract data from web pages.
Tools providing a structured way to access data from services, allowing for communication between applications.
Overview to categorize data efficiently to facilitate analysis.
Types of data (tabular, text, graph, unstructured) are crucial in the data science lifecycle.
Understanding file formats (e.g., CSV, JSON) aids in data preparation.
Databases play a central role in data management.
Data acquisition methods like RESTful APIs and web scraping are vital for initial data gathering.