Overview of data science concepts.
Definitions of types of data, data types, and data categories.
Code example: df = pd.read_csv('/kaggle/input/california-housing-prices/housing.csv')
Display first five rows of the dataset using df.head()
.
Example data displayed:
Longitude, Latitude, Housing Median Age, Total Rooms, Total Bedrooms, Population, Households, Median Income, Median House Value, and Ocean Proximity.
Introduction to various types of data relevant to data science.
Definition of Data: Raw information, facts, or statistics in various forms (numbers, text, images, etc.).
Quantitative Data: Numerical data that can be measured.
Discrete Data: Countable values (e.g., number of cars, laptops).
Continuous Data: Measurable values (e.g., height, weight).
Qualitative Data: Descriptive data that can be categorized but not counted.
Includes Structured Data and Unstructured Data.
Types of data we will cover in the course:
Tabular, Text, Images, JSON, XML, HTML, Audio.
Definition: Structured data organized in rows and columns; resembles spreadsheets or database tables.
Examples: Demographic information, grades, etc.
Examples include reviews, articles, emails, and social media posts.
Focus on natural language and human-readable text.
Represents relationships between entities using nodes and edges.
Examples: Social connections, websites, network traffic.
Lacks predefined structure; challenging to analyze.
Examples:
Videos (e.g., Tik Tok)
Images (James Webb, faces, handwriting)
Audio (Alexa, music)
Biometrics (fingerprints, facial recognition)
Haptics (phone notifications)
Tabular Data: Heights of class members.
Graph Data: Social networks and dependencies, coursework prerequisites.
Geo Data: Flight paths, weather patterns.
Raw Data: Images, video, audio, telemetry data.
Hierarchies:
Taxonomy, family trees, file directories.
Common formats: CSV, image formats (.jpg, .png), audio formats (.wav, .mpg), SQL databases.
CSV (Comma-Separated Values): Plain-text format.
TSV (Tab-Separated Values): Rows and columns separated by tabs.
These formats facilitate data import/export across various tools.
An example CSV file (classic rock playlist).
Structure includes Artist, Music, Album, Year, Genre.
Example format of CSV file:
`Artist, Music, Album, Year, Genre`.
Use Python's pandas library for data manipulation.
Image Data: Visual content properties—colors, shapes, pixel values.
Images composed of pixels with organized grids.
Each pixel holds color information (RGB channels).
Lossy Compression: Reduces size by sacrificing some data (e.g., JPEG).
Lossless Compression: Retains quality, used for critical images (e.g., PNG).
Definition: Organized collections of structured information stored electronically.
Manages complex data relationships efficiently.
Lightweight data interchange format, easy for humans and machines.
Used in web APIs and client-server communication.
Represents data with key-value pairs; organized hierarchically.
Supports various data types such as strings, numbers, arrays, objects, etc.
Example showcasing the structure of JSON data:
Demonstrates nested data with objects and arrays.
Use json
module to work with JSON data in Python:
json.dumps()
: Convert Python objects to JSON format.
json.loads()
: Convert JSON back to Python objects.
HTML: Used for webpage creation; predefined tags for content.
XML: Used for data transport and storage; allows custom tags.
Sources to get data:
Provided by companies.
Gathered from databases and the internet.
Using RESTful APIs.
Python library for parsing HTML and XML.
Facilitates web scraping and data extraction.
Structured way to access web data; relies on requests and responses.
Documentation is crucial for proper usage and data interpretation.
Overview of data types in the context of data science.
Revisits key data categories:
Quantitative (Discrete, Continuous) and Qualitative.
Further classification of data:
Continuous or Discrete.
Categorical or Non-Categorical.
Ordinal or not?
Discrete Attribute: Finite/countable values (e.g., zip codes).
Continuous Attribute: Real numbers (e.g., weight measures).
Nominal: Categorical values (e.g., profession).
Ordinal: Values with order (e.g., rankings).
Binary: Only two states (0 and 1).
Interval: Equal size units meaningful differences (e.g., temperature).
Ratio: Both differences and ratios are meaningful (e.g., length).
Types of Data: Impact on data preparation in data science.
File Formats: Essential for data ingestion and transformation.
Databases: Central to data management.
Data Acquisition: RESTful APIs and web scraping as data gathering methods.