Data Science Notes
Introduction to Data Science
Quote of the Day
"Knowledge without action is like a tree without fruit"
Hazrat Ali (S.A)
Overview
Course Logistics
Lecture No: 01
Instructor: Ms. Kauser Shaheen
Email: kauser.shaheen@iobm.edu.pk
Marks Distribution:
Assignments: 10
Quizzes: 15
Class Participation: 5
Midterm Exam: 30
Final Exams: 40
Outline
Data
Types of Data
Structured Data
Simplest Form of Data
Can Data Speak?
Objects and Features of a Data Table
Dimension, Vector, Proximity, Distance, and Similarity Measurements
Introduction to Data Science
Importance of Data Science
Big Data
Technology and Tools
Data Science Components
Business Intelligence vs. Data Science
Applications of Data Science
Understanding Data
Definition
Data: Facts and statistics collected for reference or analysis.
Three aspects of data:
Data comes from facts and statistics.
Data is collected.
Data is used for reference or analysis.
Focus of course: Analysis part of data.
Datum: Singular form of data.
Data Acquisition
Definition: The process of collecting data from a variety of sources.
Use available office data.
If unavailable, collect primary or secondary data.
Simplest Form of Data
A table is the simplest form of data. Most data science algorithms today utilize tabular data as inputs.
Complexity in data (text, images, etc.) is often converted to tabular format for the analysis to leverage existing tools.
Example Table:
Name | Salary ($) | Age (Years) |
|---|---|---|
Jane | 90000 | 52 |
John | 85000 | 48 |
Delilah | 75000 | 32 |
Can Data Speak?
Examine the table and identify interesting findings:
Jane and Dave earn the highest salary.
Delilah earns the least.
Jane and Dave are the oldest.
Delilah is the youngest.
Insights reveal trends and relationships, suggesting that experience correlates with age and salary.
Objects and Features of a Data Table
Data described using objects and features.
Objects: Rows in a dataset (individual entries).
Features: Columns in a dataset (attributes of entries).
Example continues from the above table.
Mathematical Spaces
Definition of Space: In data science, refers to mathematical spaces of data dimensions.
Example: Two-dimensional dataset forms an axis system.
Dimensions in Data Science
A dimension indicates direction and the number of features.
Example: A three-column table is considered a three-dimensional dataset.
Vectors in Data Science
Definition of Vector: Each object (data point) can be considered a vector.
Origin: The coordinate point where all dimensions equal zero.
Proximity and Distance
Proximity: Indicates nearness or farness between data points.
Distances between points can be measured through various methods including Euclidean and Manhattan distances.
Axioms of Distance:
Distance is always non-negative.
Distance is zero only when points coincide.
Distance is symmetric (D(p1, p2) = D(p2, p1)).
Importance of Proximity
Applications in Data Science:
Clustering: Grouping similar data points.
Recommendation Systems: Matching user profiles.
Anomaly Detection: Identify outliers.
Distance Calculation Techniques
Euclidean Distance Formula:
Example Calculation: Between Row 2 and Row 5 in a dataset.
Manhattan Distance Definition:
Similarity Calculations Techniques
Jaccard Index: Measures similarity between sets or vectors.
Weighted Jaccard Index Computation:
Example Rows: Row 1 and Row 3 in a dataset.
Types of Data
Structured Data: Organized data in tables.
Unstructured Data: Messy data lacking organization.
Semi-Structured Data: Data that does not fully adhere to a formal structure but contains some organizational properties.
Introduction to Data Science
Definition: Data science encompasses techniques and algorithms for knowledge extraction from data.
Components of Data Science:
Data Engineering
Scientific Method
Domain Expertise
Hacker Mindset
Data Visualization
Mathematics
Statistics
Advanced Computing
Importance of Data Science
Data as a critical business asset, allowing for data-driven decision-making.
Applications include predictive analysis, fraud detection, business intelligence.
Big Data
Characteristics: Volume, Variety, Velocity.
Data Generated Each Day: e.g., Facebook generates immense data, scanned and processed at high speed.
Tools and Technologies for Data Science
Programming Languages: Python, R.
Data Processing Tools: Apache Hadoop, Apache Spark.
Data Visualization: Power BI, Tableau.
Machine Learning Tools: TensorFlow, Keras, Scikit-learn.
Applications of Data Science
Healthcare: Improved diagnostics and treatment capabilities.
Fraud Detection: Identification of suspicious activities.
Image Recognition: Automated visual analysis.
Recommendation Systems: Personalized suggestions on platforms (Netflix, Amazon).
Dynamic Pricing and Demand Forecasting: Predict and adapt to pricing changes.
Conclusion
Data science combines multiple disciplines, utilizing extensive datasets for meaningful insights, leading to informed decisions and advancements across various industries.
Data science involves extracting knowledge from data through various techniques and algorithms. The course, led by Ms. Kauser Shaheen, covers topics like data types, data acquisition, analysis, dimensionality, and distance measurement methods. Key components include data engineering, statistical analysis, and data visualization, with applications in sectors such as healthcare, fraud detection, and dynamic pricing. The course also focuses on the importance of data as a business asset, emphasizing tools like Python, R, and Apache Hadoop for data processing and visualization. Learning about structured, unstructured, and semi-structured data is fundamental, as is understanding the role of proximity and distance calculations in clustering and recommendation systems. Students explore real-world applications of data science and the impact of big data characteristics - volume, variety, and velocity - on contemporary data challenges.