ECM3420 Learning from Data 1 Notes
Module Introduction: ECM3420 Learning from Data
Overview of the Module
Module focuses on learning from data in the context of the data economy paradigm.
Aims to provide life-changing insights and applicable techniques.
Covers data types and various techniques for data analysis.
Motivation for Studying Data
Inherent Interest: Some students find the module inherently fun, similar to data mining courses of the past.
Data Availability: The primary motivation arises from the abundance of available data.
Learning About Exeter Through Data
Initial Exploration: New students often learn about the city through asking people, consulting Wikipedia for population data, and using Google Maps.
Google Maps Insights:
Google Maps provides activity data for various places, indicating popular times.
Example: Sidwell Street is most active around 08:30-9 PM.
Example: Old Firehouse activity peaks around 11 PM.
Crime Data Analysis in Exeter
Data Source: Police.gov.uk (Devon and Cornwall) provides open crime data.
Crime Hotspots: Identify areas with high crime concentrations, such as Sidwell Street.
Law of Crime Concentration: Crime tends to concentrate in specific hotspots within a city.
Decision Support: Crime data can inform decisions about:
Patrolling strategies for the police.
Improvements to areas like Sidwell Street (e.g., increased lighting).
Informs decisions such as choosing where to rent a property.
Diversity of Data
Internet Data: Enormous amounts of data generated every minute (e.g., WhatsApp messages, YouTube video uploads).
Tracking and Prediction: Data from smartphones and smartwatches can track and predict behavior.
Facebook Study Example
Timeline Posts: Analysis of the number of posts shared between two individuals on Facebook.
Observed Trend: An increase in posts followed by a decrease correlated with the individuals entering a relationship.
Data Insights: Reveals potential insights into human behavior, though platform biases might exist.
Relevance of Data Skills
Broader Application: Even outside data science roles, dealing with data is increasingly common across various industries.
Data-Driven Positions: Many job positions now involve data analytics, regardless of the specific field.
Examples: Finance, healthcare, manufacturing, Formula One drivers.
Module Goal: Equip students with techniques to learn from data for broad applicability.
Module Content: Techniques for Learning from Data
Broad Perspective: The module aims to give a broad perspective of data handling techniques.
Supervised Learning: Introduction to linear regression and other regression types.
Error Measurement: Learn how to measure error and create better models.
Unsupervised Learning: Exploration of unsupervised learning techniques.
Bonus Topics:
Computer Vision: Analyzing images and videos.
Natural Language Processing: Analyzing text.
Module Team
Expertise: The module will bring together a team of experts passionate about data and modelling.
Diogo: "These are weeks we are happier if you wish module than for Java"
Curiosity: The main key to get the most out of the module is curiosity. The module is superficial and aims to increase confidence in the topic.
Coursework: Coursework is meant to be an open project where students can choose their own data and project topic.
Complex Networks: The module will have one lecture around computer in complex networks, instead of Linear Discriminate Analysis previously planned.
Workshop Details
Importance of Workshops: Workshops are fundamental for learning and practicing the techniques taught.
Hands-On Experience: Students will use different libraries commonly found in job postings during workshops.
Location: Workshops will be held in the new Lovelace Lab (Innovation Center).
Postgraduate Teaching Associates (PGTAs):
Owen: Focuses on science of science, using computer vision and NLP to analyze research articles.
Song Yuan: Facilitates workshop sessions to consolidate theoretical knowledge.
Assessment
Data Analysis Report (40%):
Involves using a dataset to derive insights using learned techniques.
Example: Analyzing crime data (from police.gov.uk) to understand crime dynamics and relationships.
Exam (60%):
Previously a multiple-choice exam, but this may change.
Project Alignment: the project assigned in this modulecan be used as a preliminary analysis for ECM3401 project.
Resources and Books
Recommended Books:
An Introduction to Statistical Learning: Focuses on coding and understanding techniques (R, Python).
Similar authors, more mathematical approach.
Focuses on scikit, Kera, TensorFlow (less mathematical).
Slides: Slide are self-contained.
Autonomy: Students are able to select their preferred learning level.
Data Characteristics: The Four Vs of Big Data
Volume: Large amounts of data.
Velocity: High speed of data generation.
Variety: Diversity in data types.
Veracity: Truthfulness and accuracy of data.
Example: Reported crime data is report based and doesn't represent total crime.
Data Types: Structured, Semi-structured, and Unstructured
Human Preference: Humans are good at processing site, unstructured data (audio, video, images).
Computer Preference: Computers prefer structured data (SQL tables).
Data Volume: Most of the data is unstructured.
Unstructured to Structured: It is necessary to transform unstructured data into structured formats for analysis.
Data Processing: it is quite common to have unstructured data and transform it into structured to perform analysis.
Semi-structured Data: Combination of structured and unstructured elements (e.g., emails with tags and unstructured text).
Data Scientist Role
Survey Data: Data scientists spend most of their time cleaning and organizing data.
Least Enjoyable: Data preparation (cleaning and organizing) is often the least enjoyable part of the job.