LU

ECM3420 Learning from Data 1 Notes

Module Introduction: ECM3420 Learning from Data

  • Overview of the Module

    • Module focuses on learning from data in the context of the data economy paradigm.

    • Aims to provide life-changing insights and applicable techniques.

    • Covers data types and various techniques for data analysis.

Motivation for Studying Data

  • Inherent Interest: Some students find the module inherently fun, similar to data mining courses of the past.

  • Data Availability: The primary motivation arises from the abundance of available data.

Learning About Exeter Through Data

  • Initial Exploration: New students often learn about the city through asking people, consulting Wikipedia for population data, and using Google Maps.

  • Google Maps Insights:

    • Google Maps provides activity data for various places, indicating popular times.

    • Example: Sidwell Street is most active around 08:30-9 PM.

    • Example: Old Firehouse activity peaks around 11 PM.

Crime Data Analysis in Exeter

  • Data Source: Police.gov.uk (Devon and Cornwall) provides open crime data.

  • Crime Hotspots: Identify areas with high crime concentrations, such as Sidwell Street.

  • Law of Crime Concentration: Crime tends to concentrate in specific hotspots within a city.

  • Decision Support: Crime data can inform decisions about:

    • Patrolling strategies for the police.

    • Improvements to areas like Sidwell Street (e.g., increased lighting).

    • Informs decisions such as choosing where to rent a property.

Diversity of Data

  • Internet Data: Enormous amounts of data generated every minute (e.g., WhatsApp messages, YouTube video uploads).

  • Tracking and Prediction: Data from smartphones and smartwatches can track and predict behavior.

Facebook Study Example

  • Timeline Posts: Analysis of the number of posts shared between two individuals on Facebook.

  • Observed Trend: An increase in posts followed by a decrease correlated with the individuals entering a relationship.

  • Data Insights: Reveals potential insights into human behavior, though platform biases might exist.

Relevance of Data Skills

  • Broader Application: Even outside data science roles, dealing with data is increasingly common across various industries.

  • Data-Driven Positions: Many job positions now involve data analytics, regardless of the specific field.

  • Examples: Finance, healthcare, manufacturing, Formula One drivers.

  • Module Goal: Equip students with techniques to learn from data for broad applicability.

Module Content: Techniques for Learning from Data

  • Broad Perspective: The module aims to give a broad perspective of data handling techniques.

  • Supervised Learning: Introduction to linear regression and other regression types.

  • Error Measurement: Learn how to measure error and create better models.

  • Unsupervised Learning: Exploration of unsupervised learning techniques.

  • Bonus Topics:

    • Computer Vision: Analyzing images and videos.

    • Natural Language Processing: Analyzing text.

Module Team

  • Expertise: The module will bring together a team of experts passionate about data and modelling.

  • Diogo: "These are weeks we are happier if you wish module than for Java"

  • Curiosity: The main key to get the most out of the module is curiosity. The module is superficial and aims to increase confidence in the topic.

  • Coursework: Coursework is meant to be an open project where students can choose their own data and project topic.

  • Complex Networks: The module will have one lecture around computer in complex networks, instead of Linear Discriminate Analysis previously planned.

Workshop Details

  • Importance of Workshops: Workshops are fundamental for learning and practicing the techniques taught.

  • Hands-On Experience: Students will use different libraries commonly found in job postings during workshops.

  • Location: Workshops will be held in the new Lovelace Lab (Innovation Center).

  • Postgraduate Teaching Associates (PGTAs):

    • Owen: Focuses on science of science, using computer vision and NLP to analyze research articles.

    • Song Yuan: Facilitates workshop sessions to consolidate theoretical knowledge.

Assessment

  • Data Analysis Report (40%):

    • Involves using a dataset to derive insights using learned techniques.

    • Example: Analyzing crime data (from police.gov.uk) to understand crime dynamics and relationships.

  • Exam (60%):

    • Previously a multiple-choice exam, but this may change.

  • Project Alignment: the project assigned in this modulecan be used as a preliminary analysis for ECM3401 project.

Resources and Books

  • Recommended Books:

    • An Introduction to Statistical Learning: Focuses on coding and understanding techniques (R, Python).

    • Similar authors, more mathematical approach.

    • Focuses on scikit, Kera, TensorFlow (less mathematical).

  • Slides: Slide are self-contained.

  • Autonomy: Students are able to select their preferred learning level.

Data Characteristics: The Four Vs of Big Data

  • Volume: Large amounts of data.

  • Velocity: High speed of data generation.

  • Variety: Diversity in data types.

  • Veracity: Truthfulness and accuracy of data.

    • Example: Reported crime data is report based and doesn't represent total crime.

Data Types: Structured, Semi-structured, and Unstructured

  • Human Preference: Humans are good at processing site, unstructured data (audio, video, images).

  • Computer Preference: Computers prefer structured data (SQL tables).

  • Data Volume: Most of the data is unstructured.

  • Unstructured to Structured: It is necessary to transform unstructured data into structured formats for analysis.

  • Data Processing: it is quite common to have unstructured data and transform it into structured to perform analysis.

  • Semi-structured Data: Combination of structured and unstructured elements (e.g., emails with tags and unstructured text).

Data Scientist Role

  • Survey Data: Data scientists spend most of their time cleaning and organizing data.

  • Least Enjoyable: Data preparation (cleaning and organizing) is often the least enjoyable part of the job.