CH2 - Data Understanding – Exam Notes

Page 1

  • Course: UCS551 Introduction to Data Analytics & Applications

  • Chapter: Data Understanding

    • This chapter focuses on the foundational concepts required to effectively work with data, including its various forms, characteristics, and initial preparation steps.

Page 2

  • Agenda: data types, data structures, levels of measurement, univariate vs. multivariate data, data representation

Page 3

  • Two primary data types: structured vs. unstructured

Page 4

  • Structured data: possesses a predefined schema; highly organised in rows & columns, typically within tables; easily stored, searched, and analysed due to its fixed format and clear relationships (e.g., data in SQL databases, Excel spreadsheets).

Page 5

  • Key traits:

    • Organised tables: Data is arranged in a tabular format, where each row represents a unique record and each column represents a specific attribute or field.

    • Strict schema: Requires a rigid, pre-defined structure with specified data types for each column (e.g., integer, string, date) and often constraints (e.g., primary keys, foreign keys).

    • Easy querying: Highly efficient for retrieval and manipulation using query languages like SQL, allowing for precise filtering, sorting, and aggregations.

    • Largely quantitative: Often comprises numerical data (e.g., sales figures, sensor readings) but can also include categorical data that fits within the predefined schema.

    • Stored in relational DBs: Typically resides in Relational Database Management Systems (RDBMS) which enforce relationships between tables.

Page 6

  • Examples:

    • CRM tables: Customer details (name, address, purchase history) in a database.

    • Spreadsheets: Data organised in rows and columns, like budget reports or inventory lists.

    • Sensor readings: Time-series data from IoT devices like temperature, pressure, or humidity measurements, often with timestamps.

    • Point-of-sale transactions: Records of sales, including product ID, price, quantity, and transaction date.

Page 7

  • Unstructured data: lacks a predefined format or organisational structure; significantly harder to handle, process, and extract insights from; often needs advanced techniques like Natural Language Processing (NLP) or Machine Learning (ML) for comprehensive analysis.

Page 8

  • Key traits:

    • No schema: Data does not conform to a fixed data model, allowing for high flexibility in content and format. Often described as 'schema-on-read'.

    • Diverse formats: Can include various types beyond tabular data, such as plain text documents, images, audio files, video files, and email content.

    • Harder search: Requires more sophisticated search techniques like full-text search, semantic search, or pattern recognition rather than simple keyword matching against predefined fields.

    • Qualitative focus: Often rich in contextual information and meaning, requiring interpretation to derive insights (e.g., sentiment from text, objects in images).

    • Stored in NoSQL/cloud/lakes: Commonly stored in NoSQL databases (e.g., document databases, graph databases), cloud storage (e.g., Amazon S3), or data lakes built to handle vast quantities of raw data.

Page 9

  • Examples:

    • Emails: Content, attachments, and metadata, which vary widely in structure.

    • Social media posts: Text, images, videos, and hyperlinks with inconsistent formatting.

    • Multimedia: Images (e.g., JPG, PNG), audio (e.g., MP3, WAV), and video files (e.g., MP4, AVI).

    • Irregular IoT data: Sensor data that might not come in a consistent stream or format, or includes unstructured logs.

    • Webpages: HTML content, embedded media, and varying layouts.

    • Chat transcripts: Conversations from customer service or messaging apps, reflecting natural language.

Page 10

  • Data structures: specific methods for organising and storing data in a computer so that it can be accessed and modified efficiently; the choice impacts performance and memory usage.

Page 11

  • Array: a collection of homogeneous (same type) elements stored at contiguous memory locations; features extO(1)ext{O}(1) (constant time) index access due to direct memory address calculation; low memory overhead as no extra space for dynamic resizing is needed.

Page 12

  • Vector: a dynamic array that automatically resizes itself when elements are added or removed; retains extO(1)ext{O}(1) index access (amortised, as occasional resizing can be costly); requires extra memory for capacity management to pre-allocate space for future growth.

Page 13

  • Difference: The key distinction is that an array has a static, fixed size determined at compile-time or initialization, while a vector's size is dynamic and can auto-resize at run-time as needed.

Page 14

  • Matrix: a 2extD2 ext{-D} (two-dimensional) array, typically represented as rows imesimes columns; contains homogeneous elements (all of the same data type); widely used for mathematical operations, especially in linear algebra (e.g., matrix multiplication, inversion) and image processing.

Page 15

  • Data frame: a 2extD2 ext{-D} tabular data structure (similar to a spreadsheet or relational database table); columns may differ in type (heterogeneous); dynamic in size (rows and columns can be added or removed); features labelled rows and columns for easy referencing and manipulation; a fundamental structure in data science libraries like Pandas (Python) and R.

Page 16

  • Levels of measurement: A classification system that describes the nature of information within values and determines which statistical analyses are appropriate.
    • Nominal/Ordinal = qualitative categories or non-numeric descriptions.

    • Interval/Ratio = quantitative numerical values, allowing for mathematical operations.

Page 17

  • Nominal: Data that consists of categories without any inherent order or numerical value. You can only check for equality or difference. Example: Gender (Male, Female), Marital Status.

  • Ordinal: Data with categories that have a meaningful order, but the differences or intervals between categories are not uniform or quantifiable. Example: Education level (High School, Bachelor's, Master's, PhD), Satisfaction ratings (Poor, Good, Excellent).

  • Interval: Data with ordered categories where the differences between values are meaningful and consistent, but there is no true or absolute zero point. Ratios are not meaningful. Example: Temperature in Celsius or Fahrenheit (0extoextC0^ ext{o} ext{C} does not mean absence of temperature), IQ scores.

  • Ratio: Data with ordered categories, meaningful and consistent differences, AND a true absolute zero point, meaning that zero signifies the complete absence of the measured quantity. This allows for meaningful ratios. Example: Height, Weight, Age, Income, Kilograms (00 kg means no mass).

Page 18

  • True zero <br>ightarrow<br>ightarrow absolute absence (e.g., 00 kg implies no mass) <br>ightarrow<br>ightarrow allows for meaningful ratios ($ ext{e.g., } 4 ext{ kg}istwiceasmuchasis twice as much as2 ext{ kg}).</p></li><li><p>Notruezero).</p></li><li><p>No true zero
    ightarrowanarbitraryzeropointthatdoesnotindicateabsence(e.g.,an arbitrary zero point that does not indicate absence (e.g.,0^ ext{o} ext{C}doesnotmeannotemperature,itsjustapointonthescale)does not mean no temperature, it's just a point on the scale)
    ightarrow only differences are meaningful ($ ext{e.g., } 10^ ext{o} ext{C} is 5extoextC5^ ext{o} ext{C} warmer than 5extoextC5^ ext{o} ext{C}, but not twice as hot).

Page 19

  • Univariate data: concerns a single variable per observation; analysis focuses on describing the distribution and characteristics of that one variable.

  • Multivariate data: involves two or more variables per observation; analysis focuses on understanding relationships, correlations, and interactions among these variables.

Page 20

  • Univariate example: A list of students' favourite colours, where only the 'colour' variable is collected for each student.

  • Multivariate example: An ad-performance table including variables like gender of the viewer, their age group, click-through rate, and conversion rate for each ad interaction, allowing for analysis of how these factors influence ad effectiveness.

Page 21

  • Data representation checklist: A thorough evaluation ensures data is fit for analysis.

    • Variables collected: What specific attributes or characteristics have been measured?

    • Coding: How are categorical or qualitative variables numerically represented (e.g., Male = 0, Female = 1)?

    • Measurement level: Identifying if data is nominal, ordinal, interval, or ratio determines appropriate statistical tests.

    • Meaning: What do the data values truly represent in the real world?

    • Quality: Evaluation of several aspects:

    • Accuracy: Data reflects the true values.

    • Completeness: No missing values or records.

    • Validity: Data conforms to defined business rules or constraints.

    • Consistency: Data is uniform across systems and time.

    • Uniqueness: No duplicate records.

    • Timeliness: Data is relevant and up-to-date.

    • Fitness for Use: Data meets the requirements for its intended analytical purpose.

    • Missingness: Are there missing values, and if so, what is the pattern and underlying cause (e.g., missing at random, missing not at random)?

    • Relevance: Is the data pertinent to the analytical question or problem being addressed?

Page 22

  • Data collection: The process of obtaining appropriate, high-quality data from various sources; critical to minimise error, bias, and ensure the data accurately represents the phenomena being studied. Methods include surveys, sensors, web scraping, and existing databases.

Page 23

  • Data cleaning: The process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset; involves handling missing values (e.g., imputation, deletion), outliers (e.g., removal, winsorization), and inconsistencies (e.g., standardisation, deduplication). This step is essential for ensuring accurate and reliable analysis results.

Page 24

  • Descriptive statistics: Techniques used to summarise and describe the main features of a dataset without drawing conclusions beyond the data itself.

    • Univariate: Focus on a single variable.

    • Measures of central tendency: mean (average), median (middle value), mode (most frequent value).

    • Measures of dispersion: variance (spread of data points around the mean), standard deviation (extsigmaext{sigma}, square root of variance), range (difference between max and min).

    • Visualisations: Histograms (show distribution shape and frequency of data), box-plots (display distribution summary, including quartiles and outliers).

    • Multivariate: Focus on relationships between multiple variables.

    • Covariance: Measures the directional relationship between two variables (how they change together).

    • Correlation: Standardised measure of the linear relationship between two variables, ranging from 1-1 to 11.

    • Contingency tables: Used for summarising the relationship between two or more categorical variables.

    • Visualisations: Scatter plots (show relationship between two numerical variables), heatmaps (represent correlation matrices or patterns in 2extD2 ext{-D} data).

Page 25

  • Modelling: The process of building predictive or explanatory models once data is understood and prepared; aims to uncover underlying patterns, make forecasts, or classify new data.

    • Univariate time-series models: Such as ARIMA (AutoRegressive Integrated Moving Average) models, used for forecasting future values of a single variable based on its past observations.

    • Multivariate models: Include decision trees, random forests, and neural networks, which can handle multiple input variables to make predictions or classifications, capturing complex, non-linear relationships.

  • End of chapter