CH2 - Data Understanding – Exam Notes

Page 1

Course: UCS551 Introduction to Data Analytics & Applications
Chapter: Data Understanding
- This chapter focuses on the foundational concepts required to effectively work with data, including its various forms, characteristics, and initial preparation steps.

Page 2

Agenda: data types, data structures, levels of measurement, univariate vs. multivariate data, data representation

Page 3

Two primary data types: structured vs. unstructured

Page 4

Structured data: possesses a predefined schema; highly organised in rows & columns, typically within tables; easily stored, searched, and analysed due to its fixed format and clear relationships (e.g., data in SQL databases, Excel spreadsheets).

Page 5

Key traits:
- Organised tables: Data is arranged in a tabular format, where each row represents a unique record and each column represents a specific attribute or field.
- Strict schema: Requires a rigid, pre-defined structure with specified data types for each column (e.g., integer, string, date) and often constraints (e.g., primary keys, foreign keys).
- Easy querying: Highly efficient for retrieval and manipulation using query languages like SQL, allowing for precise filtering, sorting, and aggregations.
- Largely quantitative: Often comprises numerical data (e.g., sales figures, sensor readings) but can also include categorical data that fits within the predefined schema.
- Stored in relational DBs: Typically resides in Relational Database Management Systems (RDBMS) which enforce relationships between tables.

Page 6

Examples:
- CRM tables: Customer details (name, address, purchase history) in a database.
- Spreadsheets: Data organised in rows and columns, like budget reports or inventory lists.
- Sensor readings: Time-series data from IoT devices like temperature, pressure, or humidity measurements, often with timestamps.
- Point-of-sale transactions: Records of sales, including product ID, price, quantity, and transaction date.

Page 7

Unstructured data: lacks a predefined format or organisational structure; significantly harder to handle, process, and extract insights from; often needs advanced techniques like Natural Language Processing (NLP) or Machine Learning (ML) for comprehensive analysis.

Page 8

Key traits:
- No schema: Data does not conform to a fixed data model, allowing for high flexibility in content and format. Often described as 'schema-on-read'.
- Diverse formats: Can include various types beyond tabular data, such as plain text documents, images, audio files, video files, and email content.
- Harder search: Requires more sophisticated search techniques like full-text search, semantic search, or pattern recognition rather than simple keyword matching against predefined fields.
- Qualitative focus: Often rich in contextual information and meaning, requiring interpretation to derive insights (e.g., sentiment from text, objects in images).
- Stored in NoSQL/cloud/lakes: Commonly stored in NoSQL databases (e.g., document databases, graph databases), cloud storage (e.g., Amazon S3), or data lakes built to handle vast quantities of raw data.

Page 9

Examples:
- Emails: Content, attachments, and metadata, which vary widely in structure.
- Social media posts: Text, images, videos, and hyperlinks with inconsistent formatting.
- Multimedia: Images (e.g., JPG, PNG), audio (e.g., MP3, WAV), and video files (e.g., MP4, AVI).
- Irregular IoT data: Sensor data that might not come in a consistent stream or format, or includes unstructured logs.
- Webpages: HTML content, embedded media, and varying layouts.
- Chat transcripts: Conversations from customer service or messaging apps, reflecting natural language.

Page 10

Data structures: specific methods for organising and storing data in a computer so that it can be accessed and modified efficiently; the choice impacts performance and memory usage.

Page 11

Array: a collection of homogeneous (same type) elements stored at contiguous memory locations; features $ext{O}(1)$ (constant time) index access due to direct memory address calculation; low memory overhead as no extra space for dynamic resizing is needed.

Page 12

Vector: a dynamic array that automatically resizes itself when elements are added or removed; retains $ext{O}(1)$ index access (amortised, as occasional resizing can be costly); requires extra memory for capacity management to pre-allocate space for future growth.

Page 13

Difference: The key distinction is that an array has a static, fixed size determined at compile-time or initialization, while a vector's size is dynamic and can auto-resize at run-time as needed.

Page 14

Matrix: a $2 ext{-D}$ (two-dimensional) array, typically represented as rows $imes$ columns; contains homogeneous elements (all of the same data type); widely used for mathematical operations, especially in linear algebra (e.g., matrix multiplication, inversion) and image processing.

Page 15

Data frame: a $2 ext{-D}$ tabular data structure (similar to a spreadsheet or relational database table); columns may differ in type (heterogeneous); dynamic in size (rows and columns can be added or removed); features labelled rows and columns for easy referencing and manipulation; a fundamental structure in data science libraries like Pandas (Python) and R.

Page 16

Levels of measurement: A classification system that describes the nature of information within values and determines which statistical analyses are appropriate.
• Nominal/Ordinal = qualitative categories or non-numeric descriptions.

• Interval/Ratio = quantitative numerical values, allowing for mathematical operations.

Page 17

Nominal: Data that consists of categories without any inherent order or numerical value. You can only check for equality or difference. Example: Gender (Male, Female), Marital Status.
Ordinal: Data with categories that have a meaningful order, but the differences or intervals between categories are not uniform or quantifiable. Example: Education level (High School, Bachelor's, Master's, PhD), Satisfaction ratings (Poor, Good, Excellent).
Interval: Data with ordered categories where the differences between values are meaningful and consistent, but there is no true or absolute zero point. Ratios are not meaningful. Example: Temperature in Celsius or Fahrenheit ( $0^ ext{o} ext{C}$ does not mean absence of temperature), IQ scores.
Ratio: Data with ordered categories, meaningful and consistent differences, AND a true absolute zero point, meaning that zero signifies the complete absence of the measured quantity. This allows for meaningful ratios. Example: Height, Weight, Age, Income, Kilograms ( $0$ kg means no mass).

Page 18

True zero $<br>ightarrow$ absolute absence (e.g., $0$ kg implies no mass) $<br>ightarrow$ allows for meaningful ratios ($ ext{e.g., } 4 ext{ kg} $is twice as much as$ 2 ext{ kg} $).</p></li><li><p>No true zero$
ightarrow $an arbitrary zero point that does not indicate absence (e.g.,$ 0^ ext{o} ext{C} $does not mean no temperature, it's just a point on the scale)$
ightarrow only differences are meaningful ($ ext{e.g., } 10^ ext{o} ext{C} is $5^ ext{o} ext{C}$ warmer than $5^ ext{o} ext{C}$ , but not twice as hot).

Page 19

Univariate data: concerns a single variable per observation; analysis focuses on describing the distribution and characteristics of that one variable.
Multivariate data: involves two or more variables per observation; analysis focuses on understanding relationships, correlations, and interactions among these variables.

Page 20

Univariate example: A list of students' favourite colours, where only the 'colour' variable is collected for each student.
Multivariate example: An ad-performance table including variables like gender of the viewer, their age group, click-through rate, and conversion rate for each ad interaction, allowing for analysis of how these factors influence ad effectiveness.

Page 21

Data representation checklist: A thorough evaluation ensures data is fit for analysis.
- Variables collected: What specific attributes or characteristics have been measured?
- Coding: How are categorical or qualitative variables numerically represented (e.g., Male = 0, Female = 1)?
- Measurement level: Identifying if data is nominal, ordinal, interval, or ratio determines appropriate statistical tests.
- Meaning: What do the data values truly represent in the real world?
- Quality: Evaluation of several aspects:
- Accuracy: Data reflects the true values.
- Completeness: No missing values or records.
- Validity: Data conforms to defined business rules or constraints.
- Consistency: Data is uniform across systems and time.
- Uniqueness: No duplicate records.
- Timeliness: Data is relevant and up-to-date.
- Fitness for Use: Data meets the requirements for its intended analytical purpose.
- Missingness: Are there missing values, and if so, what is the pattern and underlying cause (e.g., missing at random, missing not at random)?
- Relevance: Is the data pertinent to the analytical question or problem being addressed?

Page 22

Data collection: The process of obtaining appropriate, high-quality data from various sources; critical to minimise error, bias, and ensure the data accurately represents the phenomena being studied. Methods include surveys, sensors, web scraping, and existing databases.

Page 23

Data cleaning: The process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset; involves handling missing values (e.g., imputation, deletion), outliers (e.g., removal, winsorization), and inconsistencies (e.g., standardisation, deduplication). This step is essential for ensuring accurate and reliable analysis results.

Page 24

Descriptive statistics: Techniques used to summarise and describe the main features of a dataset without drawing conclusions beyond the data itself.
- Univariate: Focus on a single variable.
- Measures of central tendency: mean (average), median (middle value), mode (most frequent value).
- Measures of dispersion: variance (spread of data points around the mean), standard deviation ( $ext{sigma}$ , square root of variance), range (difference between max and min).
- Visualisations: Histograms (show distribution shape and frequency of data), box-plots (display distribution summary, including quartiles and outliers).
- Multivariate: Focus on relationships between multiple variables.
- Covariance: Measures the directional relationship between two variables (how they change together).
- Correlation: Standardised measure of the linear relationship between two variables, ranging from $-1$ to $1$ .
- Contingency tables: Used for summarising the relationship between two or more categorical variables.
- Visualisations: Scatter plots (show relationship between two numerical variables), heatmaps (represent correlation matrices or patterns in $2 ext{-D}$ data).

Page 25

Modelling: The process of building predictive or explanatory models once data is understood and prepared; aims to uncover underlying patterns, make forecasts, or classify new data.
- Univariate time-series models: Such as ARIMA (AutoRegressive Integrated Moving Average) models, used for forecasting future values of a single variable based on its past observations.
- Multivariate models: Include decision trees, random forests, and neural networks, which can handle multiple input variables to make predictions or classifications, capturing complex, non-linear relationships.

End of chapter