ITSS 4355 Data Visualization Module – 3 Data Foundations

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/67

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

68 Terms

New cards

to understand data what two things do we need

semantics and type

New cards

Semantics

Real world meaning of data

New cards

Type

Structural or mathematical interpretation of data

New cards

Data sources

sensors
survey
simulations
computations

New cards

Data can be

be raw or can be derived from raw data after applying some processes like noise reduction, smoothing, scaling

New cards

A typical data set used in visualization consists of

list of n records

New cards

Each record consists of

m (one or more) observations or variables

New cards

observation

may be a single number/symbol/string or a more complex structure. A variable may be classified as either independent or dependent

New cards

independent variable

is one whose value is not controlled or affected by another variable, such as the time variable in a time-series data set

New cards

dependent variable

is one whose value is affected by a variation in one or more associated independent variables

New cards

Each observation can be categorized in the following two types

ordinal
nominal

New cards

ordinal

The data takes on numeric value

New cards

Nominal

The data takes on non-numeric value

New cards

types of ordinal data

binaru
discrete
Continuous

New cards

types of nominal

categorical
ranked
arbitrary

New cards

discrete

taking on only integer values or from a specific subset (e.g., (2, 4, 6));

New cards

continuous

representing real values (e.g., in the interval [0, 5]).

New cards

arbitrary

a variable with a potentially infinite range of values with no implied ordering (e.g., addresses

New cards

Fields in a record can be classified as:

scalar
vector
tensor

New cards

scalar

An individual number in a data record. E.g. Cost of an item, age of a person (absolute data

New cards

vector

Multiple variables within a single record. For example, the flow of water in a 2D plane.

New cards

Tensor

•is defined by its rank and by the dimensionality of the space within which it is defined. Scalar and vectors are simple variants of tensor. A scalar is a tensor with rank of 0 and a vector is a tensor with a rank 1

New cards

scalar field

univariate, with a single value attribute at each point in space

New cards

example of a 3D scalar field

time-varying medical scan above; another is the temperature in a room at each point in 3D space. The geometric intuition is that each point in a scalar field has a single value. A point in space can have several different numbers associated with it; if there is no underlying connection between them, they are simply multiple separate scalar fields

New cards

vector field

multivariate, with a list of multiple attribute values at each point. The geometric intuition is that each point in a vector field has a direction and magnitude, like an arrow that can point in any direction and be any length. The length might mean a motion’s speed or a force’s strength

New cards

concrete example of a 3D vector field

the air velocity in the room at a specific time point, where each item has a direction and speed. The dimensionality of the field determines the number of components in the direction vector; its length can be computed directly from these components using the standard Euclidean distance formula. As above, the standard cases are two, three, or four components.

New cards

tensor field

array of attributes at each point, representing a more complex multivariate mathematical structure than the list of numbers in a vecto

New cards

A physical example tensor field

stress, which, in the case of a 3D field, can be defined by nine numbers that represent forces acting in three orthogonal directions. The geometric intuition is that just an arrow cannot represent the full information at each point in a tensor field and would require a more complex shape, such as an ellipsoid.

New cards

first step in data visualization.

data processing

New cards

metadata

data (information) about data

New cards

metadata helps what?

understanding the context of the data and provides guidance for the preprocessing.

New cards

metadata provides

information like the format of individual fields, base reference points, units of measurement, and symbols or numbers used

New cards

Getting the statistical analysis on data provides us

with mean, median, etc., and helps in outlier detection, clustering, and finding correlation.

New cards

Outlier detection can indicate

records with erroneous data fields

New cards

Cluster analysis

can help segment the data into groups exhibiting strong similarities

New cards

Correlation analysis

can help users eliminate redundant fields or highlight associations between dimensions that might not have been apparent otherwise.

New cards

types of statisitcal analysis

outlier detection
cluster
correlation

New cards

Erroneous data

is most often caused by human error and is difficult to detect.

New cards

reasons for dirty data

A malfunctioning sensor, blank entry in surveys, omission on the part of the person doing data entry, etc.

New cards

Pros of deleting bad records

easy to implement

New cards

cons of deleting bad records

•Data Loss

•Sometimes the missing data is of more interest than the actual data as in malfunctioning of sensors

New cards

never delete records when

missing value records are more than 2% of the whole dataset.

New cards

Assigning a Sentinel Value CONS

Care must be taken not to perform statistical analysis on these sentinel values

New cards

Assigning a Sentinel Value PROS

Easy to visualize the erroneous data

New cards

when using a sentinel value can you use statistical analysis?

not to perform statistical analysis on these sentinel values.

New cards

Assigning Average Value:

Calculate and replace the missing value with the average value of the variable or dimensions

New cards

Assigning Average Value: PROS

It minimally affects the overall statistics for the variable

New cards

Assigning Average Value: CONS

•May not be a good guess.

•May mask or obscure outliers

New cards

Assigning Value based on Nearest Neighbors:

we find the record that has the highest similarity with record in question, based on analyzing the differences in all other variable and then assign the missing value

New cards

Assigning Value based on Nearest Neighbors: PROS

Better approximation

New cards

Assigning Value based on Nearest Neighbors: CONS

Variable in question may be most dependent on only a subset of the other dimensions, rather than on all dimensions

New cards

Compute a Substitute Value:

to find values which have high statistical confidence.

New cards

Compute a Substitute Value: PROS

Mostly accurate

New cards

Compute a Substitute Value: CONS

Significant amount of money and energy has been devoted for the research and experiments.

New cards

Compute a Substitute Value: is based on

scientific researches and is known as imputation

New cards

Compute a Substitute Value:In case of normal distribution

, we impute the missing values by mean.

New cards

Compute a Substitute Value In case of skewed distribution(right skewed or positive skewed and left skewed or negative skewed),

we use median as an imputation value.

New cards

Normalization

the process of transforming a data set so that the results satisfy a particular statistical property

New cards

Normalization: convert all variables to a range of .0 to .1 (standardization)

Normalized value = (Original – Min) / (Max – Min)
Normalized value = (Original – Mean)/ (Standard deviation)

New cards

Normalization may also involve

bounding values

New cards

bounding values

values exceeding a threshold value are capped at that threshold.

New cards

Segmentation:

separate data into contiguous regions, where each region corresponds to a particular classification of data.

New cards

Segmentation: Top-down approach -

Creating a cluster with all the data and then move down by increasing the number of cluster

New cards

Segmentation:Bottom-up approach

Creating clusters with all the data in the dataset where each record represents one cluster and then go on merging the clusters.

New cards

Most common used method for clustering is

K-means clustering

New cards

Simple segmentation

can be performed by just mapping disjoint ranges of the data values to specific categories. However, in most situations, the assignment of values to a category is ambiguous

New cards

when category is ambiguous

important to look at the classification of neighbouring points to improve the confidence of classification, or even to do a probabilistic segmentation, where each data point is assigned a probability for belonging to each of the available classifications.

New cards

•Methods used for data preprocessing

§Data Cleaning

§Assigning values

§Imputations

§Clustering and Segmentation