ITSS 4355 Data Visualization Module – 3 Data Foundations

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/67

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

68 Terms

1
New cards

to understand data what two things do we need

semantics and type

2
New cards

Semantics

Real world meaning of data

3
New cards

Type

Structural or mathematical interpretation of data

4
New cards

Data sources

sensors 
survey
simulations
computations

5
New cards

Data can be

be raw or can be derived from raw data after applying some processes like noise reduction, smoothing, scaling

6
New cards

A typical data set used in visualization consists of

list of n records

7
New cards

Each record consists of

m (one or more) observations or variables

8
New cards

observation

may be a single number/symbol/string or a more complex structure. A variable may be classified as either independent or dependent

9
New cards

independent variable

is one whose value is not controlled or affected by another variable, such as the time variable in a time-series data set

10
New cards

dependent variable

is one whose value is affected by a variation in one or more associated independent variables

11
New cards

Each observation can be categorized in the following two types

ordinal
nominal

12
New cards

ordinal

The data takes on numeric value

13
New cards

Nominal

The data takes on non-numeric value

14
New cards

types of ordinal data

binaru
discrete
Continuous

15
New cards

types of nominal

categorical
ranked
arbitrary

16
New cards

discrete

taking on only integer values or from a specific subset (e.g., (2, 4, 6));

17
New cards

continuous

representing real values (e.g., in the interval [0, 5]).

18
New cards

arbitrary

a variable with a potentially infinite range of values with no implied ordering (e.g., addresses

19
New cards

Fields in a record can be classified as:

scalar
vector
tensor

20
New cards

scalar

An individual number in a data record. E.g. Cost of an item, age of a person (absolute data

21
New cards

vector

Multiple variables within a single record. For example, the flow of water in a 2D plane.

22
New cards

Tensor

•is defined by its rank and by the dimensionality of the space within which it is defined. Scalar and vectors are simple variants of tensor. A scalar is a tensor with rank of 0 and a vector is a tensor with a rank 1

23
New cards

scalar field 

univariate, with a single value attribute at each point in space

24
New cards

example of a 3D scalar field

time-varying medical scan above; another is the temperature in a room at each point in 3D space. The geometric intuition is that each point in a scalar field has a single value. A point in space can have several different numbers associated with it; if there is no underlying connection between them, they are simply multiple separate scalar fields

25
New cards

vector field 

multivariate, with a list of multiple attribute values at each point. The geometric intuition is that each point in a vector field has a direction and magnitude, like an arrow that can point in any direction and be any length. The length might mean a motion’s speed or a force’s strength

26
New cards

concrete example of a 3D vector field

the air velocity in the room at a specific time point, where each item has a direction and speed. The dimensionality of the field determines the number of components in the direction vector; its length can be computed directly from these components using the standard Euclidean distance formula. As above, the standard cases are two, three, or four components.

27
New cards

tensor field 

array of attributes at each point, representing a more complex multivariate mathematical structure than the list of numbers in a vecto

28
New cards

A physical example tensor field 

stress, which, in the case of a 3D field, can be defined by nine numbers that represent forces acting in three orthogonal directions. The geometric intuition is that just an arrow cannot represent the full information at each point in a tensor field and would require a more complex shape, such as an ellipsoid.

29
New cards

first step in data visualization.

data processing

30
New cards

metadata

data (information) about data

31
New cards

metadata helps what?

understanding the context of the data and provides guidance for the preprocessing.

32
New cards

metadata provides

information like the format of individual fields, base reference points, units of measurement, and symbols or numbers used

33
New cards

Getting the statistical analysis on data provides us

with mean, median, etc., and helps in outlier detection, clustering, and finding correlation.

34
New cards

Outlier detection can indicate

records with erroneous data fields

35
New cards

Cluster analysis

can help segment the data into groups exhibiting strong similarities

36
New cards

Correlation analysis

can help users eliminate redundant fields or highlight associations between dimensions that might not have been apparent otherwise.

37
New cards

types of statisitcal analysis

outlier detection
cluster
correlation

38
New cards

Erroneous data

is most often caused by human error and is difficult to detect.

39
New cards

reasons for dirty data

A malfunctioning sensor, blank entry in surveys, omission on the part of the person doing data entry, etc.

40
New cards

Pros of deleting bad records

easy to implement

41
New cards

cons of deleting bad records

•Data Loss

•Sometimes the missing data is of more interest than the actual data as in malfunctioning of sensors

42
New cards

never delete records when

missing value records are more than 2% of the whole dataset.

43
New cards

Assigning a Sentinel Value CONS

Care must be taken not to perform statistical analysis on these sentinel values

44
New cards

Assigning a Sentinel Value PROS

Easy to visualize the erroneous data

45
New cards

when using a sentinel value can you use statistical analysis?

not to perform statistical analysis on these sentinel values.

46
New cards

Assigning Average Value:

Calculate and replace the missing value with the average value of the variable or dimensions

47
New cards

Assigning Average Value: PROS

It minimally affects the overall statistics for the variable

48
New cards

Assigning Average Value: CONS

•May not be a good guess.

•May mask or obscure outliers

49
New cards

Assigning Value based on Nearest Neighbors:

we find the record that has the highest similarity with record in question, based on analyzing the differences in all other variable and then assign the missing value

50
New cards

Assigning Value based on Nearest Neighbors: PROS

Better approximation

51
New cards

Assigning Value based on Nearest Neighbors: CONS

Variable in question may be most dependent on only a subset of the other dimensions, rather than on all dimensions

52
New cards

Compute a Substitute Value:

to find values which have high statistical confidence.

53
New cards

Compute a Substitute Value: PROS

Mostly accurate

54
New cards

Compute a Substitute Value: CONS

Significant amount of money and energy has been devoted for the research and experiments.

55
New cards

Compute a Substitute Value: is based on

scientific researches and is  known as imputation

56
New cards

Compute a Substitute Value:In case of normal distribution

, we impute the missing values by mean.

57
New cards

Compute a Substitute Value In case of skewed distribution(right skewed or positive skewed and left skewed or negative skewed),

we use median as an imputation value.

58
New cards

Normalization

the process of transforming a data set so that the results satisfy a particular statistical property

59
New cards

Normalization: convert all variables to a range of .0 to .1 (standardization)

Normalized value = (Original – Min) / (Max – Min)
Normalized value = (Original – Mean)/ (Standard deviation)

60
New cards

Normalization may also involve

bounding values

61
New cards

bounding values

values exceeding a threshold value are capped at that threshold.

62
New cards

Segmentation:

separate data into contiguous regions, where each region corresponds to a particular classification of data.

63
New cards

Segmentation: Top-down approach -

Creating a cluster with all the data and then move down by increasing the number of cluster

64
New cards

Segmentation:Bottom-up approach

Creating clusters with all the data in the dataset where each record represents one cluster and then go on merging the clusters.

65
New cards

Most common used method for clustering is

K-means clustering

66
New cards

Simple segmentation

can be performed by just mapping disjoint ranges of the data values to specific categories. However, in most situations, the assignment of values to a category is ambiguous

67
New cards

when category is ambiguous

important to look at the classification of neighbouring points to improve the confidence of classification, or even to do a probabilistic segmentation, where each data point is assigned a probability for belonging to each of the available classifications.

68
New cards

•Methods used for data preprocessing

§Data Cleaning

§Assigning values

§Imputations

§Clustering and Segmentation

Explore top flashcards

Peripheral Nerve
Updated 905d ago
flashcards Flashcards (62)
-4 Poverty, Part 1
Updated 1088d ago
flashcards Flashcards (61)
BIO-205 Chapter 12
Updated 263d ago
flashcards Flashcards (51)
Anime
Updated 51d ago
flashcards Flashcards (70)
Optics and Vision
Updated 45d ago
flashcards Flashcards (50)
Peripheral Nerve
Updated 905d ago
flashcards Flashcards (62)
-4 Poverty, Part 1
Updated 1088d ago
flashcards Flashcards (61)
BIO-205 Chapter 12
Updated 263d ago
flashcards Flashcards (51)
Anime
Updated 51d ago
flashcards Flashcards (70)
Optics and Vision
Updated 45d ago
flashcards Flashcards (50)