1/67
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
to understand data what two things do we need
semantics and type
Semantics
Real world meaning of data
Type
Structural or mathematical interpretation of data
Data sources
sensors
survey
simulations
computations
Data can be
be raw or can be derived from raw data after applying some processes like noise reduction, smoothing, scaling
A typical data set used in visualization consists of
list of n records
Each record consists of
m (one or more) observations or variables
observation
may be a single number/symbol/string or a more complex structure. A variable may be classified as either independent or dependent
independent variable
is one whose value is not controlled or affected by another variable, such as the time variable in a time-series data set
dependent variable
is one whose value is affected by a variation in one or more associated independent variables
Each observation can be categorized in the following two types
ordinal
nominal
ordinal
The data takes on numeric value
Nominal
The data takes on non-numeric value
types of ordinal data
binaru
discrete
Continuous
types of nominal
categorical
ranked
arbitrary
discrete
taking on only integer values or from a specific subset (e.g., (2, 4, 6));
continuous
representing real values (e.g., in the interval [0, 5]).
arbitrary
a variable with a potentially infinite range of values with no implied ordering (e.g., addresses
Fields in a record can be classified as:
scalar
vector
tensor
scalar
An individual number in a data record. E.g. Cost of an item, age of a person (absolute data
vector
Multiple variables within a single record. For example, the flow of water in a 2D plane.
Tensor
•is defined by its rank and by the dimensionality of the space within which it is defined. Scalar and vectors are simple variants of tensor. A scalar is a tensor with rank of 0 and a vector is a tensor with a rank 1
scalar field
univariate, with a single value attribute at each point in space
example of a 3D scalar field
time-varying medical scan above; another is the temperature in a room at each point in 3D space. The geometric intuition is that each point in a scalar field has a single value. A point in space can have several different numbers associated with it; if there is no underlying connection between them, they are simply multiple separate scalar fields
vector field
multivariate, with a list of multiple attribute values at each point. The geometric intuition is that each point in a vector field has a direction and magnitude, like an arrow that can point in any direction and be any length. The length might mean a motion’s speed or a force’s strength
concrete example of a 3D vector field
the air velocity in the room at a specific time point, where each item has a direction and speed. The dimensionality of the field determines the number of components in the direction vector; its length can be computed directly from these components using the standard Euclidean distance formula. As above, the standard cases are two, three, or four components.
tensor field
array of attributes at each point, representing a more complex multivariate mathematical structure than the list of numbers in a vecto
A physical example tensor field
stress, which, in the case of a 3D field, can be defined by nine numbers that represent forces acting in three orthogonal directions. The geometric intuition is that just an arrow cannot represent the full information at each point in a tensor field and would require a more complex shape, such as an ellipsoid.
first step in data visualization.
data processing
metadata
data (information) about data
metadata helps what?
understanding the context of the data and provides guidance for the preprocessing.
metadata provides
information like the format of individual fields, base reference points, units of measurement, and symbols or numbers used
Getting the statistical analysis on data provides us
with mean, median, etc., and helps in outlier detection, clustering, and finding correlation.
Outlier detection can indicate
records with erroneous data fields
Cluster analysis
can help segment the data into groups exhibiting strong similarities
Correlation analysis
can help users eliminate redundant fields or highlight associations between dimensions that might not have been apparent otherwise.
types of statisitcal analysis
outlier detection
cluster
correlation
Erroneous data
is most often caused by human error and is difficult to detect.
reasons for dirty data
A malfunctioning sensor, blank entry in surveys, omission on the part of the person doing data entry, etc.
Pros of deleting bad records
easy to implement
cons of deleting bad records
•Data Loss
•Sometimes the missing data is of more interest than the actual data as in malfunctioning of sensors
never delete records when
missing value records are more than 2% of the whole dataset.
Assigning a Sentinel Value CONS
Care must be taken not to perform statistical analysis on these sentinel values
Assigning a Sentinel Value PROS
Easy to visualize the erroneous data
when using a sentinel value can you use statistical analysis?
not to perform statistical analysis on these sentinel values.
Assigning Average Value:
Calculate and replace the missing value with the average value of the variable or dimensions
Assigning Average Value: PROS
It minimally affects the overall statistics for the variable
Assigning Average Value: CONS
•May not be a good guess.
•May mask or obscure outliers
Assigning Value based on Nearest Neighbors:
we find the record that has the highest similarity with record in question, based on analyzing the differences in all other variable and then assign the missing value
Assigning Value based on Nearest Neighbors: PROS
Better approximation
Assigning Value based on Nearest Neighbors: CONS
Variable in question may be most dependent on only a subset of the other dimensions, rather than on all dimensions
Compute a Substitute Value:
to find values which have high statistical confidence.
Compute a Substitute Value: PROS
Mostly accurate
Compute a Substitute Value: CONS
Significant amount of money and energy has been devoted for the research and experiments.
Compute a Substitute Value: is based on
scientific researches and is known as imputation
Compute a Substitute Value:In case of normal distribution
, we impute the missing values by mean.
Compute a Substitute Value In case of skewed distribution(right skewed or positive skewed and left skewed or negative skewed),
we use median as an imputation value.
Normalization
the process of transforming a data set so that the results satisfy a particular statistical property
Normalization: convert all variables to a range of .0 to .1 (standardization)
Normalized value = (Original – Min) / (Max – Min)
Normalized value = (Original – Mean)/ (Standard deviation)
Normalization may also involve
bounding values
bounding values
values exceeding a threshold value are capped at that threshold.
Segmentation:
separate data into contiguous regions, where each region corresponds to a particular classification of data.
Segmentation: Top-down approach -
Creating a cluster with all the data and then move down by increasing the number of cluster
Segmentation:Bottom-up approach
Creating clusters with all the data in the dataset where each record represents one cluster and then go on merging the clusters.
Most common used method for clustering is
K-means clustering
Simple segmentation
can be performed by just mapping disjoint ranges of the data values to specific categories. However, in most situations, the assignment of values to a category is ambiguous
when category is ambiguous
important to look at the classification of neighbouring points to improve the confidence of classification, or even to do a probabilistic segmentation, where each data point is assigned a probability for belonging to each of the available classifications.
•Methods used for data preprocessing
§Data Cleaning
§Assigning values
§Imputations
§Clustering and Segmentation