the ubiquity of data opportunities
generation, collection, accessibility, risk management, and decision making
→ computers became very powerful
→ data networking is very fast
→ computer storage is very cheap
→ algorithms were developed to process large datasets quickly
comparative size of different volumes of storage
→ bit (b): the smallest unit of data; can represent a 0 or 1
→ byte (B): consists of 8 bits; it is the fundamental unit of storage in computers
→ kilobyte (KB): equal to 1,024 bytes; often used to describe the size of small text files or images
→ megabyte (MB): equal to 1,024 KBs or 1,048,576 bytes; commonly used to measure the size of documents, photos, and short videos
→ gigabyte (GB): equal to 1,024 MBs or 1,073,741,824 bytes; used for larger files such as high-resolution videos, software applications, and larger databases
→ terabyte (TB): equal to 1,024 GBs or 1,099,511,627,776 bytes; often used to measure the storage capacity of hard drives, servers, and cloud storage
structured data
data that resides in a pre-defined row/column format
→ 20% of data
unstructured data
does not conform to a pre-defined row/column format
→ 80%
→ ex. emails, videos, images, etc.
4 V’s of Big Data
volume = amount
variety = type of data (structured or unstructured)
velocity = speed
veracity = quality, trustworthiness of data, lack of bias, noise, and abnormalities
types of data analytics
→ descriptive
→ diagnostic
→ predictive
→ prescriptive
important business analytics applications
consumer analytics, financial chain, HR, risk, etc.
supervised learning
discovering patterns in the data that relate data attributes / variables with a target variable
unsupervised learning
the data has no target variable
→ we want to explore the data to find some intrinsic structures or similarities in them
types of data
→ numerical
→ categorical (nominal, ordinal)
descriptive data analytics
hindsight, low value and low difficulty foresight, high value and high difficulty
→ what happened?
diagnostic data analytics
insight, understanding, mid-value and difficulty
→ why did it happen?
predictive data analytics
insight/foresight, forecasting, mid-value and difficulty
→ what will happen?
prescriptive data analytics
foresight, high value and high difficulty
→ how can we make it happen?
two categories of supervised learning
classification
regression
classification (supervised learning)
learns a method for predicting the instance class from pre-labeled instances
regression (supervised learning)
an attempt to predict a continuous attribute
pre-attentive attributes
visual characteristics of objects or elements that the human brain can quickly and automatically perceive without conscious effort or focused attention
most used pre-attentive attributes
→ size
→ shape
→ color (hue)
primary colors
red, yellow, blue
secondary colors
purple, green, and orange
tertiary colors
red-orange, yellow-orange, yellow-green, blue-green, blue-violet, and red-violet
color schemas
→ categorical
→ sequential
→ diverging
categorical (color schemas)
contrasting colors for individual comparison
sequential (color schemas)
color is ordered from low/light to high/dark
diverging (color schemas)
two sequential colors with a neutral midpoint (odd #)
what does CVD stand for?
color vision deficiency (color blindness)
→ most prevalent form is red/green CVD
three types of color sensitive cones:
short - respond to short wave lengths; sensitive to blue colors
medium - respond to medium wave lengths; sensitive to green colors
long - respond to long wave lengths; more sensitive to red colors