CPSC 4300 Final

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/213

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No study sessions yet.

214 Terms

New cards

Data mining

extraction of interesting (non-trivial, implicit, previously unknown, and potentially useful) patterns or knowledge from a huge amount of data

New cards

Alternative names for data mining

knowledge discovery in databases (KKD), knowledge extraction, data analysis, etc.

New cards

Data science

the analysis of data using quantitative and qualitative techniques to be able to explore trends and patterns in data

New cards

Data science turns raw data into __________ ___________ that can be used for _______ ________

meaningful information, decision making

New cards

What are the steps of the data science process?

Ask an interesting question
Get the data
Explore the data
Model the data
Communicate/visualize the results

New cards

What types of things are asked for a question to be considered interesting?

What is the scientific goal?

What would you do if you had all of the data?

What do you want to predict/estimate?

New cards

What questions are asked when getting the data?

How was the data sampled?

Which data is relevant?

Are there privacy issues?

New cards

What questions are asked when exploring the data?

How can the data be plotted?

Are there anomalies?

Are there patterns?

New cards

What questions are asked when modeling the data?

How can a model be built?

How can a model be fitted?

How can a model be validated?

New cards

What questions are asked when communicating/visualizing the results?

What was learned?

Do the results make sense?

Will storytelling be effective?

New cards

Data

observations, facts, or measurements collected about the world

New cards

Where does data come from?

internal sources (already collected organizational data), external sources (data available for free or a fee), and external sources requiring collection efforts (data from external sources that require special processing)

New cards

What are the ways to gather online data?

API (application programming interface), RSS (rich site summary), or web scraping

New cards

What is an API?

a prebuilt set of functions developed by a company to access their services, often not free

New cards

What is a RSS?

a summary of frequently updated online content in standard format for free

New cards

What is web scraping?

using software, scripts, or by hand extracting data from what is displayed on a page or what is contained in the HTML files

New cards

What should be considered when web scraping?

Is it violating terms of service?

Are there privacy concerns?

Is there an API or fee that is being bypassed?

Is the company willing to share the data?

New cards

What is the most popular data type?

tabular (rows and columns of data)

New cards

What are features?

data fields representing characteristics or features of data (each column is a feature)

New cards

Nominal feature

categories, states, or names of things (ex: hair color)

New cards

Binary feature

a nominal attribute with only 2 states (0 and 1)

New cards

Symmetric binary attribute

both outcomes are equally important (ex: left vs. right handed)

New cards

Asymmetric binary attribute

outcomes are not equally important (ex: positive vs. negative medical test)

New cards

Ordinal feature

values have a meaningful order but magnitude in between values is unknown (ex: grades)

New cards

Quantity interval attribute

measured on a scale of equal sized units where values have order (ex: calendar dates); no true 0 point

New cards

Quantity ratio attribute

has an inherent 0 point; values are in order of a magnitude larger than the previous unit (ex: temperature in K)

New cards

Is student ID nominal, ordinal, or interval?

nominal

New cards

Is eye color nominal, ordinal, or interval?

nominal

New cards

Is color in the color spectrum nominal, ordinal, or interval?

interval

New cards

Discrete attribute

has a finite or countably infinite set of values (ex: zip codes)

New cards

Continuous attribute

has real numbers as attribute values (ex: height)

New cards

Binary attributes are a special case of ______ attributes

discrete

New cards

Continuous attributes are usually represented as ________ ______ variables

floating point

New cards

What does a relational records table look like?

New cards

What does transaction data look like?

New cards

What is text data?

texts in various domains and languages

New cards

What is network/graph data?

information networks (ex: transportation and social networks)

New cards

What are some examples of sequential data?

video, genetic sequences, time-series data

New cards

What are some examples of spatial/image data?

maps, images

New cards

What are the 4 major tasks in data preprocessing?

cleaning, integration, reduction, transformation, and discretization

New cards

What does data cleaning do?

handle missing data, smooth noisy data, identify or remove outliers, and resolve inconsistencies

New cards

What does data integration do?

integrate multiple databases, data cubes, or files

New cards

What does data reduction do?

reduce dimensionality and numerosity, and compress data

New cards

What do data transformation and discretization do?

normalize data and generate concept hierarchy

New cards

What are the most common issues with data?

messy format, missing values, wrong values, and unusable data

New cards

What is the best way to fix the messy data in this table? (number of produce deliveries over a weekend)

make each column represent a variable rather than a single value (ID, time, day, number), and fill in the data from there

New cards

Why might data be incomplete?

equipment malfunctions, inconsistent then deleted, misunderstood during additions, considered not important, not saved

New cards

What are the methods to handle missing data?

ignoring the tuple (done when class label is missing), filling in the missing value manually (tedious though), filling in automatically with a global constant, the mean, mean for all samples in the same class, or most probable value

New cards

When is ignoring the tuple not effective?

when the % of missing values per attribute varies considerably

New cards

When can conditional imputation be used?

if certain variables correlate with others

New cards

What is the best method for imputing data?

using predictive modeling

New cards

What is hot deck imputation?

randomly selecting a value from a record that matches with other variables

New cards

What is advanced text imputation?

using text mining/machine learning models that can predict the diagnosis based on similar records or related variables

New cards

What is noise?

random error or variance in a measured variable

New cards

Why might there be incorrect attribute values/noisy data?

faulty data collection instruments, data entry or transmission problems, technology limitations, or inconsistency in naming convention

New cards

How can noisy data be handled?

binning, regression, clustering, or semi-supervised

New cards

What is binning?

sorting data into equal frequency bins, then smoothing by each bin’s mean, median, or boundaries

New cards

What is regression?

smoothing by fitting the data into regression functions

New cards

What is clustering?

detecting and removing outliers

New cards

What is semi-supervised?

combined computer and human inspection of noisy data

New cards

What is data integration?

combining data from multiple sources

New cards

What is schema integration?

integrating metadata from different sources

New cards

What is entity identification?

identifying real world entities from multiple sources that often needs machine learning (ex: same person, different names/nicknames)

New cards

What are the possible reasons for data value conflicts?

different representations or scales

New cards

Redundant data often occurs when ________ multiple databases

integrating

New cards

What is object identification?

identifying if the same object has different names in different databases

New cards

What is derivable data?

attributes than can be derived from an attribute in another table

New cards

Redundant attributes may be detected by _________ analysis and ________ analysis

correlation, covariance

New cards

What does integrating data carefully from multiple sources help do?

reduce/avoid redundancies and improve mining speed/quality

New cards

What does the chi-square (x²) test do?

discovers the correlation relationship between 2 nominal attributes (A and B)

New cards

In the chi-square test, what does the null hypothesis say?

the 2 variables are independent

New cards

The cells that contribute the most to the chi-square value are those whose actual count is ________ from the expected count

different

New cards

The larger the chi-square value, the more likely that variables are ________

New cards

Correlation does not imply _______

causality

New cards

What is the correlation coefficient value range?

[-1, 1]

New cards

What does this graph show?

scatter plots whose correlation coefficients change from -1 to 1

New cards

After data reduction, the data set is much ______ in volume, yet produces almost the _____ analytical results

smaller, same

New cards

Why should data reduction occur?

a database may store massive amounts of data, and complex analysis may take a very long time on the complete data set

New cards

What are the methods for data reduction?

regression/log-linear models, histograms/clustering/sampling, data cube aggregation, and data compression

New cards

Simple random sampling

equal probability of selecting any particular item

New cards

Sampling without replacement

once an object is selected, it is removed from the population

New cards

Sampling with replacement

a selected object is not removed from the population

New cards

Stratified sampling

cluster the data set and draw samples from each cluster

New cards

What is data transformation?

a function that maps the entire set of values of a given attribute to a new set of replacement values (in other words each old value can be identified with one of the new values)

New cards

What are the methods for data transformation?

smoothing, attribute construction, aggregation, normalization, and discretization

New cards

What is the normalization formula?

(given number - min) / (max - min) * (new max - new min) + new min

New cards

What is the z-score formula?

(number given - mean) / std dev

New cards

What are the 3 types of attributes in data discretization?

nominal (values from unordered set), ordinal (values from ordered set), and numeric (real numbers)

New cards

Discretization divides the range of a continuous attribute into _______

intervals

New cards

What are the data discretization methods?

binning, histogram analysis, clustering analysis, decision tree analysis, correlation (chi-square) analysis

New cards

What is equal width binning?

divides data into intervals of equal size; not helpful with skewed data

New cards

What is equal depth binning?

divides data into intervals with approximately the same number of samples; not helpful with categorical attributes

New cards

Is this equal width or equal depth binning?

equal width

New cards

Is this equal width or equal depth binning?

equal depth

New cards

How does classification work?

class labels are given (if supervised), entropy determines split point, and it has a top-down recursive split

New cards

What is different about a trimmed mean?

extreme values are chopped

New cards

What is the median?

middle value if odd, average of 2 middle values if evens

New cards

The mean is sensitive to extreme _______

outliers

New cards

What is the mode?

value that occurs most frequently in data

100

New cards

What is the empirical formula in unimodal data?