CPSC 4300 Final

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/213

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No study sessions yet.

214 Terms

1
New cards

Data mining

extraction of interesting (non-trivial, implicit, previously unknown, and potentially useful) patterns or knowledge from a huge amount of data

2
New cards

Alternative names for data mining

knowledge discovery in databases (KKD), knowledge extraction, data analysis, etc.

3
New cards

Data science

the analysis of data using quantitative and qualitative techniques to be able to explore trends and patterns in data

4
New cards

Data science turns raw data into __________ ___________ that can be used for _______ ________

meaningful information, decision making

5
New cards

What are the steps of the data science process?

  1. Ask an interesting question

  2. Get the data

  3. Explore the data

  4. Model the data

  5. Communicate/visualize the results

6
New cards

What types of things are asked for a question to be considered interesting?

What is the scientific goal?

What would you do if you had all of the data?

What do you want to predict/estimate?

7
New cards

What questions are asked when getting the data?

How was the data sampled?

Which data is relevant?

Are there privacy issues?

8
New cards

What questions are asked when exploring the data?

How can the data be plotted?

Are there anomalies?

Are there patterns?

9
New cards

What questions are asked when modeling the data?

How can a model be built?

How can a model be fitted?

How can a model be validated?

10
New cards

What questions are asked when communicating/visualizing the results?

What was learned?

Do the results make sense?

Will storytelling be effective?

11
New cards

Data

observations, facts, or measurements collected about the world

12
New cards

Where does data come from?

internal sources (already collected organizational data), external sources (data available for free or a fee), and external sources requiring collection efforts (data from external sources that require special processing)

13
New cards

What are the ways to gather online data?

API (application programming interface), RSS (rich site summary), or web scraping

14
New cards

What is an API?

a prebuilt set of functions developed by a company to access their services, often not free

15
New cards

What is a RSS?

a summary of frequently updated online content in standard format for free

16
New cards

What is web scraping?

using software, scripts, or by hand extracting data from what is displayed on a page or what is contained in the HTML files

17
New cards

What should be considered when web scraping?

Is it violating terms of service?

Are there privacy concerns?

Is there an API or fee that is being bypassed?

Is the company willing to share the data?

18
New cards

What is the most popular data type?

tabular (rows and columns of data)

19
New cards

What are features?

data fields representing characteristics or features of data (each column is a feature)

20
New cards

Nominal feature

categories, states, or names of things (ex: hair color)

21
New cards

Binary feature

a nominal attribute with only 2 states (0 and 1)

22
New cards

Symmetric binary attribute

both outcomes are equally important (ex: left vs. right handed)

23
New cards

Asymmetric binary attribute

outcomes are not equally important (ex: positive vs. negative medical test)

24
New cards

Ordinal feature

values have a meaningful order but magnitude in between values is unknown (ex: grades)

25
New cards

Quantity interval attribute

measured on a scale of equal sized units where values have order (ex: calendar dates); no true 0 point

26
New cards

Quantity ratio attribute

has an inherent 0 point; values are in order of a magnitude larger than the previous unit (ex: temperature in K)

27
New cards

Is student ID nominal, ordinal, or interval?

nominal

28
New cards

Is eye color nominal, ordinal, or interval?

nominal

29
New cards

Is color in the color spectrum nominal, ordinal, or interval?

interval

30
New cards

Discrete attribute

has a finite or countably infinite set of values (ex: zip codes)

31
New cards

Continuous attribute

has real numbers as attribute values (ex: height)

32
New cards

Binary attributes are a special case of ______ attributes

discrete

33
New cards

Continuous attributes are usually represented as ________ ______ variables

floating point

34
New cards

What does a relational records table look like?

<p></p>
35
New cards

What does transaction data look like?

knowt flashcard image
36
New cards

What is text data?

texts in various domains and languages

37
New cards

What is network/graph data?

information networks (ex: transportation and social networks)

38
New cards

What are some examples of sequential data?

video, genetic sequences, time-series data

39
New cards

What are some examples of spatial/image data?

maps, images

40
New cards

What are the 4 major tasks in data preprocessing?

cleaning, integration, reduction, transformation, and discretization

41
New cards

What does data cleaning do?

handle missing data, smooth noisy data, identify or remove outliers, and resolve inconsistencies

42
New cards

What does data integration do?

integrate multiple databases, data cubes, or files

43
New cards

What does data reduction do?

reduce dimensionality and numerosity, and compress data

44
New cards

What do data transformation and discretization do?

normalize data and generate concept hierarchy

45
New cards

What are the most common issues with data?

messy format, missing values, wrong values, and unusable data

46
New cards
<p>What is the best way to fix the messy data in this table? (number of produce deliveries over a weekend)</p>

What is the best way to fix the messy data in this table? (number of produce deliveries over a weekend)

make each column represent a variable rather than a single value (ID, time, day, number), and fill in the data from there

47
New cards

Why might data be incomplete?

equipment malfunctions, inconsistent then deleted, misunderstood during additions, considered not important, not saved

48
New cards

What are the methods to handle missing data?

ignoring the tuple (done when class label is missing), filling in the missing value manually (tedious though), filling in automatically with a global constant, the mean, mean for all samples in the same class, or most probable value

49
New cards

When is ignoring the tuple not effective?

when the % of missing values per attribute varies considerably

50
New cards

When can conditional imputation be used?

if certain variables correlate with others

51
New cards

What is the best method for imputing data?

using predictive modeling

52
New cards

What is hot deck imputation?

randomly selecting a value from a record that matches with other variables

53
New cards

What is advanced text imputation?

using text mining/machine learning models that can predict the diagnosis based on similar records or related variables

54
New cards

What is noise?

random error or variance in a measured variable

55
New cards

Why might there be incorrect attribute values/noisy data?

faulty data collection instruments, data entry or transmission problems, technology limitations, or inconsistency in naming convention

56
New cards

How can noisy data be handled?

binning, regression, clustering, or semi-supervised

57
New cards

What is binning?

sorting data into equal frequency bins, then smoothing by each bin’s mean, median, or boundaries

58
New cards

What is regression?

smoothing by fitting the data into regression functions

59
New cards

What is clustering?

detecting and removing outliers

60
New cards

What is semi-supervised?

combined computer and human inspection of noisy data

61
New cards

What is data integration?

combining data from multiple sources

62
New cards

What is schema integration?

integrating metadata from different sources

63
New cards

What is entity identification?

identifying real world entities from multiple sources that often needs machine learning (ex: same person, different names/nicknames)

64
New cards

What are the possible reasons for data value conflicts?

different representations or scales

65
New cards

Redundant data often occurs when ________ multiple databases

integrating

66
New cards

What is object identification?

identifying if the same object has different names in different databases

67
New cards

What is derivable data?

attributes than can be derived from an attribute in another table

68
New cards

Redundant attributes may be detected by _________ analysis and ________ analysis

correlation, covariance

69
New cards

What does integrating data carefully from multiple sources help do?

reduce/avoid redundancies and improve mining speed/quality

70
New cards

What does the chi-square (x²) test do?

discovers the correlation relationship between 2 nominal attributes (A and B)

71
New cards

In the chi-square test, what does the null hypothesis say?

the 2 variables are independent

72
New cards

The cells that contribute the most to the chi-square value are those whose actual count is ________ from the expected count

different

73
New cards

The larger the chi-square value, the more likely that variables are ________

related

74
New cards

Correlation does not imply _______

causality

75
New cards

What is the correlation coefficient value range?

[-1, 1]

76
New cards
<p>What does this graph show?</p>

What does this graph show?

scatter plots whose correlation coefficients change from -1 to 1

77
New cards

After data reduction, the data set is much ______ in volume, yet produces almost the _____ analytical results

smaller, same

78
New cards

Why should data reduction occur?

a database may store massive amounts of data, and complex analysis may take a very long time on the complete data set

79
New cards

What are the methods for data reduction?

regression/log-linear models, histograms/clustering/sampling, data cube aggregation, and data compression

80
New cards

Simple random sampling

equal probability of selecting any particular item

81
New cards

Sampling without replacement

once an object is selected, it is removed from the population

82
New cards

Sampling with replacement

a selected object is not removed from the population

83
New cards

Stratified sampling

cluster the data set and draw samples from each cluster

84
New cards

What is data transformation?

a function that maps the entire set of values of a given attribute to a new set of replacement values (in other words each old value can be identified with one of the new values)

85
New cards

What are the methods for data transformation?

smoothing, attribute construction, aggregation, normalization, and discretization

86
New cards

What is the normalization formula?

(given number - min) / (max - min) * (new max - new min) + new min

<p>(given number - min) / (max - min) * (new max - new min) + new min</p>
87
New cards

What is the z-score formula?

(number given - mean) / std dev

<p>(number given - mean) / std dev</p>
88
New cards

What are the 3 types of attributes in data discretization?

nominal (values from unordered set), ordinal (values from ordered set), and numeric (real numbers)

89
New cards

Discretization divides the range of a continuous attribute into _______

intervals

90
New cards

What are the data discretization methods?

binning, histogram analysis, clustering analysis, decision tree analysis, correlation (chi-square) analysis

91
New cards

What is equal width binning?

divides data into intervals of equal size; not helpful with skewed data

92
New cards

What is equal depth binning?

divides data into intervals with approximately the same number of samples; not helpful with categorical attributes

93
New cards
<p>Is this equal width or equal depth binning?</p>

Is this equal width or equal depth binning?

equal width

94
New cards
<p>Is this equal width or equal depth binning?</p>

Is this equal width or equal depth binning?

equal depth

95
New cards

How does classification work?

class labels are given (if supervised), entropy determines split point, and it has a top-down recursive split

96
New cards

What is different about a trimmed mean?

extreme values are chopped

97
New cards

What is the median?

middle value if odd, average of 2 middle values if evens

98
New cards

The mean is sensitive to extreme _______

outliers

99
New cards

What is the mode?

value that occurs most frequently in data

100
New cards

What is the empirical formula in unimodal data?

knowt flashcard image