the practice of basing decisions on the analysis of data, rather than purely intuition
2
New cards
data science
involves principles, processes, and techniques for understanding phenomena via the (automated) analysis of data in order to improve decision making
3
New cards
(1) decision for which discoveries need to be made within data and (2) decisions that repeat especially at massive scale (so decision making can benefit from even small increases in decision-making accuracy based on data analysis
two types of decisions:
4
New cards
CRISP-DM
codification of the data mining process
5
New cards
data mining
the extraction of knowledge from data, via technologies that incorporate these principles
6
New cards
information technology
used to find informative descriptive attributes of entities of interest from a large mass of data
7
New cards
context
formulating data mining solutions and evaluating the results involves thinking carefully about the ____ in which they will be used
8
New cards
business understanding, data understanding, data preparation, modeling, evaluation, deployment
CRISP-DM steps: (6)
9
New cards
classification and class probability estimation
attempt to predict, for each individual in a population, which of a (small) set of classes this individual belongs to
10
New cards
scoring
model applied to an individual that produces a score (instead of a class prediction) representing the probability (or some other quantification of likelihood) that the individual belongs to each class
11
New cards
classification
scoring and ____ are closely related
12
New cards
regression (value estimation)
attempts to estimate or predict, for each individual, the numerical value of some variable for that individual
13
New cards
regression
type of model that could be generated by looking at other, similar individuals in the population
14
New cards
regression procedure
produces a model that, given an individual, estimates the value of the particular variable specific to that individual
15
New cards
whether; how much
classification predicts _____ something will happen, whereas regression predicts ____ something will happen
16
New cards
similarity matching
identify similar individuals based on data known about them; can be used directly to find similar entities
17
New cards
similarity matching
the basis for one of the most popular methods for making product recommendations
18
New cards
clustering
attempts to group individuals in a population together by their similarity, but not drive any specific purpose
19
New cards
clustering
useful in preliminary domain exploration to see which natural groups exist because these groups in turn may suggest other data mining tasks or approaches
20
New cards
questions
clustering is used as an input to decision-making processes focusing on ___
21
New cards
co-occurrence grouping
useful to find associations between entities based on transactions involving them
aka: frequent itemset mining, association rule discovery, and market-basket analysis
22
New cards
market-basket analysis
co-occurrence of products in purchases is a common type of grouping known as ____
23
New cards
co-occurrence grouping
considers the similarity of objects based on their appearing together in transactions
24
New cards
associations
co-occurrence is useful in finding _____ between entities based on transactions involving them
25
New cards
frequency of the co-occurrence and an estimate of how surprising it is
the result of co-occurrence is a description of items that occur together and the description includes statistics on _____ and _____
26
New cards
profiling (behavior description)
attempts to characterize the typical behavior of an individual, group, or population
27
New cards
profiling
often used to establish behavioral norms for anomaly detection applications such as fraud detection and monitoring for intrusions to computer systems
use the degree of mismatch as a suspicion score and issue an alarm if it is too high
28
New cards
link prediction
attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link
29
New cards
link prediction
common in social networking systems
30
New cards
data reduction
attempts to take a large set of data and replace it with a small set of data that contains much of the important information in the larger set
31
New cards
data reduction
usually involves loss of information
32
New cards
causal modeling
attempts to help us understand what events or actions actually influence others
33
New cards
counterfactual analysis
both experimental and observational methods for casual modeling generally can be viewed as ___ ___ - they attempt to understand what would be the difference between the situations - which cannot both happen - where the "treatment" event were to happen and were not to happen
34
New cards
unsupervised
no target information
35
New cards
supervised
specific target can be provided
technique is given a specific purpose for the grouping - predicting the target
36
New cards
label
the value of the target variable for an individual
37
New cards
clustering, co-occurrence grouping, and profiling
unsupervised data mining tasks (3):
38
New cards
classification, regression, causal modeling
supervised data mining tasks (3):
39
New cards
similarity matching, link prediction, data reduction
data mining tasks that can be either supervised or unsupervised (3):
40
New cards
type of target
two main sub-classes (classification and regression) of of SUPERVISED data mining are distinguished by the _____
41
New cards
classification and regression
two main sub-classes of supervised data mining:
42
New cards
binary
classification has a categorical often ___ target (the customer either purchases or does not)
43
New cards
numeric
regression has a ___ target
44
New cards
numerical prediction
for business applications we often want a ___ ___ over a categorical target
45
New cards
specific quantity
a supervised target variable must be a ___ ___ that will be the focus of the data mining (and for which we can obtain values for some example data)
46
New cards
class probability estimation
model that predicts the probability that something will happen but the underlying target is categorical
47
New cards
leak
situation where a variable collected in historical data gives information on the target variable - information that appears in historical data but it not actually available when the decision has to be made
48
New cards
data preparation
what phase of CRISP-DM should you beware of leaks
49
New cards
modeling
primary place where data mining techniques are applied to the data
50
New cards
evaluation
purpose of this phase is to assess the data mining results rigorously and to gain confidence that they are valid and reliable before moving on
this phase also serves to help ensure that the model satisfies the original business goals
51
New cards
analytics
the extensive use of data and advanced quantitative analysis to drive fact-based decision-making and actions
52
New cards
big data
a collection of data sets that are so large or complex that it is impossible to analyze them with traditional databases and tools
53
New cards
unstrucutred
the variety of big data is ___ (text, voice, video)
54
New cards
Gartner
who created the hype cycle of emerging technologies?
55
New cards
sector specialties
driven by deep industry expertise and tailored solutions
56
New cards
issue driven
start with the business issues and decisions that matter
57
New cards
technology enabled
leverage best-of-breed tools and analytics "assets"
58
New cards
central coordination
network individuals from around the world to field the right team
59
New cards
value approach
bring a commercial sophistication oriented toward creating value
60
New cards
alignment gaps
companies need to avoid ___ ___ between technical capabilities and behavior
61
New cards
technical capability
analytics production:
- data quality - infrastructure and tools - data science
62
New cards
behavioral alignment
analytics consumption:
- culture and mental models - organization & process design - learning and development - incentives/rewards
63
New cards
behavioral alignment gap
high technical capability and low behavioral alignment
64
New cards
technical capability gap
high behavioral alignment and low technical capability
65
New cards
data science engineers
software engineers who have particular expertise both in production systems and in data science
66
New cards
exploration
the CRISP-DM cycle is based around ____
67
New cards
business understanding
phase of CRISP-DM: understand the problem to be solved, design a solution, create sub-problems
68
New cards
data understanding
phase of CRISP-DM: understand strength and limitations of data, estimate cost/benefit of data source
69
New cards
data preparation
phase of CRISP-DM: manipulation for better results, removing missing, convert data types
70
New cards
modeling
phase of CRISP-DM: apply data mining tasks/techniques
71
New cards
evaluation
phase of CRSIP-DM: assess results, validity, reliability, business goal
72
New cards
deployment
phase of CRISP-DM: use of models, automatic model building
73
New cards
query
a specific request for a subset of data or for statistics about data, formulated in a technical language and posted to a database system
74
New cards
data warehouse
collects and combines data from across an enterprise, often from multiple transaction-processing systems, each with its own database
75
New cards
-1
MATCH type that requires numbers in lookup range be in descending order
76
New cards
0
MATCH type that returns the row location of the first exact match found
77
New cards
1
MATCH type that requires numbers in lookup range be in ascending order
78
New cards
1; 0
default match type = ___ however, most MATCH function applications use match type = ___
79
New cards
solver
excel tool that optimizes values based on your objectives - target cell - changing cells - constraints
model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable
83
New cards
supervised segmentation
segment the population into subgroups that have different values for the target variable
84
New cards
supervised data mining
Model describes a relationship between a set of selected variables (attributes or features) and a predefined variable (target variable) and estimates the value of the target variable as a function of the attributes
85
New cards
information
a quantity that reduces uncertainty about something
86
New cards
predictive model
a formula for estimating the unknown value of interest: the target
formula could be mathematical or a logical statement like a rule
87
New cards
instance (or example)
represents a fact or a data point
88
New cards
attributes (or features)
fields, columns, variables, or features
89
New cards
feature vector
an instance is sometimes called this because it can be represented as a fixed-length ordered collection (vector) of feature values
90
New cards
row; case
an instance is also referred to as a ___ for a database table or sometimes a ___ in statistics