1/29
Vocabulary flashcards covering central concepts, methods and ethical issues from the lecture on crafting data-mining problem statements and the seven-step analytical workflow.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Practical Motivation and Problem Identification
Stage where one verifies that the challenge is data-related and solvable with available or collectable data (e.g., reducing customer churn).
Data Collection
Process of gathering the relevant, representative, valid and reliable data required to address the formulated problem.
Relevance (in data collection)
The alignment of the collected data’s purpose, scope and specificity with the analytical question being asked.
Representativeness
Extent to which a sample reflects the diversity of the entire population, helping to avoid selection bias.
Validity and Reliability
Qualities that ensure data measures what it claims to measure (validity) and does so consistently (reliability).
Simple Random Sample
Sampling technique where every member of the population has an equal and independent chance of selection.
Systematic Sample
Sampling method that selects every k-th item from an ordered list to create the sample.
Stratified Sample
Sampling approach that divides the population into subgroups (strata) and samples from each to maintain proportional representation.
Cluster Sample
Sampling technique that randomly selects entire groups or clusters, then studies all or a subset of elements within those clusters.
Problem Formulation (Data Mining)
Crafting a clear, specific, answerable analytical question that guides the mining process (e.g., “Can usage patterns predict churn?”).
Data Preparation
Cleaning, integrating, structuring and formatting raw data so it becomes suitable for mining and modeling tasks.
Data from Different Sources
Integrated datasets (e.g., demographics, usage, feedback) that give a comprehensive view but may require complex merging.
Data in a Grid Format
Tabular (rows-columns) organization of data that simplifies analysis but may limit capture of complex relationships.
Exploratory Data Analysis (EDA)
Initial analytical step aimed at discovering patterns, anomalies and basic statistics such as mean, median, variance, distribution.
Mean
Arithmetic average of a numerical data set.
Median
Middle value of an ordered data set, splitting it into two equal halves.
Variance
Statistical metric expressing the degree to which data points spread out from the mean.
Distribution
Overall pattern of values, indicating the shape, center and spread of the data points.
Pattern Recognition
Process of detecting meaningful trends, correlations or structures within data.
Analytical Visualization
Use of charts (e.g., scatter plots, bar charts, heat maps) to display statistical characteristics and support pattern recognition and decisions.
Descriptive Analytics
Techniques that summarize and describe historical data to reveal current patterns and trends.
Inferential Analytics
Statistical techniques that draw conclusions about a larger population based on a sample, often via hypothesis testing or regression.
Statistical Inference
Discipline of drawing reliable, uncertainty-aware conclusions from data analyses.
Generalization (of a model)
Ability of a predictive model to perform accurately on new, unseen data rather than just the training set.
Cross-Validation
Model-evaluation strategy that partitions data into training and testing folds to estimate out-of-sample performance.
Confidence Interval
Range of values that, with a specified probability, contains the true parameter or prediction of interest.
Actionable Intelligence
Insights derived from analysis that directly inform strategic, real-world decisions (e.g., targeting retention campaigns).
Ethical Considerations in Data Mining
Practices ensuring privacy, fairness and compliance with regulations during data handling and analysis.
GDPR (General Data Protection Regulation)
EU legislation governing how personal data must be collected, processed and protected.
Customer Churn Prediction
Data-mining application that identifies customers likely to leave, enabling proactive retention strategies.