Data Mining Course Notes

Dr. Srinivas Prasad is the instructor for the course DNSC 479 - Data Mining.
The course assumes prior knowledge from the prerequisite course DNSC 2001 (Statistics), ensuring students are comfortable with data distributions, probability, and basic statistical inference.
The instructor has been teaching this course since its inception (2008-2009), a period that coincided with the rapid rise of digital analytics and the emergence of tools like Google Analytics.
The course content has evolved significantly, shifting from traditional business analytics and decision technology to modern advancements in Artificial Intelligence, Generative AI, and Large Language Models (LLMs).

The course serves as a foundational introduction to statistical learning, focusing on how algorithms learn from data to produce insights.
Key discussions will center on the practical applications of language processing and generative models like ChatGPT to solve business problems.
This course prepares students for DNSE2 480 (Machine Learning), which explores deeper core topics in AI/ML algorithms and computational efficiency.

Evolution of Business Analytics: The major was established approximately 6 years ago. It remains selective, enrolling about 30 students annually, many of whom represent the top 20% (honors) of the business school.
AI and Recent Trends: The curriculum integrates contemporary AI content to ensure relevance in the changing data landscape, reflecting the transition from predictive modeling to generative intelligence.

Modeling is taught as a tool to support strategic business decision-making rather than just a mathematical exercise.
Example - Customer Churn:
- Analyzes historical consumer data to identify patterns leading to defection (leaving a service).
- The goal is to develop strategies that reduce churn by a specific, measurable percentage.
- It involves an investigation into independent variables (features) such as contract length, monthly charges, and support tickets that differentiate loyal customers from those moving to competitors.

Differentiate between the mechanics of supervised and unsupervised learning.
Apply theoretical methods to practical business problems using real-world datasets.
Establish the technical groundwork for complex data science roles and advanced analytics.

Supervised Learning: Training models on data where the outcome is already known (labeled data).
- Goal: Find a mapping function $f$ such that $Y = f(X) + \epsilon$, where $\epsilon$ represents irreducible error.
- Components:
- $X$ (Features): Independent variables or predictors, such as the carat weight or clarity of a diamond.
- $Y$ (Target): Dependent variable or response, such as the market price of the diamond.
Unsupervised Learning: Identifying hidden structures, patterns, or anomalies in data without a predefined target variable.
- Example: Using cluster analysis to segment customers into groups $(C1, C2, \dots, C_k)$ based on similarity in purchasing behavior without knowing the segments beforehand.

Linear Regression: Used to predict a continuous outcome by fitting a linear equation to the observed data points.
Logistic Regression: A classification algorithm used when the outcome variable is categorical or binary (e.g., $Y \in {0, 1}$).
Decision Trees: Non-linear models that use a series of branching 'if-then' rules to make predictions. They consist of internal nodes for features, branches for decision rules, and leaf nodes for predicted outcomes.
Neural Networks: Composed of layers of interconnected nodes (neurons). They can be conceptualized as a network of many logistic regression models working together to capture complex, non-linear relationships through weights and activation functions.
Clustering and PCA:
- Clustering: Algorithms like K-means that group data points based on proximity or similarity.
- Principal Component Analysis (PCA): A dimensionality reduction technique that transforms a large set of correlated variables into a smaller set of uncorrelated variables (principal components) while retaining as much variance as possible.

Success in the course requires a grasp of Basic Calculus (specifically derivatives for optimization tasks like minimizing error) and Matrix Algebra (for representing and manipulating high-dimensional datasets as matrices $X$ and vectors $y$).
These mathematical concepts are implemented through programming languages like Python and R which handle the underlying computation.

Programming: Python is the preferred language for assignments. GitHub repositories are provided containing scripts, datasets, and practice exercises to facilitate hands-on learning.
AI Policy: While AI tools can generate code snippets, the instructor emphasizes the importance of understanding the context and logic of the code rather than relying solely on automated generation.

Assessment: Evaluation includes multiple quizzes to check progress and two major exams to test comprehensive knowledge.
Midterm: Scheduled before the spring break to cover the first half of the course material.
Late Policy: Students are granted a total of 4 late days to use as a buffer throughout the semester for assignments to accommodate unforeseen circumstances.

Teams: Collaborative work in pairs or groups of three to mimic real-world analytical environments.
Data Sourcing: Students are required to source or customize unique, real-world datasets rather than using standard pedagogical datasets (like Iris or Titanic).
Expectation: Projects must demonstrate model building, validation, and a rigorous interpretation of results to provide actionable business recommendations.

Active engagement and classroom discussion are critical. Students are encouraged to take their own detailed notes to supplement provided slides.
Students are responsible for all material covered in class and for keeping up with the schedule during any absences.