ACCT 331 LECTURE 9

Data collection and data sourcing

The ML workflow is being presented as a lifecycle with real-world engineering context, not just a theoretical loop.
There is a core exception: production deployment is important but is handled by other teams; the lecture emphasizes awareness rather than deep hands-on production engineering for the course.
Different perspectives on the process: start with a business use case, then identify data sources, sources of data, and how to process/clean it.
Data sources discussed:
- Internal data: e.g., inventory maintenance records, parts inventory, maintenance history.
- External data: weather, traffic patterns on railcars, and other factors that can affect repairs or availability.
- Example use case: on-shelf availability – predicting the probability that an item (e.g., wood) won't be on the shelf so it can be sold to a consumer.
Creativity in data gathering: sometimes you must engineer or bring in new data sources to solve the problem, not just rely on data you already have.
A pragmatic question when starting: How much data is enough? The answer given is “as much as possible.” More data usually helps, but there is a cost to not using data when it exists.
Key data questions to consider early:
- Do I have enough features (independent variables) and what do x’s (columns) represent?
- What is the right scope of data to capture for the task?
- Are there relevant features that are missing or need to be engineered?

Data cleaning and feature engineering

Second stage focuses on data cleaning, feature engineering, and exploratory data analysis (EDA).
Goals in data cleaning:
- Handle incorrect, incomplete, or missing data.
- Understand the features (variables, x’s) and extract essential variables while discarding nonessential derivatives.
Feature engineering focuses on selecting and creating the right features for computation; this is often driven by experiments and intuition.
Example narrative: in a dataset with car-related features, some features may be redundant or highly correlated (collinearity); this leads to the idea of consolidating features or forming ratios to reduce redundancy (e.g., a height-to-weight ratio, or weight divided by height).
Collinearity concept:
- When two features are so similar that they do not contribute much to the output, they can be eliminated or consolidated.
- Practical approach could be to create a ratio or combination that captures the information more effectively (e.g., weight/height).
Discussion of a practical data example involving RPM and net horsepower as potentially similar indicators;
- Emphasis on domain knowledge and interpretation when deciding whether two features convey unique information.
Data transformation and scaling concepts introduced:
- Transformations help move data closer to the knee (the “center” or scale where computations are better conditioned) and can make models train more effectively.
- Transformation examples include log transformations and standardization.

Data transformation and scaling (with intuition and formulas)

Log transformations:
- Purpose: compresses large numbers and stretches small numbers while preserving the relative order and the ratios between data points.
- Effect: reduces skewness and stabilizes variance, helping with features that span several orders of magnitude.
- Common form: $y = <br /> ce{log}(x)$ where log is typically natural log or log base 10 depending on the context.
- Note: maintains the ordering of data and the proportional relationships between values.
Standardization (z-score normalization):
- Purpose: reorients data to fall around zero and to have unit standard deviation, enabling comparisons across different units.
- Formula: z = rac{x - bc{\mu}}{\sigma} where \mu is the mean and \sigma is the standard deviation of the feature.
- Effects:
- Centers data at zero: mean of transformed feature is 0.
- Scales by the spread (standard deviation) to achieve unit variance.
- Preserves the relative positions of data; changes in scale do not alter the order or the ratios between points.
- Why it matters: when features have different units (e.g., height in cm, weight in kg), standardization allows direct comparison and fair treatment by many models.
Practical illustration concepts:
- A visual showing how standardization re-centers around zero and changes the spread.
- The goal is consistent scale across features rather than changing the underlying relationships.

Feature extraction, selection, and model-related considerations

Coding/encoding of categorical features as part of feature engineering (feature extraction step).
Feature importance and ablation (dropout) techniques:
- Start with a set of features (e.g., 14 features) and systematically drop each to observe the impact on model performance.
- This helps identify which features are contributing meaningfully to the output and which can be eliminated or re-engineered.
Commonly discussed feature engineering insights:
- Reducing dimensionality via design of new features or elimination of redundant ones improves learning efficiency.
- The process is iterative and often requires testing with objective metrics.
Practical point: data science includes both art and science; curiosity and experience drive effective feature design and model refinement.

Model selection and evaluation

After data preprocessing and feature engineering, you select a modeling approach, train, and evaluate.
Many algorithms can be tried to compare performance; standard examples include:
- Logistic regression
- Support Vector Machines (SVMs)
- Probit/probit-like variants (as referenced in lecture: "probe vector machine" appears as a term in the transcript)
- K-Nearest Neighbors (neighbors)
The goal is to produce a model that generalizes well, not just one that performs best on the training data.
When selecting a model, consider more than performance metrics alone:
- Training time and computational resources required
- Interpretability and explainability to stakeholders
- Ease of maintenance and updating the model over time
- How well the model aligns with business expectations and prior solutions
Model evaluation concepts addressed:
- Performance is not the sole arbiter of “best” model; stakeholders and practicality matter.
- Evaluation should consider real-world deployment context and maintenance implications.

From model to production: deployment and MLOps (high-level overview)

Production deployment is treated as a separate area handled by engineers, not the core focus of the course.
High-level ML operations (MLOps) loop:
- Continuous data collection and labeling
- Experimentation to improve performance
- Ongoing evaluation and monitoring of performance in production
- Handling data drift and changes in incoming data distributions over time
A deployed model typically involves a data pipeline that automates:
- Data extraction from production sources
- Data cleaning and feature engineering in production
- Training/inference pipelines that update and apply the model
- A user interface or API layer to deliver predictions to end users
Real-world implications of production pipelines:
- Production data may differ from training data; pipelines must adapt to these changes.
- The phrase "data pipelines" implies automated, ongoing data flow from raw data to predictions.
- Monitoring is essential: if performance drops, engineers respond by updating data collection, features, or models.
The lecture’s emphasis: understanding the end-to-end production reality and recognizing that deploying ML in production is non-trivial and requires cross-functional collaboration.

Real-world examples and case contexts mentioned

Inventory optimization example to maximize uptime of production cores; involved discussion of internal data (inventory maintenance records) and external data (weather, traffic) to influence maintenance and repair planning.
On-shelf availability example: predicting the probability of stockouts to adjust sales and marketing decisions.
The data pipeline in practice includes continuous loops from data collection to deployment, with periodic re-evaluation and adjustments as business conditions change.
Case study and assignments (two, three, four) to be introduced later by Anne; there will be a reading list for next week.

Practical considerations, tools, and professional workflows

The instructor touches on practical tools for research and collaboration:
- Generating tailored content (e.g., prompts that guide AI in producing outputs). The idea is to customize prompts for more useful or relevant outputs.
- Demonstrations of AI-assisted slide customization and prompt systems (e.g., a system prompt that guides the AI to act in a certain way).
- Google AI tools (Gemini) to describe documents, summarize data, or describe folders and content; example: image generation with different styles.
- Notebooks and PDFs: tools that allow uploading PDFs and querying the content, and tools that generate podcasts or mind maps from content.
- Fathom (note-taking and meeting summarization): integrates with Zoom/Teams/Meet to provide detailed meeting summaries; free tier offers limited meetings, premium offers more features.
AI encoding and code assistance: AI can help with coding tasks in Python, offering autocompletion and problem-solving aid; caution is advised for interview or assessment contexts to avoid inappropriate reliance.
The overall takeaway: AI-enabled tooling can accelerate research, notes, summarization, and coding tasks, but it should be used ethically and in alignment with coursework and exam expectations.

Ethical, philosophical, and practical implications

Data governance: legal and security issues related to data collection and usage are acknowledged as important but are handled by other teams; students should be aware of these constraints.
The balance between automation and human oversight: while AI tools can speed up work, human judgment remains critical for interpretation, ethical considerations, and strategic decisions.
Continuous learning: ML practice requires curiosity, experimentation, and willingness to iterate—reinforcing that ML is an evolving field.
Real-world relevance: the pipeline concepts map directly to business outcomes (uptime, stock availability, efficiency) and emphasize the need for practical problem-solving beyond theoretical models.

Core formulas, terminology, and quick references

Standardization (z-score):z = rac{x - \mu}{\sigma}
- where \mu is the mean and \sigma is the standard deviation of the feature.
- Purpose: center at zero and scale to unit variance; enables comparison across features with different units.
Log transformation: $y = \log(x)$
- Purpose: compresses large values, stretches small values, preserves order and relative ratios; reduces skewness and stabilizes variance.
Ratio/derived features example:
- height to weight ratio: $\text{ratio} = \frac{\text{weight}}{\text{height}}$
- or weight divided by height to create a normalized metric when appropriate.
Ablation/feature importance approach: drop one feature at a time and observe impact on performance to determine its contribution.
Model evaluation metric (illustrative): $R^2 = 1 - \frac{SS<em>{\text{res}}}{SS</em>{\text{tot}}}$
- SSres is the sum of squared residuals, and SStot is the total sum of squares.

Quick connections to broader themes

The ML process is iterative and non-linear; revisit earlier steps based on evaluation outcomes.
The pipeline encompasses data collection, cleaning, feature engineering, transformation, model training, evaluation, and deployment, with monitoring in production for data drift and performance changes.
The material emphasizes practical engineering realities (data pipelines, production challenges) alongside core modeling concepts, illustrating why data engineering and MLOps matter just as much as model accuracy.
The lecture situates the course within the broader context of business objectives, stakeholder needs, and real-world constraints, preparing students for industry-facing ML projects.

Next steps and study notes alignment

Expect coverage of JAI case study tools and assignments two through four in upcoming sessions.
Readings and exercises for next week will likely reinforce data collection, EDA, and model evaluation practices, with hands-on practice on real datasets.
Students are encouraged to explore AI tools for productivity and to practice designing data pipelines and evaluating models in production settings.