ACCT 331 LECTURE 9

Data collection and data sourcing

  • The ML workflow is being presented as a lifecycle with real-world engineering context, not just a theoretical loop.
  • There is a core exception: production deployment is important but is handled by other teams; the lecture emphasizes awareness rather than deep hands-on production engineering for the course.
  • Different perspectives on the process: start with a business use case, then identify data sources, sources of data, and how to process/clean it.
  • Data sources discussed:
    • Internal data: e.g., inventory maintenance records, parts inventory, maintenance history.
    • External data: weather, traffic patterns on railcars, and other factors that can affect repairs or availability.
    • Example use case: on-shelf availability – predicting the probability that an item (e.g., wood) won't be on the shelf so it can be sold to a consumer.
  • Creativity in data gathering: sometimes you must engineer or bring in new data sources to solve the problem, not just rely on data you already have.
  • A pragmatic question when starting: How much data is enough? The answer given is “as much as possible.” More data usually helps, but there is a cost to not using data when it exists.
  • Key data questions to consider early:
    • Do I have enough features (independent variables) and what do x’s (columns) represent?
    • What is the right scope of data to capture for the task?
    • Are there relevant features that are missing or need to be engineered?

Data cleaning and feature engineering

  • Second stage focuses on data cleaning, feature engineering, and exploratory data analysis (EDA).
  • Goals in data cleaning:
    • Handle incorrect, incomplete, or missing data.
    • Understand the features (variables, x’s) and extract essential variables while discarding nonessential derivatives.
  • Feature engineering focuses on selecting and creating the right features for computation; this is often driven by experiments and intuition.
  • Example narrative: in a dataset with car-related features, some features may be redundant or highly correlated (collinearity); this leads to the idea of consolidating features or forming ratios to reduce redundancy (e.g., a height-to-weight ratio, or weight divided by height).
  • Collinearity concept:
    • When two features are so similar that they do not contribute much to the output, they can be eliminated or consolidated.
    • Practical approach could be to create a ratio or combination that captures the information more effectively (e.g., weight/height).
  • Discussion of a practical data example involving RPM and net horsepower as potentially similar indicators;
    • Emphasis on domain knowledge and interpretation when deciding whether two features convey unique information.
  • Data transformation and scaling concepts introduced:
    • Transformations help move data closer to the knee (the “center” or scale where computations are better conditioned) and can make models train more effectively.
    • Transformation examples include log transformations and standardization.

Data transformation and scaling (with intuition and formulas)

  • Log transformations:
    • Purpose: compresses large numbers and stretches small numbers while preserving the relative order and the ratios between data points.
    • Effect: reduces skewness and stabilizes variance, helping with features that span several orders of magnitude.
    • Common form: y=<br/>celog(x)y = <br /> ce{log}(x) where log is typically natural log or log base 10 depending on the context.
    • Note: maintains the ordering of data and the proportional relationships between values.
  • Standardization (z-score normalization):
    • Purpose: reorients data to fall around zero and to have unit standard deviation, enabling comparisons across different units.
    • Formula: z = rac{x - bc{\mu}}{\sigma} where \mu is the mean and \sigma is the standard deviation of the feature.
    • Effects:
    • Centers data at zero: mean of transformed feature is 0.
    • Scales by the spread (standard deviation) to achieve unit variance.
    • Preserves the relative positions of data; changes in scale do not alter the order or the ratios between points.
    • Why it matters: when features have different units (e.g., height in cm, weight in kg), standardization allows direct comparison and fair treatment by many models.
  • Practical illustration concepts:
    • A visual showing how standardization re-centers around zero and changes the spread.
    • The goal is consistent scale across features rather than changing the underlying relationships.

Feature extraction, selection, and model-related considerations

  • Coding/encoding of categorical features as part of feature engineering (feature extraction step).
  • Feature importance and ablation (dropout) techniques:
    • Start with a set of features (e.g., 14 features) and systematically drop each to observe the impact on model performance.
    • This helps identify which features are contributing meaningfully to the output and which can be eliminated or re-engineered.
  • Commonly discussed feature engineering insights:
    • Reducing dimensionality via design of new features or elimination of redundant ones improves learning efficiency.
    • The process is iterative and often requires testing with objective metrics.
  • Practical point: data science includes both art and science; curiosity and experience drive effective feature design and model refinement.

Model selection and evaluation

  • After data preprocessing and feature engineering, you select a modeling approach, train, and evaluate.
  • Many algorithms can be tried to compare performance; standard examples include:
    • Logistic regression
    • Support Vector Machines (SVMs)
    • Probit/probit-like variants (as referenced in lecture: "probe vector machine" appears as a term in the transcript)
    • K-Nearest Neighbors (neighbors)
  • The goal is to produce a model that generalizes well, not just one that performs best on the training data.
  • When selecting a model, consider more than performance metrics alone:
    • Training time and computational resources required
    • Interpretability and explainability to stakeholders
    • Ease of maintenance and updating the model over time
    • How well the model aligns with business expectations and prior solutions
  • Model evaluation concepts addressed:
    • Performance is not the sole arbiter of “best” model; stakeholders and practicality matter.
    • Evaluation should consider real-world deployment context and maintenance implications.

From model to production: deployment and MLOps (high-level overview)

  • Production deployment is treated as a separate area handled by engineers, not the core focus of the course.
  • High-level ML operations (MLOps) loop:
    • Continuous data collection and labeling
    • Experimentation to improve performance
    • Ongoing evaluation and monitoring of performance in production
    • Handling data drift and changes in incoming data distributions over time
  • A deployed model typically involves a data pipeline that automates:
    • Data extraction from production sources
    • Data cleaning and feature engineering in production
    • Training/inference pipelines that update and apply the model
    • A user interface or API layer to deliver predictions to end users
  • Real-world implications of production pipelines:
    • Production data may differ from training data; pipelines must adapt to these changes.
    • The phrase "data pipelines" implies automated, ongoing data flow from raw data to predictions.
    • Monitoring is essential: if performance drops, engineers respond by updating data collection, features, or models.
  • The lecture’s emphasis: understanding the end-to-end production reality and recognizing that deploying ML in production is non-trivial and requires cross-functional collaboration.

Real-world examples and case contexts mentioned

  • Inventory optimization example to maximize uptime of production cores; involved discussion of internal data (inventory maintenance records) and external data (weather, traffic) to influence maintenance and repair planning.
  • On-shelf availability example: predicting the probability of stockouts to adjust sales and marketing decisions.
  • The data pipeline in practice includes continuous loops from data collection to deployment, with periodic re-evaluation and adjustments as business conditions change.
  • Case study and assignments (two, three, four) to be introduced later by Anne; there will be a reading list for next week.

Practical considerations, tools, and professional workflows

  • The instructor touches on practical tools for research and collaboration:
    • Generating tailored content (e.g., prompts that guide AI in producing outputs). The idea is to customize prompts for more useful or relevant outputs.
    • Demonstrations of AI-assisted slide customization and prompt systems (e.g., a system prompt that guides the AI to act in a certain way).
    • Google AI tools (Gemini) to describe documents, summarize data, or describe folders and content; example: image generation with different styles.
    • Notebooks and PDFs: tools that allow uploading PDFs and querying the content, and tools that generate podcasts or mind maps from content.
    • Fathom (note-taking and meeting summarization): integrates with Zoom/Teams/Meet to provide detailed meeting summaries; free tier offers limited meetings, premium offers more features.
  • AI encoding and code assistance: AI can help with coding tasks in Python, offering autocompletion and problem-solving aid; caution is advised for interview or assessment contexts to avoid inappropriate reliance.
  • The overall takeaway: AI-enabled tooling can accelerate research, notes, summarization, and coding tasks, but it should be used ethically and in alignment with coursework and exam expectations.

Ethical, philosophical, and practical implications

  • Data governance: legal and security issues related to data collection and usage are acknowledged as important but are handled by other teams; students should be aware of these constraints.
  • The balance between automation and human oversight: while AI tools can speed up work, human judgment remains critical for interpretation, ethical considerations, and strategic decisions.
  • Continuous learning: ML practice requires curiosity, experimentation, and willingness to iterate—reinforcing that ML is an evolving field.
  • Real-world relevance: the pipeline concepts map directly to business outcomes (uptime, stock availability, efficiency) and emphasize the need for practical problem-solving beyond theoretical models.

Core formulas, terminology, and quick references

  • Standardization (z-score):z = rac{x - \mu}{\sigma}
    • where \mu is the mean and \sigma is the standard deviation of the feature.
    • Purpose: center at zero and scale to unit variance; enables comparison across features with different units.
  • Log transformation:y=log(x)y = \log(x)
    • Purpose: compresses large values, stretches small values, preserves order and relative ratios; reduces skewness and stabilizes variance.
  • Ratio/derived features example:
    • height to weight ratio: ratio=weightheight\text{ratio} = \frac{\text{weight}}{\text{height}}
    • or weight divided by height to create a normalized metric when appropriate.
  • Ablation/feature importance approach: drop one feature at a time and observe impact on performance to determine its contribution.
  • Model evaluation metric (illustrative): R2=1SS<em>resSS</em>totR^2 = 1 - \frac{SS<em>{\text{res}}}{SS</em>{\text{tot}}}
    • SSres is the sum of squared residuals, and SStot is the total sum of squares.

Quick connections to broader themes

  • The ML process is iterative and non-linear; revisit earlier steps based on evaluation outcomes.
  • The pipeline encompasses data collection, cleaning, feature engineering, transformation, model training, evaluation, and deployment, with monitoring in production for data drift and performance changes.
  • The material emphasizes practical engineering realities (data pipelines, production challenges) alongside core modeling concepts, illustrating why data engineering and MLOps matter just as much as model accuracy.
  • The lecture situates the course within the broader context of business objectives, stakeholder needs, and real-world constraints, preparing students for industry-facing ML projects.

Next steps and study notes alignment

  • Expect coverage of JAI case study tools and assignments two through four in upcoming sessions.
  • Readings and exercises for next week will likely reinforce data collection, EDA, and model evaluation practices, with hands-on practice on real datasets.
  • Students are encouraged to explore AI tools for productivity and to practice designing data pipelines and evaluating models in production settings.