Exhaustive Study Notes on Data Engineering in Professional Sports

Introduction and Overview

  • Greetings and acknowledgment of the audience and participants.

    • Mention of Southwest Research Institute and their presence in the coming days.

    • Introduction of Tegan Ashby, director of software engineering at the Philadelphia Phillies.

Background of Speaker: Tegan Ashby

  • Leading application development and data engineering for the Phillies.

  • Experienced in professional sports analytics.

  • Asserts importance of data engineering before hiring data scientists.

Importance of Data Engineering

  • Suggests that the first technical hire should be a data engineer.

  • Explanation of the rationale behind this hiring strategy:

    • Emphasizes that data engineering lays the foundation for analytics success.

    • Underlines the role of data engineers in unifying data, ensuring reliability, and delivering actionable insights.

Structure of the Philadelphia Phillies Software Engineering Department

  • Overview of the organization:

    • Three core teams: application development, data engineering, infrastructure.

    • Two cross-functional teams: Machine Learning (ML) platform and applied biomechanics.

    • Total of approximately two dozen engineers.

  • Main goal: "Turn data into information into action."

Historical Context of Analytics in the Phillies

  • Recognition as the biggest losers in professional sports history and the last team in MLB to fully embrace analytics.

    • Reference to a 2015 ESPN analytics ranking where the Phillies ranked last.

  • The journey from a skeptical stance on analytics to becoming a leading analytics department in MLB.

  • Importance of learning from past failures and recognizing the need for engineering fundamentals.

Establishing an Analytics Framework

  • The Gift of Hindsight: Learning from prior failures to create robust engineering practices.

  • Principal goals shared across organizations involved in sports analytics:

    • Gaining competitive advantage.

    • Providing actionable insights based on data to enhance performance.

Key Responsibilities of the Data Engineer

  • Role and responsibilities discussed in depth:

    • Unifying data sources.

    • Ensuring data reliability.

    • Delivering insights through user-friendly tools.

  • Emphasis on the essential nature of incorporating engineering early on in the analytics process.

Common Pitfalls in Data Management

  • A storytelling approach to common initial pitfalls:

    • Disorganized data ingestion methods (using PDFs, various formats without a schema).

    • Slow processing of data and mislabeling issues resulting in unreliable reports.

    • Need for integration between analytics and operational systems.

  • Realization that the existence of data does not equate to actionable insights.

Data Hierarchy of Needs

  • The Pyramid Model of Data Requirements:

    • Engineering as the base layer of the hierarchy: Explore, Transform, Move, Store, Collect.

    • Analytics (including machine learning) as the upper layer, dependent upon the stability of the lower foundational layers.

Data Flow and Lifecycle Stages

  • Breakdown of the data lifecycle:

    • Data Generation: Sourcing data from competitions, tracking sessions, etc.

    • Data Storage and Processing:

    • Application databases (e.g., MySQL, Postgres).

    • Transition to using data warehouses (e.g., BigQuery, Snowflake) for complex data needs.

    • Pipelines Management (ETL and ELT processes): Automation to streamline data workflows and minimize errors.

Concepts Relevant to Data Engineering

  • DAG (Directed Acyclic Graph): Explained as a model for workflow management.

    • Contains components like scheduling, task management, dependencies, and callbacks.

  • Serving Layer and its Impact:

    • Focus on delivering insights effectively to end users (coaches, athletes).

    • Importance of APIs for data accessibility.

Hiring Strategy for Data Engineers

  • Insights on selecting the right candidates:

    • Need for technical competency in Python, SQL, cloud data warehouses, and orchestration tools.

    • Discussion of roles and skills necessary for technical hires to avoid burdening analysts and data scientists with engineering tasks.

Technical Considerations for Startups

  • Evaluation of utilizing foundational tools:

    • Importance of starting with a simple system and incrementally upgrading.

    • Concerns around over-engineering, especially for small teams.

Closing Remarks

  • Reinforcement of essential takeaways:

    • Data is pivotal for performance in high-caliber sports; its value must precede data management and engineering practices.

    • Foundation-building is paramount to the growth of analytics operations & insights.