Exhaustive Study Notes on Data Engineering in Professional Sports
Introduction and Overview
Greetings and acknowledgment of the audience and participants.
Mention of Southwest Research Institute and their presence in the coming days.
Introduction of Tegan Ashby, director of software engineering at the Philadelphia Phillies.
Background of Speaker: Tegan Ashby
Leading application development and data engineering for the Phillies.
Experienced in professional sports analytics.
Asserts importance of data engineering before hiring data scientists.
Importance of Data Engineering
Suggests that the first technical hire should be a data engineer.
Explanation of the rationale behind this hiring strategy:
Emphasizes that data engineering lays the foundation for analytics success.
Underlines the role of data engineers in unifying data, ensuring reliability, and delivering actionable insights.
Structure of the Philadelphia Phillies Software Engineering Department
Overview of the organization:
Three core teams: application development, data engineering, infrastructure.
Two cross-functional teams: Machine Learning (ML) platform and applied biomechanics.
Total of approximately two dozen engineers.
Main goal: "Turn data into information into action."
Historical Context of Analytics in the Phillies
Recognition as the biggest losers in professional sports history and the last team in MLB to fully embrace analytics.
Reference to a 2015 ESPN analytics ranking where the Phillies ranked last.
The journey from a skeptical stance on analytics to becoming a leading analytics department in MLB.
Importance of learning from past failures and recognizing the need for engineering fundamentals.
Establishing an Analytics Framework
The Gift of Hindsight: Learning from prior failures to create robust engineering practices.
Principal goals shared across organizations involved in sports analytics:
Gaining competitive advantage.
Providing actionable insights based on data to enhance performance.
Key Responsibilities of the Data Engineer
Role and responsibilities discussed in depth:
Unifying data sources.
Ensuring data reliability.
Delivering insights through user-friendly tools.
Emphasis on the essential nature of incorporating engineering early on in the analytics process.
Common Pitfalls in Data Management
A storytelling approach to common initial pitfalls:
Disorganized data ingestion methods (using PDFs, various formats without a schema).
Slow processing of data and mislabeling issues resulting in unreliable reports.
Need for integration between analytics and operational systems.
Realization that the existence of data does not equate to actionable insights.
Data Hierarchy of Needs
The Pyramid Model of Data Requirements:
Engineering as the base layer of the hierarchy: Explore, Transform, Move, Store, Collect.
Analytics (including machine learning) as the upper layer, dependent upon the stability of the lower foundational layers.
Data Flow and Lifecycle Stages
Breakdown of the data lifecycle:
Data Generation: Sourcing data from competitions, tracking sessions, etc.
Data Storage and Processing:
Application databases (e.g., MySQL, Postgres).
Transition to using data warehouses (e.g., BigQuery, Snowflake) for complex data needs.
Pipelines Management (ETL and ELT processes): Automation to streamline data workflows and minimize errors.
Concepts Relevant to Data Engineering
DAG (Directed Acyclic Graph): Explained as a model for workflow management.
Contains components like scheduling, task management, dependencies, and callbacks.
Serving Layer and its Impact:
Focus on delivering insights effectively to end users (coaches, athletes).
Importance of APIs for data accessibility.
Hiring Strategy for Data Engineers
Insights on selecting the right candidates:
Need for technical competency in Python, SQL, cloud data warehouses, and orchestration tools.
Discussion of roles and skills necessary for technical hires to avoid burdening analysts and data scientists with engineering tasks.
Technical Considerations for Startups
Evaluation of utilizing foundational tools:
Importance of starting with a simple system and incrementally upgrading.
Concerns around over-engineering, especially for small teams.
Closing Remarks
Reinforcement of essential takeaways:
Data is pivotal for performance in high-caliber sports; its value must precede data management and engineering practices.
Foundation-building is paramount to the growth of analytics operations & insights.