AWS Data Lake Migration and Analytical Roadmap Strategy Notes for Data Warehouse Roadmap Meeting Roadmap
Project Overview and Strategic Objectives
The primary goal of this initiative is to migrate the existing reporting environment into a centralized AWS-based data lake architecture.
The current environment is identified as highly complex, consisting of over physical tables, hundreds of logical alliances, and a vast number of inter-table relationships.
The strategy emphasizes a phased approach to avoid the risks associated with moving all data simultaneously.
Success of the data warehouse is predicated on universal agreement regarding definitions and Key Performance Indicators (KPIs).
High-Level Roadmap: The Three-Phase Approach
Phase 1: Data Acquisition and Ingestion
Objective: Establish a centralized data lake by bringing all source systems into AWS S3.
Methodology: Data will be moved "as-is" without transformation or business logic changes to create a consistent landing zone.
Technical Focus: Understanding source system database structures, documenting relationships, and establishing Change Data Capture (CDC) injection pipelines for streaming data.
Phase 2: Trusted Analytical Data (Silver Layer)
Objective: Convert raw operational data into trusted analytical data.
Process: Focus on data quality, standardization, and validation. This phase addresses operational structures optimized for transactions and transitions them to analytical formats.
Business Logic Definition: Developing core definitions for clinical and operational terms, including:
Definitions for an "active patient."
Criteria for "active prescriptions."
Standards for a "complete order."
Identifying "valid financial transactions."
Phase 3: Business Friendly Structures (Gold Layer)
Objective: Create structures ready for stakeholder reporting and advanced analytics using tools like DBT (Data Build Tool) or Athena for transformation logic.
Key Outputs:
Prescription performance metrics.
Facility performance metrics.
Patient analytics and financial reports.
Stakeholder Collaboration: Working closely with business leaders to define specific KPIs and calculation methodologies.
Phase 4: Integration and Testing (Optional)
Objective: Implementation of an integration testing suite to ensure data integrity across the pipeline.
Source System Inventory and Data Scale
Framework LTC: The core pharmacy platform containing patient information, facility details, pharmacy operations, and daily business activities. This represents the largest data source, exceeding .
Drug Database: Houses Medispan reference data and information essential for drug classification.
CCM: Responsible for storing prescription-related documentation and physical attachments.
Past Framework: Handles supporting processing activities.
ACPAC (Sage/Active): Contains financial statements, general ledger accounts, and payroll data.
Data Volume:
Total estimated data volume: .
Daily processing volume: approximately .
Technical Infrastructure and Tools
AWS S3: Serving as the primary landing zone and storage for Bronze, Silver, and Gold layers.
AWS DMS (Data Migration Service): Used for ingestion; specifically utilizing the serverless configuration to optimize costs.
AWS Glue: Orchestration for landing and data preparation.
Amazon Athena: Used for scanning metadata and executing transformation logic; estimated cost based on data scanning volume.
DBT (Data Build Tool): Employed for the orchestration of transformations into business-friendly structures.
SAP Business Objects (BO): The legacy reporting environment; access is required to analyze existing SQL queries and 330 specific reports to understand current data relationships.
Financial Estimates and Cost Projection
Monthly Operational Estimate: Approximately for core services.
Annual Lower-End Estimate: Approximately .
Realistic/High-End Annual Estimate: Approximately , accounting for potential increases in data size, orchestration complexity, and reporting tool licenses.
Component Costs:
AWS DMS: Estimated at for high-volume framework LTC data.
Amazon Athena: Approximately for metadata scanning.
Immediate Action Items and Future Deadlines
Validation and Approval: The plan must be reviewed and approved by Satya.
Infrastructure Setup: Work with Satya and Sandeep to establish a secure AWS zone for development.
POC (Proof of Concept): Begin moving a subset of tables to AWS to refine cost estimates and technical feasibility.
KPI Development: Start defining common KPIs with stakeholders before the full commencement of Phase 2.
Timeline for Data Object Breakdown: Odile to provide a high-level breakdown of data objects (Drug, Participant, Company, etc.) and a specific timeframe by the following week.
Questions & Discussion
Q (Seth): Regarding DMS costs, are you using serverless or a static system?
A (Odile): We are using the serverless system.
Q (Gaurav): How will you identify the specific details for drugs and patients if Rich does not like the legacy 330 reports?
A (Odile): I will work with Rich to analyze existing SQL queries in the SQL management server. Even if the old reports are not favored for the future state, they contain the logic and relationships needed to build the new warehouse.
Discussion on SAP BO Access: Rich is currently working on providing the URL and credentials for SAP Business Object access so the team can analyze current reporting logic. This is a critical dependency for understanding the "Universe" of data relationships currently in use.