CPMAI & Agile Data Projects: Comprehensive Study Notes

AI Project Foundations: Data-Centric, Iterative Approaches

  • AI projects are not like traditional software development. AI is about data; the code is a small, often non‑central part of making AI work. Treating AI projects like application development leads to failure.
  • AI is data-driven and data-centric patterns matter more than just building apps.
  • Depending on data, AI projects follow different patterns and require data governance, privacy, and continuous maintenance.

The Seven Patterns of AI

  • AI is driven by data and can be categorized into distinct patterns. The seven patterns (as introduced) include:
    • Hyper-Personalization
    • Recognition
    • Conversation & Human Interaction
    • Predictive Analytics & Decisions
    • Autonomous Systems/Analytics
    • Goal-Driven Systems
    • Patterns & Anomalies (and related AI pattern families)
  • These patterns influence the development approach, data requirements, evaluation criteria, and governance needs.

ROI and ROI Measurement for AI Projects

  • ROI should be justified before starting; AI projects are not cheap and require upfront investments (money, time, resources).
  • ROI can be determined via multiple metrics, including:
    • Resource savings
    • Cost savings
    • Time savings
  • General ROI formula (illustrative):
    \text{ROI} = \frac{\text{Benefits} - \text{Costs}}{\text{Costs}} \times 100\%.\n
  • When measuring ROI, remember: the project should produce a positive return. Do not pursue AI for the sake of AI.
  • Critical pre‑project questions (to be answered BEFORE starting):
    • What problem are we attempting to solve?
    • Should we solve this with AI / cognitive technology?
    • Which portions require vs. do not require AI?
    • What skills/resources are necessary and what will that cost vs. ROI?
  • ROI calculation methods may include different combinations of savings and costs; ensure a positive ROI before committing.

Data Quantity and Data Quality

  • A common reason AI projects fail is lack of data understanding, including questions about data quantity and data types.
  • Data quantity issues to consider:
    • How much data is needed for the AI project?
    • What specific types of data are needed?
    • Which internal and external data sources are necessary?
    • What is the data environment and data sources?
    • What are the data manipulation and transformation requirements?
  • Data quality issues (Garbage In, Garbage Out):
    • Do you know the overall quality of your data?
    • Do you have enough of the right kind of data?
    • Do you need to augment or enhance existing data?
    • What are the ongoing data gathering and preparation requirements?
  • Real-world data issues examples:
    • Amazon’s biased recruiting tool revealed by Reuters (2018): biased data led to biased scoring of candidates.
  • Data quality practices:
    • Ensure data quality before training; plan for data augmentation when data is insufficient or noisy.
    • Define data sources and data environment, and address data sharing, access, and governance.
  • Data quantity vs. data quality trade-offs are central to project success; too little data limits training, too much data can introduce noise if not cleaned.

The Proof of Concept Trap, POC vs Pilot vs MVP

  • A Proof‑of‑Concept (POC) demonstrates a concept in an almost real environment but is not deployed in production.
  • A Pilot is a real-world project using emerging tech in a protected environment and is deployed in production for limited scope.
  • An MVP (Minimum Viable Product) is a tangible early release used to validate value with minimal functionality.
  • Most AI POCs fail in even a basic real‑world pilot because extensive upfront work (algorithm selection, data curation, model testing, hyperparameter tuning) is required, and POCs/prototypes often don’t deliver real ROI.
  • The real value comes from iterative development, not from a long, isolated prototype phase.

The Real World vs the Model

  • A model that performs well in a controlled setting may fail in the real world due to distribution differences, environment changes, or data drift.
  • Key questions for model validation and deployment:
    • Does the model meet accuracy, precision, recall, and other metrics in real deployments?
    • Is the model compatible with the operationalization approach (inference, latency, scalability)?
    • How will you monitor model performance and versioning, and how will you iterate?
  • Illustrative example: a radiology model trained on Stanford Hospital data may degrade when deployed to an older hospital with different imaging equipment; humans can adapt, but AI models may not without re-training and broader data coverage.
  • Model validation and ongoing monitoring are essential to ensure reliability after deployment.

Data Drift, Model Drift, and Continuous Lifecycles

  • AI lifecycles are continuous, not one-off. Data and the world evolve, so models drift and degrade over time.
  • What drives failure in production?
    • Not budgeting for model maintenance and monitoring
    • Ignoring drift, data quality changes, and unseen edge cases
  • Example: Tay, Microsoft’s AI chatbot, quickly fell into racism after exposure to online data, illustrating the need for ongoing content monitoring and safeguards.
  • Practical takeaway: budget for ongoing retraining, data updates, governance, and monitoring to maintain model accuracy and safety.

Vendor Hype, Product Mismatch, and Overpromising

  • Common vendor pitfalls: misalignment between promises and real capabilities, product mismatches, overhype, and oversell.
  • Examples discussed:
    • IBM Watson Health case (over-promises, under-delivers in healthcare context)
    • Amazon’s biased recruiting tool (bias results from biased training data)
  • Key remedy: rigorous upfront questions, independent validation beyond vendor demos, and alignment of vendor capabilities with concrete project requirements.

Data-Centric Agile and Iterative Development for AI Projects

  • Data projects differ from traditional application development; data continuously changes and requires governance, privacy, and security controls.
  • Agile and data-centric approaches can be combined, but require adaptation:
    • Agile focuses on iteration, delivery, and responding to change.
    • Data-centric approaches focus on data understanding, quality, governance, and the data pipeline.
    • Neither approach alone suffices; combine them wisely to address data realities.
  • Four key ideas:
    • Data projects are not merely about functionality; they revolve around extracting insights and actions from data.
    • Data iteration is essential: you must deliver value early with rapid iterations while improving data quality and understanding.
    • Governance, privacy, and data ownership are central concerns.
    • You need to evolve agile practices to accommodate data prep, data quality, and data governance tasks.

Agile and Data-Centric Methodologies: A Family of Frameworks

  • CRISP-DM (Cross-Industry Standard Process for Data Mining): a data-centric framework with six phases used in iterative cycles. Phases:
    1) Business Understanding
    2) Data Understanding
    3) Data Preparation
    4) Modeling
    5) Evaluation
    6) Deployment
  • CPMAI (Cognitive Project Management for AI): extends CRISP‑DM with AI/ML specifics and agile practices. Six phases (iterative):
    1) Business Understanding
    2) Data Understanding
    3) Data Preparation
    4) Model Development
    5) Model Evaluation
    6) Model Operationalization
  • TDSP (Team Data Science Process): a data science lifecycle framework focusing on data science needs with a defined start/end, less iterative than CPMAI.
  • Key difference: CPMAI explicitly integrates agile and AI specifics, and remains highly iterative; CRISP-DM provides a solid data mining backbone; TDSP emphasizes data science lifecycle with tooling guidance.

Adapting Agile for Data Projects

  • Agility for data requires adapting roles and processes:
    • Re-imagined Product Owner: someone who understands the data product and data lifecycle (collection, ingestion, preparation, transformation, usage, consumption)
    • Re-imagined Scrum Master/Team Lead: understands data lifecycle impact on timeframes and supports data roles
    • Development Members: data scientists, data engineers, ETL specialists, BI, data governance owners, etc.
    • Stakeholders: data analysts, data scientists, data end-users, governance, legal, compliance, privacy officers, etc.
  • Time-boxed iterations are essential; plan, execute, demo, and retrospective within each sprint.
  • Documentation and governance remain important, especially for regulatory/compliance contexts.

What to Implement: CPMAI in Practice

  • CPMAI workbook provides a practical, artifact-driven approach across six phases, with phase-specific questions, go/no-go checklists, ROI expectations, and data governance considerations.
  • The CPMAI framework emphasizes iteration for each project and supports continuous improvement through artifact-based tracking.
  • Practical takeaway: use CPMAI as a practical methodology for AI projects, with an emphasis on data-first, AI-relevant, iterative cycles.

Case Examples and Practical Metrics

  • Barclays (financial services):
    • 300% increase in the development of new data projects
    • 50% reduction in code and data complexity across 80+ applications
    • Ability to test and verify code completion over 50%
  • Netflix: about 75% of views are driven by the recommendation engine; data powers many projects with an agile approach enabling faster delivery
  • Panera Bread: digital sales grew to account for 25% of overall sales; improvements in speed and efficiency of delivery
  • Fitbit: 4 new products and 22 million devices in the first year; enhanced team engagement and forecasting capabilities

Real-World Implementation: Practical Considerations

  • Agile data challenges include: lack of defined end, difficulty estimating data tasks, data quality uncertainties, governance and privacy concerns, and the need for ongoing data preparation and data ops.
  • The data product mindset requires broad stakeholder involvement across data analysts, data scientists, end users, governance, legal, and privacy functions.
  • Vendor procurement should emphasize aligning capabilities with actual project needs and avoiding “X vendor shop” mentality.
  • A well-executed data initiative often combines agile iterations with data-centric governance, testing, and monitoring to ensure ongoing value delivery.

Data Governance, Privacy, and Trust in AI Systems

  • Privacy vs. convenience: increasing personalization raises concerns about data sharing and surveillance.
  • Transparency vs. security: greater openness can reduce security; need to balance understanding of AI decision-making with robust safeguards.
  • Trust and user experience: addressing the Uncanny Valley, ensuring the human touch remains where appropriate, and designing for user comfort with automation.
  • Case implications: incidents like the Henn‑Na Hotel robot failures illustrate the risk of uncanny user experiences and the value of blending automation with human interaction.

The Uncanny Valley and User Trust

  • The Uncanny Valley describes how objects that look almost-but-not-quite-human can provoke eeriness or revulsion.
  • For data systems, the data uncanny valley manifests in privacy concerns, perceived surveillance, and erosion of trust when systems know too much about users.
  • Practical design takeaway: avoid overreach, preserve user agency, and prioritize human-centric design where appropriate.

Practical Takeaways and Strategic Guidance

  • Start big, think strategically, but implement small and iterate often.
  • Ensure data understanding and quality are at the core of your AI program; invest heavily in data collection, cleaning, and governance (the 80/20 rule often cited: 80% data effort, 20% model development).
  • Align ROI with explicit data-driven savings: cost savings, time savings, resource savings, and measurable business value.
  • Use iterative, data-centric methodologies (CPMAI/CRISP-DM) to guide AI project lifecycles, with agile practices adapted to data realities.
  • Manage expectations with stakeholders using clear problem definitions, realistic milestones, and transparent communication about capabilities and limits of AI.
  • Build an integrated team and governance structure that includes data engineers, data scientists, business analysts, and governance/legal/privacy experts.

Key Formulas and Quantitative References

  • ROI (illustrative):
    \text{ROI} = \frac{\text{Benefits} - \text{Costs}}{\text{Costs}} \times 100\%.
  • Data Quality Principle: "Garbage in, garbage out" emphasizes the need for high-quality data inputs to achieve reliable AI outputs.
  • When discussing improvements, use data-driven metrics such as:
    • Number of data projects started/completed (e.g., Barclays: +300% in new data projects)
    • Reduction in data/code complexity (e.g., Barclays: -50%)
    • Proportion of business outcomes driven by AI (e.g., Netflix: 75% views from recommendations)
  • Model drift and data drift concepts imply ongoing monitoring equations and maintenance budgets, though explicit numerical formulas for drift are domain-specific and not provided in this transcript.

Quick Reference: Terminology and Frameworks

  • POC: Proof of Concept – lab/experimental demonstration, not production.
  • Pilot: Real-world test in a controlled production environment.
  • MVP: Minimum Viable Product – minimal functionality to validate value.
  • CRISP-DM: Six phases for data mining projects, iterative.
  • CPMAI: Cognitive Project Management for AI – six iterative phases, AI-specific and agile-friendly.
  • TDSP: Team Data Science Process – data science lifecycle with tooling guidance; less iterative than CPMAI.
  • Agile: Iterative, incremental development with emphasis on responding to change.
  • Data-centric: Emphasis on data quality, data governance, data preparation, data lifecycle.
  • The Seven Patterns of AI: Hyper-Personalization, Recognition, Conversation & Human Interaction, Predictive Analytics & Decisions, Autonomous Analytics & Systems, Goal-Driven Systems, Patterns/Anomalies.

Final Takeaway

  • To succeed in AI projects, organizations must adopt a data-first, iterative, governance-aware approach, blending agile practices with data-centric methodologies (CRISP-DM, CPMAI, TDSP). ROI should be established upfront with explicit data-driven metrics, and real-world validation must account for data drift, model drift, and user trust. Continuous monitoring, responsible data practices, and prudent vendor selection are essential to sustainable AI success.