CPMAI & Agile Data Projects: Comprehensive Study Notes
AI Project Foundations: Data-Centric, Iterative Approaches
- AI projects are not like traditional software development. AI is about data; the code is a small, often non‑central part of making AI work. Treating AI projects like application development leads to failure.
- AI is data-driven and data-centric patterns matter more than just building apps.
- Depending on data, AI projects follow different patterns and require data governance, privacy, and continuous maintenance.
The Seven Patterns of AI
- AI is driven by data and can be categorized into distinct patterns. The seven patterns (as introduced) include:
- Hyper-Personalization
- Recognition
- Conversation & Human Interaction
- Predictive Analytics & Decisions
- Autonomous Systems/Analytics
- Goal-Driven Systems
- Patterns & Anomalies (and related AI pattern families)
- These patterns influence the development approach, data requirements, evaluation criteria, and governance needs.
ROI and ROI Measurement for AI Projects
- ROI should be justified before starting; AI projects are not cheap and require upfront investments (money, time, resources).
- ROI can be determined via multiple metrics, including:
- Resource savings
- Cost savings
- Time savings
- General ROI formula (illustrative):
\text{ROI} = \frac{\text{Benefits} - \text{Costs}}{\text{Costs}} \times 100\%.\n - When measuring ROI, remember: the project should produce a positive return. Do not pursue AI for the sake of AI.
- Critical pre‑project questions (to be answered BEFORE starting):
- What problem are we attempting to solve?
- Should we solve this with AI / cognitive technology?
- Which portions require vs. do not require AI?
- What skills/resources are necessary and what will that cost vs. ROI?
- ROI calculation methods may include different combinations of savings and costs; ensure a positive ROI before committing.
Data Quantity and Data Quality
- A common reason AI projects fail is lack of data understanding, including questions about data quantity and data types.
- Data quantity issues to consider:
- How much data is needed for the AI project?
- What specific types of data are needed?
- Which internal and external data sources are necessary?
- What is the data environment and data sources?
- What are the data manipulation and transformation requirements?
- Data quality issues (Garbage In, Garbage Out):
- Do you know the overall quality of your data?
- Do you have enough of the right kind of data?
- Do you need to augment or enhance existing data?
- What are the ongoing data gathering and preparation requirements?
- Real-world data issues examples:
- Amazon’s biased recruiting tool revealed by Reuters (2018): biased data led to biased scoring of candidates.
- Data quality practices:
- Ensure data quality before training; plan for data augmentation when data is insufficient or noisy.
- Define data sources and data environment, and address data sharing, access, and governance.
- Data quantity vs. data quality trade-offs are central to project success; too little data limits training, too much data can introduce noise if not cleaned.
The Proof of Concept Trap, POC vs Pilot vs MVP
- A Proof‑of‑Concept (POC) demonstrates a concept in an almost real environment but is not deployed in production.
- A Pilot is a real-world project using emerging tech in a protected environment and is deployed in production for limited scope.
- An MVP (Minimum Viable Product) is a tangible early release used to validate value with minimal functionality.
- Most AI POCs fail in even a basic real‑world pilot because extensive upfront work (algorithm selection, data curation, model testing, hyperparameter tuning) is required, and POCs/prototypes often don’t deliver real ROI.
- The real value comes from iterative development, not from a long, isolated prototype phase.
The Real World vs the Model
- A model that performs well in a controlled setting may fail in the real world due to distribution differences, environment changes, or data drift.
- Key questions for model validation and deployment:
- Does the model meet accuracy, precision, recall, and other metrics in real deployments?
- Is the model compatible with the operationalization approach (inference, latency, scalability)?
- How will you monitor model performance and versioning, and how will you iterate?
- Illustrative example: a radiology model trained on Stanford Hospital data may degrade when deployed to an older hospital with different imaging equipment; humans can adapt, but AI models may not without re-training and broader data coverage.
- Model validation and ongoing monitoring are essential to ensure reliability after deployment.
Data Drift, Model Drift, and Continuous Lifecycles
- AI lifecycles are continuous, not one-off. Data and the world evolve, so models drift and degrade over time.
- What drives failure in production?
- Not budgeting for model maintenance and monitoring
- Ignoring drift, data quality changes, and unseen edge cases
- Example: Tay, Microsoft’s AI chatbot, quickly fell into racism after exposure to online data, illustrating the need for ongoing content monitoring and safeguards.
- Practical takeaway: budget for ongoing retraining, data updates, governance, and monitoring to maintain model accuracy and safety.
Vendor Hype, Product Mismatch, and Overpromising
- Common vendor pitfalls: misalignment between promises and real capabilities, product mismatches, overhype, and oversell.
- Examples discussed:
- IBM Watson Health case (over-promises, under-delivers in healthcare context)
- Amazon’s biased recruiting tool (bias results from biased training data)
- Key remedy: rigorous upfront questions, independent validation beyond vendor demos, and alignment of vendor capabilities with concrete project requirements.
Data-Centric Agile and Iterative Development for AI Projects
- Data projects differ from traditional application development; data continuously changes and requires governance, privacy, and security controls.
- Agile and data-centric approaches can be combined, but require adaptation:
- Agile focuses on iteration, delivery, and responding to change.
- Data-centric approaches focus on data understanding, quality, governance, and the data pipeline.
- Neither approach alone suffices; combine them wisely to address data realities.
- Four key ideas:
- Data projects are not merely about functionality; they revolve around extracting insights and actions from data.
- Data iteration is essential: you must deliver value early with rapid iterations while improving data quality and understanding.
- Governance, privacy, and data ownership are central concerns.
- You need to evolve agile practices to accommodate data prep, data quality, and data governance tasks.
Agile and Data-Centric Methodologies: A Family of Frameworks
- CRISP-DM (Cross-Industry Standard Process for Data Mining): a data-centric framework with six phases used in iterative cycles. Phases:
1) Business Understanding
2) Data Understanding
3) Data Preparation
4) Modeling
5) Evaluation
6) Deployment - CPMAI (Cognitive Project Management for AI): extends CRISP‑DM with AI/ML specifics and agile practices. Six phases (iterative):
1) Business Understanding
2) Data Understanding
3) Data Preparation
4) Model Development
5) Model Evaluation
6) Model Operationalization - TDSP (Team Data Science Process): a data science lifecycle framework focusing on data science needs with a defined start/end, less iterative than CPMAI.
- Key difference: CPMAI explicitly integrates agile and AI specifics, and remains highly iterative; CRISP-DM provides a solid data mining backbone; TDSP emphasizes data science lifecycle with tooling guidance.
Adapting Agile for Data Projects
- Agility for data requires adapting roles and processes:
- Re-imagined Product Owner: someone who understands the data product and data lifecycle (collection, ingestion, preparation, transformation, usage, consumption)
- Re-imagined Scrum Master/Team Lead: understands data lifecycle impact on timeframes and supports data roles
- Development Members: data scientists, data engineers, ETL specialists, BI, data governance owners, etc.
- Stakeholders: data analysts, data scientists, data end-users, governance, legal, compliance, privacy officers, etc.
- Time-boxed iterations are essential; plan, execute, demo, and retrospective within each sprint.
- Documentation and governance remain important, especially for regulatory/compliance contexts.
What to Implement: CPMAI in Practice
- CPMAI workbook provides a practical, artifact-driven approach across six phases, with phase-specific questions, go/no-go checklists, ROI expectations, and data governance considerations.
- The CPMAI framework emphasizes iteration for each project and supports continuous improvement through artifact-based tracking.
- Practical takeaway: use CPMAI as a practical methodology for AI projects, with an emphasis on data-first, AI-relevant, iterative cycles.
Case Examples and Practical Metrics
- Barclays (financial services):
- 300% increase in the development of new data projects
- 50% reduction in code and data complexity across 80+ applications
- Ability to test and verify code completion over 50%
- Netflix: about 75% of views are driven by the recommendation engine; data powers many projects with an agile approach enabling faster delivery
- Panera Bread: digital sales grew to account for 25% of overall sales; improvements in speed and efficiency of delivery
- Fitbit: 4 new products and 22 million devices in the first year; enhanced team engagement and forecasting capabilities
Real-World Implementation: Practical Considerations
- Agile data challenges include: lack of defined end, difficulty estimating data tasks, data quality uncertainties, governance and privacy concerns, and the need for ongoing data preparation and data ops.
- The data product mindset requires broad stakeholder involvement across data analysts, data scientists, end users, governance, legal, and privacy functions.
- Vendor procurement should emphasize aligning capabilities with actual project needs and avoiding “X vendor shop” mentality.
- A well-executed data initiative often combines agile iterations with data-centric governance, testing, and monitoring to ensure ongoing value delivery.
Data Governance, Privacy, and Trust in AI Systems
- Privacy vs. convenience: increasing personalization raises concerns about data sharing and surveillance.
- Transparency vs. security: greater openness can reduce security; need to balance understanding of AI decision-making with robust safeguards.
- Trust and user experience: addressing the Uncanny Valley, ensuring the human touch remains where appropriate, and designing for user comfort with automation.
- Case implications: incidents like the Henn‑Na Hotel robot failures illustrate the risk of uncanny user experiences and the value of blending automation with human interaction.
The Uncanny Valley and User Trust
- The Uncanny Valley describes how objects that look almost-but-not-quite-human can provoke eeriness or revulsion.
- For data systems, the data uncanny valley manifests in privacy concerns, perceived surveillance, and erosion of trust when systems know too much about users.
- Practical design takeaway: avoid overreach, preserve user agency, and prioritize human-centric design where appropriate.
Practical Takeaways and Strategic Guidance
- Start big, think strategically, but implement small and iterate often.
- Ensure data understanding and quality are at the core of your AI program; invest heavily in data collection, cleaning, and governance (the 80/20 rule often cited: 80% data effort, 20% model development).
- Align ROI with explicit data-driven savings: cost savings, time savings, resource savings, and measurable business value.
- Use iterative, data-centric methodologies (CPMAI/CRISP-DM) to guide AI project lifecycles, with agile practices adapted to data realities.
- Manage expectations with stakeholders using clear problem definitions, realistic milestones, and transparent communication about capabilities and limits of AI.
- Build an integrated team and governance structure that includes data engineers, data scientists, business analysts, and governance/legal/privacy experts.
- ROI (illustrative):
\text{ROI} = \frac{\text{Benefits} - \text{Costs}}{\text{Costs}} \times 100\%. - Data Quality Principle: "Garbage in, garbage out" emphasizes the need for high-quality data inputs to achieve reliable AI outputs.
- When discussing improvements, use data-driven metrics such as:
- Number of data projects started/completed (e.g., Barclays: +300% in new data projects)
- Reduction in data/code complexity (e.g., Barclays: -50%)
- Proportion of business outcomes driven by AI (e.g., Netflix: 75% views from recommendations)
- Model drift and data drift concepts imply ongoing monitoring equations and maintenance budgets, though explicit numerical formulas for drift are domain-specific and not provided in this transcript.
Quick Reference: Terminology and Frameworks
- POC: Proof of Concept – lab/experimental demonstration, not production.
- Pilot: Real-world test in a controlled production environment.
- MVP: Minimum Viable Product – minimal functionality to validate value.
- CRISP-DM: Six phases for data mining projects, iterative.
- CPMAI: Cognitive Project Management for AI – six iterative phases, AI-specific and agile-friendly.
- TDSP: Team Data Science Process – data science lifecycle with tooling guidance; less iterative than CPMAI.
- Agile: Iterative, incremental development with emphasis on responding to change.
- Data-centric: Emphasis on data quality, data governance, data preparation, data lifecycle.
- The Seven Patterns of AI: Hyper-Personalization, Recognition, Conversation & Human Interaction, Predictive Analytics & Decisions, Autonomous Analytics & Systems, Goal-Driven Systems, Patterns/Anomalies.
Final Takeaway
- To succeed in AI projects, organizations must adopt a data-first, iterative, governance-aware approach, blending agile practices with data-centric methodologies (CRISP-DM, CPMAI, TDSP). ROI should be established upfront with explicit data-driven metrics, and real-world validation must account for data drift, model drift, and user trust. Continuous monitoring, responsible data practices, and prudent vendor selection are essential to sustainable AI success.