CPMAI & Agile Data Projects: Comprehensive Study Notes

AI Project Foundations: Data-Centric, Iterative Approaches

AI projects are not like traditional software development. AI is about data; the code is a small, often non‑central part of making AI work. Treating AI projects like application development leads to failure.
AI is data-driven and data-centric patterns matter more than just building apps.
Depending on data, AI projects follow different patterns and require data governance, privacy, and continuous maintenance.

The Seven Patterns of AI

AI is driven by data and can be categorized into distinct patterns. The seven patterns (as introduced) include:
- Hyper-Personalization
- Recognition
- Conversation & Human Interaction
- Predictive Analytics & Decisions
- Autonomous Systems/Analytics
- Goal-Driven Systems
- Patterns & Anomalies (and related AI pattern families)
These patterns influence the development approach, data requirements, evaluation criteria, and governance needs.

ROI and ROI Measurement for AI Projects

ROI should be justified before starting; AI projects are not cheap and require upfront investments (money, time, resources).
ROI can be determined via multiple metrics, including:
- Resource savings
- Cost savings
- Time savings
General ROI formula (illustrative):
\text{ROI} = \frac{\text{Benefits} - \text{Costs}}{\text{Costs}} \times 100\%.\n
When measuring ROI, remember: the project should produce a positive return. Do not pursue AI for the sake of AI.
Critical pre‑project questions (to be answered BEFORE starting):
- What problem are we attempting to solve?
- Should we solve this with AI / cognitive technology?
- Which portions require vs. do not require AI?
- What skills/resources are necessary and what will that cost vs. ROI?
ROI calculation methods may include different combinations of savings and costs; ensure a positive ROI before committing.

Data Quantity and Data Quality

A common reason AI projects fail is lack of data understanding, including questions about data quantity and data types.
Data quantity issues to consider:
- How much data is needed for the AI project?
- What specific types of data are needed?
- Which internal and external data sources are necessary?
- What is the data environment and data sources?
- What are the data manipulation and transformation requirements?
Data quality issues (Garbage In, Garbage Out):
- Do you know the overall quality of your data?
- Do you have enough of the right kind of data?
- Do you need to augment or enhance existing data?
- What are the ongoing data gathering and preparation requirements?
Real-world data issues examples:
- Amazon’s biased recruiting tool revealed by Reuters (2018): biased data led to biased scoring of candidates.
Data quality practices:
- Ensure data quality before training; plan for data augmentation when data is insufficient or noisy.
- Define data sources and data environment, and address data sharing, access, and governance.
Data quantity vs. data quality trade-offs are central to project success; too little data limits training, too much data can introduce noise if not cleaned.

The Proof of Concept Trap, POC vs Pilot vs MVP

A Proof‑of‑Concept (POC) demonstrates a concept in an almost real environment but is not deployed in production.
A Pilot is a real-world project using emerging tech in a protected environment and is deployed in production for limited scope.
An MVP (Minimum Viable Product) is a tangible early release used to validate value with minimal functionality.
Most AI POCs fail in even a basic real‑world pilot because extensive upfront work (algorithm selection, data curation, model testing, hyperparameter tuning) is required, and POCs/prototypes often don’t deliver real ROI.
The real value comes from iterative development, not from a long, isolated prototype phase.

The Real World vs the Model

A model that performs well in a controlled setting may fail in the real world due to distribution differences, environment changes, or data drift.
Key questions for model validation and deployment:
- Does the model meet accuracy, precision, recall, and other metrics in real deployments?
- Is the model compatible with the operationalization approach (inference, latency, scalability)?
- How will you monitor model performance and versioning, and how will you iterate?
Illustrative example: a radiology model trained on Stanford Hospital data may degrade when deployed to an older hospital with different imaging equipment; humans can adapt, but AI models may not without re-training and broader data coverage.
Model validation and ongoing monitoring are essential to ensure reliability after deployment.

Data Drift, Model Drift, and Continuous Lifecycles

AI lifecycles are continuous, not one-off. Data and the world evolve, so models drift and degrade over time.
What drives failure in production?
- Not budgeting for model maintenance and monitoring
- Ignoring drift, data quality changes, and unseen edge cases
Example: Tay, Microsoft’s AI chatbot, quickly fell into racism after exposure to online data, illustrating the need for ongoing content monitoring and safeguards.
Practical takeaway: budget for ongoing retraining, data updates, governance, and monitoring to maintain model accuracy and safety.

Vendor Hype, Product Mismatch, and Overpromising

Common vendor pitfalls: misalignment between promises and real capabilities, product mismatches, overhype, and oversell.
Examples discussed:
- IBM Watson Health case (over-promises, under-delivers in healthcare context)
- Amazon’s biased recruiting tool (bias results from biased training data)
Key remedy: rigorous upfront questions, independent validation beyond vendor demos, and alignment of vendor capabilities with concrete project requirements.

Data-Centric Agile and Iterative Development for AI Projects

Data projects differ from traditional application development; data continuously changes and requires governance, privacy, and security controls.
Agile and data-centric approaches can be combined, but require adaptation:
- Agile focuses on iteration, delivery, and responding to change.
- Data-centric approaches focus on data understanding, quality, governance, and the data pipeline.
- Neither approach alone suffices; combine them wisely to address data realities.
Four key ideas:
- Data projects are not merely about functionality; they revolve around extracting insights and actions from data.
- Data iteration is essential: you must deliver value early with rapid iterations while improving data quality and understanding.
- Governance, privacy, and data ownership are central concerns.
- You need to evolve agile practices to accommodate data prep, data quality, and data governance tasks.

Agile and Data-Centric Methodologies: A Family of Frameworks

CRISP-DM (Cross-Industry Standard Process for Data Mining): a data-centric framework with six phases used in iterative cycles. Phases:
1) Business Understanding
2) Data Understanding
3) Data Preparation
4) Modeling
5) Evaluation
6) Deployment
CPMAI (Cognitive Project Management for AI): extends CRISP‑DM with AI/ML specifics and agile practices. Six phases (iterative):
1) Business Understanding
2) Data Understanding
3) Data Preparation
4) Model Development
5) Model Evaluation
6) Model Operationalization
TDSP (Team Data Science Process): a data science lifecycle framework focusing on data science needs with a defined start/end, less iterative than CPMAI.
Key difference: CPMAI explicitly integrates agile and AI specifics, and remains highly iterative; CRISP-DM provides a solid data mining backbone; TDSP emphasizes data science lifecycle with tooling guidance.

Adapting Agile for Data Projects

Agility for data requires adapting roles and processes:
- Re-imagined Product Owner: someone who understands the data product and data lifecycle (collection, ingestion, preparation, transformation, usage, consumption)
- Re-imagined Scrum Master/Team Lead: understands data lifecycle impact on timeframes and supports data roles
- Development Members: data scientists, data engineers, ETL specialists, BI, data governance owners, etc.
- Stakeholders: data analysts, data scientists, data end-users, governance, legal, compliance, privacy officers, etc.
Time-boxed iterations are essential; plan, execute, demo, and retrospective within each sprint.
Documentation and governance remain important, especially for regulatory/compliance contexts.

What to Implement: CPMAI in Practice

CPMAI workbook provides a practical, artifact-driven approach across six phases, with phase-specific questions, go/no-go checklists, ROI expectations, and data governance considerations.
The CPMAI framework emphasizes iteration for each project and supports continuous improvement through artifact-based tracking.
Practical takeaway: use CPMAI as a practical methodology for AI projects, with an emphasis on data-first, AI-relevant, iterative cycles.

Case Examples and Practical Metrics

Barclays (financial services):
- 300% increase in the development of new data projects
- 50% reduction in code and data complexity across 80+ applications
- Ability to test and verify code completion over 50%
Netflix: about 75% of views are driven by the recommendation engine; data powers many projects with an agile approach enabling faster delivery
Panera Bread: digital sales grew to account for 25% of overall sales; improvements in speed and efficiency of delivery
Fitbit: 4 new products and 22 million devices in the first year; enhanced team engagement and forecasting capabilities

Real-World Implementation: Practical Considerations

Agile data challenges include: lack of defined end, difficulty estimating data tasks, data quality uncertainties, governance and privacy concerns, and the need for ongoing data preparation and data ops.
The data product mindset requires broad stakeholder involvement across data analysts, data scientists, end users, governance, legal, and privacy functions.
Vendor procurement should emphasize aligning capabilities with actual project needs and avoiding “X vendor shop” mentality.
A well-executed data initiative often combines agile iterations with data-centric governance, testing, and monitoring to ensure ongoing value delivery.

Data Governance, Privacy, and Trust in AI Systems

Privacy vs. convenience: increasing personalization raises concerns about data sharing and surveillance.
Transparency vs. security: greater openness can reduce security; need to balance understanding of AI decision-making with robust safeguards.
Trust and user experience: addressing the Uncanny Valley, ensuring the human touch remains where appropriate, and designing for user comfort with automation.
Case implications: incidents like the Henn‑Na Hotel robot failures illustrate the risk of uncanny user experiences and the value of blending automation with human interaction.

The Uncanny Valley and User Trust

The Uncanny Valley describes how objects that look almost-but-not-quite-human can provoke eeriness or revulsion.
For data systems, the data uncanny valley manifests in privacy concerns, perceived surveillance, and erosion of trust when systems know too much about users.
Practical design takeaway: avoid overreach, preserve user agency, and prioritize human-centric design where appropriate.

Practical Takeaways and Strategic Guidance

Start big, think strategically, but implement small and iterate often.
Ensure data understanding and quality are at the core of your AI program; invest heavily in data collection, cleaning, and governance (the 80/20 rule often cited: 80% data effort, 20% model development).
Align ROI with explicit data-driven savings: cost savings, time savings, resource savings, and measurable business value.
Use iterative, data-centric methodologies (CPMAI/CRISP-DM) to guide AI project lifecycles, with agile practices adapted to data realities.
Manage expectations with stakeholders using clear problem definitions, realistic milestones, and transparent communication about capabilities and limits of AI.
Build an integrated team and governance structure that includes data engineers, data scientists, business analysts, and governance/legal/privacy experts.

Key Formulas and Quantitative References

ROI (illustrative):
\text{ROI} = \frac{\text{Benefits} - \text{Costs}}{\text{Costs}} \times 100\%.
Data Quality Principle: "Garbage in, garbage out" emphasizes the need for high-quality data inputs to achieve reliable AI outputs.
When discussing improvements, use data-driven metrics such as:
- Number of data projects started/completed (e.g., Barclays: +300% in new data projects)
- Reduction in data/code complexity (e.g., Barclays: -50%)
- Proportion of business outcomes driven by AI (e.g., Netflix: 75% views from recommendations)
Model drift and data drift concepts imply ongoing monitoring equations and maintenance budgets, though explicit numerical formulas for drift are domain-specific and not provided in this transcript.

Quick Reference: Terminology and Frameworks

POC: Proof of Concept – lab/experimental demonstration, not production.
Pilot: Real-world test in a controlled production environment.
MVP: Minimum Viable Product – minimal functionality to validate value.
CRISP-DM: Six phases for data mining projects, iterative.
CPMAI: Cognitive Project Management for AI – six iterative phases, AI-specific and agile-friendly.
TDSP: Team Data Science Process – data science lifecycle with tooling guidance; less iterative than CPMAI.
Agile: Iterative, incremental development with emphasis on responding to change.
Data-centric: Emphasis on data quality, data governance, data preparation, data lifecycle.
The Seven Patterns of AI: Hyper-Personalization, Recognition, Conversation & Human Interaction, Predictive Analytics & Decisions, Autonomous Analytics & Systems, Goal-Driven Systems, Patterns/Anomalies.

Final Takeaway

To succeed in AI projects, organizations must adopt a data-first, iterative, governance-aware approach, blending agile practices with data-centric methodologies (CRISP-DM, CPMAI, TDSP). ROI should be established upfront with explicit data-driven metrics, and real-world validation must account for data drift, model drift, and user trust. Continuous monitoring, responsible data practices, and prudent vendor selection are essential to sustainable AI success.