Data Analysis

Reflection on previous topic: qualities of good information and generation of business intelligence from data.
Focus on:
- Data collection and preparation for analysis.
- Development of AI analytics tools and their impact.
- Role of financial professionals in processing data and collaboration with ICT professionals.

The specific processes for analyzing data vary among businesses and shift with technological advancements.
Basic stages (recognition of challenges and need for continuous knowledge updates):
- Selection of the data
- Pre-processing of the data for quality improvement
- Transformation of the dataset for analysis readiness
- Data mining for pattern and relationship recognition
- Evaluation of findings to derive insights

Big data along with AI allows businesses to answer almost any relevant question, with the caveat of costs.
Data selection contingent upon specific questions that reflect strategic plans, goals, and objectives.
Goals identification and question formulation is discussed further in this competency area.

Noise: Corrupted or unwanted data, includes:
- Faulty data (e.g., erroneous product codes)
- Irrelevant data (e.g., customer addresses for age analysis)
- Meaningless conversions (e.g., numbers reformatted as dates)
- Note: Leaving some noise may prevent overfitting in algorithms.
Outliers: Significant deviations in data that may indicate errors or natural variations. Evaluation before removal is crucial to avoid losing valid data.
Duplicates: Occurs when data is recorded multiple times or in different formats. Ensures a single version is retained by identifying instances and correcting discrepancies.
Omissions: Incomplete datasets through:
- Data regeneration from other sources.
- Re-attempting data collection.
- Adjusting algorithms to accommodate omissions.

Tasks considered specialized, often performed by algorithms, but finance professionals are involved in:
- Correcting data errors.
- Reviewing suggested outliers.
- Regenerating missing data.

Prepares data for analysis through size reduction and format conversion for analytics.

Sampling:
- Selecting a representative sample versus full dataset.
- Commonly used in:
  - Quality control (checking products)
  - Drug testing (conducting trials)
  - Auditing (inspecting item samples)
- Statistical models help determine necessary sample sizes.
- Example: Monitoring temperature in factory settings based on production complexity.
Aggregation:
- Combining features based on analysis purpose.
- Example:
  - Exam data aggregated by school for institution performance (e.g., summing results).
  - Critique on generic aggregation in drug trials resulting in follow-on decisions detrimental to specific groups.
- Analysts need guidance from business on aggregation strategy to avoid misinterpretations.

Data analysis processes are in constant flux, determined largely by storage methods:
- ETL: Extraction, Transformation, Loading into data warehouses.
- ELT: Extract, Load, Transform for data lakes.

Describes computer use for tasks traditionally performed by human cognition. This includes processing large data volumes rapidly.
Automated decision-making raises issues; however, algorithms function within human-defined parameters.

Object Recognition:
- AI classifies images based on visual clues (e.g., photos, videos).
Natural Language Processing (NLP):
- AI analyzes and interprets language, adapting to various accents and contexts.
Human-AI Interaction:
- Used in customer service through chatbots and emotion detection in calls.
Machine Learning:
- Algorithms develop independently from instructions, improving over time based on vast data sets.

Supervised Learning
- Data is tagged for classification into categories.
- Examples:
  - Classification Types:
  - Binary (e.g., spam vs. legitimate emails)
  - Multi-class (e.g., customer categories)
  - Multi-label (e.g., books with multiple genres)
- Steps involved in the supervised learning process:
1. Labeling of data.
2. Training the machine with subsets of data to identify features.
3. Learning with feedback for accuracy.
4. Application of rules to classify new data.
- Metrics: Precision and Recall concepts applied to assess outcome effectiveness.
Regression Analysis:
- Used for predictive analysis considering numerous variables.
- Examples to apply in business:
  - Advertising strategies considering demographic traffic, weather patterns, among others.
Challenges: Overfitting and Underfitting:
- Overfitting: Creating perfect rules for dataset specifics, lacking generalization.
- Underfitting: Inadequate training results in inability to discern relationships.

Machine analyzes unlabelled datasets to discover patterns without specific prompts.

Cluster Analysis:
- Groups data based on shared characteristics (e.g., customer queries).
Anomaly Detection:
- Identifies outliers which may indicate fraud or other concerns.
Association Rules Mining:
- Discovers relationships using logical inference for predictive insights (e.g., market basket analysis, health symptom identification).

Mimics behavioral training through positive reinforcement for achieving set objectives.
Algorithms adjust based on performance outcomes, enhancing their functionality through repetitive processes.

Importance of defining operational bounds for AI systems to prevent unintended consequences.
Businesses need to anticipate potential paths AI might take to ensure outcomes align with ethical and operational standards.

Data Scientists:
- Specialize in data mining, modeling processes, and developing predictive algorithms.
Data Analysts:
- Focus on answering business questions through detailed data evaluation and interpretations.
Business Analysts:
- Bridge between business needs and technical data solutions to ensure project alignment.
Finance Professionals:
- Integral in ensuring data accuracy, ethical considerations, and effective communication of findings.

Overview of focusing on data collection, preparation, analytical methodology, and AI's role in data-driven decision-making.
Transition to discuss data's influence on business decision making in upcoming sections of the course.