Final Study Guide

Definition: Flexible and scalable databases. Example: MongoDB
NoSQL vs. Traditional Databases:
- Schema: Dynamic (NoSQL) vs. Fixed (Traditional)
- Scalability:
  - Horizontal scaling: Better performance
  - Vertical scaling: Costlier
- Data Types: Manages diverse and unstructured data (NoSQL) vs. Structured data (Traditional)
- Speed: Optimized for high-speed operations (NoSQL) vs. ACID transactions may slow down performance (Traditional)

Characteristics: Distributed, scalable, high-performance NoSQL database for vast data handling
Components:
- HMaster: Manages cluster metadata, region assignments, load balancing
- Region Servers: Manage specific data regions and handle read/write requests
- ZooKeeper: Monitors cluster health and coordinates distributed processes
Data Handling: Reads and writes using row keys, column families, and qualifiers.

Function: Data warehousing and query language tool for Hadoop
Limitations:
- Slower query response
- Query complexity
- Inefficient data updates and deletions
- Limited real-time processing
- Limited schema evolution
Ideal Use Cases: Large-scale data processing, batch processing, log analysis, ETL operations, historical data analysis.

Definition: Open-source workflow scheduling and coordination system.
- Workflows: Automate task execution.
- Coordinators: Schedule workflows based on time or data availability.

Function: Open-source unified analytics engine for large-scale data processing.
Competition: Competes with MapReduce as a cluster computing framework.
File System: Does not have its own, uses HDFS.
Benefits: In-memory processing increases speed, offers high-level APIs for various languages (Python, Java, R), versatile with built-in libraries.
Key Components:
- Spark Driver: Control node, coordinates tasks and manages application lifecycle.
- Cluster Manager: Manages resources allocated to Spark applications.
- Executors: Worker nodes executing tasks in parallel.

Can be used with HDFS, HBase, and Hive.
Advantage: Simplifies data processing by focusing on transformations, reducing complexity.

Definition: Algorithms designed to learn from historical data to make predictions.
Steps:
1. Problem definition and understanding
2. Data collection
3. Data preprocessing
4. Data exploration and visualization
5. Data splitting
6. Model selection
7. Model training
8. Model evaluation
9. Documentation and reporting
10. Continuous improvement
When to use: Addresses five fundamental questions in data science.

Goal: Provide free implementations of scalable machine learning algorithms, focused on linear algebra.
Integration: Works with Hadoop and Spark for large datasets not fitting in memory.
Algorithms Offered:
- Clustering: K means, canopy
- Classification: Naïve Bayes, Random Forest.

Methods:
- Pearson Correlation
- Euclidean Distance
- Cosine Measure
- Tanimoto Coefficient (Jaccard Coefficient)
Pearson Correlation: Measures covariance relative to standard deviations of two variables.
- Issues:
  - Ignores overlap number of preferences.
  - Undefined correlation with identical value series.
- Normalization makes Cosine and Pearson similarity equivalent.

Categories:
- Supervised Learning: Uses labeled datasets (e.g., classification, regression).
- Unsupervised Learning: Works with unlabeled datasets (e.g., clustering).
- Semi-supervised Learning
- Reinforcement Learning
Applications:
- Supervised Learning Uses: Risk assessments, fraud detection, and spam filtering.
- Unsupervised Learning: Identifies hidden patterns in datasets.
Clustering: Groups objects together; identifies hidden patterns and aids decision-making.

Definition: Divides a dataset into k clusters based on similarity.
Steps:
1. Select initial centroids from distinct data points.
2. Measure distances and assign points to closest centroid.
3. Recompute centroids of formed clusters.
4. Repeat until stopping criteria are met.

Methodology:
- Instantiated via KMeansDriver.
- Requires SequenceFiles for input vectors and initial cluster centers.
- Similarity measures include Euclidean distance and convergence thresholds.

Definition: A supervised learning type predicting categorical class labels of data.
Purpose: Organizes and categorizes data for efficient decision-making.
Common Algorithms:
- Decision Trees
- Support Vector Machines
- K Nearest Neighbors
- Neural Networks
Decision Trees: Flowchart structure where:
- Internal nodes: Tests on attributes.
- Branches: Outcomes of tests.
- Leaf nodes: Class labels.
Splitting Methods:
- Information Gain
- Gini Impurity
- Gain Ratio.

Overview: Predicts a continuous outcome variable based on predictors.
Types:
- Simple Linear Regression
- Multiple Linear Regression
- Polynomial Regression
- Logistic Regression.

Use: Binary classification tasks.
Evaluation Metrics:
- Calculating Accuracy: (TP + TN) / Total
- Precision: TP / (TP + FP)
- Recall (Sensitivity): TP / (TP + FN)
- F1 Score: Harmonic mean of precision and recall.

Methods:
- Mean Absolute Error (MAE): Lower is better.
- Mean Squared Error (MSE): Penalizes large errors.
- Root Mean Squared Error (RMSE): Average magnitude of error.
- R-Squared: Represents variance explained by the model, closeness to 1 indicates a better fit.
- Adjusted R-Squared: More reliable for models with many predictors.

Challenges: Effective visualization can present obstacles, requiring careful consideration of data representation.