Final Study Guide

Page 1: NoSQL and HBase

NoSQL Overview

  • Definition: Flexible and scalable databases. Example: MongoDB

  • NoSQL vs. Traditional Databases:

    • Schema: Dynamic (NoSQL) vs. Fixed (Traditional)

    • Scalability:

      • Horizontal scaling: Better performance

      • Vertical scaling: Costlier

    • Data Types: Manages diverse and unstructured data (NoSQL) vs. Structured data (Traditional)

    • Speed: Optimized for high-speed operations (NoSQL) vs. ACID transactions may slow down performance (Traditional)

HBase

  • Characteristics: Distributed, scalable, high-performance NoSQL database for vast data handling

  • Components:

    • HMaster: Manages cluster metadata, region assignments, load balancing

    • Region Servers: Manage specific data regions and handle read/write requests

    • ZooKeeper: Monitors cluster health and coordinates distributed processes

  • Data Handling: Reads and writes using row keys, column families, and qualifiers.

Hive

  • Function: Data warehousing and query language tool for Hadoop

  • Limitations:

    • Slower query response

    • Query complexity

    • Inefficient data updates and deletions

    • Limited real-time processing

    • Limited schema evolution

  • Ideal Use Cases: Large-scale data processing, batch processing, log analysis, ETL operations, historical data analysis.

Oozie

  • Definition: Open-source workflow scheduling and coordination system.

    • Workflows: Automate task execution.

    • Coordinators: Schedule workflows based on time or data availability.

Spark

  • Function: Open-source unified analytics engine for large-scale data processing.

  • Competition: Competes with MapReduce as a cluster computing framework.

  • File System: Does not have its own, uses HDFS.

  • Benefits: In-memory processing increases speed, offers high-level APIs for various languages (Python, Java, R), versatile with built-in libraries.

  • Key Components:

    • Spark Driver: Control node, coordinates tasks and manages application lifecycle.

    • Cluster Manager: Manages resources allocated to Spark applications.

    • Executors: Worker nodes executing tasks in parallel.

Pig

  • Function: High-level platform for data analysis and processing in Hadoop.

Page 2: Machine Learning and Mahout

Pig Integration

  • Can be used with HDFS, HBase, and Hive.

  • Advantage: Simplifies data processing by focusing on transformations, reducing complexity.

Machine Learning

  • Definition: Algorithms designed to learn from historical data to make predictions.

  • Steps:

    1. Problem definition and understanding

    2. Data collection

    3. Data preprocessing

    4. Data exploration and visualization

    5. Data splitting

    6. Model selection

    7. Model training

    8. Model evaluation

    9. Documentation and reporting

    10. Continuous improvement

  • When to use: Addresses five fundamental questions in data science.

Mahout

  • Goal: Provide free implementations of scalable machine learning algorithms, focused on linear algebra.

  • Integration: Works with Hadoop and Spark for large datasets not fitting in memory.

  • Algorithms Offered:

    • Clustering: K means, canopy

    • Classification: Naïve Bayes, Random Forest.

Page 3: Similarity Measurements and Clustering

Similarity Measurements

  • Methods:

    • Pearson Correlation

    • Euclidean Distance

    • Cosine Measure

    • Tanimoto Coefficient (Jaccard Coefficient)

  • Pearson Correlation: Measures covariance relative to standard deviations of two variables.

    • Issues:

      • Ignores overlap number of preferences.

      • Undefined correlation with identical value series.

    • Normalization makes Cosine and Pearson similarity equivalent.

Machine Learning Types

  • Categories:

    • Supervised Learning: Uses labeled datasets (e.g., classification, regression).

    • Unsupervised Learning: Works with unlabeled datasets (e.g., clustering).

    • Semi-supervised Learning

    • Reinforcement Learning

  • Applications:

    • Supervised Learning Uses: Risk assessments, fraud detection, and spam filtering.

    • Unsupervised Learning: Identifies hidden patterns in datasets.

  • Clustering: Groups objects together; identifies hidden patterns and aids decision-making.

K Means Clustering

  • Definition: Divides a dataset into k clusters based on similarity.

  • Steps:

    1. Select initial centroids from distinct data points.

    2. Measure distances and assign points to closest centroid.

    3. Recompute centroids of formed clusters.

    4. Repeat until stopping criteria are met.

Page 4: K Means in Mahout and Classification

K Means in Mahout

  • Methodology:

    • Instantiated via KMeansDriver.

    • Requires SequenceFiles for input vectors and initial cluster centers.

    • Similarity measures include Euclidean distance and convergence thresholds.

Classification

  • Definition: A supervised learning type predicting categorical class labels of data.

  • Purpose: Organizes and categorizes data for efficient decision-making.

  • Common Algorithms:

    • Decision Trees

    • Support Vector Machines

    • K Nearest Neighbors

    • Neural Networks

  • Decision Trees: Flowchart structure where:

    • Internal nodes: Tests on attributes.

    • Branches: Outcomes of tests.

    • Leaf nodes: Class labels.

  • Splitting Methods:

    • Information Gain

    • Gini Impurity

    • Gain Ratio.

Regression

  • Overview: Predicts a continuous outcome variable based on predictors.

  • Types:

    • Simple Linear Regression

    • Multiple Linear Regression

    • Polynomial Regression

    • Logistic Regression.

Page 5: Logistic Regression and Evaluation Methods

Logistic Regression

  • Use: Binary classification tasks.

  • Evaluation Metrics:

    • Calculating Accuracy: (TP + TN) / Total

    • Precision: TP / (TP + FP)

    • Recall (Sensitivity): TP / (TP + FN)

    • F1 Score: Harmonic mean of precision and recall.

Regression Evaluation

  • Methods:

    • Mean Absolute Error (MAE): Lower is better.

    • Mean Squared Error (MSE): Penalizes large errors.

    • Root Mean Squared Error (RMSE): Average magnitude of error.

    • R-Squared: Represents variance explained by the model, closeness to 1 indicates a better fit.

    • Adjusted R-Squared: More reliable for models with many predictors.

Data Visualization Techniques

  • Basic: Bar charts, Pie Charts, Line graphs.

  • Advanced: Heatmaps, Scatter plots, Bubble Charts, Tree maps, Word clouds.

  • Geospatial: Choropleth maps, Heat maps, Point Maps.

  • Time Series: Line graphs, Area charts, Candlestick charts.

  • Network: Node link diagrams, Chord diagrams, Sankey diagrams.

Page 6: Visualization Challenges

  • Challenges: Effective visualization can present obstacles, requiring careful consideration of data representation.