Final Study Guide
Page 1: NoSQL and HBase
NoSQL Overview
Definition: Flexible and scalable databases. Example: MongoDB
NoSQL vs. Traditional Databases:
Schema: Dynamic (NoSQL) vs. Fixed (Traditional)
Scalability:
Horizontal scaling: Better performance
Vertical scaling: Costlier
Data Types: Manages diverse and unstructured data (NoSQL) vs. Structured data (Traditional)
Speed: Optimized for high-speed operations (NoSQL) vs. ACID transactions may slow down performance (Traditional)
HBase
Characteristics: Distributed, scalable, high-performance NoSQL database for vast data handling
Components:
HMaster: Manages cluster metadata, region assignments, load balancing
Region Servers: Manage specific data regions and handle read/write requests
ZooKeeper: Monitors cluster health and coordinates distributed processes
Data Handling: Reads and writes using row keys, column families, and qualifiers.
Hive
Function: Data warehousing and query language tool for Hadoop
Limitations:
Slower query response
Query complexity
Inefficient data updates and deletions
Limited real-time processing
Limited schema evolution
Ideal Use Cases: Large-scale data processing, batch processing, log analysis, ETL operations, historical data analysis.
Oozie
Definition: Open-source workflow scheduling and coordination system.
Workflows: Automate task execution.
Coordinators: Schedule workflows based on time or data availability.
Spark
Function: Open-source unified analytics engine for large-scale data processing.
Competition: Competes with MapReduce as a cluster computing framework.
File System: Does not have its own, uses HDFS.
Benefits: In-memory processing increases speed, offers high-level APIs for various languages (Python, Java, R), versatile with built-in libraries.
Key Components:
Spark Driver: Control node, coordinates tasks and manages application lifecycle.
Cluster Manager: Manages resources allocated to Spark applications.
Executors: Worker nodes executing tasks in parallel.
Pig
Function: High-level platform for data analysis and processing in Hadoop.
Page 2: Machine Learning and Mahout
Pig Integration
Can be used with HDFS, HBase, and Hive.
Advantage: Simplifies data processing by focusing on transformations, reducing complexity.
Machine Learning
Definition: Algorithms designed to learn from historical data to make predictions.
Steps:
Problem definition and understanding
Data collection
Data preprocessing
Data exploration and visualization
Data splitting
Model selection
Model training
Model evaluation
Documentation and reporting
Continuous improvement
When to use: Addresses five fundamental questions in data science.
Mahout
Goal: Provide free implementations of scalable machine learning algorithms, focused on linear algebra.
Integration: Works with Hadoop and Spark for large datasets not fitting in memory.
Algorithms Offered:
Clustering: K means, canopy
Classification: Naïve Bayes, Random Forest.
Page 3: Similarity Measurements and Clustering
Similarity Measurements
Methods:
Pearson Correlation
Euclidean Distance
Cosine Measure
Tanimoto Coefficient (Jaccard Coefficient)
Pearson Correlation: Measures covariance relative to standard deviations of two variables.
Issues:
Ignores overlap number of preferences.
Undefined correlation with identical value series.
Normalization makes Cosine and Pearson similarity equivalent.
Machine Learning Types
Categories:
Supervised Learning: Uses labeled datasets (e.g., classification, regression).
Unsupervised Learning: Works with unlabeled datasets (e.g., clustering).
Semi-supervised Learning
Reinforcement Learning
Applications:
Supervised Learning Uses: Risk assessments, fraud detection, and spam filtering.
Unsupervised Learning: Identifies hidden patterns in datasets.
Clustering: Groups objects together; identifies hidden patterns and aids decision-making.
K Means Clustering
Definition: Divides a dataset into k clusters based on similarity.
Steps:
Select initial centroids from distinct data points.
Measure distances and assign points to closest centroid.
Recompute centroids of formed clusters.
Repeat until stopping criteria are met.
Page 4: K Means in Mahout and Classification
K Means in Mahout
Methodology:
Instantiated via KMeansDriver.
Requires SequenceFiles for input vectors and initial cluster centers.
Similarity measures include Euclidean distance and convergence thresholds.
Classification
Definition: A supervised learning type predicting categorical class labels of data.
Purpose: Organizes and categorizes data for efficient decision-making.
Common Algorithms:
Decision Trees
Support Vector Machines
K Nearest Neighbors
Neural Networks
Decision Trees: Flowchart structure where:
Internal nodes: Tests on attributes.
Branches: Outcomes of tests.
Leaf nodes: Class labels.
Splitting Methods:
Information Gain
Gini Impurity
Gain Ratio.
Regression
Overview: Predicts a continuous outcome variable based on predictors.
Types:
Simple Linear Regression
Multiple Linear Regression
Polynomial Regression
Logistic Regression.
Page 5: Logistic Regression and Evaluation Methods
Logistic Regression
Use: Binary classification tasks.
Evaluation Metrics:
Calculating Accuracy: (TP + TN) / Total
Precision: TP / (TP + FP)
Recall (Sensitivity): TP / (TP + FN)
F1 Score: Harmonic mean of precision and recall.
Regression Evaluation
Methods:
Mean Absolute Error (MAE): Lower is better.
Mean Squared Error (MSE): Penalizes large errors.
Root Mean Squared Error (RMSE): Average magnitude of error.
R-Squared: Represents variance explained by the model, closeness to 1 indicates a better fit.
Adjusted R-Squared: More reliable for models with many predictors.
Data Visualization Techniques
Basic: Bar charts, Pie Charts, Line graphs.
Advanced: Heatmaps, Scatter plots, Bubble Charts, Tree maps, Word clouds.
Geospatial: Choropleth maps, Heat maps, Point Maps.
Time Series: Line graphs, Area charts, Candlestick charts.
Network: Node link diagrams, Chord diagrams, Sankey diagrams.
Page 6: Visualization Challenges
Challenges: Effective visualization can present obstacles, requiring careful consideration of data representation.