Big Data Analytics and Hadoop - Study Notes
Course Code and Overview
- Course Code: CSC702
- Course Name: Big Data Analytics
- Credits: 3
- Institute: SIES Graduate School of Technology (Autonomous)
- Delivery: CE – BE – BDA (Dr. Kalyani Pampattiwar, Associate Professor, Dept. of Computer Engg.)
- Assessment overview (as per slides):
- Internal Assessment: two class tests of 20 marks each; first test after ~40% syllabus; second after additional ~40% syllabus; duration of each test = 1 hour.
- End Semester Theory Examination: 6 questions, each 20 marks; must solve 4 questions; Question 1 is compulsory and based on the entire syllabus; Q2–Q6 from modules.
- Useful links (as references):
- https://nptel.ac.in/courses/106104189
- https://www.coursera.org/specializations/big-data#courses
- https://www.digimat.in/nptel/courses/video/106106169/L01.html
- https://www.coursera.org/learn/nosql-databases#syllabus
- https://www.coursera.org/learn/basic-recommender-systems#syllabus
Course Objectives
- Objective 1: Provide an overview of the exciting, growing field of big data analytics.
- Objective 2: Introduce programming skills to build simple solutions using big data technologies such as MapReduce, scripting for NoSQL, and R.
- Objective 3: Learn fundamental techniques and principles for achieving big data analytics with scalability and streaming capability.
- Objective 4: Enable students to acquire skills to solve complex real-world problems for decision support.
Course Outcomes
- Outcome 1: Understand the building blocks of Big Data Analytics.
- Outcome 2: Apply fundamental enabling techniques like Hadoop and MapReduce to solve real-world problems.
- Outcome 3: Understand different NoSQL systems and how they handle big data.
- Outcome 4: Apply advanced techniques for emerging applications like stream analytics.
- Outcome 5: Gain perspectives of big data analytics in applications such as recommender systems, social media, etc.
- Outcome 6: Apply statistical computing techniques and graphics for analyzing big data.
Detailed Syllabus Content
- Modules covered:
- 1) Introduction to Big Data and Hadoop
- 2) Hadoop HDFS and MapReduce
- 3) NoSQL
- 4) Mining Data Streams
- 5) Real-Time Big Data Models
- 6) Data Analytics with R
- Prerequisites: Database, Data Mining
Syllabus Modules (Detailed Content)
- Module 1: Introduction to Big Data and Hadoop
- 1.1 Introduction to Big Data: Big Data characteristics and Types of Big Data
- 1.2 Traditional vs. Big Data business approach
- 1.3 Case Study of Big Data Solutions
- 1.4 Concept of Hadoop, Core Hadoop Components; Hadoop Ecosystem
- Module 2: Hadoop HDFS and MapReduce
- 2.1 Distributed File Systems: Physical Organization of Compute Nodes, Large-Scale File-System Organization
- 2.2 MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners, Details of MapReduce Execution, Coping With Node Failures
- 2.3 Algorithms Using MapReduce: Matrix-Vector Multiplication by MapReduce, Relational-Algebra Operations, Computing Selections by MapReduce, Computing Projections by MapReduce, Union, Intersection, and Difference by MapReduce
- Module 3: NoSQL
- 3.1 Introduction to NoSQL, NoSQL Business Drivers; NoSQL Data Architecture Patterns: Key-value stores, Graph stores, Column family (Bigtable) stores, Document stores, Variations of NoSQL architectural patterns; NoSQL Case Study
- 3.2 NoSQL solution for big data, Understanding the types of big data problems; Analyzing big data with a shared-nothing architecture; Choosing distribution models: master-slave versus peer-to-peer; NoSQL systems to handle big data problems
- Module 4: Mining Data Streams
- 4.1 The Stream Data Model: A Data-Stream-Management System, Examples of Stream Sources, Stream Queries, Issues in Stream Processing
- 4.2 Sampling Data techniques in a Stream
- 4.3 Filtering Streams: Bloom Filter with Analysis
- 4.4 Distinct Elements in a Stream, Count-Problem, Flajolet-Martin Algorithm, Combining Estimates, Space Requirements
- Module 5: Real-Time Big Data Models and Data Analytics with R
- 5.1 A Model for Recommendation Systems, Content-Based Recommendations, Collaborative Filtering
- 5.2 Case Study: Product Recommendation
- 5.3 Social Networks as Graphs, Clustering of Social-Network Graphs, Direct Discovery of Communities in a social graph
- 6.1–6.3 Data Analytics with R: Basic features of R, RGUI, RStudio, Handling Basic Expressions in R, Variables in R, Working with Vectors, Storing and Calculating Values in R, Creating and using Objects, Interacting with users, Handling data in R workspace, Executing Scripts, Creating Plots, Accessing help and documentation in R, Reading datasets and Exporting data from R, Manipulating and Processing Data in R, Using functions instead of script, built-in functions in R, Data Visualization: Types, Applications
- Textbooks and References provided in the slides.
Textbooks and References
- Textbooks:
- Cre Anand Rajaraman and Jeff Ullman - Mining of Massive Datasets, Cambridge University Press
- Alex Holmes - Hadoop in Practice, Manning Press, Dreamtech Press
- Dan Mcary and Ann Kelly - Making Sense of NoSQL – A guide for managers and the rest of us, Manning Press
- DT Editorial Services, "Big Data Black Book", Dreamtech Press
- EMC Education Services, "Data Science and Big Data Analytics", Wiley
- References:
- Bill Franks, -Taming The Big Data Tidal Wave: Finding Opportunities In Huge Data Streams With Advanced Analytics, Wiley
- Chuck Lam, Jared Dean, Hadoop in Action, Dreamtech Press
- Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners, Wiley India Private Limited, 2014
- Jiawei Han and Micheline Kamber, -Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 3rd ed, 2010
- Lior Rokach and Oded Maimon, -Data Mining and Knowledge Discovery Handbook, Springer, 2nd edition, 2010
- Ronen Feldman and James Sanger, -The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, 2006
- Vojislav Kecman, -Learning and Soft Computing, MIT Press, 2010
Big Data Fundamentals and Concepts
What is Big Data?
- Definition: A collection of data sets that are large and complex, making them difficult to store or process with traditional DBMS or data processing applications.
- Key challenges: capture, curation, storage, search, sharing, transfer, analysis, and visualization.
- Informational context: data is generated from diverse sources and used to derive correlations for business insight, research quality, disease prevention, legal analytics, traffic conditions, etc.
Definition (alternative wording from slides):
- "Big data is high-volume, velocity, and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making."
Data characteristics and counts
- Example daily data scales from various platforms: Walmart transactions, Facebook photo uploads, flight data, mobile traffic, genome decoding time reductions, etc. Illustrative figures include statements like:
- Facebook stores, accesses, and analyzes 30+ PB of user-generated data.
- A flight generates ~ of data in 6–8 hours.
- Facebook inserts ~ of new data every day.
Five Vs of Big Data (Quantified ideas):
- Volume: the sheer amount of data generated.
- Velocity: the speed of data generation and the need for real-time processing.
- Variety: different data types and formats (structured, semi-structured, unstructured).
- Veracity: trustworthiness and quality of data; data quality and uncertainty management.
- Value: turning raw data into meaningful insights for business.
- Notation: ext{5 Vs} = ig\u0026 Volume, Velocity, Variety, Veracity, Value ig
Data types (Big Data varieties)
- Structured data: predefined schema, easy access, typically in databases; e.g., employee tables in DBMS.
- Semi-structured data: some structure, but not as rigid; contains tags/markers (CSV, XML, JSON).
- Unstructured data: no consistent structure; examples include audio, images, emails.
Data sources and formats (examples cited): Web logs, images, sensor data, audio/video, emails, streaming data, social media posts, etc.
Big Data realities and growth projections
- As of 2025: global data volume estimated around 175 zettabytes; by 2030 projected to ~300 zettabytes.
- Daily data growth and real-world stats described in the slides (e.g., 3 quintillion bytes/day, 8.5B Google searches/day, 65B WhatsApp messages/day, etc.). In LaTeX form where appropriate:
- Daily data created:
Real-world data growth visuals and anecdotes
- Analogy: Horse pulling cart turned into multiple horses; supports intuition that scaling hardware alone is not enough; Big Data requires distributed, scalable architectures.
Big Data analytics: purpose and value
- Big Data analytics aims to extract patterns, correlations, market trends, and customer preferences to inform decisions in real time or near-real time.
- Example use cases include recommender systems, social media analytics, fraud detection, and real-time decision support.
Stages in Big Data Analytics (as per slides)
- Pre-processing data (step 3)
- Designing data requirements (step 4)
- Identifying problems (step 3)
- Performing analytics over data (step 5)
- Visualizing data (step 5)
Types of Big Data Analytics
- Descriptive analytics: What happened? (historical view)
- Diagnostic analytics: Why did it happen? (root cause)
- Predictive analytics: What could happen? (forecasting, patterns)
- Prescriptive analytics: What should we do? (recommendations, optimization)
- These map to Hindsight, Insight, Foresight, with corresponding questions like "What happened?", "Why did it happen?", "What will happen?", "What should we do?"
Example: Hacking of a Facebook account
- Descriptive: What happened? account hacked; involved fraudulent money request.
- Diagnostic: Why did it happen? insufficient verification of the friend request and details.
- Predictive: What will happen if trends continue? more victims may occur.
- Prescriptive: What is the solution? verify requests, contact sender to confirm identity, implement safeguards.
Data veracity: issues and mitigation
- Common veracity issues: sensor calibration errors, biased social media data, missing data.
- Approaches involve data quality checks, uncertainty handling, and cleaning to improve reliability of insights.
Data value and impact
- Value: turning raw data into actionable insights; steps include data cleaning, transformation, and analysis that benefit business decisions.
Big data needs and scalability
- Needs: processing large data volumes, intensive computations, and scalable infrastructure that can grow/shrink with workload.
- Scalability concept: ability to handle growing work and capacity; two primary strategies:
- Horizontal scaling (scale out): add more machines; distribute workload across multiple servers.
- Vertical scaling (scale up): upgrade a single machine with more CPU, memory, storage.
- Visual aids: horizontal scaling involves distributing workload across many servers; vertical scaling involves more powerful hardware on a single server.
Distributed, parallel, and cloud concepts
- Distributed computing: independent components on different machines sharing messages to achieve common goals; appears as a single interface to users.
- Parallel computing: breaking problems into smaller tasks that can be executed simultaneously by multiple processors; results combined for final output.
- Cloud computing characteristics: on-demand services, resource pooling, scalability, accountability, broad network access; access via Internet.
- Grid computing: dispersed, heterogeneous resources connected to achieve a common task.
- Cluster computing: tightly coupled homogeneous systems working together for load balancing and task distribution.
Distributed computing paradigm for Big Data
- Big Data typically involves data sizes > 10 MB; distributed, parallel, scalable, and often shared-nothing architectures (no data sharing between nodes).
Hadoop and ecosystem overview
- Hadoop is an open-source framework for storing and processing large datasets in a distributed environment using MapReduce paradigm.
- Core components:
- Hadoop HDFS: storage layer; distributed file system with name node (master) and data nodes (slaves); large files broken into blocks (e.g., 128 MB or 256 MB) and replicated (often 3x) for fault tolerance.
- Hadoop MapReduce: processing layer; Map phase processes data chunks into key-value pairs; Reduce phase groups and combines to produce final results.
- Hadoop YARN: Yet Another Resource Negotiator; resource management and job scheduling; acts as the OS for Hadoop; manages containers and resources across a cluster.
- HDFS architecture specifics:
- One name node; multiple data nodes; heartbeats from data nodes to name node to report status.
Hadoop ecosystem tools (selected)
- Sqoop: data transfer between Hadoop and external relational databases/data warehouses; imports data into HDFS, Hive, HBase; maps tasks to Hadoop jobs.
- Flume: ingestion tool for collecting and moving large amounts of log data; sources streaming into HDFS.
- Pig: high-level data flow language (Pig Latin) for analyzing large datasets without heavy Java coding; Pig Latin Compiler translates to executable MapReduce jobs; ETL support.
- Hive: SQL-like interface for reading/writing large datasets in distributed storage; components include Hive Command Line and JDBC/ODBC drivers.
- Spark: distributed computing engine for fast in-memory processing; up to ~100x faster than MapReduce in certain workloads; supports real-time streaming analytics.
- Mahout: scalable machine learning algorithms (clustering, classification, regression) built to run on Hadoop/Spark.
- Ambari: open-source tool for provisioning, managing, and monitoring Hadoop clusters; provides a central management interface; uses agents on each host and a master server.
- Kafka: distributed streaming platform for building real-time data pipelines; durable, scalable messaging system.
- Storm: real-time computation system for streaming data; high-throughput, low-latency processing; integrates with Hadoop.
- Ranger: data security framework for Hadoop ecosystem; centralized authorization and policy management; supports RBAC and ABAC.
- Knox: gateway to securely access Hadoop REST APIs; provides proxying, authentication, and client services.
- Oozie: workflow scheduler for Hadoop jobs; supports DAG-based workflows and time/data-triggered coordination; includes a map/reduce job sequence with email notifications on success/failure.
Hadoop in practice: Jack and the data analogy
- A farmer scenario illustrates distributed storage and parallel processing: multiple workers can harvest different fruits simultaneously, but separate storage is needed to avoid bottlenecks, highlighting the need for distributed storage and processing.
Big Data case studies (applications)
- Walmart: Optimizing retail operations and customer experience; data used for inventory management, customer insights, supply chain optimization, personalized marketing; dynamic pricing and store layout optimization; fraud detection.
- Uber: Route optimization, demand forecasting, driver incentives, safety measures, and customer experience improvements; dynamic surge pricing; real-time routing with traffic patterns; driver performance metrics.
- Netflix: Personalized content recommendations; content creation and licensing decisions based on user preferences and regional trends; optimization of streaming quality; churn prediction; distribution strategies.
Common Big Data technologies used by Walmart, Uber, and Netflix
- Hadoop and Spark for storage and real-time processing; machine learning for recommendations and demand forecasting; NoSQL databases (e.g., MongoDB, Cassandra) for un/semistructured data storage.
Big Data Characteristics — The Data Landscape
Data sources and growth drivers
- Major sources: social media, sensor networks, digital images/videos, cell phones, transaction records, web logs, medical records, archives, military surveillance, e-commerce, complex scientific research.
- Estimates (2025): total global data ~; by 2030 ~.
Data volumes and real-world scales (illustrative figures)
- Daily scale examples: billions of emails, billions of messages, terabytes to petabytes of data from platforms like Facebook, Google, and mobile networks.
- Important numbers (as cited):
- Facebook: ~ of data; ~ new data per day.
- A flight: ~ data in 6–8 hours.
2023 statistics and trends
- 8.5 billion searches by Google daily; WhatsApp roughly 65 billion messages daily; the world to produce > by 2025; 95% of businesses cite unstructured data management as a problem; 45% of businesses run at least one Big Data workload in the cloud; 80–90% of data today is unstructured.
Big Data drivers and ecosystems
- Sensor proliferation (IoT), wearables, connected devices, mobile devices, cloud and edge computing, etc.
- The growth rate is described as 5x faster than electricity/telephony in some visualizations; “We hides” concept referencing the rapid buildup of smart devices and data collection.
Big Data Analytics — Concepts and Stages
- What is Big Data Analytics?
- The process of examining large data sets to uncover hidden patterns, correlations, market trends, and customer preferences to inform decisions.
- Desired outcomes: better, faster decisions in real time; move code to data for efficiency; richer insights; handling datasets beyond traditional DBMS capabilities; cross-functional collaboration among IT, business users, and data scientists.
- Stages in Big Data Analytics (visualized)
- Stage 1: Identifying the problem
- Stage 2: Designing data requirements
- Stage 3: Pre-processing data
- Stage 4: Performing analytics over data
- Stage 5: Visualizing results
- Types of analytics (progression from hindsight to foresight)
- Descriptive analytics (What happened?)
- Diagnostic analytics (Why did it happen?)
- Predictive analytics (What could happen?)
- Prescriptive analytics (What should we do?)
- Descriptive → Diagnostic → Predictive → Prescriptive (Hindsight → Insight → Foresight)
Practical Big Data and Real-World Scenarios
Example: Descriptive, Diagnostic, Predictive, and Prescriptive analytics for a fraud scenario (Facebook hack)
- Descriptive: what happened (accounts compromised, messages sent, etc.).
- Diagnostic: why it happened (insufficient verification, lack of identity checks).
- Predictive: what could happen if trends continue (more victims).
- Prescriptive: recommended actions (verify identity, tighten friend-requests verification).
Descriptive analytics example for website data
- Google Analytics: describes what happened in historical data, e.g., page views, user sessions, traffic sources.
Diagnostic analytics example in social media marketing
- Assessing post counts, mentions, followers, reviews to analyze campaign performance and pinpoint failures or successes.
Predictive analytics example in aviation
- Southwest Airlines analyzing sensor data to identify patterns indicating potential malfunctions; proactive maintenance.
Prescriptive analytics example in autonomous systems
- Google’s self-driving car analyzing environment and data to decide direction and action.
Data veracity and quality challenges (with examples)
- Sensor data calibration errors, environmental interference; biased social posts; missing data; need for techniques to identify and filter unreliable data.
Data value and value creation
- Process involves turning raw data into meaningful insights, cleaning and preparing data, and ensuring analysis informs business outcomes.
Big Data Tools: Definitions and Core Concepts
- HDFS (Hadoop Distributed File System)
- Storage layer; data stored in blocks (typical block sizes: 128 MB or 256 MB).
- Data replication (commonly 3x) for fault tolerance.
- Architecture: 1 NameNode (master) and multiple DataNodes (slaves).
- Features: Scalability, reliability via replication, high throughput.
- MapReduce
- Processing layer; Map phase outputs key-value pairs; Reduce phase aggregates, producing final results.
- YARN (Yet Another Resource Negotiator)
- Resource management layer; schedules jobs, manages cluster resources; separates resource management from job execution.
- Hadoop ecosystem overview
- Core: HDFS, MapReduce, YARN
- Data ingestion: Sqoop, Flume
- Data processing: Pig, Hive, Spark, Mahout
- Cluster management: Ambari
- Real-time streaming and security: Kafka, Storm, Ranger, Knox, Oozie
Big Data Technologies and Use Cases
- Hadoop ecosystem tools usage summary
- Sqoop: transfer between Hadoop and RDBMS/enterprise warehouses
- Flume: log data ingestion into HDFS
- Pig: high-level data flow for processing; reduces Java coding effort; ETL support
- Hive: SQL-like interface on top of Hadoop for data manipulation
- Spark: fast, in-memory computation; real-time streaming support
- Mahout: machine learning algorithms
- Ambari: cluster provisioning/monitoring
- Kafka: distributed streaming platform for real-time data pipelines
- Storm: real-time streaming processing
- Ranger/Knox: security framework and gateway for Hadoop services
- Oozie: workflow scheduler for Hadoop jobs
Traditional Systems vs Big Data Approach
- Traditional systems
- Centralized data storage; offline analysis; primarily structured data; limited scalability; standard DB tools suffice; conventional configurations.
- Big Data approach
- Distributed storage (data resides across clusters); real-time and offline analysis; support for structured, semi-structured, and unstructured data; advanced analytics platforms; high system configuration requirements for large-scale workloads.
- Key contrasts
- Data type flexibility, scalability, distribution, and support for exploratory analysis vs centralized, predefined schemas.
Data Security, Governance, and Access
- Data security concepts in Hadoop ecosystem
- Ranger provides centralized authorization and policy management; RBAC and ABAC support.
- Knox provides gateway-based access to Hadoop REST APIs; proxy, authentication, and client services.
- Data workflow governance
- Oozie coordinates workflow execution; ensures job sequences run as designed with notifications.
Application Areas of Big Data
- Industries and domains mentioned in slides:
- Healthcare, Telecom, Insurance, Government, Finance, Automobile, Education, Retail, and other sectors leveraging Big Data Analytics.
Traditional vs Big Data: Key Differences (Summary)
- Data storage and access
- Traditional: centralized storage, offline analysis, structured data; relatively smaller scale.
- Big Data: distributed storage, real-time and offline analysis, mixed data types (structured, semi-structured, unstructured).
- Analytical approach
- Traditional: structured, repeatable analyses; IT-driven question-defining; monthly/periodic reports.
- Big Data: iterative and exploratory analysis; business users and data scientists collaborating; dynamic questions and discovery of insights.
Quick Reference: Mathematical and Quantitative Aspects
- Big Data metrics and notations (selected):
- Daily data generation examples:
- Twitter activity: ~ tweets/day (example figure referenced as daily tweets).
- Emails: ~ emails/day (approximate from slides).
- Data growth:
- Global data volume in 2025: > 175\ ext{Zettabytes}, and projected to by 2030.
- Quintillion-byte scale (illustrative):
- Daily data creation:
- 5 Vs (formal representation):
- Block and replication concepts (HDFS):
- Block size: typically or per block.
- Replication factor: (default) for fault tolerance.
Summary of Key Takeaways
- Big Data Analytics combines large-scale data collection with advanced analytic techniques to derive real-time and strategic insights.
- Hadoop provides a scalable, fault-tolerant framework for storage (HDFS) and processing (MapReduce, YARN) across commodity hardware.
- The Hadoop ecosystem offers a wide set of tools (Sqoop, Flume, Pig, Hive, Spark, Mahout, Ambari, Kafka, Storm, Ranger, Knox, Oozie) to cover ingestion, processing, analytics, security, and workflow orchestration.
- Data comes in three main varieties (structured, semi-structured, unstructured) and is characterized by Volume, Velocity, Variety, Veracity, and Value.
- Real-world case studies (Walmart, Uber, Netflix) illustrate how Big Data analytics drives supply chain optimization, pricing, route optimization, personalization, churn reduction, and content strategy.
- The course emphasizes the move from traditional, centralized analytics to distributed, iterative, and exploratory analytics, with both real-time and offline capabilities.