Spark & Hadoop – IDS 200 Lecture Vocabulary
Distributed Databases
- Data too large for single hardware ➔ split across multiple nodes
- Key challenges: data placement, fast retrieval, avoiding device-level bottlenecks
- Real-world scale: Google webpages, Facebook photos, YouTube videos
MapReduce Paradigm
- Two stages:
- Map: divide job, assign sub-tasks to different nodes
- Reduce: aggregate partial outputs into final result
- Original use case: Google search index processing
Hadoop Ecosystem
- Cluster of commodity nodes; each stores & processes its own data chunk
- Scheduler: YARN (Yet Another Resource Negotiator)
- Automatic replication & node scaling for fault-tolerance / elasticity
Hadoop Distributed File System (HDFS)
- Data on disk in fixed blocks (default 64 or 128\,\text{MB})
- \text{NameNode} holds metadata (file ➔ blocks ➔ data nodes)
- Optimized for petabyte-scale, sequential read workloads
Hadoop Limitations
- Disk-based I/O ⇒ slower for real-time / ML needs
- Introduced ~20 years ago; many orgs now moving beyond classic MapReduce
Spark Highlights
- Built to run atop Hadoop (reuses HDFS & YARN) but keeps data in RAM
- Supports MapReduce plus richer APIs (MLlib, SQL, streaming)
- Performance: markedly faster; Cost: higher RAM requirements
- Demands greater technical skill for deployment & tuning
Choosing Hadoop vs. Spark
- Hadoop = lower cost, acceptable for batch or less time-critical jobs
- Spark = higher speed & flexibility, preferred for iterative analytics / ML
- Mixed environments common (e.g., Amazon retail, many government systems)
Takeaways for IDS Majors
- Learn underlying principles, not just current tools
- Legacy concepts persist in evolved forms ➔ foundation for new tech
- Demonstrating breadth + growth mindset valued in interviews