DP

Spark & Hadoop – IDS 200 Lecture Vocabulary

Distributed Databases

  • Data too large for single hardware ➔ split across multiple nodes
  • Key challenges: data placement, fast retrieval, avoiding device-level bottlenecks
  • Real-world scale: Google webpages, Facebook photos, YouTube videos

MapReduce Paradigm

  • Two stages:
    • Map: divide job, assign sub-tasks to different nodes
    • Reduce: aggregate partial outputs into final result
  • Original use case: Google search index processing

Hadoop Ecosystem

  • Cluster of commodity nodes; each stores & processes its own data chunk
  • Scheduler: YARN (Yet Another Resource Negotiator)
  • Automatic replication & node scaling for fault-tolerance / elasticity

Hadoop Distributed File System (HDFS)

  • Data on disk in fixed blocks (default 64 or 128\,\text{MB})
  • \text{NameNode} holds metadata (file ➔ blocks ➔ data nodes)
  • Optimized for petabyte-scale, sequential read workloads

Hadoop Limitations

  • Disk-based I/O ⇒ slower for real-time / ML needs
  • Introduced ~20 years ago; many orgs now moving beyond classic MapReduce

Spark Highlights

  • Built to run atop Hadoop (reuses HDFS & YARN) but keeps data in RAM
  • Supports MapReduce plus richer APIs (MLlib, SQL, streaming)
  • Performance: markedly faster; Cost: higher RAM requirements
  • Demands greater technical skill for deployment & tuning

Choosing Hadoop vs. Spark

  • Hadoop = lower cost, acceptable for batch or less time-critical jobs
  • Spark = higher speed & flexibility, preferred for iterative analytics / ML
  • Mixed environments common (e.g., Amazon retail, many government systems)

Takeaways for IDS Majors

  • Learn underlying principles, not just current tools
  • Legacy concepts persist in evolved forms ➔ foundation for new tech
  • Demonstrating breadth + growth mindset valued in interviews