1/37
This set of vocabulary flashcards covers core Big Data concepts including Hadoop architecture, types of digital data, 5 Vs, HDFS components, MapReduce phases, and various ecosystem tools like Hive, Pig, Spark, and HBase.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Digital Data
Any information that is stored or processed using computers, such as photos, WhatsApp messages, or YouTube videos.
Structured Data
Data organized in a proper format like rows and columns, making it easy to search and store in databases like Excel or MySQL.
Unstructured Data
Data that has no fixed format and is difficult to organize, such as a selfie or a photo.
Semi-Structured Data
Data that is partly organized using tags or keys rather than a full table form, such as a JSON file.
5 Vs of Big Data
The core characteristics of Big Data: Volume (amount), Velocity (speed of data), Variety (different types), Veracity (accuracy/reliability), and Value (usefulness).
Batch Processing
A method where large amounts of data are processed in chunks or bulk rather than instantly, such as processing a whole day's sales data at night.
Real-Time Message Ingestion
A process that collects data as soon as it is created, such as a sensor sending temperature readings every second.
Orchestration
The component that manages all steps of the Big Data architecture smoothly, acting as a controller to keep everything running properly.
Apache Hadoop
An open-source framework designed to store and process large datasets across multiple computers in a distributed manner using cheap hardware.
HDFS (Hadoop Distributed File System)
The storage system used in Hadoop that breaks large files into blocks and distributes them across a cluster.
NameNode
The 'master' or manager in HDFS that keeps track of the metadata and where each data block is stored, but does not store actual data.
DataNodes
The 'worker' nodes in HDFS that store the actual blocks of data and follow instructions from the NameNode.
MapReduce
A programming model in Hadoop that processes data by dividing work into two steps: Map (breaking tasks down) and Reduce (combining results).
YARN (Yet Another Resource Negotiator)
A component of Hadoop that manages resources like memory and CPU across the cluster and decides which computer performs which job.
Block Abstraction
Specific to HDFS, it is the breaking of files into fixed-size parts (default 128MB or 256MB) to treat a file as a set of blocks rather than a whole.
Data Replication
The process where every block of a file is copied multiple times (default: 3 copies) and saved on different machines for fault tolerance.
Apache Flume
A tool used to bring live, real-time streaming data (like logs from web servers) into Hadoop.
Sqoop
A tool used for importing and exporting data between Hadoop and relational databases like MySQL or Oracle.
Hadoop Archives (HAR)
A feature used to combine many small files into one big file to save space and improve performance in HDFS.
Avro
A tool used for storing and exchanging data that includes its schema within a compact binary format, supporting many programming languages.
Kerberos
The authentication protocol used in Hadoop to check the identity of users trying to access the cluster.
Delegation Token
A temporary key given to a user or application to access Hadoop services without needing repeated logins.
Intelligent Data Analysis (IDA)
Using smart methods like machine learning, AI, and statistical tools to automatically understand data and find hidden patterns.
Scale-Out
The process of improving system performance by adding more machines (nodes) to a cluster rather than upgrading a single machine's power.
NoSQL Database
A database type used for Big Data that is schema-less, flexible, and capable of storing unstructured or semi-structured data better than traditional SQL.
MongoDB
A popular document-oriented NoSQL database that stores data in flexible, JSON-like documents rather than tables and rows.
Apache Spark
A fast and powerful processing tool that can perform data tasks in-memory, making it significantly faster than Hadoop MapReduce.
Resilient Distributed Dataset (RDD)
The fundamental data structure in Spark that breaks big data into pieces across computers and can rebuild data using lineage if a crash occurs.
Scala
A programming language that combines Java and functional programming features, commonly used with Apache Spark for Big Data processing.
Closure
A function in Scala that remembers the values of variables from the environment where it was created.
Apache Pig
A data flow platform that uses a language called Pig Latin to analyze big data, converting the scripts into MapReduce jobs.
Grunt
The command-line interface (CLI) for Pig where users can type and run Pig commands step-by-step.
Apache Hive
A data warehouse tool built on top of Hadoop that allows users to manage and query large datasets using a SQL-like language called HiveQL.
Hive Metastore
A service that acts as a catalog, storing information about Hive tables, columns, and data types, and their locations in HDFS.
HBase
A distributed, column-oriented NoSQL database that runs on top of HDFS, optimized for real-time read and write access to billions of rows.
ZooKeeper
A coordination tool used in Hadoop and HBase to track node status, manage leader election, and ensure smooth cluster operation.
BigSQL
An IBM tool that allows users to write standard SQL queries to interact with and analyze data stored in HDFS, Hive, or HBase.
BigSheets
An IBM tool within BigInsights that provides a spreadsheet-style interface for non-technical users to analyze Big Data without coding.