Big Data practicing concepts (BCS061/BCDS-601/KOE-097)

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/37

Earn XP

Description and Tags

This set of vocabulary flashcards covers core Big Data concepts including Hadoop architecture, types of digital data, 5 Vs, HDFS components, MapReduce phases, and various ecosystem tools like Hive, Pig, Spark, and HBase.

Last updated 6:41 PM on 5/18/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

38 Terms

New cards

Digital Data

Any information that is stored or processed using computers, such as photos, WhatsApp messages, or YouTube videos.

New cards

Structured Data

Data organized in a proper format like rows and columns, making it easy to search and store in databases like Excel or MySQL.

New cards

Unstructured Data

Data that has no fixed format and is difficult to organize, such as a selfie or a photo.

New cards

Semi-Structured Data

Data that is partly organized using tags or keys rather than a full table form, such as a JSON file.

New cards

5 Vs of Big Data

The core characteristics of Big Data: Volume (amount), Velocity (speed of data), Variety (different types), Veracity (accuracy/reliability), and Value (usefulness).

New cards

Batch Processing

A method where large amounts of data are processed in chunks or bulk rather than instantly, such as processing a whole day's sales data at night.

New cards

Real-Time Message Ingestion

A process that collects data as soon as it is created, such as a sensor sending temperature readings every second.

New cards

Orchestration

The component that manages all steps of the Big Data architecture smoothly, acting as a controller to keep everything running properly.

New cards

Apache Hadoop

An open-source framework designed to store and process large datasets across multiple computers in a distributed manner using cheap hardware.

New cards

HDFS (Hadoop Distributed File System)

The storage system used in Hadoop that breaks large files into blocks and distributes them across a cluster.

New cards

NameNode

The 'master' or manager in HDFS that keeps track of the metadata and where each data block is stored, but does not store actual data.

New cards

DataNodes

The 'worker' nodes in HDFS that store the actual blocks of data and follow instructions from the NameNode.

New cards

MapReduce

A programming model in Hadoop that processes data by dividing work into two steps: Map (breaking tasks down) and Reduce (combining results).

New cards

YARN (Yet Another Resource Negotiator)

A component of Hadoop that manages resources like memory and CPU across the cluster and decides which computer performs which job.

New cards

Block Abstraction

Specific to HDFS, it is the breaking of files into fixed-size parts (default $128\,MB$ or $256\,MB$ ) to treat a file as a set of blocks rather than a whole.

New cards

Data Replication

The process where every block of a file is copied multiple times (default: 3 copies) and saved on different machines for fault tolerance.

New cards

Apache Flume

A tool used to bring live, real-time streaming data (like logs from web servers) into Hadoop.

New cards

Sqoop

A tool used for importing and exporting data between Hadoop and relational databases like MySQL or Oracle.

New cards

Hadoop Archives (HAR)

A feature used to combine many small files into one big file to save space and improve performance in HDFS.

New cards

Avro

A tool used for storing and exchanging data that includes its schema within a compact binary format, supporting many programming languages.

New cards

Kerberos

The authentication protocol used in Hadoop to check the identity of users trying to access the cluster.

New cards

Delegation Token

A temporary key given to a user or application to access Hadoop services without needing repeated logins.

New cards

Intelligent Data Analysis (IDA)

Using smart methods like machine learning, AI, and statistical tools to automatically understand data and find hidden patterns.

New cards

Scale-Out

The process of improving system performance by adding more machines (nodes) to a cluster rather than upgrading a single machine's power.

New cards

NoSQL Database

A database type used for Big Data that is schema-less, flexible, and capable of storing unstructured or semi-structured data better than traditional SQL.

New cards

MongoDB

A popular document-oriented NoSQL database that stores data in flexible, JSON-like documents rather than tables and rows.

New cards

Apache Spark

A fast and powerful processing tool that can perform data tasks in-memory, making it significantly faster than Hadoop MapReduce.

New cards

Resilient Distributed Dataset (RDD)

The fundamental data structure in Spark that breaks big data into pieces across computers and can rebuild data using lineage if a crash occurs.

New cards

Scala

A programming language that combines Java and functional programming features, commonly used with Apache Spark for Big Data processing.

New cards

Closure

A function in Scala that remembers the values of variables from the environment where it was created.

New cards

Apache Pig

A data flow platform that uses a language called Pig Latin to analyze big data, converting the scripts into MapReduce jobs.

New cards

Grunt

The command-line interface (CLI) for Pig where users can type and run Pig commands step-by-step.

New cards

Apache Hive

A data warehouse tool built on top of Hadoop that allows users to manage and query large datasets using a SQL-like language called HiveQL.

New cards

Hive Metastore

A service that acts as a catalog, storing information about Hive tables, columns, and data types, and their locations in HDFS.

New cards

HBase

A distributed, column-oriented NoSQL database that runs on top of HDFS, optimized for real-time read and write access to billions of rows.

New cards

ZooKeeper

A coordination tool used in Hadoop and HBase to track node status, manage leader election, and ensure smooth cluster operation.

New cards

BigSQL

An IBM tool that allows users to write standard SQL queries to interact with and analyze data stored in HDFS, Hive, or HBase.

New cards

BigSheets

An IBM tool within BigInsights that provides a spreadsheet-style interface for non-technical users to analyze Big Data without coding.