1/11
Flashcards for review of Big Data Processing Concepts lecture.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Parallel Data Processing
Involves the simultaneous execution of multiple sub-tasks that collectively comprise a larger task to reduce execution time, typically within a single machine with multiple processors or cores.
Distributed Data Processing
Similar to parallel data processing but always achieved through physically separate machines networked together as a cluster, applying the divide-and-conquer principle.
Hadoop
An open-source framework for large-scale data storage and processing compatible with commodity hardware, serving as a de facto industry platform for Big Data solutions and implementing the MapReduce processing framework.
Batch Processing
Also known as offline processing, involves processing data in batches, usually imposing delays and resulting in high-latency responses, typically involving large quantities of data with sequential read/writes.
Transactional Processing
Also known as online processing, data is processed interactively without delay, resulting in low-latency responses, involving small amounts of data with random reads and writes.
Cluster
Provides the mechanism to enable distributed data processing with linear scalability, ideal for Big Data processing as large datasets can be divided and processed in parallel.
MapReduce
A widely used implementation of a batch processing framework that is highly scalable and reliable and is based on the principle of divide-and-conquer, which provides built-in fault tolerance and redundancy.
Task Parallelism
Parallelization of data processing by dividing a task into sub-tasks and running each sub-task on a separate processor, generally on a separate node in a cluster.
Data Parallelism
Parallelization of data processing by dividing a dataset into multiple datasets and processing each sub-dataset in parallel.
Realtime Mode (Big Data)
Data is processed in-memory as it is captured before being persisted to disk. Response time generally ranges from a sub-second to under a minute. Also called event or stream processing.
Speed, Consistency, Volume (SCV) Principle
States that a distributed data processing system can be designed to support only two of the following three requirements: Speed, Consistency, and Volume.
Event Stream Processing (ESP)
During ESP, an incoming stream of events, generally from a single source and ordered by time, is continuously analyzed. The analysis can occur via simple queries or the application of algorithms that are mostly formula-based.