Data_Engineering_Part3

studied byStudied by 0 people
0.0(0)
learn
LearnA personalized and smart learning plan
exam
Practice TestTake a test on your terms and definitions
spaced repetition
Spaced RepetitionScientifically backed study method
heart puzzle
Matching GameHow quick can you match all your cards?
flashcards
FlashcardsStudy terms and definitions

1 / 23

encourage image

There's no tags or description

Looks like no one added any tags here yet for you.

24 Terms

1

Apache Spark

A powerful open-source distributed computing library that enables large-scale data processing and analytics tasks.

New cards
2

Delta Lake

An open-source storage layer that brings ACID transactions and enhances data reliability for Apache Spark.

New cards
3

Native Execution Engine

An enhancement for Apache Spark workloads that executes Spark queries directly on lakehouse infrastructure, improving performance.

New cards
4

Fabric Runtime

An Azure-integrated platform based on Apache Spark that facilitates data engineering and data science experiences.

New cards
5

Vendor Lock-in

A situation where customers cannot easily switch suppliers due to compatibility or cost considerations.

New cards
6

ETL

Extract, Transform, Load; a process of moving data from one system to another after transforming it into a standard format.

New cards
7

ACID Transactions

A set of properties (Atomicity, Consistency, Isolation, Durability) that guarantee database transactions are processed reliably.

New cards
8

Parquet

A columnar storage file format optimized for use with big data processing frameworks.

New cards
9

TPC-DS Benchmark

A benchmark designed to measure the performance of decision support systems.

New cards
10

Data Lakehouse

A modern architectural pattern that combines the best features of data lakes and data warehouses.

New cards
11

Auto-scaling

The capability of a system to automatically adjust its resource allocation based on the current workload.

New cards
12

Job Admission Logic

Rules that govern how jobs are queued and executed within a computing environment.

New cards
13

Interactive Queries

Queries that allow users to interact with data in real time, often used for data exploration.

New cards
14

Runtime Version

The specific version of software that executes a program, which may affect compatibility and performance.

New cards
15

High concurrency mode

A feature that allows multiple users to share the same computation resources to optimize performance.

New cards
16

Structured Streaming

A stream processing engine built on the Spark SQL engine that enables scalable and fault-tolerant stream processing.

New cards
17

Custom Spark Pools

User-defined configurations for clusters in Apache Spark environments to manage specific workloads.

New cards
18

Migration Scenarios

Procedures and considerations for transitioning from one version of software to another.

New cards
19

Lakehouse Architecture

Combines data lakes and data warehouses into a unified architecture that supports both storage and analytics.

New cards
20

Open-source

Software with source code that is made available to the public for use, modification, and distribution.

New cards
21

Job Definition

A set of configurations and scripts that specify how a Spark job should be executed.

New cards
22

Library Management

The process of handling software libraries within a computing environment to ensure compatibility and availability.

New cards
23

Workspace Settings

Configuration options that control the parameters and resources available to a user within a software platform.

New cards
24

Experimental Preview

A phase of a software release where new features are tested by users before general availability.

New cards
robot