Data_Engineering_Part3

Studied by 0 people

0.0(0)

View linked note

LearnA personalized and smart learning plan

Practice TestTake a test on your terms and definitions

Spaced RepetitionScientifically backed study method

Matching GameHow quick can you match all your cards?

FlashcardsStudy terms and definitions

1 / 23

There's no tags or description

Looks like no one added any tags here yet for you.

24 Terms

Apache Spark

A powerful open-source distributed computing library that enables large-scale data processing and analytics tasks.

New cards

Delta Lake

An open-source storage layer that brings ACID transactions and enhances data reliability for Apache Spark.

New cards

Native Execution Engine

An enhancement for Apache Spark workloads that executes Spark queries directly on lakehouse infrastructure, improving performance.

New cards

Fabric Runtime

An Azure-integrated platform based on Apache Spark that facilitates data engineering and data science experiences.

New cards

Vendor Lock-in

A situation where customers cannot easily switch suppliers due to compatibility or cost considerations.

New cards

ETL

Extract, Transform, Load; a process of moving data from one system to another after transforming it into a standard format.

New cards

ACID Transactions

A set of properties (Atomicity, Consistency, Isolation, Durability) that guarantee database transactions are processed reliably.

New cards

Parquet

A columnar storage file format optimized for use with big data processing frameworks.

New cards

TPC-DS Benchmark

A benchmark designed to measure the performance of decision support systems.

New cards

Data Lakehouse

A modern architectural pattern that combines the best features of data lakes and data warehouses.

New cards

Auto-scaling

The capability of a system to automatically adjust its resource allocation based on the current workload.

New cards

Job Admission Logic

Rules that govern how jobs are queued and executed within a computing environment.

New cards

Interactive Queries

Queries that allow users to interact with data in real time, often used for data exploration.

New cards

Runtime Version

The specific version of software that executes a program, which may affect compatibility and performance.

New cards

High concurrency mode

A feature that allows multiple users to share the same computation resources to optimize performance.

New cards

Structured Streaming

A stream processing engine built on the Spark SQL engine that enables scalable and fault-tolerant stream processing.

New cards

Custom Spark Pools

User-defined configurations for clusters in Apache Spark environments to manage specific workloads.

New cards

Migration Scenarios

Procedures and considerations for transitioning from one version of software to another.

New cards

Lakehouse Architecture

Combines data lakes and data warehouses into a unified architecture that supports both storage and analytics.

New cards

Open-source

Software with source code that is made available to the public for use, modification, and distribution.

New cards

Job Definition

A set of configurations and scripts that specify how a Spark job should be executed.

New cards

Library Management

The process of handling software libraries within a computing environment to ensure compatibility and availability.

New cards

Workspace Settings

Configuration options that control the parameters and resources available to a user within a software platform.

New cards

Experimental Preview

A phase of a software release where new features are tested by users before general availability.

New cards