Data_Engineering_Part3

Page 1: Introduction to Fabric Runtime

Overview

  • Microsoft Fabric Runtime: An Azure-integrated platform based on Apache Spark.

  • Combines internal and open-source components to enable data engineering and data science experiences.

  • Often referred to simply as Fabric Runtime.

Major Components

  • Apache Spark: An open-source distributed computing library designed for large-scale data processing and analytics.

  • Delta Lake: An open-source storage layer that adds ACID transactions and data reliability features to Apache Spark.

  • Native Execution Engine: Enhances performance by executing Spark queries directly on lakehouse infrastructure, boasting up to 4x faster query speeds compared to traditional OSS Spark.

    • Compatible with Parquet and Delta formats supported in Runtime 1.3.

    • Built on Meta's Velox and Intel's Apache Gluten.

  • Default-level Packages: Includes Java/Scala, Python, and R packages that are pre-installed for ease of use.

Page 2: Runtime Versions and Optimizations

Current Production Version

  • Always use the most recent General Availability (GA) runtime version.

Runtime Comparison

Version

Apache Spark

Operating System

Java

Scala

Python

Delta Lake

R

Runtime 1.1

3.3.1

Ubuntu 18.04

8

2.12.15

3.10

2.2.0

4.2.2

Runtime 1.2

3.4.1

Mariner 2.0

11

2.12.17

3.10

2.4.0

4.2.2

Runtime 1.3

3.5.0

Mariner 2.0

11

2.12.17

3.11

3.2

4.4.1

  • Check links for detailed features and migration scenarios for each runtime.

Fabric Optimizations

  • Incorporates optimizations specifically for Spark and Delta Lake, designed for native integration within Fabric.

  • Nearly 100 built-in query performance enhancements.

Page 3: Writer Capabilities and Runtime Changes

Writer Capabilities in Fabric

  • Optimized writing processes for better performance.

  • Default V-Order optimization for Delta Parquet files enhances read performance.

Support for Multiple Runtimes

  • Users can switch between multiple runtimes without risk of disruption.

  • Changing the runtime version affects all system-created items in the workspace, with specific guidance on doing so.

Page 4: Consequences of Runtime Changes on Settings

Migration of Spark Settings

  • Aim to migrate Spark settings; warnings issued for incompatible settings.

  • Configuration settings differ between mutable and immutable settings.

Page 5: Library Management During Runtime Changes

Handling Libraries

  • Python and R libraries generally operate without issues when versions are unchanged.

  • Jars may face compatibility issues due to dependency changes—users must address conflicts with their libraries.

Page 6: Delta Lake Protocol Management

Protocol Upgrade

  • Delta Lake features are backward compatible, but forward compatibility may be compromised when certain features are enabled.

  • Use method delta.upgradeTableProtocol to upgrade Delta table protocols with caution.

Page 7: Default Table Format Changes

Table Format Transition

  • Runtime 1.3 changes default table format from Parquet to Delta in various Spark commands.

  • Scripts assuming Parquet should be revised as Delta is now the default.

Page 8: Feedback Request

Feedback

  • Provide product feedback and engage with community for further questions.

Page 9: Release Cadence of Apache Spark Runtimes

General Release Information

  • Minor versions of Apache Spark are released every 6 to 9 months.

  • Microsoft Fabric Spark team rapidly delivers new runtime versions.

Page 10: Lifecycle and Support Date for Runtimes

Runtime Lifecycle

  • Each runtime has distinct support phases including Experimental, Public Preview, GA, LTS, and End of Support.

Page 11: Lifecycle Continued

End-of-Support Phase

  • Once a runtime's end-of-support date arrives, the runtime is removed and will not receive any updates.

Page 12: Version Numbering

Runtime Versioning

  • Runtime major versions correspond to Apache Spark's major version.

Page 13: Fabric Runtime 1.3 Overview

Runtime 1.3 Features

  • Latest GA version introduces Apache Spark 3.5 with numerous performance enhancements.

Page 14: Apache Spark 3.5 Improvements

Notable Enhancements

  • Compatibility upgrades, new features for structured streaming, and expanded functionality in PySpark.

Page 15: Delta Lake 3.2 Improvements

Enhanced Interoperability

  • Improvements for Delta Lake 3.2 focused on performance and ease of use.

Page 16: Fabric Runtime 1.2 Overview

Runtime 1.2 Features

  • Maintains a range of updates for improved performance and patches.

Page 17: Apache Spark 3.4.1 Features

New Enhancements

  • Various fixes and enhancements including improvements in stability.

Page 18: Limitations and Best Practices

Limitations for Concurrent Writes

  • New algorithms minimize data loss issues common with parallel insert operations.

Page 19: Delta Lake Advantages

ACID Transactions and Reliability

  • Overview of Delta Lake capabilities reinforcing its use in lakehouse architecture.

Page 20: Feedback Reminder

Community Engagement

  • Encourage users to provide feedback and engage with the community.

Pages 21-100: Continued Documentation

Further Documentation and Topics

  • Covers Advanced settings, REST API usage, Git integration, Spark job definitions, and administration for enhanced user experience in Microsoft Fabric.

robot