Data_Engineering_Part3
Page 1: Introduction to Fabric Runtime
Overview
Microsoft Fabric Runtime: An Azure-integrated platform based on Apache Spark.
Combines internal and open-source components to enable data engineering and data science experiences.
Often referred to simply as Fabric Runtime.
Major Components
Apache Spark: An open-source distributed computing library designed for large-scale data processing and analytics.
Delta Lake: An open-source storage layer that adds ACID transactions and data reliability features to Apache Spark.
Native Execution Engine: Enhances performance by executing Spark queries directly on lakehouse infrastructure, boasting up to 4x faster query speeds compared to traditional OSS Spark.
Compatible with Parquet and Delta formats supported in Runtime 1.3.
Built on Meta's Velox and Intel's Apache Gluten.
Default-level Packages: Includes Java/Scala, Python, and R packages that are pre-installed for ease of use.
Page 2: Runtime Versions and Optimizations
Current Production Version
Always use the most recent General Availability (GA) runtime version.
Runtime Comparison
Version | Apache Spark | Operating System | Java | Scala | Python | Delta Lake | R |
|---|---|---|---|---|---|---|---|
Runtime 1.1 | 3.3.1 | Ubuntu 18.04 | 8 | 2.12.15 | 3.10 | 2.2.0 | 4.2.2 |
Runtime 1.2 | 3.4.1 | Mariner 2.0 | 11 | 2.12.17 | 3.10 | 2.4.0 | 4.2.2 |
Runtime 1.3 | 3.5.0 | Mariner 2.0 | 11 | 2.12.17 | 3.11 | 3.2 | 4.4.1 |
Check links for detailed features and migration scenarios for each runtime.
Fabric Optimizations
Incorporates optimizations specifically for Spark and Delta Lake, designed for native integration within Fabric.
Nearly 100 built-in query performance enhancements.
Page 3: Writer Capabilities and Runtime Changes
Writer Capabilities in Fabric
Optimized writing processes for better performance.
Default V-Order optimization for Delta Parquet files enhances read performance.
Support for Multiple Runtimes
Users can switch between multiple runtimes without risk of disruption.
Changing the runtime version affects all system-created items in the workspace, with specific guidance on doing so.
Page 4: Consequences of Runtime Changes on Settings
Migration of Spark Settings
Aim to migrate Spark settings; warnings issued for incompatible settings.
Configuration settings differ between mutable and immutable settings.
Page 5: Library Management During Runtime Changes
Handling Libraries
Python and R libraries generally operate without issues when versions are unchanged.
Jars may face compatibility issues due to dependency changes—users must address conflicts with their libraries.
Page 6: Delta Lake Protocol Management
Protocol Upgrade
Delta Lake features are backward compatible, but forward compatibility may be compromised when certain features are enabled.
Use method delta.upgradeTableProtocol to upgrade Delta table protocols with caution.
Page 7: Default Table Format Changes
Table Format Transition
Runtime 1.3 changes default table format from Parquet to Delta in various Spark commands.
Scripts assuming Parquet should be revised as Delta is now the default.
Page 8: Feedback Request
Feedback
Provide product feedback and engage with community for further questions.
Page 9: Release Cadence of Apache Spark Runtimes
General Release Information
Minor versions of Apache Spark are released every 6 to 9 months.
Microsoft Fabric Spark team rapidly delivers new runtime versions.
Page 10: Lifecycle and Support Date for Runtimes
Runtime Lifecycle
Each runtime has distinct support phases including Experimental, Public Preview, GA, LTS, and End of Support.
Page 11: Lifecycle Continued
End-of-Support Phase
Once a runtime's end-of-support date arrives, the runtime is removed and will not receive any updates.
Page 12: Version Numbering
Runtime Versioning
Runtime major versions correspond to Apache Spark's major version.
Page 13: Fabric Runtime 1.3 Overview
Runtime 1.3 Features
Latest GA version introduces Apache Spark 3.5 with numerous performance enhancements.
Page 14: Apache Spark 3.5 Improvements
Notable Enhancements
Compatibility upgrades, new features for structured streaming, and expanded functionality in PySpark.
Page 15: Delta Lake 3.2 Improvements
Enhanced Interoperability
Improvements for Delta Lake 3.2 focused on performance and ease of use.
Page 16: Fabric Runtime 1.2 Overview
Runtime 1.2 Features
Maintains a range of updates for improved performance and patches.
Page 17: Apache Spark 3.4.1 Features
New Enhancements
Various fixes and enhancements including improvements in stability.
Page 18: Limitations and Best Practices
Limitations for Concurrent Writes
New algorithms minimize data loss issues common with parallel insert operations.
Page 19: Delta Lake Advantages
ACID Transactions and Reliability
Overview of Delta Lake capabilities reinforcing its use in lakehouse architecture.
Page 20: Feedback Reminder
Community Engagement
Encourage users to provide feedback and engage with the community.
Pages 21-100: Continued Documentation
Further Documentation and Topics
Covers Advanced settings, REST API usage, Git integration, Spark job definitions, and administration for enhanced user experience in Microsoft Fabric.