How do you run notebooks in Databricks from Data Factory?
Generate an access token for your Azure Databricks workspace, create a linked service in your Azure Data Factory resource that uses the access token to connect to Azure Databricks
6
New cards
How do you save a dataframe as a partitioned set of files?
use the partitionBy method; dated_df.write.partitionBy("Year").mode("overwrite").parquet("/data")
7
New cards
How does ADLS organize data?
organizes the stored data into hierarchy of directories and subdirectories, much like a file system. As a result, data processing requires less computational resources, reducing both the time and cost
8
New cards
How does ADLS store data?
stores data in an HDFS-compatible file system hosted in Azure storage; with this feature, you can store the data in one place and access it through compute technologies like Azure Databricks, ASA without moving the data between environments
9
New cards
How is Azure Databricks used for Data Science and Engineering?
provides Apache Spark based processing and analysis of large volumes of data in a data lake
10
New cards
How is Azure Databricks used for machine learning?
supports machine learning workloads that involve data exploration and preparation, training and evaluating machine learning models, and serving models to generate predictions for applications and analyses
11
New cards
How is Azure Databricks used for SQL?
supports SQL-based querying for data stored in tables in a SQL Warehouse. This capability enables data analysts to query, aggregate, summarize, and visualize data using familiar SQL syntax and a wide range of SQL-based data analytical tools
12
New cards
How is data in a lake database stored?
stored in the data lake as parquet or CSV files. the files can be managed independently of the database tables
13
New cards
How is granting access done in Azure Data Lake Gen2?
through RBAC
14
New cards
How is the underlying data in delta lakes stored?
in parquet format, which is commonly used in data lake ingestion pipelines
What are 4 stages for processing big data solutions?
Ingest, store, prep & train, model &serve
17
New cards
What are Apache Spark Clusters?
spark is a distributed data processing solution that makes use of clusters to scale processing across multiple compute nodes. Each spark cluster has a driver node to coordinate processing jobs, and one or more worker nodes on which the processing occurs
18
New cards
What are data lakes?
a storage repository that holds large amounts of data in native, raw formats. optimized for scaling to massive volumes of data.
19
New cards
What are data pipelines?
used to orchestrate activities that transfer and transform data. Pipelines are the primary way in which DE's implement repeatable ETL solutions that can be triggered based on a schedule or in response to events
20
New cards
What are hopping windows?
hop forward in time by a fixed period. It may be easy to think of them as Tumbling windows that can overlap and be emitted more often than the window size
21
New cards
What are notebooks in Azure Databricks?
an interactive environment in which you can combine text and graphics in Markdown format with cells containing code that you run interactively in the notebook sessions
22
New cards
What are parameters in Azure Data Factory?
key-value pairs of read-only configuration. parameters are defined in the pipeline
23
New cards
What are session windows?
group events that arrive at similar times, filtering out periods of time where there is no data
24
New cards
What are sliding windows?
output events only for points in time when the content of the window actually changes. In other words, when an event enters or exits the window.
25
New cards
What are the benefits of a delta lake?
Relational tables that support querying and data modification, support for ACID transaction, data versioning and time travel, support for batch and streaming data, standard formats and interoperability
26
New cards
What are the benefits of a serverless SQL pool?
Familiar Transact-SQL syntax, Integrate connectivity, distribute query processing, Built-in query execution fault-tolerance, no infrastructure to setup or clusters
27
New cards
What are the two SQL pools that ASA supports?
Serverless SQL pool and Dedicated SQL pool
28
New cards
What are tumbling windows?
used to segment a data stream into distinct time segments and perform a function against them; The key differentiators of a Tumbling window are that they repeat, do not overlap, and an event cannot belong to more than one tumbling window
29
New cards
What are typical operations on a dataframe?
Filtering rows and columns, renaming columns, creating new columns- often derived rom existing ones, replacing null or other values
30
New cards
What can the LAG function be used for?
to find the difference between prior events
31
New cards
What data integration capabilities does integration runtime provides?
Data flow, data movement, activity dispatch, SSIS package execution
32
New cards
What does a Linked Service in Data Factory enable users to do?
enables you to ingest the data from a data source in readiness to prepare the data for transformation and/or analysis
33
New cards
What does it mean to define the data source?
Identify source details such as the resource group, subscription, and identity information such as a key or secret
34
New cards
What does it mean to define the data?
Identify the data to be extracted. Define data by using a database query, a set of files, or an Azure Blob Storage name for blob storage
35
New cards
What does RBAC mean?
In Azure Data Lake Gen2, role-based access control is available; built in security groups include ReadOnlyUsers, WriteAccessUsers, and FullAccessUsers
36
New cards
What does the lookup task allow you to do in a data flow activity?
allows you to take a value and search for it in a different source to retrieve another value or set of values
37
New cards
What function is used to define and use parameters in a notebook?
What function is used to find the time difference?
DATEDIFF
39
New cards
What is a control flow?
an orchestration of pipeline activities that includes chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or from a trigger
40
New cards
What is a databricks file system (DBFS)?
while each cluster node has its own local file system, the nodes in a cluster have access to a shared, distributed file system in which they can access and operate on data files
41
New cards
What is a delta lake?
an open-source storage layer that adds relational database semantics to Spark-based data lake processing
42
New cards
What is a Hive metastore?
The Hive metastore is a special read-only table created within Hive that stores the meta-data related to all tables in a Hive database. This allows for the querying of metadata from the metastore, while also protecting that metadata from writes that may ultimately undermine the structure of other tables within Hive.
43
New cards
What is a lake database?
provides a relational metadata layer over one or more files in a data lake
44
New cards
What is a managed table?
defined without a specified location, and the data files are stored within the storage used by the metastore; dropping the table not only removes its metadata from the catalog, but also deletes the folder in which its data files are stored
45
New cards
What is aggregate?
define different types of aggregations such as SUM, MIN, MAX, and COUNT grouped by existing or computed columns
46
New cards
What is alter row?
set insert, delete, update, and upsert policies on rows. You can add one-to-many conditions as expressions
47
New cards
What is an activity in Azure Data Factory?
typically contain the transformation logic or the analysis commands of the Azure Data Factory. activities includes the Copy Activity that can be used to ingest data from a variety of data sources
48
New cards
What is an external table?
defined for a custom file location, where the data for the table is stored; dropping the table deletes the metadata from the catalog, but doesn't affect the data files
49
New cards
What is analytical data?
data that has been optimized for analysis and reporting, often in a data warehouse
50
New cards
What is Apache Spark?
distributed data processing framework that enables large-scale data analytics by coordinating work across multiple processing nodes in a cluster
51
New cards
What is Azure Cosmos DB?
a globally distributed, multi-model database
52
New cards
What is Azure Data Factory?
the cloud-based ETL and data integration service that allows you to create data-driven workflows (pipelines) for orchestrating data movement and transforming data at scale
53
New cards
What is Azure Data Lake Storage Gen2?
provides a cloud-based solution for data lake storage in Azure, and underpins many large-scale analytics solutions built on Azure
54
New cards
What is Azure Databricks?
a comprehensive platform that is optimized for three specific types of data workload and associated user personas: Data science and Engineering, Machine learning, and SQL
55
New cards
What is Azure HDInsight?
provides technologies to help you ingest, process, and analyze big data. It supports batch processing, data warehousing, IoT, and data science
56
New cards
What is Azure Purview?
a unified data governance service that helps you manage and govern your on-premises, multi-cloud and software-as-a-service (SaaS) data
57
New cards
What is Azure SQL Database?
supports structures such as relational data and unstructured formats such as spatial and XML data
58
New cards
What is Azure Stream Analytics used for?
process streaming data and respond to data anomalies in real time; use for Internet of Thing(IoT) monitoring, web logs, remote patient monitoring and point of sale (POS) systems
59
New cards
What is Azure Synapse Analytics?
provides capabilities for running data pipelines and managing analytical data in a data lake or relational data warehouse; a cloud-based data platform that brings together enterprise data warehousing and Big Data analytics
60
New cards
What is conditional split?
route rows of data to different streams based on matching conditions
61
New cards
What is data consolidation?
the process of combining data that has been extracted from multiple data sources into a consistent structure usually to support analysis
62
New cards
What is data integration?
establishing links between operational and analytical services and data sources to enable secure, reliable access to data across multiple systems
63
New cards
What is data transformation?
operational data usually needs to be transformed into suitable structure and format for analysis, often as part of an extract, transform, and load (ETL) process
64
New cards
What is derived column?
generate new columns or modify existing fields using the data flow expression language
65
New cards
What is flatten?
take array values inside hierarchical structures such as JSON and unroll them into individual rows
66
New cards
What is integration runtime in Azure Data Factory?
enables Azure Data Factory to bridge between the activity and linked services objects
67
New cards
What is lookup?
enables you to reference data from another source
68
New cards
What is managed identity?
this identity can be used to authorize the request for data access in Azure storage; before accessing the data, the store administrator must grant permissions to managed identity for accessing the data
69
New cards
What is mapping data flow?
provide an environment for building a wide range of data transformations visually without the need to use code
70
New cards
What is needed in the Azure SQL database to ensure lineage?
You need a master key in the Azure SQL database for lineage to work.
71
New cards
What is new branch?
apply multiple sets of operations and transformations against the same data stream
72
New cards
What is one of the core resources in a synapse analytics workspace?
data lake
73
New cards
What is operational data?
transactional data that is generated and stored by applications, often in a relational or non-relational database
74
New cards
What is partitioning?
an optimization technique that enables spark to maximize performance across the worker nodes
75
New cards
What is pivot?
an aggregation where one or more grouping columns has distinct row values transformed into individual columns
76
New cards
What is semi-structured data?
data such as JavaScript object rotation (JSON) files, which may require flattening prior to loading into your source system
77
New cards
What is Shared access signature (SAS)?
provides delegated access to resources in storage account, without sharing account keys; gives you granular control over the type of access you grant to clients who have the SAS
78
New cards
What is streaming data?
refers to perpetual sources of data that generate data values in real-time, often relation to specific events
79
New cards
What is structured data?
primarily comes from table-based source systems such as a relational database or from a flat file such as a CSV file
80
New cards
What is the benefit of ELT?
you can store data in its original format, be it JSON, XML, PDF, or images; you can define the data's structure during the transformation phase, so you can use the source data in multiple downstream systems
81
New cards
What is the primary structure for working with data in Apache Spark
dataframe
82
New cards
What is the right design pattern to follow when it comes to the folder structure and file naming convention for the streaming data?
\DataSet\YYYY\MM\DD\datafile_YYYY_MM_DD.csv
83
New cards
What is Type 1 SCD?
always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten; this approach is common for columns that store supplementary values
84
New cards
What is Type 2 SCD?
keeps history of changes in dimensions members by adding a new row to the table for each change
85
New cards
What is Type 3 SCD?
supports storing two versions of a dimension member as separate columns. The table includes a column for the current value of a member plus either the original or previous value of the member
86
New cards
What is Type 6 SCD?
Combines Type 1, 2, and 3; in this design you also store the current value in all versions of that entity so you can easily report on the current value or the historical value
87
New cards
What is unpivot?
pivot columns into row values
88
New cards
What is unstructured data?
includes data stored as key-value pairs that don't adhere to standard relational models and other types of unstructured data that are commonly used include portable data format
89
New cards
What is useful about Azure Blob storage?
you can store large amounts of unstructured data in a flat namespace within a blob container
90
New cards
What is user identity?
also known as "pass-through", is an authorization type where the identity of the Azure AD user that logged into serverless SQL pool is used to authorize access to the data
91
New cards
What key feature does Azure Synapse Analytics have?
the Data Movement services (DMS) coordinates and transports data between compute nodes as necessary. But you can use a replicated table to reduce data movement and improve performance
92
New cards
What occurs during the extraction process?
data engineers define the data and its source
93
New cards
What occurs during the load process?
many Azure destinations can accept data formatted as a JavaScript Object Notation (JSON), file, or blob. You might need to write code to interact with application APIs.
94
New cards
What occurs during the transformation process?
splitting, combining, deriving, adding, removing, or pivoting columns. Map field between the data source and the data destination. You might also need to aggregate or merge data
95
New cards
What requirements must be met to create and manage data factory objects?
To create and manage child resources in the Azure portal, you must belong to the Data Factory contributor role at the resource group level or above; To create and manage resources with PowerShell or the SDK, the contributor role at the resource level or above is sufficient
96
New cards
What service would you use for a requirement listed as such: provide an environment to support an Analytical data store
Azure Synapse Analytics
97
New cards
What statement can be used to implement column level security in the table?
GRANT
98
New cards
What statement can be used to implement row level security in the table?
CREATE SECURITY POLICY
99
New cards
What steps are typically performed for pipelines in Azure Data Factory?
Connect and Collect, Transform and enrich, Publish, Monitor
100
New cards
What two types of authentication are supported in the serverless SQL pools?
SQL Authentication and Azure Active Directory Authentication