IR

DP-900

Explore Core Data Concepts

Identify Data Formats

Data: a collection of facts such as numbers, descriptions, and observations used to record information

Data Structures: data is organized often represents entities (typically has one ore more attributes or characteristics) that are important to an organization (i.e., customers, products, sales orders, and so on)

  • Structured Data:

    • data that adheres to a fixed schema so that the data has the same fields or properties

    • schema is tabular, data is represented in one ore more tables that consist of rows to represent each instance of a data entity, and columns to represent each instance of data entity and columns to represent attributes of the entity

    • often stored in a database in which multiple tables can be referenced one another by using key values in a relational model

  • Semi-Structured Data:

    • information that has some structure but allows for some variation between entity instances

    • a common format is JavaScript Object Notation (JSON)

    • JSON is one of many ways to represent this data

  • Unstructured Data

    • documents, images, audio and video data, and binary files don’t have a specific structure

Data Stores: organizations typically store data in any of the 3 formats to record details of entities, specific events, and other formats. Stored data can be retrieved for analysis and reporting later (files stores, databases)

Explore File Storage

File Storage: ability to store data in files is a core element of any computing system, files can be stored in local file systems on the hard disk of your personal computer, and on removable media such as USB drives (but in most organizations, important data files are stored centrally in some kind shared file storage system). the central storage location is hosted in the cloud, enabling cost-effective, secure, and reliable storage for large volumes of data

Specific file format used to store data depends on a number of factors including:

  • type of data being stored (structured, semi-structured, or unstructured)

  • applications and services that will need to read, write, and process the data

  • need for the data files to be readable by humans, or optimized for efficient storage and processing

File Formats:

  • Delimited Text Files

    • data is often stored in plain text format with specific field delimiters and row terminators, the most common format is comma-separated values (CSV) where fields are sepeated by commas and rows are terminated by a carriage return  / new line

    • optionally, first line may include the field names

    • other common formats include tab-separated values (TSV) and space-delimited (in which tabs or spaces are used to separate fields) and fixed-width data in which each field is allocated in a fixed number of characters

    • good choice for structured data that needs to be accessed by a wide range of applications and services in a human-readable format

  • JavaScript Object Notation (JSON)

    • a ubiquitous format in which a hierarchical document schema is used to define data entities (objects) that have multiple attributes, each attribute might be an object (or collection of objects); making JSON a flexible format that’s good for both structured and semi-structured data

  • Extensible Markup Language (XML)

    • a human-readable data format that was popular in the 1990s and 2000s, largely been superseded by the less verbose JSON format but there are still some systems that use XML to represent data

    • uses tags enclosed in angle-brackets (<…/>) to define elements and attributes

  • Binary Large Object (BLOB)

    • all files are stored as binary data, but in the human-readable formats, the bytes of binary data are mapped to printable characters (typically through a character encoding scheme such as ASCII or Unicode)

    • some file formats, particularly for unstructured data, store the data as raw binary that must be interpreted by applications and rendered

    • common types of data stored as binary include images, video, audio, and application-specific documents

    • files are often referred to as BLOBs

  • Optimized File Formats

    • human-readable for structured and semi-structured data can be useful but not typically optimized for storage space or processing

    • over time, some specialized file formats that enable compression, indexing, and efficient storage and processing have been developed

    • common optimized formats you might see include:

      • Avro: row-based format created by Apache

        • each record contains a header that describes the structure of the data i the record, header is stored as JSON and data is stored as binary information

        • application uses the information of the header to parse the binary data and extract the fields it contains, a good format for compressing data and minimizing storage and network bandwidth requirements 

      • ORC (Optimized Row Columnar format) organizes data in columns rather than rows

        • developed by HortonWorks for optimizing read and write operations in Apache Hive (a data warehouse system that supports fast data summarization and querying over large datasets)

        • contains stripes of data, each stripe holds the data for a column or set of columns and a stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column

        • Parquet, created by Cloudera and Twitter, is another columnar data format

          • contains row groups, data for each column is stored together in the same row group, each row group contains one or more chunks of data

          • includes metadata that describe the set of rows found in each chunk, an application can use the metadata to quickly locate the correct chunk for a given set of rows and retrieve the data in the specific columns for these rows

          • specializes in storing and processing nested data types efficiently, supports very efficient compression and encoding schemes

Explore Databases

Database: used to define a central system in which data can be stored and queried. the file system on which files are stored in a kind of database but when we use the term in a professional data context, we usually mean a dedicated system for managing data records rather than files

  • Relational Databases

    • commonly used to store and query structured data which is stored in tables that represent entities such as customers, products or sale orders

    • each instance of an entity is assigned a primary key that uniquely identifies it; the keys are used to reference the entity instances in other tables

    • the use of keys to reference data entitles enables a relational database to be normalized; the elimination of duplicate data values

    • tables are managed and queried using Structured Query Language (SQL) which is based in ANSI standard so that it’s similar across multiple database systems

  • Non-Relational Databases

    • data management systems that don’t apply a relational schema to the data, often referred to as NoSQL database even though some support a variant of the SQL language

    • four common types in use:

      • Key-Value Databases: each record consists of a unique key and an associated value which can be in any format

      • Document Databases: a specific form of key-value database in which the value is a JSON document (which the system is optimized to parse and query)

      • Column Family Databases: store tabular data comprising row and columns but the columns can be divided into groups known as column-families (each one holds a set of columns that are logically related together

      • Graph Databases: store entities as nodes with links to define relationships between them

Explore Transactional Data Processing

Transactional Data Processing System: what most people consider the primary function of business computing, records transactions (a small, discrete, unit of work) that encapsulates specific events that the organization wants to track. can be financial, such as the movement of money between accounts in a banking system, or a retail system, tracking payment for goods and services from customers. often high-volume, sometimes handling many millions of transactions in a single day. data being processed has to be accessible very quickly. 

Online Transactional Processing (OLTP): work performed by transactional systems, solutions rely on a database system in which data storage is optimized for both read and write operations in order to support transactional workloads in which data records are created, retrieved, updated, and deleted (often referred to as CRUD operations). operations are applied transactionally in a way that ensures that integrity of the data stored in the database. to accomplish this OLTP systems enforce transactions that support so-called ACID semantics:

  • Atomicity: each transaction is treated as a single unit which succeeds or fails completely

  • Consistency: transactions can only take the data in the database from one valid state to another

  • Isolation: concurrent transactions cannot interfere with one another and, as a result, must result in a consistent database state

  • Durability: when a transaction has been committed, it will remain committed

OLTP systems are typically used to support live applications that process business data - often referred to as line of business (LOB) applications

Explore Analytical Data Processing

Analytical Data Processing: typically uses read-only (or read-mostly) systems that store vast volumes of historical data or business metrics, analytics can be based on a snapshot of the data at a given point in time or series of snapshots. specific details for this can vary between solutions, but a common architecture for enterprise-scale analytics looks like this:

  1. operational data is extracted, transformed, and loaded (ETL) into a data lake for analysis

  2. data is loaded into a schema of tables - typically in a Spark-based data lakehouse with tabular abstractions over files in the data lake, or a data warehouse with fully relational SQL engine

  3. data in the data warehouse may be aggregated and loaded into an online analytical processing (OLAP) model, or cube. aggregated numeric values (measures) from fact tables are calculated for intersections of dimensions from dimensional tables

  4. the data in the data lake, data warehouse, and analytical model can be queried to produce reports, visualizations, and dashboards

Data Lakes: common in large0scale data analytical processing scenarios, where a large volume of file0based data must be collected and analyzed

Data Warehouses: an established way to store data in a relational schema that is optimized for read operations — primarily queries to support reporting and data visualization

Data Lakehouses: a more recent innovation that combine the flexible and scalable storage of a data lake with the relational querying semantics of a data warehouse, the table schema may require some denormalization of data in an OLTP dat source (introducing some duplication to make queries perform faster)

OLAP Model: an aggregated type of data storage that is optimized for analytical workloads. data aggregations are across dimensions at different levels, enabling you to drill up/down to view aggregations at multiple hierarchical levels

different types of users might perform data analytical work at different stages of the overall architecture, for example:

  • data scientists might work directly with data files in a data lake to explore and model data

  • day analysts might query tables directly in the data warehouse to produce complex reports and visualizations

  • business users might consume pre-aggregated data in an analytical model int he form of reports or dashboards

Knowledge Check

  1. How is data in a relational table organized?

  • Rows and Columns

  • Header and Footer

  • Pages and Paragraphs

  1. Which of the following is an example of unstructured data?

  • An Employee table with columns ExmployeeID, EmployeeName, and EmployeeDesignation

  • Audio and Video files

  • A table within a relational database

  1. What is a data warehouse?

  • A nonrelational database optimized for read and write operations

  • A relational database optimized for read operations

  • A storage location for unstructured data files

Explore Data Roles and Services

Explore Job Roles in the World of Data

Three key job roles that deal with data in most organizations are (not always referred to as these names but just illustrated the common breakdown of these types of roles):

  • Database Administrators: manage databases, assigning permissions to users, storing backup copies of data and restore data in the event of a failure

    • responsible for the design, implementation, maintenance, and operational aspects of on premises and cloud-based database systems

    • responsible for the overall availability and consistent performance and optimization of databases

    • work with stakeholders to implement policies, tools, and processes for backup and recovery plans to recover following a natural disaster or human-made error

    • also responsible for managing the security of the data in the database granting privileges over the data, granting or denying access to users as appropriate

  • Data Engineers: manage infrastructure and processes for data integration across the organization, applying data cleaning routines, identifying data governance rules, and implementing pipelines to transfer and transform data between systems

    • collaborates with stakeholders to design and implement data-related workloads, including data ingestion pipelines, cleansing and transformation activities, and datastore for analytical workloads

    • they use a wide range of data platform technologies, including relational and non-relational databases, file stores, and data streams

    • they’re also responsible for ensuring that the privacy of data is maintained within the cloud of spanning from on-premises to the cloud data stores

    • they own the management and monitoring of data pipelines to ensure that data loads perform as expected

  • Data Analysts: explore and analyze data to create visualizations and charts that enable organizations to make informed decisions

    • enables to maximize the value of their data assets

    • responsible for exploring data to identify trends and relationships, designing and bilig analytical models, and enabling advanced analytics capabilities through reports and visualizations

    • processes raw data into relevant insights based on identified business requirements to deliver relevant insights

Identify Data Services

Microsoft Azure: a cloud platform that powers the applications and IT infrastructure for some of the world’s largest organization, includes many services to support cloud solutions including transactional and analytical data workloads

Some of the most commonly used cloud services for data are:

  • Azure SQL: the collective name for a family of relational database solutions based on the MS Sql Server database engine. services include

    • Azure SQL Database: a fully managed platform-as-a-service (PaaS) database hosting in Azure

    • Azure SQL Managed Instance: hosted instance of SQL Server with automated maintenance, which allows more flexible configuration than Azure SQL DB but with more administrative responsibility for the owner

    • Azure SQL VM: a virtual machine with an installation of SQL Server, allowing maximum configurability with full management responsibility

    • database administrators typically provision and manage Azure SQL database systems to support line of business (LOB) applications that need to store transactional data

    • data engineers may use Azure SQL database systems as sources for data pipelines that perform extract, transform, and load (ETL) operations to ingest the transactional data into an analytical system

    • data analysts may query Azure SQL databases directly to create reports, though in large organizations the data is generally combined with data from other sources in an analytical data store to support enterprise analytics

  • Azure Database for Open-Source Relational Databases: includes managed services for popular open-source relational database systems, including:

    • Azure Database for MySQL: a simple-to-use open-source database management system that is commonly used in Linux, Apache, MySQL, and PHP (LAMP) stack apps

    • Azure Database for MariaDB: a newer database management system, created by the original developers of MySQL, the database engine has been rewritten and optimized to improve performance, MariaDB offers compatibility with Oracle Database (another popular commercial database management system)

    • Azure Database for PostgreSQL: a hybrid relational-object database, you can store data in relational tables, but a PostgreSQL database also enables you to store custom data types, with their own non-relational properties

    • similar to Azure SQL database systems, open-source relational databases are managed by database administrators to support transactional applications, and provide a data source for data engineers building pipelines for analytical solutions and data analysts creating reports

  • Azure Cosmos DB: a global-scale non-relational (NoSQL) database system that supports multiple application programming interfaces (APIs), enabling you to store and manage data as JSON documents, key-value pairs, column-families, and graphs

    • in some organizations, instances may be provisioned and managed by a database administrator; though often software developers manage NoSQL data storage as part of the overall application architecture

    • data engineers often need to integrate Cosmos DB data sources into enterprise analytical solutions that support modeling and reporting by data analysts

  • Azure Storage: a core Azure service that enables you to store data in:

    • Blob Containers: scalable, cost-effective storage for binary files

    • File Shares: network file shares such as you typically find in corporate networks

    • Tables: key-values for applications that need to read and write data values quickly

    • data engineers use Azure Storage to host data lakes - blob storage with hierarchical namespace that enable files to be organized in folders with a distributed file systems

  • Azure Data Factory: an Azure service that enables you to define and schedule data pipelines to transfer and transform data

    • pipelines can be integrated with other Azure services, enabling you to ingest data from cloud data store, process the data using cloud-based compute and persist the results in another data store

    • used by data engineers to build extract, transform, and load (ETL) solutions that populate analytical data stores with data from transactional systems across the organization

  • Azure Synapse Analytics: a comprehensive, unified Platform-as-a-Service (PaaS) solution for data analytics that provides a single service interface for multiple analytical capabilities including:

    • Pipelines: based on the same technology as Azure Data Factory

    • SQL: a highly scalable SQL database engine, optimized for data ware house workloads

    • Apache Spark: an open-source distributed data processing system that supports multiple programming languages and APIs including Java, Scala, Python, and SQL

    • Azure Synapse Data Explorer: a high-performance data analytics solution that is optimized for real-time querying of log and telemetry data using Kusto Query Language (KQL)

    • data engineers can use this to create a unified data analytics solution that combines data ingestion pipelines, data warehouse storage, and data lake storage through a single service

    • data analysts can use SQL and Spark pools through interactive notebooks to explore and analyze data, and take advantage of integration with services such as Azure Machine Learning and Microsoft Power BI to create data models and extract insights from the data

  • Azure Databricks: an Azure-integrated version of the popular Databricks platform, which combines the Apache Spark data processing platform with SQL database semantics and an integrated management interface to enable large-scale data analytics

    • data engineers can use existing Databricks and Spark skills to create analytical data stores in this

    • data analysts can use the native notebook support in this to query and visualize data in an easy to use web-based interface

  • Azure HDInsight

    • an Azure service that provides Azure-hosted clusters for popular Apache open-source big data processing technologies including:

      • Apache Spark: a distributed data processing system that supports multiple programming languages and APIs, including Java, Scala, Python and SQL

      • Apache Hadoop: a distributed system tat uses MapReduce jobs (can be written in Java or abstracted by interfaces such as Apache Hive — a SQL-based API that runs on Hadoop) to process large volumes of data efficiently across multiple cluster nodes

      • Apache HBase: an open-source system for large-scale NoSQL data storage and querying

      • Apache Kafka: a message broker for data stream processing

    • data engineers can use Azure HDInsight to support big data analytics workloads that depend on multiple open-source technologies

  • Azure Stream Analytics

    • a real-time stream processing engine that captures a stream of data from an input, applies a query to extract and manipulate data from the input stream and writes the results to an output for analysis or further processing

    • data engineers can incorporate Azure Stream Analytics into data analytics architectures that capture streaming data for ingesting into an analytical data store or for real-time visualization

  • Azure Data Explorer

    • a standalone service that offers the same high-performance querying of log and telemetry data as the Azure Synapse Data Explorer runtime in Azure Synapse Analytics

    • data analysts can use this to query and analyze data that includes a timestamp attribute, such as is typically found in log files and Internet-of-Things (IoT) telemetry data

  • Microsoft Purview

    • provides a solution for enterprise-wide data governance and discoverability

    • you can use this to crate a map of your data and track data lineage across multiple data sources and systems, enabling you to find trustworthy data for analysis and reporting

    • data engineers can use this to enforce data governance across the enterprise and ensure the integrity of data used to support analytical workloads

  • Microsoft Fabric

    • a unified Software-as-a-Service (SaaS) analytics platform based on open and governed lakehouse the includes functionality to support:

      • data ingestion and ETL

      • data lakehouse analytics

      • data warehouse analytics

      • data science and machine learning

      • realtime analytics

      • data visualization

      • data governance and management

Knowledge Check

  1. Which one of the following tasks is the responsibility of a database administrator?

  • Backing up and restoring databases

  • Creating dashboards and reports

  • Creating pipelines to process data in a data lake

  1. Which role is most likely to use Azure Data Factory to define a data pipeline for an ETL process?

  • Database Administrator

  • Data Engineer

  • Data Analyst

  1. Which services would you use as a SaaS solution for data analytics?

  • Azure SQL Database

  • Azure Synapse Analytics

  • Microsoft Fabric

Explore Fundamental Relational Data Concepts

Understand Relational Data

Relational Database: you model collections of entities from the real world as tables, a format for structured data and each row in a table has the same columns (though in some cases, not all columns need to have a value); each column stores data of a specific datatype; the available datatypes that you can use when defining a table depend on the database system you are using; though there are standard datatypes defined by the American National Standards Institute (ANSI) that are supported by most database systems

Entity: can be anything for which you want to record information; typically important objects and events

Table: contains rows, and each row represents a single instance of an entity

Understand Normalization

Normalization: a term used by database professionals for a schema design process that minimizes data duplication and enforces data integrity; while there are many complex rules that define the process of refactoring dat into various levels (or forms) of normalization, a simple definition for practical purposes is:

  1. separate each entity into its own table

  2. separate each discrete attribute into its own column

  3. uniquely identify each entity instance (row) using a primary key

  4. use foreign key columns to link related entities

Each entity is stored in its own table and each discrete attribute of those entities is in its own column. Recording each instance of an entity as a row in an entity-specific table removes duplication of data. Decomposition of attributes into individual columns ensures that each value is constrained to an appropriate data type. Additionally, the creation of individual columns provides a useful level of granularity in the data for querying. Instances of each entity are uniquely identified by an ID or other key value, known as a primary key; and when one entity references another, the primary key of the related entity is stored as a foreign key. Typically, a relational database management system (RDBMS) can enforce referential integrity to ensure that a value entered into a foreign key field has an existing corresponding primary key in the related table. In some cases, a key (primary or foreign) can be defined as a composite key based on a unique combination of multiple columns

Explore SQL

Standard Query Language (SQL): used to communicate with a relational database, the standard language for RDBMS; SQL statements are used to perform tasks such as update data in a database, or retrieve data from a database, common RDBMS that use SQL include MS SQL Server, MySQL, PostgreSQL, MariaDB, and Oracle

SQL statements like SELECT, INSERT, UPDATE, DELETE, CREATE, and DROP can be used to accomplish almost everything that you need to do with a database. Even though these are standard statements, many DBMS also have their own additional proprietary extensions to handle the specifics of that DBMS. Extensions provide functionality that isn’t covered by the SQL standard and include areas such as security management and programmability. Implementation includes proprietary extensions for writing stored procedures and triggers (application code that can be stored in the database), and managing user accounts. PostgreSQL and MySQL also have their own versions of these features. Popular dialects of SQL include:

  • Transact-SQL (T-SQL): used by MS SQL Server and Azure SQL Services

  • pgSQL: implmented in PostgreSQL with extensions

  • PL/SQL: used by Oracle and stands for Procedural Learning/SQL

Users who plan to work specifically with a single database system should learn the intricacies of their preferred SQL dialect and platform

SQL statements are grouped into three main logical groups:

  • Data Definition Language (DDL)

    • create, modify, and remove tables and other objects in a database (table, stored procedures, views, and so on)

    • most common statements are CREATE, ALTER, DROP, or RENAME

    • datatypes available for columns will vary between DBMS but most systems will support numeric types such as INT, DECIMAL, and string types such as VARCHAR

  • Data Control Language (DCL)

    • database administrators generally use these statements to manage access to objects in a database by granting, denying, or revoking permissions to specific users or groups

    • main statements are: GRANT, DENY, and REVOKE

  • Data Manipulation Language (DML)

    • use to manipulate rows in tables, enables you to retrieve (query) data, insert new rows, or modify existing rows as well as deleting rows if you don’t need them anymore

    • four main statements are: SELECT, INSERT, UPDATE, and DELETE

Describe Database Objects

Relational databases contain other structures to help optimize data organization, encapsulate programmatic actions, and improve the speed of access. Three of these structures are: 

  • Views

    • virtual tables based on the results of a SELECT query, like a window on specified rows in one or more underlying tables

    • you can query the view and filter the data in much the same way as a table

  • Stored Procedures

    • defines SQL statements that can be run on command, are used to encapsulate programmatic logic in a database for actions that applications need to perform when working with data

    • you can define a stored procedure with parameters to create a flexible solution for common actions that might need to be applied to data based on a specific key or criteria

  • Indexes

    • helps search for data in a table, like an index at the back of a book

    • a book index contains a sorted set of references with the pages on which each reference occurs

    • when creating an index in a database, you specify a column from the table and the index contains a copy of this data in a sorted order with pointers to the corresponding rows in the table

    • when a user runs a query that specifies this column in the WHERE clause, the database management system can use this index to fetch the data more quickly than if it had to scan through the entire table row by row

Knowledge Check

  1. Which one of the following statements is a characteristic of a relational database?

  • All columns in a table must be of the same data type

  • A row in a table represents a single instance of an entity

  • Rows in the same table can contain different columns

  1. While SQL statement is used to query tables and return data?

  • QUERY

  • READ

  • SELECT

  1. What is an index?

  • A structure that enables queries to locate rows in a table quickly

  • A virtual table based on the results of a query

  • A predefined SQL statement that modifies data

Explore Relational Database Services in Azure

Describe Azure SQL Services and Capabilities

Azure SQL is a collective term for a family of MS SQL Server based database services in Azure, specific services include:

  • SQL Server on Azure Virtual Machines (VMs)

    • a virtual machine running in Azure with an installation of SQL Server

    • use of a VM makes this option an infrastructure-as-a-service (IaaS) solution that virtualizes hardware infrastructure for compute, storage, and networking in Azure; making it a great option for “life and shift” migration of existing on-premises SQL Server installations to the cloud

    • enable you to use full versions of SQL Server in the Cloud without having to manage any on-premises hardware (an example of the IaaS approach)

    • replicates the database running on real on-premises hardware, migrating from the system running on-premises to an Azure virtual machine is no different than moving the database from one on-premises server to another

    • approach is suitable for migrations and applications requiring access to operating system features that might be unsupported at the PaaS level

    • lift-and-shift ready for existing applications that require fast migration to the cloud with minimal changes

    • can use this to extend existing on-premises applications to the cloud in hybrid deployments (a system where part of the operation runs on-premises, and part in the cloud; database may be part of a larger system that runs on-premises although the database elements might be hosted in the cloud)

    • can develop and test traditional applications, have full administrative rights over the DBMS and operating systems; perfect choice when an organization already as IT resources available to maintain the VM, enables you to

      • create rapid development and test scenarios when you don’t want to buy on-premises non-production SQL Server hardware

      • become lift-and-shift ready for existing applications that require fast migration to the cloud with minimal changes or no changes

      • scale up the platform on which SQL Server is running, by allocating more memory, CPU power, and disk space to the VM; you can quickly resize it without the requirement that you reinstall the software that is running on it

      • Business Benefits

        • allows you to meet unique and diverse business needs through a combination of on-premises and cloud-hosted deployments, while using the same set of server products, development tools, and expertise across these environments

        • not always easy for businesses to switch their DBMS to a fully managed service; there may be specific requirements that must be satisfied in order to migrate to a managed service that requires making changes to the database and the applications that use it which is why using a VM can offer a solution but doesn’t eliminate the need to administer your DBMS as carefully as you would on-premises

  • Azure SQL Managed Instance

    • a platform-as-a-service (PaaS) option that provides near-100% compatibility with on-premises SQL Server instances while abstracting the underlying hardware and operating system

    • service includes automated software update management, backups, and other maintenance tasks, reducing the administrative burden of supporting a database server instance

    • runs a fully controllable instance of SQL Server in the cloud, you can install multiple databases on the same instances; full control as you would on an on-premises server

    • automates backups, software patching, database monitoring, and other general tasks, and you have full control over security and resources allocation for your databases

    • depend on other services such as Azure Storage for backups, Azure Event Hubs for telemetry, Microsoft Entra ID for authentication, Azure Key Vault for Transparent Data Encryption (TDE) and a couple of Azure platform services that provide security and supportability features; managed instances make connections to these services

    • all communications are encrypted and signed using certificates

    • to check the trustworthiness of communicating parties, managed instances constantly verify these certificates through certificate revocation lists; if certificates are revoked, the managed instance closes the connections to protect the data

    • Use Cases

      • consider this if you want a lift-and-shift an on-premises SQL Server instance and all is database to the cloud without incurring the management overhead of running SQL Server on a VM

      • provides features not available in an Azure SQL Database

      • if system uses features such as linked servers, Service Broker (a message processing system that can be used to distribute work across servers), or Database Mail (which enables your database to send email messages to users), then you should use managed instances

    • Business Benefits

      • enables a system administrator to spend less time on administrative tasks because the service either performs them for you or greatly simplifies those tasks; automated tasks include operating system and database management system software installation and patching, dynamic instance resizing and configuration, backups, database replication (including system databases), high availability configuration, and configuration of health and performance monitoring data streams

      • has a near 100% compatibility with SQL Server Enterprise Edition, running on-premises

      • supports its engine logins and logins integrated with MS Entra ID, engine logins include a username and a password, requiring you to enter your credentials each time you connect to the server

  • Azure SQL Database

    • fully managed, highly scalable PaaS database service that is designed for the cloud

    • service includes the core database-level capabilities of on-premises SQL Server and is a good option when you need to create a new application in the cloud

    • a PaaS offering from Microsoft, create a managed database server in the cloud and then deploy the database on this server

    • available as:

      • Single Database

        • enables you to quickly set up a run a single SQL Server database, create and run a database server in the cloud and access your database through this server

        • MS manages the server so that all you have to do is configure the database, create your tables, and populate them with your data

        • you can scale the database if you need more storage space, memory, or processing power

        • by default, resources are pre-allocated, and you’re charged per hour for the resources you’ve requested

        • you can also specify a serverless configuration (MS creates its own server, which might be shared by databases belonging to other Azure subscribers)

        • MS ensures the privacy of your database which automatically scales and resources are allocated or deallocated as required

      • Elastic Pool

        • similar to single but, by default, multiple databases can share the same resources such as memory, data storage space, and processing power through multiple-tenancy

        • you create the pool and only your database can use the pool

        • model is useful if you have databases with resources requirements that vary over time and can help you to reduce costs

    • Use Cases

      • gives you the best option for low cost with minimal administration

      • isn’t fully compatible with on-premises installations

      • often used in new cloud projects where the application design can accommodate any required changes to your applications

      • often used for:

        • modern cloud applications that need to use the latest stable SQL Server features

        • applications that require high availability

        • systems with a variable load that need the database server to scale up and down quickly

    • Business Benefits

      • automatically updates and patches the SQL Server software to ensure that you’re always running the latest and most secure version of the service

      • scalability features ensure that you can increase the resources available to store and process data without having to perform a costly manual upgrade

      • provides high availability guarantees, to ensure that your databases are available at lease 99.995% of the time; supports point-in-time restore, enabling you to recover a database to the state it was in at any pint in the past; databases can be replicated to different regions to provide more resiliency and disaster recovery

      • advanced threat protection provides advanced security capabilities, such as vulnerability assessments, to help detect and remediate potential security problems with your database; threat protection also detects anomalous activities that indicate unusual and potentially harmful attempts to access or exploit you database, continuously monitoring your database for suspicious activities and provides immediate security alerts on potential vulnerabilities, SQL injection attacks, and anomalous database access patterns; threat detection alert provide details of the suspicious activity and recommend action on how to investigate and mitigate the threat

      • auditing tracks database events and writes them to an audit log in your Azure storage account which can help maintain regulatory compliance, understanding database activity, and gain insight into discrepancies and anomalies that might indicate business concerns or suspected security violations

      • helps secure your data by providing encryption that protects data that is stored in the database (at rest) and while it is being transferred across the network (in motion)

  • Azure SQL Edge

    • a SQL engine that is optimized for Internet-of-Things (IoT) scenarios that need to work with streaming time-series data

Describe Azure Services for Open-Source Databases

Other popular relational database systems that are primarily used to enable organizations that use them in on-premises apps to move to Azure quickly without making significant changes to their applications:

  • MySQL

    • started as a simple-to-use open-source DBMS, leading open source relational database for Linux, Apache, MySQL, and PHP (LAMP) stackapps

    • a PaaS implementation

    • includes high availability at no additional cost and scalability as required, pay for what you use, automatic backups are provided, with point-in-time restore

    • server provides connection security to enforce firewall rules and optionally require SSL connections; many parameters enable you to configure server settings such as lock modes, maximum number of connections and timeouts

    • provides a global database system that scales up to large databases without the need to manage hardware, network components, virtual servers, software patches, and other underlying components

    • certain operations aren’t available, primary concerned with security and administration; these are managed by Azure itself

    • Benefits of Azure Database for MySQL

      • high availability features built-in

      • predictable performance

      • easy scaling that responds quickly to demand

      • secure data, both at least and in motion

      • automatic backups and point-in-time restore for the last 35 days

      • enterprise-level security and compliance with legislation

      • pay-as-you-go pricing so you only pay for what you use

      • provides monitoring functionality to add alerts and to view metrics and logs

  • MariaDB

    • a newer DBMS, created by the original developers of MySQL, but has been rewritten and optimized to improve performance; offers compatibility with Oracle Database and has a notable feature of a built-in support for temporal data and the table can hold several versions of data, enabling an application to query the data as it appeared at some point in the past

    • fully managed and controlled by Azure; once you’ve provisioned the service and transferred your data, the system requires almost no additional administration

    • Benefits of Azure Database for MariaDB

      • built-in high availability with no additional cost

      • predictable performance, using inclusive pay-as-you-go pricing

      • scaling as needed within seconds

      • secured protection of sensitive data at rest and in motion

      • automatic backups and point-in-time-restore for up to 35 days

      • enterprise-grade security and compliance

  • PostgreSQL

    • a hybrid relational-object database where you can store data in relational table but also enables you to store custom data types with their own non-relational properties and has the ability to store and manipulate geometric data, such as lines, circles, and polygons

    • has its own query language called pgsql which is a variant of SQL with features that enables you to write stored procedures that run inside the database

    • Azure Database for PostgreSQL Flexible Server

      • flexible-server deployment option is a fully managed database service

      • provides a high level of control and server configuration customizations and provides optimization controls

    • Benefits of Azure Database for PostgreSQL

      • highly available service, contains built-in failure detection and failover mechanisms

      • users will be familiar with the pgAdmin tool which can be used to manage and monitor a database; you can continue to use this tool to connect to Azure Database for PostgreSQL but some server-focused functionality, such as performing server backup and restore, aren’t available because the server is managed and maintained by Microsoft

      • records information about queries run against databases on the server and saves them in a database named azure_sys; query the query_store.qs_view view to see this information and use it to monitor the queries that users are running; information can prove invaluable if you need to fine0tune the queries performed by your applications

Knowledge Check

  1. Which deployment option offers the best compatibility when migrating an existing SQL Server on-premises solution?

  • Azure SQL Database (single database)

  • Azure SQL Database (elastic pool)

  • Azure SQL Managed Instance

  1. Which of the following statements is true about Azure SQL Database?

  • Most database maintenance tasks are automated

  • You must purchase a SQL Server license

  • It can only support one database

  1. Which database service is the simplest option for migrating a LAMP application to Azure?

  • Azure SQL Managed Instance

  • Azure Database for MySQL

  • Azure Database for PostgreSQL

Explore Azure Storage for Non-Relational Data

Explore Azure Blob Storage

Azure Blob Storage: enables you to store massive amounts of unstructured data as binary large objects (blobs) in the cloud; an efficient way to store data files in a format that is optimized for cloud-based storage and applications can read and write them by using its API, they get stored in containers

Containers: a convenient way of grouping related blobs together; you can control who can read and write blobs inside a container at the container level; help organize blobs in a hierarchy of virtual folders, similar to files in a file system on disk, but the default of these folders are simply a way of using a “/“ character in a blob name to organize the blobs into namespace; folders are purely virtual and you can’t perform folder-level operations to control access or perform bulk operations

Three Types of Blobs:

  • Block Blobs

    • handled at a set of blocks, each block can vary in size up to 4000 MiB

    • can contain up to 190.7 TiB (4000 MiB X 50,000 blocks) giving a maximum size of over 5000 MiB

    • block is the smallest amount of data that ca be read or written as an individual unit

    • best used to store discrete, large, binary objects that change infrequently

  • Page Blobs

    • organized as a collection of fixed size 512-byte page

    • optimized to support random read and write operations; you can fetch and store data for a single page if necessary

    • can hold up to 8 TB of data

    • uses page blobs to implement virtual disk storage for virtual machines

  • Append Blobs

    • a block blob optimized to support append operations

    • can only add blocks to the end of an appen blob; updating or deleting existing blocks isn’t supported

    • each block can vary in size, up to 4 MB, maximum size of an append block is just over 195 GB

Three Access Tiers to Help Balance Access Latency and Storage Cost:

  • Hot Tier

    • the default

    • use this tier for blobs that are accessed frequently

    • stored on high-performance media

  • Cool Tier

    • lower performance and incurs reduced storage charges compared to the Hot Tier

    • for data that is accessed infrequently

    • common for newly created blobs to be accessed frequently initially but less so as time basses

    • can create a blob in the Hot Tier and then move it to the Cool Tier later, or vice versa

  • Archive Tier

    • lowest storage cost and increased latency

    • intended for historical data that shouldn’t bet be lost but is required only rarely

    • blobs are effectively stored in an offline state

    • while the other tiers have a reading latency of a few milliseconds, here it can take hours for it to be available

    • to retrieve a blob, access tier needs to be changed to Hot or Cool; blob will then be rehydrated which will allow you to read it

Lifecycle Management Policy: can automatically move a blob from Hot to Cool, and then to the Archive tier as it ages and is used less frequently (policy is based on the number of days since modification); this can also arrange to delete outdated blobs

Explore Azure DataLake Storage Gen2

Azure Data Lake Store (Gen1): a separate service for hierarchical data storage for analytical data lakes, often used by so-called big data analytical solutions that work with structured, semi-structures, and unstructured data stored in files

Azure Data Lake Storage Gen2: a newer version of the service that is integrated into Azure Storage; enabling you to take advantage of the scalability of blob storage and the cost-control of storage tiers, combined with the hierarchical file system capabilities and compatibility with major analytics systems of Azure Data Lake Store

Systems like Hadoop in Azure HDInsight, Azure Databricks, and Azure Synapse Analytics can mount a distributed file system hosted in Azure Data Lake Store Gen2 and use it to process huge volumes of data

To create this file system, you must enable the Hierarchical Namespace option of a storage account which can be done when initial creating the storage account or can be upgraded an existing account to support this. However, upgrading is a one-way process, it can’t be reverted to a flat namespace

Explore Azure Files

File Shares: many on-premises systems made up of a network of in-house computers make use of this. this enables you to store a file on one computer and grant access to that file to users and applications running on other computers. strategy can work well for computers in the same local area network but doesn’t scale well as the number of users increases or if users are located at different sites

Azure Files: a way to create cloud-based network shares, like those you would typically find in on-premises organizations to make documents and other files available to multiple users. by hosting file shares in Azure, organizations can eliminate hardware costs and maintenance overhead and benefit from high availability and scalable cloud storage for files. create these in a storage account and enables you to share up to 100 TB of data in a single storage account. this data can be distributed across any number of file shares in the account. maximum size of a single file is 1 TB but you can set quotas to limit the size of each share below this figure. currently Azure File Storage supports up to 2000 concurrent connections per shared file. after creating an account, you can upload files using the Azure portal or tools like the AzCopy utility. Azure File Sync service can also be used to synchronize locally cached copies of shared files with the data in this. Two Performance Tiers:

  • Standard Tier: a hard disk-based hardware in a datacenter

  • Premium Tier: solid-state disks and offers greater throughput at a higher charge rate

Network File Sharing Protocols

  • Server Message Block (SMB): commonly used across multiple operating systems (Windows, Linux, macOS)

  • Network File System (NFS): used by some Linux and macOS versions; to create this, you must use a premium tier storage account and create and configure a virtual network through which access to the share can be controlled

Explore Azure Tables

Azure Table Storage: NoSQL storage solution that makes use of tables containing key/value data items, each item is represented by a row that contains columns for the data fields that need to be stored. but it’s not like a table in a relational database. instead, it enables you to store semi-structured data; all rows in a table must have a unique key (composed of a partition key and a row key) and when you modify data in a table, a timestamp column records the date and time the modification was made, besides that the columns in each row can vary. these tables have no concept of foreign keys, relationships, stored procedures, views, or other objects you might find in a relational database. data here is usually denormalized with each row holding the entire data for a logical entity

for fast access, Azure Table Storage splits a table into partitions

Partitioning: a mechanism for grouping related rows, based on a common property or partition key; rows that share the same partition key will be stored together.

Partitioning not only helps to organize data, but can also improve scalability and performance because:

  • partitions are independent from each other and can grow or shrink as rows are added/removed; a table can contain any number of partitions

  • when you search for data, you can include the partition key in the search criteria which helps to narrow down the volume of data to be examined and improves performance by reducing the amount of I/O (input and output operations, and read and writes) needed to locate the data

The key in an Azure Table Storage table comprises of two elements:

  • Partition Key

    • identifies the partition containing the row

  • Row Key

    • unique to each row in the same partition

    • items in the same partition are stored in row key order

If an application adds a new row to a table, Azure ensures that the row is placed in the correct position in the table. Scheme enables an application to quickly perform point queries that identify a single row and range queries that fetch a contiguous block of rows in a partition

Knowledge Check

  1. What are the elements of an Azure Table storage key?

  • Table name and column name

  • Partition key and row key

  • row number

  1. What should you do to an existing Azure Storage account in order to support a data lake for Azure Synapse Analytics?

  • Add an Azure Files share

  • Create Azure Storage tables for the data you want to analyze

  • Upgrade the account to enable hierarchical namespace and create a blob container

  1. Why might you use Azure File storage?

  • To share files that are stored on-premises with users located at other sites

  • To enable users at different sites to share files

  • To store large binary data files containing images or other unstructured data

Explore Fundamentals of Azure Cosmos DB

Describe Azure Cosmos DB

Azure Cosmos DB: supports multiple application programming interfaces (APIs) that enable developers to use the programming semantics of many common kinds of data store to work with data in a Cosmos DB database; internal data structures is abstracted, enabling developers to use this to store and query dat using APIs with which they’re already familiar. uses indexing and partitioning to provide fast read and write performances and can scale to massive volumes of data. can enable multi-region writes, adding the Azure regions of your choice to your Cosmos DB account so that globally distributed users can each work with data in their local replica

Cosmos DB is a highly scalable DBMS and automatically allocates space in containers for your partitions, where they can grow up to 10 GB in size. Indexes are created and maintained automatically with virtually no administrative overhead.

Cosmos DB is a foundational service in Azure and has been used by many of MS’s products for mission critical applications at global scale including Skype, Xbox, MS 365, and many other. Cosmos DB is highly suitable for the following scenarios:

  • IoT and Telematics

    • typically ingest large amounts of data in frequent burst of activity

    • can accept and store this information quickly and then be used by analytics services such as Azure Machine Learning, Azure HDInsight, and Power BI

    • you can also process the data in real-time using Azure Functions that are triggered as data arrives in the database

  • Retail and Marketing

    • MS uses Cosmos DB for its own e-commerce platforms that run as part of Windows Store and Xbox Live

    • can also be used in the retail industry for storing catalog data and for event sourcing in order processing pipelines

  • Gaming

    • a crucial component of gaming applications

    • modern games perform graphical processing on mobil/console clients but rely on the cloud to deliver customized and personalized content like in-game stats, social media integrations, and high-score leaderboards

    • games often require single-millisecond latencies for reads and write to provide and engaging in-game experience

    • game databases need to be fast and able to handle massive spikes in request rates during new game launches and feature updates

  • Web and Mobile Applications

    • commonly used within web and mobile applications and is well suited for modeling and social interactions, integrating with third-party services, and for building rick personalized experiences

    • Cosmos DB SDKs can be used to build rich iOS and Android applications using the popular Tamarin framework

Identify Azure Cosmos DB APIs

Azure Cosmos DB is fully managed and serverless database for applications of any size or scale with support for both relational and non-relational workloads allowing developers to build and migrate applications fast using their preferred open source database engines (i.e. PostgreSQL, MongoDB, and Apache Cassandra). when provisioning a new Cosmos DB onstage, select the database engine to use because the engine choice depends on many factors including the type of data to be stored, the need to support existing applications, and the skills of the developers who will work with the data store

  • Azure Cosmos DB for NoSQL

    • MS’s native non-relational service for working with the document data model

    • manages data in JSON document format

    • despite being NoSQL, uses SQL syntax to work with the data

  • Azure Cosmos DB for MongoDB

    • popular open source database in which data is stored in a Binary JSON (BSON) format

    • enables developers to use MongoDB client libraries and code to work with data in Azure Cosmos DB

    • MongoDB Query Language (MQL) uses a compact, object-oriented syntax in which developers use objects to call methods

  • Azure Cosmos for PostgreSQL

    • a native PostgreSQL, globally distributed relational database that automatically shards data to help you build highly scalable apps

    • can start building apps on a single node server group, the same way you would with PostgreSQL anywhere else

    • as app’s scalability and performance requirements grow, you can seamlessly scale to multiple nodes by transparently distributing your tables

    • an RDBMS in which you define relational tables of data

  • Azure Cosmos DB for Table

    • used to work with data in key-value tables, similar to Azure Table Storage

    • offers greater scalability and performance than Azure Table Storage

    • you can use the Table API through one of the language-specific SDKs to make calls to your service endpoint to retrieve data from the table

  • Azure Cosmos DB for Apache Cassandra

    • compatible with Apache Cassandra, a popular open source database that uses a column-family storage structure

    • tables, similar to those in a relational database, with the exception that it’s not mandatory for each row to have the same columns

    • syntax based on SQL

  • Azure Cosmos DB for Apache Gremlin

    • used with data in a graph structure which entities are defined as vertices that form nodes in connected graph

    • nodes are connected by edges that represent relationships

    • syntax includes functions to operate on vertices and edges, enabling you to insert, update, delete and query data in the graph

Knowledge Check

  1. Which API should you use to store and query JSON documents in Azure Cosmos DB?

  • Azure Costumes DB for NoSQL

  • Azure Cosmos DB for Apache Cassandra

  • Azure Cosmos DB for Table

  1. Which Azure Cosmos DB API should you use to work with data in which entities and their relationships to one another are represented in a graph using vertices and edges?

  • Azure Cosmos DB for MongoDB

  • Azure Cosmos DB for NoSQL

  • Azure Cosmos DB for Apache Gremlin

  1. How can you enable globally distributed users to work with their own local replica of a Cosmos DB database?

  • Create an Azure Cosmos DB account in each region where you have users

  • Use the API for Table to copy data to Azure Table Storage in each region where you have users

  • Enable multi-region writes and add the regions where you have users

Explore Fundamentals of Large-Scale Analytics

Describe Data Warehousing Architecture

Large-scale data analytics architecture can vary, as can the specific technologies used to implement it; but in general, the following elements are:

  1. Data Injection and Processing

    • data from one or more transactional data stores, files, real-time streams, or other sources is loaded into a data lake or a relational data warehouse

    • load operation usually involves an extract, transform, and load (ETL) or extract, load, and transform (ELT) process in which the data is cleaned, filtered, and structures for analysis

      • ETL Process: data is transformed before being loaded into an analytical store

      • ELT Process: data is copied to the store and then transformed

      • either way the resulting data structure is optimized for analytical queries

    • data processing is often performed by distributed systems that can process high volumes of data in parallel using multi-node clusters

    • data ingestion includes both batch processing of static data and real-time processing of streaming data

  2. Analytical Data Store

    • data stores for large analytics include relational data warehouses, file-systems based data lakes, and hybrid architectures that combine features of data warehouses and data lakes (sometimes called data bakehouses or lake databases)

  3. Analytical Data Model

    • while data analysts and data scientists can work with the data directly in the analytical data store, it’s common to create one or more data models that pre-aggregate the data to make it easier to produce reports, dashboards, and interactive visualizations

    • often these data models are describes as cubes, in which numeric data values are aggregated across one or more dimensions

    • model encapsulates the relationship between data values and dimensional entities to support “drill-up/drill-down” analysis

  4. Data Visualization

    • data analysts consume data from analytical models and directly from analytical sores to create reports, dashboards, and other visualizations

    • users in an organization who may not be technology professionals might perform self-service data analysis and reporting

    • the visualizations from the data show trends, comparisons, and key performance indicators (KPIs) for a business or other organization and can take the form of printed reports, graphs and charts in documents or PowerPoint presentations, web-based dashboards, and interactive environments in which users can explore data visually

Explore Data Ingestion Pipelines

Large-scale data ingestion is best implemented by creating pipelines that orchestrate ETL processes. Piplines can be created and run using Azure Data Factory. Azure Synapse Analytics or Microsoft Fabric can also be used if you want to manage all of the components of your data analytics solution in a unified workspace. Either way, pipelines consist of one or more activities that operate on data. An input dataset provides the source data and activities can be defined as a data flow that incremental manipulates the data until an output dataset is produced. Pipelines can connect to external data sources to integrate with a wide variety of data services.

Explore Analytical Data Stores

Two Common Types of Analytical Data Stores:

  • Data Warehouses

    • relational database in which the data is stored in a schema that is optimized for data analytics rather than transactional workloads

    • data from a transactional store is transformed into a schema in which numeric values are stored in central fact tables, which are related to one or more dimension tables that represent entities by which the data can be aggregated which is referred to as a star schema

    • a star schema can be extending into a snowflake schema by adding additional tables related to the dimensional tables to represent dimensional hierarchies

    • this is great choice when you have transactional data that can be organized into a structured schema of tables, and you want to use SQL to query them

  • Data Lakehouses

    • a file store, usually on a distributed file system for high performance data access

    • technologies like Spark or Hadoop are often used to process queries on the stored files and return data for reporting and analytics; these systems often apply a schema-on-read approach to define tabular schemas on semi-structured data files at the point where the data is read for analysis without applying constraints when it’s stored

    • data lakes are great for supporting a mix of structured, semi-structured, and even unstructured data that you want to analyze without the need for schema enforcement when the data is written to the store

    • raw data is stored as files in a data lake and a relational storage layer abstracts the underlying files and expose them as tables, which can be queried using SQL

    • SQL pools in Azure Synapse Analytics include PolyBase which enables you to define external tables based on files in a data lake (and other sources) and query them using SQL

    • Synapse Analytics also supports this approach in which you can use database templates to define the relational schema of your data warehouse while storing the underlying data in data lake storage — separating the storage and compute for you data warehousing solution

    • this is a relatively new approach in Spark-based systems and are enabled through technologies like Delta Lakes (which adds relational storage capabilities to Spark, so you can define tables that enforce schemas and transactional consistency, support batch-loaded and streaming data sources, and provide a SQL API for querying)

Explore Platform-as-a-Service (PaaS) Solutions

On Azure, there are three main PaaS services that you can use to implement a large-scale analytical store:

  • Azure Synapse Analytics

    • a unified, end-to-end solution for large scale data analytics

    • brings together multiple technologies and capabilities, enabling you to combine the data integrity and reliability and scalable, high-performance SQL Server based relational data warehouse with the flexibility of a data lake and open-source Apache Spark

    • also includes native support for log and telemetry analytics with Azure Synapse Data Explorer pools as well as built in data pipelines for data ingestion and transformation

    • all these services can be managed through a single, interactive user interface called Azure Synapse Studio which includes the ability to create interactive notebooks in which Spark code and markdown content can be combined

    • a great choice for when you want to create a single, unified analytics solution on Azure

  • Azure Databricks

    • Azure implementation of the popular Databricks platform

    • a comprehensive data analytics solution built on Apache Spark and offers native SQL capabilities as well as workload-optimized Spark clusters for data analytics and data science

    • provides an interactive user interface through which the system can be managed and data can be explored in interactive notebooks

    • due to its common use on multiple cloud platforms, you might want to consider using this as your analytical store if you want to use existing expertise with the platform of if you ned to operate in a multi cloud environment or support a cloud-portable solution

  • Azure HDInsight

    • supports multiple open-source data analytics cluster types

    • not as user-friendly as Azure Synapse Analytics and Azure Databricks but is a suitable option if analytics rely on multiple open-source frameworks or if you need to migrate an existing on-premises Hadoop-based solution to the cloud

Explore Microsoft Fabric

Sacalable analytics with PaaS can be complex, fragmented, and expensive but with MS Fabric, you don’t have to spend all of your time combining various services and implementing interfaces through which business users can access them. It’s a single product that is easy to understand, set up, create, and manage. Fabric is a unified Sofware-as-a-Service (SaaS) offering with all your data stored in a single open format in OneLake.

OneLake is Fabric’s lake-centric architecture that provides a single, integrated environment for data professionals and the business to collaborate on data projects. Similar to OneDrive and Data, OneLake combines storage locations across different regions and clouds into a single logical lake without moving or duplicating data. Data can be stored in any file format in OneLake and can be structured or unstructured. For tabular data, analytics engines in Fabric will write data in delta format when writing to OneLake. All engines will know how to read this format and treat delta files as tables no matter which engine writes it

Knowledge Check

  1. Which Azure PaaS services can you use to create a pipeline for data ingestion and processing?

  • Azure SQL Database and Azure Cosmos DB

  • Azure Synapse Analytics and Azure Data Factory

  • Azure HDInsight and Azure Databricks

  1. What must you define to implement a pipeline that reads data from Azure Blob Storage?

  • A linked service for your Azure Blob Storage account

  • A dedicated SQL pool in your Azure Synapse Analytics workspace

  • An Azure HDInsight cluster in your subscription

  1. Which open-source distributed processing engine does Azure Synapse Analytics include?

  • Apache Hadoop

  • Apache Spark

  • Apache Storm

Explore Fundamentals of Real-Time Analytics

Understand Batch and Stream Processing

Data processing is simply the conversion of raw data to meaningful information through a process. Two general ways to process data:

  • Batch Processing

    • multiple data records are collected and stored before being processed together in a single operation

    • newly arriving data elements are collected and stored and the whole group is processed together as a batch

    • exactly when each group is processed can be determined in many ways

    • advantages of batch processing:

      • large volumes of data can be processed at a convenient time

      • can be scheduled to run at a time when computers or systems might be otherwise idl, such as overnight or during off-peak hours

    • disadvantages of batch processing:

      • time delay between ingesting the data and getting the results

      • all of a batch job’s data must be ready before a batch can be processed which means data must be carefully checked

      • problems with data, errors, and program crashed that occur during batch jobs bring the whole process to a halt so the input data must be carefully checked before the job can be run again

      • even minor data errors can prevent a batch job from running

  • Stream Processing

    • a source of data is constantly monitored and processed in real time as new data events occur

    • each new piece of data is processed when it arrives, unlike batch processing where there’s a wait until the next batch processing interval

    • data is processed as individual units in real-time

    • beneficial for scenarios where new, dynamic data is generated on a continual basis

    • ideal for time-critical operations that require an instant real-time response

Differences Between Batch and Streaming Data

  • Data Scope

    • batch: processes all the data in the dataset

    • stream: typically only has access to the most recent data received or within a rolling time window 

  • Data Size

    • batch: suitable for handling large datasets efficiently

    • stream: intended for individual records or micro batches consisting of few records

  • Performance (latency is the time taken for the data to be received and processed)

    • batch: latency is typically a few hours

    • stream: typically occurs immediately, with latency in the order of seconds or milliseconds

  • Analysis

    • batch: to perform complex analytics

    • stream: for simple response functions, aggregates, or calculations such as rolling averages

Many large-scale analytics solutions include a mix of batch and stream processing, enabling both historical and real-time data analysis. It’s common for stream processing solutions to capture real-time data, process it by filtering and aggregating it and present it through real-time dashboards and visualizations while also persisting the processed results in a data store for historical analysis alongside batch processed data. Even when real-time analysis or visualization of data is not required, streaming technologies are often used to capture real-time data and store it in a data store for subsequent batch processing

  1. data events from a streaming data source are captured in real-time

  2. data from other sources is ingested into a data store (often a data lake) for batch processing

  3. if real-time analytics is not required, the captured streaming data is written to the data store for subsequent batch processing

  4. when real-time analytics is required, a stream processing technology is used to prepare the streaming data for real-time analysis or visualization; often by filtering or aggregating the data over temporal windows

  5. the non-streaming data is periodically batch processed to prepare it for analysis and the results are persisted in an analytical data store (often referred to as a data warehouse) for historical analysis

  6. the results of stream processing may also be persisted in the analytical data store to support historical analysis

  7. analytical and visualization tools are used to present and explore the real-time and historical data

Explore Common Elements of Stream Processing Architecture

There are many technologies that you can use to implement a stream processing solution, but while specific implementation details may vary, there are common elements to most streaming architectures. At the simplest form, high-level architecture for stream processing looks like:

  1. an event generates some data; this might be a signal being emitted by a sensor, a social media message being posted, a log file entry being written, or any other occurrence that results in some digital data

  2. generated data is captured in a streaming source for processing

  3. event data is processed, often by a perpetual query that operates on the event data to select data for specific types of events, project data values, or aggregate data values over temporal (time-based) periods (or windows)

  4. results of the stream processing operation are written to an output (or sink) which may be a file, a database table, a real-time visual dashboard, or another queue for further processing by a subsequent downstream query

MS Azure supports multiple technologies that can be used to implement real-time analytics of streaming data including:

  • Azure Stream Analytics: a PaaS solution that you can use to define streaming jobs that ingest data from a streaming source, apply a perpetual query and write the results to an output

  • Spark Structured Streaming: open-source library that enables you to develop complex streaming solutions on Apache Spark based services (including Azure Synapse Analytics, Azure Databricks, and Azure HDInsight)

  • Azure Data Explorer: high-performance database and analytics service that is optimized for ingesting and querying batch or streaming data with a time-series element and which can be used as a standalone Azure service or as an Azure Synapse Data Explorer runtime in an Azure Synapse Analytics workspace

Services that are commonly used to ingest data for stream processing on Azure include:

  • Azure Event Hubs: a data ingestion service that you can use to manage queues of event data, ensuring that each event is processed in order, exactly once

  • Azure IoT Hub: a data ingestion service that is similar to Azure Event Hubs but is optimized for managing event data from Internet-of-Things (IoT) devices

  • Azure Data Lake Store Gen2: a highly scalable storage service that is often used in batch processing scenarios but can also be used as a source of streaming data

  • Apache Kafka: an open-source data ingestion solution that is commonly used together with Apache Spark (Azure HDInsight can be used to create a Kafka cluster)

The output from stream processing is often sent to the following services:

  • Azure Event Hubs: used to queue the processed data for further downstream process

  • Azure Data Lake Store Gen2 or Azure Blob Storage: used to persist the processed results as a file

  • Azure SQL Database or Azure Synapse Analytics or Azure Databricks: used to persist the processed results in a database table for queuing and analysis

  • Microsoft Power BI: used to generate real time data visualization in reports and dashboards

Explore Azure Stream Analytics

Azure Stream Analytics: a service for complex event processing and analysis of streaming data. This is used to:

  • ingest data from an input, such as an Azure event hub, Azure IoT Hub, or Azure Storage Blob Container

  • process the data by using a query to select, projects and aggregate data values

  • write the results to an output, such as Azure Data Lake Gen2, Azure SQL Database, Azure Synapse Analytics, Azure Functions, Azure Event Hub, Microsoft Power BI, or others

Once started, this query runs perpetually, processing new data as it arrives in the input and storing results in the output. This is a great technology choice when you need to continually capture data from a streaming source, filter or aggregate it, and send the results to a data store or downstream process for analysis and reporting

The easiest way to use this is to create a System Analytics job in an Azure subscription, configure its input(s) and output(s), and define the query that the job will use to process the data. The query is expressed using SQL syntax and can incorporate static reference data from multiple data sources to supply lookup value that can be combined with the streaming data ingested from an input. If your stream process requirements are complex or resource-instensive, you can create a Stream Analysis cluster, which uses the same underlying processing engine as a Stream Analytics job, but in a dedicated tenant (so your processing isn’t affected by other customers) and with configurable scalability that enables you to define the right balance of throughput and cost for you specific scenario.

Explore Apache Spark on Microsoft Azure

Apache Spark is distributed processing framework for large scale data analytics that can be used in the following services:

  • Azure Synapse Analytics

  • Azure Databricks

  • Azure HDInsight

Spark can be used to run code (usually written in Python, Scala, or Java) in parallel across multiple cluster nodes, enabling it to process very large volumes of data efficiently. It can be used for both batch and stream processing.

  • Spark Structured Streaming

    • provides an API for ingesting, processing, and outputting results from perpetual streams of data

    • built on a ubiquitous structure in Spark called a dataframe (encapsulates a table of data)

    • can use this to read data from a real-time data source like Kafka hub, a file store, or a network port into a “boundless” data frame that is continually populated with new data from the stream; then the query has to be defined on the data frame that selects projects or aggregates the data — often in temporal windows

    • results of the query generate another datagram, which can be persisted for analysis or further processing

    • a great choice for real-time analytics when you need to incorporate streaming data into a Spark based data lake or analytical data store

  • Delta Lake

    • an open-source storage layer that adds support for transactional consistency, schema enforcement, and other common data warehousing features to data lake storage

    • also unifies storage for streaming and batch data, and can be used in Spark to define relational tables for both batch and stream processing

    • a Detla Lake table can be used as a streaming source for queries against real-time data, or as a sink to which a stream of data is written

    • Spark runtimes in Azure Synapse Analytics and Azure Databricks include support for Delta Lake

Both of these combined is a good solution for when you need to abstract batch and stream processed data in a data lake behind a relational schema for SQL-based querying and analysis

Explore Realtime Analytics in Microsoft Fabric

Microsoft Fabric includes native support for real-time data analytics, including real-time data ingestion from multiple streaming sources. You can use an event stream to capture real-time event data from a streaming source and persist it in a destination such as a table in a Lakehouse or KQL database. When writing event stream data to a Lakehouse table, you can apply aggregations and filters to summarize the captured data. KQL databases support tables based on the Data Explorer engine, enabling you to perform real-time analytics on the data in tables by running KQL queries. After capturing real-time data in a table, you can use Power BI in Microsoft Fabric to create real-time data visualizations

Knowledge Check

  1. Which definition of stream processing is correct?

  • Data is processed continually as new data records arrives

  • Data is collected in a temporary store, and all records are processed together as a batch

  • Data is incomplete and cannot be analyzed

  1. Which service would you use to continually capture data from an IoT Hub, aggregate it over temporal periods, and store results in Azure SQL Database?

  • Azure Cosmos DB

  • Azure Stream Analytics

  • Azure Storage

Explore Fundamentals of Data Visualization

Describe Power BI Tools and Workflow

Microsoft Power BI: a suite of tools and services that data analysts can use to build interactive data visualizations for business users to consume

  • typical workflow for create a data visualization solution starts with Power BI Desktop (Microsoft Windows application where you can import data from a wide range of data sources, combine and organize the data from these sources in an analytics data model, and create reports that contain interactive visualizations of the data)

  • after creating data models and reports, they can be published to the Power BI service (a cloud service in which reports can be published and interacted with by business users)

  • basic data modeling and report editing can be directly done in the service using a web browser, but the functionality for this is limited compared to the desktop tool

  • the service can also be used to schedule refreshes of the data sources on which your reports are based and to share reports with other users

  • can define dashboards and apps that combine related reports in a single, easy to consume location

  • users can consume reports, dashboards, and apps in the Power BI service through a web browser or on mobile devices using the Power BI phone app

Describe Core Concepts of Data Modeling

Dimension Tables: represent entities by which you want to aggregate numeric measures; each entity is represented by a row with a unique key value and the remaining columns represent attributes of an entity. common in most analytical models to include a time dimension to aggregate numeric measures associated with events over time

Fact Tables: stores numeric measure that will be aggregated by various dimensions in the model; each row represents a recorded event that has numeric measures associated with it

Star Schema: where a fact table is related to one or more dimension tables; a more complex schema can be defined in which dimensional tables are related to additional tables containing more details

Hierarchies can enable you to drill-up or drill-down to find aggregated values at different levels which is worth considering about analytical models. Models can be build with pre-aggregated values for each level of a hierarchy, enabling you to quickly change the scope of your analysis

Power BI can be used to define an analytical model from tables of data which can be imported from one or more data sources. Then you can use the data modeling interface on the Model tab of Power BI Desktop to define your analytical model by creating relationships between fact and dimension tables, defining hierarchies, setting data types, and display formats for fields in the tables, and managing other properties of your data that help define a rich model for analysis

Describe Considerations for Data Visualization

After a model is created, it can be used to generate data visualizations that can be included in a report. There are many different kinds of data visualizations that are used, some which include:

  • Tables and Text

    • simplest way to communicate

    • useful when numerous related values mut be displayed and individual text values in cards can be a useful way to show important figures or metrics

  • Bar and Column Charts

    • good to visually compare numeric values for discrete categories

  • Line Charts

    • compare categorized values and useful when you need to examine trends, often over time

  • Pie Charts

    • visually compare categorized values as proportions of a total

  • Scatter Plots

    • compare two numeric measures and identify a relationship or correlation between them

  • Maps

    • visually compare values for different geographic areas or locations

  • Interactive Reports in Power BI

    • visual elements for related data in a report are automatically linked to one another and provide interactivity

Knowledge Check

  1. Which tool should you use to import data from multiple data sources and create a report?

  • Power BI Desktop

  • Power BI Phone App

  • Azure Data Factory

  1. What should you define in your data model to enable drill-up/down analysis?

  • A measure

  • A hierarchy

  • A relationship

  1. Which kind of visualization should you use to analyze pass rates for multiple exams over time?

  • A pie chart

  • A scatter plot

  • A line chart

Practice Assessment

Notes:

  • Azure Files is used to share files by using NFS and SMB

  • Data Lake Storage Gen2 is used for storing huge amounts of data to be processed and not to be shared among virtual machines

  • Page Blobs are used for VHDs

  • Table Storage is used for two dimensional tables

  • Archive tier is for data that will be used once per year and access time that takes more than an hour

  • Azure Data Lake Storage and Azure Blob Storage can both be mounted in Azure Synapse Analytics and be used to process large volumes of data

  • Azure Files and Table Storage cannot be mounted in Azure Synapse Analytics

  • in Azure Table Storage, the RowKey is unique in each partition (not table)

  • Append Blobs allow you to frequently add new data to a file but doesn’t allow for modification or deletion of existing data

  • Apache Cassandra is for tabular data in a column-family storage and is queried by SQL

  • Gremlin API is for graph databases

  • MongoDB API stores data in the Binary JSON (BSON) format

  • Table is used to retrieve key-value pairs

  • Azure Cosmos DB allows you to aggregate data stored in JSON files for use in analytical reports without additional development effort

  • Azure SQL Database doesn’t store data in JSON files

  • Blob Storage and Data Lake Storage doesn’t allow you to aggregate data for analytical reports without additional development effort

  • Data Warehouse uses fact and dimension tables in a star/snowflake schema

  • Relational Databases don’t use fact and dimension tables

  • Data Factory and Synapse Analytics allow you to create a pipeline in response to an event

  • Databricks and HDInsight run data processing tasks

  • Databricks and the Spark pool run data processing for large amounts of data by using Scala

  • a Linked Service must be created for an integration process that copies data from MS Excel files to Parquet files by using Azure Data Factory

  • Pipelines use existing linked services to load and process data

  • Datasets are the input and output

  • Activities can be defined as the data flow

  • Azure Stream Analytics handles stream processing from Kafka to Data Lake

  • Azure Synapse Analytics doesn’t process streaming data

  • Azure Cosmos DB and Data Factory don’t handle streams

  • Data Lake Storage Gen2 can be used to store files in stream processing architecture

  • Relational Database is the best option for CRUD operations and uses the least amount of storage space