Explore Core Data Concepts
Identify Data Formats
Data: a collection of facts such as numbers, descriptions, and observations used to record information
Data Structures: data is organized often represents entities (typically has one ore more attributes or characteristics) that are important to an organization (i.e., customers, products, sales orders, and so on)
Structured Data:
data that adheres to a fixed schema so that the data has the same fields or properties
schema is tabular, data is represented in one ore more tables that consist of rows to represent each instance of a data entity, and columns to represent each instance of data entity and columns to represent attributes of the entity
often stored in a database in which multiple tables can be referenced one another by using key values in a relational model
Semi-Structured Data:
information that has some structure but allows for some variation between entity instances
a common format is JavaScript Object Notation (JSON)
JSON is one of many ways to represent this data
Unstructured Data
documents, images, audio and video data, and binary files don’t have a specific structure
Data Stores: organizations typically store data in any of the 3 formats to record details of entities, specific events, and other formats. Stored data can be retrieved for analysis and reporting later (files stores, databases)
Explore File Storage
File Storage: ability to store data in files is a core element of any computing system, files can be stored in local file systems on the hard disk of your personal computer, and on removable media such as USB drives (but in most organizations, important data files are stored centrally in some kind shared file storage system). the central storage location is hosted in the cloud, enabling cost-effective, secure, and reliable storage for large volumes of data
Specific file format used to store data depends on a number of factors including:
type of data being stored (structured, semi-structured, or unstructured)
applications and services that will need to read, write, and process the data
need for the data files to be readable by humans, or optimized for efficient storage and processing
File Formats:
Delimited Text Files
data is often stored in plain text format with specific field delimiters and row terminators, the most common format is comma-separated values (CSV) where fields are sepeated by commas and rows are terminated by a carriage return / new line
optionally, first line may include the field names
other common formats include tab-separated values (TSV) and space-delimited (in which tabs or spaces are used to separate fields) and fixed-width data in which each field is allocated in a fixed number of characters
good choice for structured data that needs to be accessed by a wide range of applications and services in a human-readable format
JavaScript Object Notation (JSON)
a ubiquitous format in which a hierarchical document schema is used to define data entities (objects) that have multiple attributes, each attribute might be an object (or collection of objects); making JSON a flexible format that’s good for both structured and semi-structured data
Extensible Markup Language (XML)
a human-readable data format that was popular in the 1990s and 2000s, largely been superseded by the less verbose JSON format but there are still some systems that use XML to represent data
uses tags enclosed in angle-brackets (<…/>) to define elements and attributes
Binary Large Object (BLOB)
all files are stored as binary data, but in the human-readable formats, the bytes of binary data are mapped to printable characters (typically through a character encoding scheme such as ASCII or Unicode)
some file formats, particularly for unstructured data, store the data as raw binary that must be interpreted by applications and rendered
common types of data stored as binary include images, video, audio, and application-specific documents
files are often referred to as BLOBs
Optimized File Formats
human-readable for structured and semi-structured data can be useful but not typically optimized for storage space or processing
over time, some specialized file formats that enable compression, indexing, and efficient storage and processing have been developed
common optimized formats you might see include:
Avro: row-based format created by Apache
each record contains a header that describes the structure of the data i the record, header is stored as JSON and data is stored as binary information
application uses the information of the header to parse the binary data and extract the fields it contains, a good format for compressing data and minimizing storage and network bandwidth requirements
ORC (Optimized Row Columnar format) organizes data in columns rather than rows
developed by HortonWorks for optimizing read and write operations in Apache Hive (a data warehouse system that supports fast data summarization and querying over large datasets)
contains stripes of data, each stripe holds the data for a column or set of columns and a stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column
Parquet, created by Cloudera and Twitter, is another columnar data format
contains row groups, data for each column is stored together in the same row group, each row group contains one or more chunks of data
includes metadata that describe the set of rows found in each chunk, an application can use the metadata to quickly locate the correct chunk for a given set of rows and retrieve the data in the specific columns for these rows
specializes in storing and processing nested data types efficiently, supports very efficient compression and encoding schemes
Explore Databases
Database: used to define a central system in which data can be stored and queried. the file system on which files are stored in a kind of database but when we use the term in a professional data context, we usually mean a dedicated system for managing data records rather than files
Relational Databases
commonly used to store and query structured data which is stored in tables that represent entities such as customers, products or sale orders
each instance of an entity is assigned a primary key that uniquely identifies it; the keys are used to reference the entity instances in other tables
the use of keys to reference data entitles enables a relational database to be normalized; the elimination of duplicate data values
tables are managed and queried using Structured Query Language (SQL) which is based in ANSI standard so that it’s similar across multiple database systems
Non-Relational Databases
data management systems that don’t apply a relational schema to the data, often referred to as NoSQL database even though some support a variant of the SQL language
four common types in use:
Key-Value Databases: each record consists of a unique key and an associated value which can be in any format
Document Databases: a specific form of key-value database in which the value is a JSON document (which the system is optimized to parse and query)
Column Family Databases: store tabular data comprising row and columns but the columns can be divided into groups known as column-families (each one holds a set of columns that are logically related together
Graph Databases: store entities as nodes with links to define relationships between them
Explore Transactional Data Processing
Transactional Data Processing System: what most people consider the primary function of business computing, records transactions (a small, discrete, unit of work) that encapsulates specific events that the organization wants to track. can be financial, such as the movement of money between accounts in a banking system, or a retail system, tracking payment for goods and services from customers. often high-volume, sometimes handling many millions of transactions in a single day. data being processed has to be accessible very quickly.
Online Transactional Processing (OLTP): work performed by transactional systems, solutions rely on a database system in which data storage is optimized for both read and write operations in order to support transactional workloads in which data records are created, retrieved, updated, and deleted (often referred to as CRUD operations). operations are applied transactionally in a way that ensures that integrity of the data stored in the database. to accomplish this OLTP systems enforce transactions that support so-called ACID semantics:
Atomicity: each transaction is treated as a single unit which succeeds or fails completely
Consistency: transactions can only take the data in the database from one valid state to another
Isolation: concurrent transactions cannot interfere with one another and, as a result, must result in a consistent database state
Durability: when a transaction has been committed, it will remain committed
OLTP systems are typically used to support live applications that process business data - often referred to as line of business (LOB) applications
Explore Analytical Data Processing
Analytical Data Processing: typically uses read-only (or read-mostly) systems that store vast volumes of historical data or business metrics, analytics can be based on a snapshot of the data at a given point in time or series of snapshots. specific details for this can vary between solutions, but a common architecture for enterprise-scale analytics looks like this:
operational data is extracted, transformed, and loaded (ETL) into a data lake for analysis
data is loaded into a schema of tables - typically in a Spark-based data lakehouse with tabular abstractions over files in the data lake, or a data warehouse with fully relational SQL engine
data in the data warehouse may be aggregated and loaded into an online analytical processing (OLAP) model, or cube. aggregated numeric values (measures) from fact tables are calculated for intersections of dimensions from dimensional tables
the data in the data lake, data warehouse, and analytical model can be queried to produce reports, visualizations, and dashboards
Data Lakes: common in large0scale data analytical processing scenarios, where a large volume of file0based data must be collected and analyzed
Data Warehouses: an established way to store data in a relational schema that is optimized for read operations — primarily queries to support reporting and data visualization
Data Lakehouses: a more recent innovation that combine the flexible and scalable storage of a data lake with the relational querying semantics of a data warehouse, the table schema may require some denormalization of data in an OLTP dat source (introducing some duplication to make queries perform faster)
OLAP Model: an aggregated type of data storage that is optimized for analytical workloads. data aggregations are across dimensions at different levels, enabling you to drill up/down to view aggregations at multiple hierarchical levels
different types of users might perform data analytical work at different stages of the overall architecture, for example:
data scientists might work directly with data files in a data lake to explore and model data
day analysts might query tables directly in the data warehouse to produce complex reports and visualizations
business users might consume pre-aggregated data in an analytical model int he form of reports or dashboards
Knowledge Check
How is data in a relational table organized?
Rows and Columns
Header and Footer
Pages and Paragraphs
Which of the following is an example of unstructured data?
An Employee table with columns ExmployeeID, EmployeeName, and EmployeeDesignation
Audio and Video files
A table within a relational database
What is a data warehouse?
A nonrelational database optimized for read and write operations
A relational database optimized for read operations
A storage location for unstructured data files
Explore Data Roles and Services
Explore Job Roles in the World of Data
Three key job roles that deal with data in most organizations are (not always referred to as these names but just illustrated the common breakdown of these types of roles):
Database Administrators: manage databases, assigning permissions to users, storing backup copies of data and restore data in the event of a failure
responsible for the design, implementation, maintenance, and operational aspects of on premises and cloud-based database systems
responsible for the overall availability and consistent performance and optimization of databases
work with stakeholders to implement policies, tools, and processes for backup and recovery plans to recover following a natural disaster or human-made error
also responsible for managing the security of the data in the database granting privileges over the data, granting or denying access to users as appropriate
Data Engineers: manage infrastructure and processes for data integration across the organization, applying data cleaning routines, identifying data governance rules, and implementing pipelines to transfer and transform data between systems
collaborates with stakeholders to design and implement data-related workloads, including data ingestion pipelines, cleansing and transformation activities, and datastore for analytical workloads
they use a wide range of data platform technologies, including relational and non-relational databases, file stores, and data streams
they’re also responsible for ensuring that the privacy of data is maintained within the cloud of spanning from on-premises to the cloud data stores
they own the management and monitoring of data pipelines to ensure that data loads perform as expected
Data Analysts: explore and analyze data to create visualizations and charts that enable organizations to make informed decisions
enables to maximize the value of their data assets
responsible for exploring data to identify trends and relationships, designing and bilig analytical models, and enabling advanced analytics capabilities through reports and visualizations
processes raw data into relevant insights based on identified business requirements to deliver relevant insights
Identify Data Services
Microsoft Azure: a cloud platform that powers the applications and IT infrastructure for some of the world’s largest organization, includes many services to support cloud solutions including transactional and analytical data workloads
Some of the most commonly used cloud services for data are:
Azure SQL: the collective name for a family of relational database solutions based on the MS Sql Server database engine. services include
Azure SQL Database: a fully managed platform-as-a-service (PaaS) database hosting in Azure
Azure SQL Managed Instance: hosted instance of SQL Server with automated maintenance, which allows more flexible configuration than Azure SQL DB but with more administrative responsibility for the owner
Azure SQL VM: a virtual machine with an installation of SQL Server, allowing maximum configurability with full management responsibility
database administrators typically provision and manage Azure SQL database systems to support line of business (LOB) applications that need to store transactional data
data engineers may use Azure SQL database systems as sources for data pipelines that perform extract, transform, and load (ETL) operations to ingest the transactional data into an analytical system
data analysts may query Azure SQL databases directly to create reports, though in large organizations the data is generally combined with data from other sources in an analytical data store to support enterprise analytics
Azure Database for Open-Source Relational Databases: includes managed services for popular open-source relational database systems, including:
Azure Database for MySQL: a simple-to-use open-source database management system that is commonly used in Linux, Apache, MySQL, and PHP (LAMP) stack apps
Azure Database for MariaDB: a newer database management system, created by the original developers of MySQL, the database engine has been rewritten and optimized to improve performance, MariaDB offers compatibility with Oracle Database (another popular commercial database management system)
Azure Database for PostgreSQL: a hybrid relational-object database, you can store data in relational tables, but a PostgreSQL database also enables you to store custom data types, with their own non-relational properties
similar to Azure SQL database systems, open-source relational databases are managed by database administrators to support transactional applications, and provide a data source for data engineers building pipelines for analytical solutions and data analysts creating reports
Azure Cosmos DB: a global-scale non-relational (NoSQL) database system that supports multiple application programming interfaces (APIs), enabling you to store and manage data as JSON documents, key-value pairs, column-families, and graphs
in some organizations, instances may be provisioned and managed by a database administrator; though often software developers manage NoSQL data storage as part of the overall application architecture
data engineers often need to integrate Cosmos DB data sources into enterprise analytical solutions that support modeling and reporting by data analysts
Azure Storage: a core Azure service that enables you to store data in:
Blob Containers: scalable, cost-effective storage for binary files
File Shares: network file shares such as you typically find in corporate networks
Tables: key-values for applications that need to read and write data values quickly
data engineers use Azure Storage to host data lakes - blob storage with hierarchical namespace that enable files to be organized in folders with a distributed file systems
Azure Data Factory: an Azure service that enables you to define and schedule data pipelines to transfer and transform data
pipelines can be integrated with other Azure services, enabling you to ingest data from cloud data store, process the data using cloud-based compute and persist the results in another data store
used by data engineers to build extract, transform, and load (ETL) solutions that populate analytical data stores with data from transactional systems across the organization
Azure Synapse Analytics: a comprehensive, unified Platform-as-a-Service (PaaS) solution for data analytics that provides a single service interface for multiple analytical capabilities including:
Pipelines: based on the same technology as Azure Data Factory
SQL: a highly scalable SQL database engine, optimized for data ware house workloads
Apache Spark: an open-source distributed data processing system that supports multiple programming languages and APIs including Java, Scala, Python, and SQL
Azure Synapse Data Explorer: a high-performance data analytics solution that is optimized for real-time querying of log and telemetry data using Kusto Query Language (KQL)
data engineers can use this to create a unified data analytics solution that combines data ingestion pipelines, data warehouse storage, and data lake storage through a single service
data analysts can use SQL and Spark pools through interactive notebooks to explore and analyze data, and take advantage of integration with services such as Azure Machine Learning and Microsoft Power BI to create data models and extract insights from the data
Azure Databricks: an Azure-integrated version of the popular Databricks platform, which combines the Apache Spark data processing platform with SQL database semantics and an integrated management interface to enable large-scale data analytics
data engineers can use existing Databricks and Spark skills to create analytical data stores in this
data analysts can use the native notebook support in this to query and visualize data in an easy to use web-based interface
Azure HDInsight
an Azure service that provides Azure-hosted clusters for popular Apache open-source big data processing technologies including:
Apache Spark: a distributed data processing system that supports multiple programming languages and APIs, including Java, Scala, Python and SQL
Apache Hadoop: a distributed system tat uses MapReduce jobs (can be written in Java or abstracted by interfaces such as Apache Hive — a SQL-based API that runs on Hadoop) to process large volumes of data efficiently across multiple cluster nodes
Apache HBase: an open-source system for large-scale NoSQL data storage and querying
Apache Kafka: a message broker for data stream processing
data engineers can use Azure HDInsight to support big data analytics workloads that depend on multiple open-source technologies
Azure Stream Analytics
a real-time stream processing engine that captures a stream of data from an input, applies a query to extract and manipulate data from the input stream and writes the results to an output for analysis or further processing
data engineers can incorporate Azure Stream Analytics into data analytics architectures that capture streaming data for ingesting into an analytical data store or for real-time visualization
Azure Data Explorer
a standalone service that offers the same high-performance querying of log and telemetry data as the Azure Synapse Data Explorer runtime in Azure Synapse Analytics
data analysts can use this to query and analyze data that includes a timestamp attribute, such as is typically found in log files and Internet-of-Things (IoT) telemetry data
Microsoft Purview
provides a solution for enterprise-wide data governance and discoverability
you can use this to crate a map of your data and track data lineage across multiple data sources and systems, enabling you to find trustworthy data for analysis and reporting
data engineers can use this to enforce data governance across the enterprise and ensure the integrity of data used to support analytical workloads
Microsoft Fabric
a unified Software-as-a-Service (SaaS) analytics platform based on open and governed lakehouse the includes functionality to support:
data ingestion and ETL
data lakehouse analytics
data warehouse analytics
data science and machine learning
realtime analytics
data visualization
data governance and management
Knowledge Check
Which one of the following tasks is the responsibility of a database administrator?
Backing up and restoring databases
Creating dashboards and reports
Creating pipelines to process data in a data lake
Which role is most likely to use Azure Data Factory to define a data pipeline for an ETL process?
Database Administrator
Data Engineer
Data Analyst
Which services would you use as a SaaS solution for data analytics?
Azure SQL Database
Azure Synapse Analytics
Microsoft Fabric
Explore Fundamental Relational Data Concepts
Understand Relational Data
Relational Database: you model collections of entities from the real world as tables, a format for structured data and each row in a table has the same columns (though in some cases, not all columns need to have a value); each column stores data of a specific datatype; the available datatypes that you can use when defining a table depend on the database system you are using; though there are standard datatypes defined by the American National Standards Institute (ANSI) that are supported by most database systems
Entity: can be anything for which you want to record information; typically important objects and events
Table: contains rows, and each row represents a single instance of an entity
Understand Normalization
Normalization: a term used by database professionals for a schema design process that minimizes data duplication and enforces data integrity; while there are many complex rules that define the process of refactoring dat into various levels (or forms) of normalization, a simple definition for practical purposes is:
separate each entity into its own table
separate each discrete attribute into its own column
uniquely identify each entity instance (row) using a primary key
use foreign key columns to link related entities
Each entity is stored in its own table and each discrete attribute of those entities is in its own column. Recording each instance of an entity as a row in an entity-specific table removes duplication of data. Decomposition of attributes into individual columns ensures that each value is constrained to an appropriate data type. Additionally, the creation of individual columns provides a useful level of granularity in the data for querying. Instances of each entity are uniquely identified by an ID or other key value, known as a primary key; and when one entity references another, the primary key of the related entity is stored as a foreign key. Typically, a relational database management system (RDBMS) can enforce referential integrity to ensure that a value entered into a foreign key field has an existing corresponding primary key in the related table. In some cases, a key (primary or foreign) can be defined as a composite key based on a unique combination of multiple columns
Explore SQL
Standard Query Language (SQL): used to communicate with a relational database, the standard language for RDBMS; SQL statements are used to perform tasks such as update data in a database, or retrieve data from a database, common RDBMS that use SQL include MS SQL Server, MySQL, PostgreSQL, MariaDB, and Oracle
SQL statements like SELECT, INSERT, UPDATE, DELETE, CREATE, and DROP can be used to accomplish almost everything that you need to do with a database. Even though these are standard statements, many DBMS also have their own additional proprietary extensions to handle the specifics of that DBMS. Extensions provide functionality that isn’t covered by the SQL standard and include areas such as security management and programmability. Implementation includes proprietary extensions for writing stored procedures and triggers (application code that can be stored in the database), and managing user accounts. PostgreSQL and MySQL also have their own versions of these features. Popular dialects of SQL include:
Transact-SQL (T-SQL): used by MS SQL Server and Azure SQL Services
pgSQL: implmented in PostgreSQL with extensions
PL/SQL: used by Oracle and stands for Procedural Learning/SQL
Users who plan to work specifically with a single database system should learn the intricacies of their preferred SQL dialect and platform
SQL statements are grouped into three main logical groups:
Data Definition Language (DDL)
create, modify, and remove tables and other objects in a database (table, stored procedures, views, and so on)
most common statements are CREATE, ALTER, DROP, or RENAME
datatypes available for columns will vary between DBMS but most systems will support numeric types such as INT, DECIMAL, and string types such as VARCHAR
Data Control Language (DCL)
database administrators generally use these statements to manage access to objects in a database by granting, denying, or revoking permissions to specific users or groups
main statements are: GRANT, DENY, and REVOKE
Data Manipulation Language (DML)
use to manipulate rows in tables, enables you to retrieve (query) data, insert new rows, or modify existing rows as well as deleting rows if you don’t need them anymore
four main statements are: SELECT, INSERT, UPDATE, and DELETE
Describe Database Objects
Relational databases contain other structures to help optimize data organization, encapsulate programmatic actions, and improve the speed of access. Three of these structures are:
Views
virtual tables based on the results of a SELECT query, like a window on specified rows in one or more underlying tables
you can query the view and filter the data in much the same way as a table
Stored Procedures
defines SQL statements that can be run on command, are used to encapsulate programmatic logic in a database for actions that applications need to perform when working with data
you can define a stored procedure with parameters to create a flexible solution for common actions that might need to be applied to data based on a specific key or criteria
Indexes
helps search for data in a table, like an index at the back of a book
a book index contains a sorted set of references with the pages on which each reference occurs
when creating an index in a database, you specify a column from the table and the index contains a copy of this data in a sorted order with pointers to the corresponding rows in the table
when a user runs a query that specifies this column in the WHERE clause, the database management system can use this index to fetch the data more quickly than if it had to scan through the entire table row by row
Knowledge Check
Which one of the following statements is a characteristic of a relational database?
All columns in a table must be of the same data type
A row in a table represents a single instance of an entity
Rows in the same table can contain different columns
While SQL statement is used to query tables and return data?
QUERY
READ
SELECT
What is an index?
A structure that enables queries to locate rows in a table quickly
A virtual table based on the results of a query
A predefined SQL statement that modifies data
Explore Relational Database Services in Azure
Describe Azure SQL Services and Capabilities
Azure SQL is a collective term for a family of MS SQL Server based database services in Azure, specific services include:
SQL Server on Azure Virtual Machines (VMs)
a virtual machine running in Azure with an installation of SQL Server
use of a VM makes this option an infrastructure-as-a-service (IaaS) solution that virtualizes hardware infrastructure for compute, storage, and networking in Azure; making it a great option for “life and shift” migration of existing on-premises SQL Server installations to the cloud
enable you to use full versions of SQL Server in the Cloud without having to manage any on-premises hardware (an example of the IaaS approach)
replicates the database running on real on-premises hardware, migrating from the system running on-premises to an Azure virtual machine is no different than moving the database from one on-premises server to another
approach is suitable for migrations and applications requiring access to operating system features that might be unsupported at the PaaS level
lift-and-shift ready for existing applications that require fast migration to the cloud with minimal changes
can use this to extend existing on-premises applications to the cloud in hybrid deployments (a system where part of the operation runs on-premises, and part in the cloud; database may be part of a larger system that runs on-premises although the database elements might be hosted in the cloud)
can develop and test traditional applications, have full administrative rights over the DBMS and operating systems; perfect choice when an organization already as IT resources available to maintain the VM, enables you to
create rapid development and test scenarios when you don’t want to buy on-premises non-production SQL Server hardware
become lift-and-shift ready for existing applications that require fast migration to the cloud with minimal changes or no changes
scale up the platform on which SQL Server is running, by allocating more memory, CPU power, and disk space to the VM; you can quickly resize it without the requirement that you reinstall the software that is running on it
Business Benefits
allows you to meet unique and diverse business needs through a combination of on-premises and cloud-hosted deployments, while using the same set of server products, development tools, and expertise across these environments
not always easy for businesses to switch their DBMS to a fully managed service; there may be specific requirements that must be satisfied in order to migrate to a managed service that requires making changes to the database and the applications that use it which is why using a VM can offer a solution but doesn’t eliminate the need to administer your DBMS as carefully as you would on-premises
Azure SQL Managed Instance
a platform-as-a-service (PaaS) option that provides near-100% compatibility with on-premises SQL Server instances while abstracting the underlying hardware and operating system
service includes automated software update management, backups, and other maintenance tasks, reducing the administrative burden of supporting a database server instance
runs a fully controllable instance of SQL Server in the cloud, you can install multiple databases on the same instances; full control as you would on an on-premises server
automates backups, software patching, database monitoring, and other general tasks, and you have full control over security and resources allocation for your databases
depend on other services such as Azure Storage for backups, Azure Event Hubs for telemetry, Microsoft Entra ID for authentication, Azure Key Vault for Transparent Data Encryption (TDE) and a couple of Azure platform services that provide security and supportability features; managed instances make connections to these services
all communications are encrypted and signed using certificates
to check the trustworthiness of communicating parties, managed instances constantly verify these certificates through certificate revocation lists; if certificates are revoked, the managed instance closes the connections to protect the data
Use Cases
consider this if you want a lift-and-shift an on-premises SQL Server instance and all is database to the cloud without incurring the management overhead of running SQL Server on a VM
provides features not available in an Azure SQL Database
if system uses features such as linked servers, Service Broker (a message processing system that can be used to distribute work across servers), or Database Mail (which enables your database to send email messages to users), then you should use managed instances
Business Benefits
enables a system administrator to spend less time on administrative tasks because the service either performs them for you or greatly simplifies those tasks; automated tasks include operating system and database management system software installation and patching, dynamic instance resizing and configuration, backups, database replication (including system databases), high availability configuration, and configuration of health and performance monitoring data streams
has a near 100% compatibility with SQL Server Enterprise Edition, running on-premises
supports its engine logins and logins integrated with MS Entra ID, engine logins include a username and a password, requiring you to enter your credentials each time you connect to the server
Azure SQL Database
fully managed, highly scalable PaaS database service that is designed for the cloud
service includes the core database-level capabilities of on-premises SQL Server and is a good option when you need to create a new application in the cloud
a PaaS offering from Microsoft, create a managed database server in the cloud and then deploy the database on this server
available as:
Single Database
enables you to quickly set up a run a single SQL Server database, create and run a database server in the cloud and access your database through this server
MS manages the server so that all you have to do is configure the database, create your tables, and populate them with your data
you can scale the database if you need more storage space, memory, or processing power
by default, resources are pre-allocated, and you’re charged per hour for the resources you’ve requested
you can also specify a serverless configuration (MS creates its own server, which might be shared by databases belonging to other Azure subscribers)
MS ensures the privacy of your database which automatically scales and resources are allocated or deallocated as required
Elastic Pool
similar to single but, by default, multiple databases can share the same resources such as memory, data storage space, and processing power through multiple-tenancy
you create the pool and only your database can use the pool
model is useful if you have databases with resources requirements that vary over time and can help you to reduce costs
Use Cases
gives you the best option for low cost with minimal administration
isn’t fully compatible with on-premises installations
often used in new cloud projects where the application design can accommodate any required changes to your applications
often used for:
modern cloud applications that need to use the latest stable SQL Server features
applications that require high availability
systems with a variable load that need the database server to scale up and down quickly
Business Benefits
automatically updates and patches the SQL Server software to ensure that you’re always running the latest and most secure version of the service
scalability features ensure that you can increase the resources available to store and process data without having to perform a costly manual upgrade
provides high availability guarantees, to ensure that your databases are available at lease 99.995% of the time; supports point-in-time restore, enabling you to recover a database to the state it was in at any pint in the past; databases can be replicated to different regions to provide more resiliency and disaster recovery
advanced threat protection provides advanced security capabilities, such as vulnerability assessments, to help detect and remediate potential security problems with your database; threat protection also detects anomalous activities that indicate unusual and potentially harmful attempts to access or exploit you database, continuously monitoring your database for suspicious activities and provides immediate security alerts on potential vulnerabilities, SQL injection attacks, and anomalous database access patterns; threat detection alert provide details of the suspicious activity and recommend action on how to investigate and mitigate the threat
auditing tracks database events and writes them to an audit log in your Azure storage account which can help maintain regulatory compliance, understanding database activity, and gain insight into discrepancies and anomalies that might indicate business concerns or suspected security violations
helps secure your data by providing encryption that protects data that is stored in the database (at rest) and while it is being transferred across the network (in motion)
Azure SQL Edge
a SQL engine that is optimized for Internet-of-Things (IoT) scenarios that need to work with streaming time-series data
Describe Azure Services for Open-Source Databases
Other popular relational database systems that are primarily used to enable organizations that use them in on-premises apps to move to Azure quickly without making significant changes to their applications:
MySQL
started as a simple-to-use open-source DBMS, leading open source relational database for Linux, Apache, MySQL, and PHP (LAMP) stackapps
a PaaS implementation
includes high availability at no additional cost and scalability as required, pay for what you use, automatic backups are provided, with point-in-time restore
server provides connection security to enforce firewall rules and optionally require SSL connections; many parameters enable you to configure server settings such as lock modes, maximum number of connections and timeouts
provides a global database system that scales up to large databases without the need to manage hardware, network components, virtual servers, software patches, and other underlying components
certain operations aren’t available, primary concerned with security and administration; these are managed by Azure itself
Benefits of Azure Database for MySQL
high availability features built-in
predictable performance
easy scaling that responds quickly to demand
secure data, both at least and in motion
automatic backups and point-in-time restore for the last 35 days
enterprise-level security and compliance with legislation
pay-as-you-go pricing so you only pay for what you use
provides monitoring functionality to add alerts and to view metrics and logs
MariaDB
a newer DBMS, created by the original developers of MySQL, but has been rewritten and optimized to improve performance; offers compatibility with Oracle Database and has a notable feature of a built-in support for temporal data and the table can hold several versions of data, enabling an application to query the data as it appeared at some point in the past
fully managed and controlled by Azure; once you’ve provisioned the service and transferred your data, the system requires almost no additional administration
Benefits of Azure Database for MariaDB
built-in high availability with no additional cost
predictable performance, using inclusive pay-as-you-go pricing
scaling as needed within seconds
secured protection of sensitive data at rest and in motion
automatic backups and point-in-time-restore for up to 35 days
enterprise-grade security and compliance
PostgreSQL
a hybrid relational-object database where you can store data in relational table but also enables you to store custom data types with their own non-relational properties and has the ability to store and manipulate geometric data, such as lines, circles, and polygons
has its own query language called pgsql which is a variant of SQL with features that enables you to write stored procedures that run inside the database
Azure Database for PostgreSQL Flexible Server
flexible-server deployment option is a fully managed database service
provides a high level of control and server configuration customizations and provides optimization controls
Benefits of Azure Database for PostgreSQL
highly available service, contains built-in failure detection and failover mechanisms
users will be familiar with the pgAdmin tool which can be used to manage and monitor a database; you can continue to use this tool to connect to Azure Database for PostgreSQL but some server-focused functionality, such as performing server backup and restore, aren’t available because the server is managed and maintained by Microsoft
records information about queries run against databases on the server and saves them in a database named azure_sys; query the query_store.qs_view view to see this information and use it to monitor the queries that users are running; information can prove invaluable if you need to fine0tune the queries performed by your applications
Knowledge Check
Which deployment option offers the best compatibility when migrating an existing SQL Server on-premises solution?
Azure SQL Database (single database)
Azure SQL Database (elastic pool)
Azure SQL Managed Instance
Which of the following statements is true about Azure SQL Database?
Most database maintenance tasks are automated
You must purchase a SQL Server license
It can only support one database
Which database service is the simplest option for migrating a LAMP application to Azure?
Azure SQL Managed Instance
Azure Database for MySQL
Azure Database for PostgreSQL
Explore Azure Storage for Non-Relational Data
Explore Azure Blob Storage
Azure Blob Storage: enables you to store massive amounts of unstructured data as binary large objects (blobs) in the cloud; an efficient way to store data files in a format that is optimized for cloud-based storage and applications can read and write them by using its API, they get stored in containers
Containers: a convenient way of grouping related blobs together; you can control who can read and write blobs inside a container at the container level; help organize blobs in a hierarchy of virtual folders, similar to files in a file system on disk, but the default of these folders are simply a way of using a “/“ character in a blob name to organize the blobs into namespace; folders are purely virtual and you can’t perform folder-level operations to control access or perform bulk operations
Three Types of Blobs:
Block Blobs
handled at a set of blocks, each block can vary in size up to 4000 MiB
can contain up to 190.7 TiB (4000 MiB X 50,000 blocks) giving a maximum size of over 5000 MiB
block is the smallest amount of data that ca be read or written as an individual unit
best used to store discrete, large, binary objects that change infrequently
Page Blobs
organized as a collection of fixed size 512-byte page
optimized to support random read and write operations; you can fetch and store data for a single page if necessary
can hold up to 8 TB of data
uses page blobs to implement virtual disk storage for virtual machines
Append Blobs
a block blob optimized to support append operations
can only add blocks to the end of an appen blob; updating or deleting existing blocks isn’t supported
each block can vary in size, up to 4 MB, maximum size of an append block is just over 195 GB
Three Access Tiers to Help Balance Access Latency and Storage Cost:
Hot Tier
the default
use this tier for blobs that are accessed frequently
stored on high-performance media
Cool Tier
lower performance and incurs reduced storage charges compared to the Hot Tier
for data that is accessed infrequently
common for newly created blobs to be accessed frequently initially but less so as time basses
can create a blob in the Hot Tier and then move it to the Cool Tier later, or vice versa
Archive Tier
lowest storage cost and increased latency
intended for historical data that shouldn’t bet be lost but is required only rarely
blobs are effectively stored in an offline state
while the other tiers have a reading latency of a few milliseconds, here it can take hours for it to be available
to retrieve a blob, access tier needs to be changed to Hot or Cool; blob will then be rehydrated which will allow you to read it
Lifecycle Management Policy: can automatically move a blob from Hot to Cool, and then to the Archive tier as it ages and is used less frequently (policy is based on the number of days since modification); this can also arrange to delete outdated blobs
Explore Azure DataLake Storage Gen2
Azure Data Lake Store (Gen1): a separate service for hierarchical data storage for analytical data lakes, often used by so-called big data analytical solutions that work with structured, semi-structures, and unstructured data stored in files
Azure Data Lake Storage Gen2: a newer version of the service that is integrated into Azure Storage; enabling you to take advantage of the scalability of blob storage and the cost-control of storage tiers, combined with the hierarchical file system capabilities and compatibility with major analytics systems of Azure Data Lake Store
Systems like Hadoop in Azure HDInsight, Azure Databricks, and Azure Synapse Analytics can mount a distributed file system hosted in Azure Data Lake Store Gen2 and use it to process huge volumes of data
To create this file system, you must enable the Hierarchical Namespace option of a storage account which can be done when initial creating the storage account or can be upgraded an existing account to support this. However, upgrading is a one-way process, it can’t be reverted to a flat namespace
Explore Azure Files
File Shares: many on-premises systems made up of a network of in-house computers make use of this. this enables you to store a file on one computer and grant access to that file to users and applications running on other computers. strategy can work well for computers in the same local area network but doesn’t scale well as the number of users increases or if users are located at different sites
Azure Files: a way to create cloud-based network shares, like those you would typically find in on-premises organizations to make documents and other files available to multiple users. by hosting file shares in Azure, organizations can eliminate hardware costs and maintenance overhead and benefit from high availability and scalable cloud storage for files. create these in a storage account and enables you to share up to 100 TB of data in a single storage account. this data can be distributed across any number of file shares in the account. maximum size of a single file is 1 TB but you can set quotas to limit the size of each share below this figure. currently Azure File Storage supports up to 2000 concurrent connections per shared file. after creating an account, you can upload files using the Azure portal or tools like the AzCopy utility. Azure File Sync service can also be used to synchronize locally cached copies of shared files with the data in this. Two Performance Tiers:
Standard Tier: a hard disk-based hardware in a datacenter
Premium Tier: solid-state disks and offers greater throughput at a higher charge rate
Network File Sharing Protocols
Server Message Block (SMB): commonly used across multiple operating systems (Windows, Linux, macOS)
Network File System (NFS): used by some Linux and macOS versions; to create this, you must use a premium tier storage account and create and configure a virtual network through which access to the share can be controlled
Explore Azure Tables
Azure Table Storage: NoSQL storage solution that makes use of tables containing key/value data items, each item is represented by a row that contains columns for the data fields that need to be stored. but it’s not like a table in a relational database. instead, it enables you to store semi-structured data; all rows in a table must have a unique key (composed of a partition key and a row key) and when you modify data in a table, a timestamp column records the date and time the modification was made, besides that the columns in each row can vary. these tables have no concept of foreign keys, relationships, stored procedures, views, or other objects you might find in a relational database. data here is usually denormalized with each row holding the entire data for a logical entity
for fast access, Azure Table Storage splits a table into partitions
Partitioning: a mechanism for grouping related rows, based on a common property or partition key; rows that share the same partition key will be stored together.
Partitioning not only helps to organize data, but can also improve scalability and performance because:
partitions are independent from each other and can grow or shrink as rows are added/removed; a table can contain any number of partitions
when you search for data, you can include the partition key in the search criteria which helps to narrow down the volume of data to be examined and improves performance by reducing the amount of I/O (input and output operations, and read and writes) needed to locate the data
The key in an Azure Table Storage table comprises of two elements:
Partition Key
identifies the partition containing the row
Row Key
unique to each row in the same partition
items in the same partition are stored in row key order
If an application adds a new row to a table, Azure ensures that the row is placed in the correct position in the table. Scheme enables an application to quickly perform point queries that identify a single row and range queries that fetch a contiguous block of rows in a partition
Knowledge Check
What are the elements of an Azure Table storage key?
Table name and column name
Partition key and row key
row number
What should you do to an existing Azure Storage account in order to support a data lake for Azure Synapse Analytics?
Add an Azure Files share
Create Azure Storage tables for the data you want to analyze
Upgrade the account to enable hierarchical namespace and create a blob container
Why might you use Azure File storage?
To share files that are stored on-premises with users located at other sites
To enable users at different sites to share files
To store large binary data files containing images or other unstructured data
Explore Fundamentals of Azure Cosmos DB
Describe Azure Cosmos DB
Azure Cosmos DB: supports multiple application programming interfaces (APIs) that enable developers to use the programming semantics of many common kinds of data store to work with data in a Cosmos DB database; internal data structures is abstracted, enabling developers to use this to store and query dat using APIs with which they’re already familiar. uses indexing and partitioning to provide fast read and write performances and can scale to massive volumes of data. can enable multi-region writes, adding the Azure regions of your choice to your Cosmos DB account so that globally distributed users can each work with data in their local replica
Cosmos DB is a highly scalable DBMS and automatically allocates space in containers for your partitions, where they can grow up to 10 GB in size. Indexes are created and maintained automatically with virtually no administrative overhead.
Cosmos DB is a foundational service in Azure and has been used by many of MS’s products for mission critical applications at global scale including Skype, Xbox, MS 365, and many other. Cosmos DB is highly suitable for the following scenarios:
IoT and Telematics
typically ingest large amounts of data in frequent burst of activity
can accept and store this information quickly and then be used by analytics services such as Azure Machine Learning, Azure HDInsight, and Power BI
you can also process the data in real-time using Azure Functions that are triggered as data arrives in the database
Retail and Marketing
MS uses Cosmos DB for its own e-commerce platforms that run as part of Windows Store and Xbox Live
can also be used in the retail industry for storing catalog data and for event sourcing in order processing pipelines
Gaming
a crucial component of gaming applications
modern games perform graphical processing on mobil/console clients but rely on the cloud to deliver customized and personalized content like in-game stats, social media integrations, and high-score leaderboards
games often require single-millisecond latencies for reads and write to provide and engaging in-game experience
game databases need to be fast and able to handle massive spikes in request rates during new game launches and feature updates
Web and Mobile Applications
commonly used within web and mobile applications and is well suited for modeling and social interactions, integrating with third-party services, and for building rick personalized experiences
Cosmos DB SDKs can be used to build rich iOS and Android applications using the popular Tamarin framework
Identify Azure Cosmos DB APIs
Azure Cosmos DB is fully managed and serverless database for applications of any size or scale with support for both relational and non-relational workloads allowing developers to build and migrate applications fast using their preferred open source database engines (i.e. PostgreSQL, MongoDB, and Apache Cassandra). when provisioning a new Cosmos DB onstage, select the database engine to use because the engine choice depends on many factors including the type of data to be stored, the need to support existing applications, and the skills of the developers who will work with the data store
Azure Cosmos DB for NoSQL
MS’s native non-relational service for working with the document data model
manages data in JSON document format
despite being NoSQL, uses SQL syntax to work with the data
Azure Cosmos DB for MongoDB
popular open source database in which data is stored in a Binary JSON (BSON) format
enables developers to use MongoDB client libraries and code to work with data in Azure Cosmos DB
MongoDB Query Language (MQL) uses a compact, object-oriented syntax in which developers use objects to call methods
Azure Cosmos for PostgreSQL
a native PostgreSQL, globally distributed relational database that automatically shards data to help you build highly scalable apps
can start building apps on a single node server group, the same way you would with PostgreSQL anywhere else
as app’s scalability and performance requirements grow, you can seamlessly scale to multiple nodes by transparently distributing your tables
an RDBMS in which you define relational tables of data
Azure Cosmos DB for Table
used to work with data in key-value tables, similar to Azure Table Storage
offers greater scalability and performance than Azure Table Storage
you can use the Table API through one of the language-specific SDKs to make calls to your service endpoint to retrieve data from the table
Azure Cosmos DB for Apache Cassandra
compatible with Apache Cassandra, a popular open source database that uses a column-family storage structure
tables, similar to those in a relational database, with the exception that it’s not mandatory for each row to have the same columns
syntax based on SQL
Azure Cosmos DB for Apache Gremlin
used with data in a graph structure which entities are defined as vertices that form nodes in connected graph
nodes are connected by edges that represent relationships
syntax includes functions to operate on vertices and edges, enabling you to insert, update, delete and query data in the graph
Knowledge Check
Which API should you use to store and query JSON documents in Azure Cosmos DB?
Azure Costumes DB for NoSQL
Azure Cosmos DB for Apache Cassandra
Azure Cosmos DB for Table
Which Azure Cosmos DB API should you use to work with data in which entities and their relationships to one another are represented in a graph using vertices and edges?
Azure Cosmos DB for MongoDB
Azure Cosmos DB for NoSQL
Azure Cosmos DB for Apache Gremlin
How can you enable globally distributed users to work with their own local replica of a Cosmos DB database?
Create an Azure Cosmos DB account in each region where you have users
Use the API for Table to copy data to Azure Table Storage in each region where you have users
Enable multi-region writes and add the regions where you have users
Explore Fundamentals of Large-Scale Analytics
Describe Data Warehousing Architecture
Large-scale data analytics architecture can vary, as can the specific technologies used to implement it; but in general, the following elements are:
Data Injection and Processing
data from one or more transactional data stores, files, real-time streams, or other sources is loaded into a data lake or a relational data warehouse
load operation usually involves an extract, transform, and load (ETL) or extract, load, and transform (ELT) process in which the data is cleaned, filtered, and structures for analysis
ETL Process: data is transformed before being loaded into an analytical store
ELT Process: data is copied to the store and then transformed
either way the resulting data structure is optimized for analytical queries
data processing is often performed by distributed systems that can process high volumes of data in parallel using multi-node clusters
data ingestion includes both batch processing of static data and real-time processing of streaming data
Analytical Data Store
data stores for large analytics include relational data warehouses, file-systems based data lakes, and hybrid architectures that combine features of data warehouses and data lakes (sometimes called data bakehouses or lake databases)
Analytical Data Model
while data analysts and data scientists can work with the data directly in the analytical data store, it’s common to create one or more data models that pre-aggregate the data to make it easier to produce reports, dashboards, and interactive visualizations
often these data models are describes as cubes, in which numeric data values are aggregated across one or more dimensions
model encapsulates the relationship between data values and dimensional entities to support “drill-up/drill-down” analysis
Data Visualization
data analysts consume data from analytical models and directly from analytical sores to create reports, dashboards, and other visualizations
users in an organization who may not be technology professionals might perform self-service data analysis and reporting
the visualizations from the data show trends, comparisons, and key performance indicators (KPIs) for a business or other organization and can take the form of printed reports, graphs and charts in documents or PowerPoint presentations, web-based dashboards, and interactive environments in which users can explore data visually
Explore Data Ingestion Pipelines
Large-scale data ingestion is best implemented by creating pipelines that orchestrate ETL processes. Piplines can be created and run using Azure Data Factory. Azure Synapse Analytics or Microsoft Fabric can also be used if you want to manage all of the components of your data analytics solution in a unified workspace. Either way, pipelines consist of one or more activities that operate on data. An input dataset provides the source data and activities can be defined as a data flow that incremental manipulates the data until an output dataset is produced. Pipelines can connect to external data sources to integrate with a wide variety of data services.
Explore Analytical Data Stores
Two Common Types of Analytical Data Stores:
Data Warehouses
relational database in which the data is stored in a schema that is optimized for data analytics rather than transactional workloads
data from a transactional store is transformed into a schema in which numeric values are stored in central fact tables, which are related to one or more dimension tables that represent entities by which the data can be aggregated which is referred to as a star schema
a star schema can be extending into a snowflake schema by adding additional tables related to the dimensional tables to represent dimensional hierarchies
this is great choice when you have transactional data that can be organized into a structured schema of tables, and you want to use SQL to query them
Data Lakehouses
a file store, usually on a distributed file system for high performance data access
technologies like Spark or Hadoop are often used to process queries on the stored files and return data for reporting and analytics; these systems often apply a schema-on-read approach to define tabular schemas on semi-structured data files at the point where the data is read for analysis without applying constraints when it’s stored
data lakes are great for supporting a mix of structured, semi-structured, and even unstructured data that you want to analyze without the need for schema enforcement when the data is written to the store
raw data is stored as files in a data lake and a relational storage layer abstracts the underlying files and expose them as tables, which can be queried using SQL
SQL pools in Azure Synapse Analytics include PolyBase which enables you to define external tables based on files in a data lake (and other sources) and query them using SQL
Synapse Analytics also supports this approach in which you can use database templates to define the relational schema of your data warehouse while storing the underlying data in data lake storage — separating the storage and compute for you data warehousing solution
this is a relatively new approach in Spark-based systems and are enabled through technologies like Delta Lakes (which adds relational storage capabilities to Spark, so you can define tables that enforce schemas and transactional consistency, support batch-loaded and streaming data sources, and provide a SQL API for querying)
Explore Platform-as-a-Service (PaaS) Solutions
On Azure, there are three main PaaS services that you can use to implement a large-scale analytical store:
Azure Synapse Analytics
a unified, end-to-end solution for large scale data analytics
brings together multiple technologies and capabilities, enabling you to combine the data integrity and reliability and scalable, high-performance SQL Server based relational data warehouse with the flexibility of a data lake and open-source Apache Spark
also includes native support for log and telemetry analytics with Azure Synapse Data Explorer pools as well as built in data pipelines for data ingestion and transformation
all these services can be managed through a single, interactive user interface called Azure Synapse Studio which includes the ability to create interactive notebooks in which Spark code and markdown content can be combined
a great choice for when you want to create a single, unified analytics solution on Azure
Azure Databricks
Azure implementation of the popular Databricks platform
a comprehensive data analytics solution built on Apache Spark and offers native SQL capabilities as well as workload-optimized Spark clusters for data analytics and data science
provides an interactive user interface through which the system can be managed and data can be explored in interactive notebooks
due to its common use on multiple cloud platforms, you might want to consider using this as your analytical store if you want to use existing expertise with the platform of if you ned to operate in a multi cloud environment or support a cloud-portable solution
Azure HDInsight
supports multiple open-source data analytics cluster types
not as user-friendly as Azure Synapse Analytics and Azure Databricks but is a suitable option if analytics rely on multiple open-source frameworks or if you need to migrate an existing on-premises Hadoop-based solution to the cloud
Explore Microsoft Fabric
Sacalable analytics with PaaS can be complex, fragmented, and expensive but with MS Fabric, you don’t have to spend all of your time combining various services and implementing interfaces through which business users can access them. It’s a single product that is easy to understand, set up, create, and manage. Fabric is a unified Sofware-as-a-Service (SaaS) offering with all your data stored in a single open format in OneLake.
OneLake is Fabric’s lake-centric architecture that provides a single, integrated environment for data professionals and the business to collaborate on data projects. Similar to OneDrive and Data, OneLake combines storage locations across different regions and clouds into a single logical lake without moving or duplicating data. Data can be stored in any file format in OneLake and can be structured or unstructured. For tabular data, analytics engines in Fabric will write data in delta format when writing to OneLake. All engines will know how to read this format and treat delta files as tables no matter which engine writes it
Knowledge Check
Which Azure PaaS services can you use to create a pipeline for data ingestion and processing?
Azure SQL Database and Azure Cosmos DB
Azure Synapse Analytics and Azure Data Factory
Azure HDInsight and Azure Databricks
What must you define to implement a pipeline that reads data from Azure Blob Storage?
A linked service for your Azure Blob Storage account
A dedicated SQL pool in your Azure Synapse Analytics workspace
An Azure HDInsight cluster in your subscription
Which open-source distributed processing engine does Azure Synapse Analytics include?
Apache Hadoop
Apache Spark
Apache Storm
Explore Fundamentals of Real-Time Analytics
Understand Batch and Stream Processing
Data processing is simply the conversion of raw data to meaningful information through a process. Two general ways to process data:
Batch Processing
multiple data records are collected and stored before being processed together in a single operation
newly arriving data elements are collected and stored and the whole group is processed together as a batch
exactly when each group is processed can be determined in many ways
advantages of batch processing:
large volumes of data can be processed at a convenient time
can be scheduled to run at a time when computers or systems might be otherwise idl, such as overnight or during off-peak hours
disadvantages of batch processing:
time delay between ingesting the data and getting the results
all of a batch job’s data must be ready before a batch can be processed which means data must be carefully checked
problems with data, errors, and program crashed that occur during batch jobs bring the whole process to a halt so the input data must be carefully checked before the job can be run again
even minor data errors can prevent a batch job from running
Stream Processing
a source of data is constantly monitored and processed in real time as new data events occur
each new piece of data is processed when it arrives, unlike batch processing where there’s a wait until the next batch processing interval
data is processed as individual units in real-time
beneficial for scenarios where new, dynamic data is generated on a continual basis
ideal for time-critical operations that require an instant real-time response
Differences Between Batch and Streaming Data
Data Scope
batch: processes all the data in the dataset
stream: typically only has access to the most recent data received or within a rolling time window
Data Size
batch: suitable for handling large datasets efficiently
stream: intended for individual records or micro batches consisting of few records
Performance (latency is the time taken for the data to be received and processed)
batch: latency is typically a few hours
stream: typically occurs immediately, with latency in the order of seconds or milliseconds
Analysis
batch: to perform complex analytics
stream: for simple response functions, aggregates, or calculations such as rolling averages
Many large-scale analytics solutions include a mix of batch and stream processing, enabling both historical and real-time data analysis. It’s common for stream processing solutions to capture real-time data, process it by filtering and aggregating it and present it through real-time dashboards and visualizations while also persisting the processed results in a data store for historical analysis alongside batch processed data. Even when real-time analysis or visualization of data is not required, streaming technologies are often used to capture real-time data and store it in a data store for subsequent batch processing
data events from a streaming data source are captured in real-time
data from other sources is ingested into a data store (often a data lake) for batch processing
if real-time analytics is not required, the captured streaming data is written to the data store for subsequent batch processing
when real-time analytics is required, a stream processing technology is used to prepare the streaming data for real-time analysis or visualization; often by filtering or aggregating the data over temporal windows
the non-streaming data is periodically batch processed to prepare it for analysis and the results are persisted in an analytical data store (often referred to as a data warehouse) for historical analysis
the results of stream processing may also be persisted in the analytical data store to support historical analysis
analytical and visualization tools are used to present and explore the real-time and historical data
Explore Common Elements of Stream Processing Architecture
There are many technologies that you can use to implement a stream processing solution, but while specific implementation details may vary, there are common elements to most streaming architectures. At the simplest form, high-level architecture for stream processing looks like:
an event generates some data; this might be a signal being emitted by a sensor, a social media message being posted, a log file entry being written, or any other occurrence that results in some digital data
generated data is captured in a streaming source for processing
event data is processed, often by a perpetual query that operates on the event data to select data for specific types of events, project data values, or aggregate data values over temporal (time-based) periods (or windows)
results of the stream processing operation are written to an output (or sink) which may be a file, a database table, a real-time visual dashboard, or another queue for further processing by a subsequent downstream query
MS Azure supports multiple technologies that can be used to implement real-time analytics of streaming data including:
Azure Stream Analytics: a PaaS solution that you can use to define streaming jobs that ingest data from a streaming source, apply a perpetual query and write the results to an output
Spark Structured Streaming: open-source library that enables you to develop complex streaming solutions on Apache Spark based services (including Azure Synapse Analytics, Azure Databricks, and Azure HDInsight)
Azure Data Explorer: high-performance database and analytics service that is optimized for ingesting and querying batch or streaming data with a time-series element and which can be used as a standalone Azure service or as an Azure Synapse Data Explorer runtime in an Azure Synapse Analytics workspace
Services that are commonly used to ingest data for stream processing on Azure include:
Azure Event Hubs: a data ingestion service that you can use to manage queues of event data, ensuring that each event is processed in order, exactly once
Azure IoT Hub: a data ingestion service that is similar to Azure Event Hubs but is optimized for managing event data from Internet-of-Things (IoT) devices
Azure Data Lake Store Gen2: a highly scalable storage service that is often used in batch processing scenarios but can also be used as a source of streaming data
Apache Kafka: an open-source data ingestion solution that is commonly used together with Apache Spark (Azure HDInsight can be used to create a Kafka cluster)
The output from stream processing is often sent to the following services:
Azure Event Hubs: used to queue the processed data for further downstream process
Azure Data Lake Store Gen2 or Azure Blob Storage: used to persist the processed results as a file
Azure SQL Database or Azure Synapse Analytics or Azure Databricks: used to persist the processed results in a database table for queuing and analysis
Microsoft Power BI: used to generate real time data visualization in reports and dashboards
Explore Azure Stream Analytics
Azure Stream Analytics: a service for complex event processing and analysis of streaming data. This is used to:
ingest data from an input, such as an Azure event hub, Azure IoT Hub, or Azure Storage Blob Container
process the data by using a query to select, projects and aggregate data values
write the results to an output, such as Azure Data Lake Gen2, Azure SQL Database, Azure Synapse Analytics, Azure Functions, Azure Event Hub, Microsoft Power BI, or others
Once started, this query runs perpetually, processing new data as it arrives in the input and storing results in the output. This is a great technology choice when you need to continually capture data from a streaming source, filter or aggregate it, and send the results to a data store or downstream process for analysis and reporting
The easiest way to use this is to create a System Analytics job in an Azure subscription, configure its input(s) and output(s), and define the query that the job will use to process the data. The query is expressed using SQL syntax and can incorporate static reference data from multiple data sources to supply lookup value that can be combined with the streaming data ingested from an input. If your stream process requirements are complex or resource-instensive, you can create a Stream Analysis cluster, which uses the same underlying processing engine as a Stream Analytics job, but in a dedicated tenant (so your processing isn’t affected by other customers) and with configurable scalability that enables you to define the right balance of throughput and cost for you specific scenario.
Explore Apache Spark on Microsoft Azure
Apache Spark is distributed processing framework for large scale data analytics that can be used in the following services:
Azure Synapse Analytics
Azure Databricks
Azure HDInsight
Spark can be used to run code (usually written in Python, Scala, or Java) in parallel across multiple cluster nodes, enabling it to process very large volumes of data efficiently. It can be used for both batch and stream processing.
Spark Structured Streaming
provides an API for ingesting, processing, and outputting results from perpetual streams of data
built on a ubiquitous structure in Spark called a dataframe (encapsulates a table of data)
can use this to read data from a real-time data source like Kafka hub, a file store, or a network port into a “boundless” data frame that is continually populated with new data from the stream; then the query has to be defined on the data frame that selects projects or aggregates the data — often in temporal windows
results of the query generate another datagram, which can be persisted for analysis or further processing
a great choice for real-time analytics when you need to incorporate streaming data into a Spark based data lake or analytical data store
Delta Lake
an open-source storage layer that adds support for transactional consistency, schema enforcement, and other common data warehousing features to data lake storage
also unifies storage for streaming and batch data, and can be used in Spark to define relational tables for both batch and stream processing
a Detla Lake table can be used as a streaming source for queries against real-time data, or as a sink to which a stream of data is written
Spark runtimes in Azure Synapse Analytics and Azure Databricks include support for Delta Lake
Both of these combined is a good solution for when you need to abstract batch and stream processed data in a data lake behind a relational schema for SQL-based querying and analysis
Explore Realtime Analytics in Microsoft Fabric
Microsoft Fabric includes native support for real-time data analytics, including real-time data ingestion from multiple streaming sources. You can use an event stream to capture real-time event data from a streaming source and persist it in a destination such as a table in a Lakehouse or KQL database. When writing event stream data to a Lakehouse table, you can apply aggregations and filters to summarize the captured data. KQL databases support tables based on the Data Explorer engine, enabling you to perform real-time analytics on the data in tables by running KQL queries. After capturing real-time data in a table, you can use Power BI in Microsoft Fabric to create real-time data visualizations
Knowledge Check
Which definition of stream processing is correct?
Data is processed continually as new data records arrives
Data is collected in a temporary store, and all records are processed together as a batch
Data is incomplete and cannot be analyzed
Which service would you use to continually capture data from an IoT Hub, aggregate it over temporal periods, and store results in Azure SQL Database?
Azure Cosmos DB
Azure Stream Analytics
Azure Storage
Explore Fundamentals of Data Visualization
Describe Power BI Tools and Workflow
Microsoft Power BI: a suite of tools and services that data analysts can use to build interactive data visualizations for business users to consume
typical workflow for create a data visualization solution starts with Power BI Desktop (Microsoft Windows application where you can import data from a wide range of data sources, combine and organize the data from these sources in an analytics data model, and create reports that contain interactive visualizations of the data)
after creating data models and reports, they can be published to the Power BI service (a cloud service in which reports can be published and interacted with by business users)
basic data modeling and report editing can be directly done in the service using a web browser, but the functionality for this is limited compared to the desktop tool
the service can also be used to schedule refreshes of the data sources on which your reports are based and to share reports with other users
can define dashboards and apps that combine related reports in a single, easy to consume location
users can consume reports, dashboards, and apps in the Power BI service through a web browser or on mobile devices using the Power BI phone app
Describe Core Concepts of Data Modeling
Dimension Tables: represent entities by which you want to aggregate numeric measures; each entity is represented by a row with a unique key value and the remaining columns represent attributes of an entity. common in most analytical models to include a time dimension to aggregate numeric measures associated with events over time
Fact Tables: stores numeric measure that will be aggregated by various dimensions in the model; each row represents a recorded event that has numeric measures associated with it
Star Schema: where a fact table is related to one or more dimension tables; a more complex schema can be defined in which dimensional tables are related to additional tables containing more details
Hierarchies can enable you to drill-up or drill-down to find aggregated values at different levels which is worth considering about analytical models. Models can be build with pre-aggregated values for each level of a hierarchy, enabling you to quickly change the scope of your analysis
Power BI can be used to define an analytical model from tables of data which can be imported from one or more data sources. Then you can use the data modeling interface on the Model tab of Power BI Desktop to define your analytical model by creating relationships between fact and dimension tables, defining hierarchies, setting data types, and display formats for fields in the tables, and managing other properties of your data that help define a rich model for analysis
Describe Considerations for Data Visualization
After a model is created, it can be used to generate data visualizations that can be included in a report. There are many different kinds of data visualizations that are used, some which include:
Tables and Text
simplest way to communicate
useful when numerous related values mut be displayed and individual text values in cards can be a useful way to show important figures or metrics
Bar and Column Charts
good to visually compare numeric values for discrete categories
Line Charts
compare categorized values and useful when you need to examine trends, often over time
Pie Charts
visually compare categorized values as proportions of a total
Scatter Plots
compare two numeric measures and identify a relationship or correlation between them
Maps
visually compare values for different geographic areas or locations
Interactive Reports in Power BI
visual elements for related data in a report are automatically linked to one another and provide interactivity
Knowledge Check
Which tool should you use to import data from multiple data sources and create a report?
Power BI Desktop
Power BI Phone App
Azure Data Factory
What should you define in your data model to enable drill-up/down analysis?
A measure
A hierarchy
A relationship
Which kind of visualization should you use to analyze pass rates for multiple exams over time?
A pie chart
A scatter plot
A line chart
Practice Assessment
Notes:
Azure Files is used to share files by using NFS and SMB
Data Lake Storage Gen2 is used for storing huge amounts of data to be processed and not to be shared among virtual machines
Page Blobs are used for VHDs
Table Storage is used for two dimensional tables
Archive tier is for data that will be used once per year and access time that takes more than an hour
Azure Data Lake Storage and Azure Blob Storage can both be mounted in Azure Synapse Analytics and be used to process large volumes of data
Azure Files and Table Storage cannot be mounted in Azure Synapse Analytics
in Azure Table Storage, the RowKey is unique in each partition (not table)
Append Blobs allow you to frequently add new data to a file but doesn’t allow for modification or deletion of existing data
Apache Cassandra is for tabular data in a column-family storage and is queried by SQL
Gremlin API is for graph databases
MongoDB API stores data in the Binary JSON (BSON) format
Table is used to retrieve key-value pairs
Azure Cosmos DB allows you to aggregate data stored in JSON files for use in analytical reports without additional development effort
Azure SQL Database doesn’t store data in JSON files
Blob Storage and Data Lake Storage doesn’t allow you to aggregate data for analytical reports without additional development effort
Data Warehouse uses fact and dimension tables in a star/snowflake schema
Relational Databases don’t use fact and dimension tables
Data Factory and Synapse Analytics allow you to create a pipeline in response to an event
Databricks and HDInsight run data processing tasks
Databricks and the Spark pool run data processing for large amounts of data by using Scala
a Linked Service must be created for an integration process that copies data from MS Excel files to Parquet files by using Azure Data Factory
Pipelines use existing linked services to load and process data
Datasets are the input and output
Activities can be defined as the data flow
Azure Stream Analytics handles stream processing from Kafka to Data Lake
Azure Synapse Analytics doesn’t process streaming data
Azure Cosmos DB and Data Factory don’t handle streams
Data Lake Storage Gen2 can be used to store files in stream processing architecture
Relational Database is the best option for CRUD operations and uses the least amount of storage space