1/147
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What is Cloud Computing?
The practice of using a network of remote servers hosted on the internet to store, manage, and process data, rather than using your own local computer.
What is "the cloud" (simple definition)?
Essentially someone else's computer(s) connected to the internet, usually a company charging you to use their servers. There's no single cloud; many companies each provide cloud services on their own infrastructure.
Why host on the cloud instead of your local machine?
You avoid buying and maintaining your own hardware, you only pay for what you use (cheaper and faster than running your own IT room), and you get scalability and availability you couldn't easily achieve locally.
What is AWS?
Amazon Web Services, a collection of hundreds of cloud computing services that provide the building blocks of IT infrastructure for individuals or companies to rent and use for a fee. It is a cloud provider.
AWS vs GCP vs Azure (quick comparison)?
All three are major cloud providers that do the same fundamental thing: sell cloud resources. AWS (Amazon) leads at ~31% market share, Azure (Microsoft) is ~20%, and GCP (Google, ~12%) is the one we use for BigQuery.
What do cloud providers do for us, and is it free?
They provide on-demand infrastructure and services so you don't build everything from scratch. It is not free in general (you pay for what you use), though many offer limited free-tier resources.
Why does Ben say cloud providers are "filling a niche in our field"?
A small business or solo developer doesn't need to buy hardware and build everything just to host one app. Providers supply those assets to rent, so it's a smaller investment for us and the provider makes money off us.
Name three benefits of using a cloud provider like AWS.
Pay only for what you use (cheaper/faster than your own IT room), highly scalable (easily upsize or downsize), and highly available (resources in locations worldwide, via availability zones).
What are the 3 Cloud Computing service models?
Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). They're defined by how much of the stack the provider manages versus how much you handle yourself.
What is IaaS? Give an example.
Infrastructure as a Service: provides infrastructure-level tech like virtual machines, storage, firewalls (security groups), load balancers, and VLANs. Example: Amazon EC2, a raw virtual server you configure and maintain yourself.
What is PaaS? Give examples.
Platform as a Service: provides the cloud infrastructure to run software you've written, with the provider managing servers and databases. Examples: AWS RDS (managed DB server), AWS Elastic Beanstalk (deploys your app and handles the EC2s/infra), AWS Lambda (serverless).
What is SaaS? Give examples.
Software as a Service: a complete, ready-to-use cloud application delivered to the user; one instance that many users consume, with no configuration needed. Examples: the apps on your phone, Squarespace, even the AWS.com console itself.
The car analogy for IaaS / PaaS / SaaS?
IaaS = leasing a car (you handle gas, insurance, maintenance). PaaS = renting a car (gassed up and insured, you just drive it). SaaS = using a rideshare app (you manage nothing, you just get the service).
How does price differ across IaaS, PaaS, and SaaS?
Price increases from IaaS to PaaS to SaaS (more managed = more expensive). Example: hosting a site on S3 (IaaS) is ~2 cents/month, while Squarespace (SaaS) starts ~$16/month.
What are the 3 cloud privacy/deployment models?
Public, Private, and Hybrid. They describe whether your infrastructure is managed by a provider, by yourself, or a mix.
What is the Public cloud model? Use case?
Cloud computing run by a third-party provider (e.g., AWS). Use case: startups, solo devs, or any org wanting scalable infrastructure without owning hardware; most web apps and SaaS products are public.
What is the Private cloud model? Use case?
Cloud services and infrastructure hosted privately within a company's own intranet; the company owns and maintains the hardware with no third party. Use case: enterprises or government agencies with strict security, compliance, or regulatory requirements.
What is the Hybrid cloud model? Use case?
Applications run across a combination of public and private environments. Use case: orgs keeping sensitive data on private infrastructure but offloading everything else to the public cloud, like hospitals storing patient records on-prem but using the cloud for everything else.
What is Big Data, and why is it called that?
Any dataset so massive and complex that it can't be efficiently stored or processed by traditional means like Excel or a relational SQL database. "Big" refers to that scale and complexity.
What do Big Data workflows do better than traditional SQL workflows?
They stay operationally efficient on enormous datasets and can handle various data types (unstructured and semi-structured), not just the structured, fixed-schema data of a SQL table.
Common Big Data use cases beyond storage?
Data analytics, data cleansing, forecasting, reporting, and machine learning modeling.
What are the Five Vs of Big Data?
Volume, Variety, and Velocity (the original three), plus Veracity and Value (the newer two that reflect maturing industry priorities).
Five Vs: Volume?
The amount of data. We process "big" amounts, in the petabyte range or an ever-growing dataset of unknown size.
Five Vs: Variety?
The type of data. Big data goes beyond traditional SQL types to include unstructured data (like video) and semi-structured data (like JSON).
Five Vs: Velocity?
The speed at which data is received and processed. Efficiency is critical because huge datasets lose value if they take too long to process.
Five Vs: Veracity?
How reliably accurate the data is. You need high quality and integrity, or you can't trust your analysis, conclusions, and resulting business decisions.
Five Vs: Value?
Data has intrinsic value to businesses. Data with actionable insights (location, age, purchase history) is worth far more than simple record-keeping.
The "SELECT * on 200 million records in 30 seconds" demo, what's the point?
It illustrates Big Data's velocity/processing speed: a query that would choke a traditional database returns almost instantly on a Big Data platform like BigQuery.
What is "data," and what are common sources of it?
Any recordable, storable piece of information. Common sources: transaction records, activity logs, images, emails, databases, social media feeds, and form submissions.
The typical components of a Big Data architecture?
Data Sources, then Data Ingestion (Consolidation) Tools, then Data Streaming/Processing Tools, then Data Storage Tools, then Analysis & Visualization Tools.
Big Data component: Data Sources?
Where the raw data comes from, such as databases, social media feeds, or form submissions.
Big Data component: Data Ingestion / Consolidation Tools?
Tools that pull data from the sources and funnel it into your Big Data tool of choice. Example: Apache Kafka.
Big Data component: Data Streaming / Processing Tools?
Tools that actually process the data, either in real time (streaming) or in large batches (batch processing); where transformation and cleaning often happen. Examples: Apache Spark, Apache Flink.
Big Data component: Data Storage Tools?
Where the processed data lives, ready to be used, such as a data lake or data warehouse. Examples: Google BigQuery, Amazon S3.
Big Data component: Analysis and Visualization Tools?
Tools to make sense of the data and communicate findings, where the payoff happens. Examples: BigQuery for querying, Matplotlib or Tableau for visualization, NumPy/Pandas for programmatic analysis.
Key benefits of Big Data?
Scalable analytics (scale compute and storage independently), handling diverse data, high processing speeds, and support for machine learning and business intelligence. (Be sure to hit scalability and speed.)
Key challenges of Big Data?
Complexity and a steep learning curve, cost (more power means more expensive), data quality (dirty data must be cleaned), and sheer-scale issues like compliance, security, latency, and migrations. (Be sure to mention complexity and cost.)
What are the steps of the Data Lifecycle?
Create/Capture, Store, Process, Analyze, Archive/Retain, and Delete.
Data Lifecycle: Create/Capture?
Data is received, either created on the spot or pulled from an external source like logs, transaction records, or form inputs.
Data Lifecycle: Store?
Data gets saved to a storage solution such as a database, data lake, or data warehouse.
Data Lifecycle: Process?
The data is cleaned, transformed, and aggregated to fit business analysis needs.
Data Lifecycle: Analyze?
The data is used to create reports, dashboards, and ML models; arguably the most impactful step of all.
Data Lifecycle: Archive/Retain?
Older data is moved into cheaper, long-term storage solutions.
Data Lifecycle: Delete?
Data is sometimes (not always) intentionally destroyed, often due to company data-retention policies.
Do the Data Lifecycle steps always happen in that exact order?
No. It's a general guideline, not a strict sequence; steps can overlap or switch places depending on context.
What are the three main types of data?
Structured, Semi-Structured, and Unstructured.
Structured data, with file types?
Highly organized data with a fixed schema. File types: SQL tables and Parquet (a columnar format optimized for analytics).
Semi-Structured data, with file types?
Has some structure but no rigid schema. File types: JSON, XML, CSV, and logs.
Unstructured data, with file types?
Free-form data with no particular structure. File types: images, videos, PDFs, and social media comments.
Why is Parquet worth knowing?
It's a columnar file format designed specifically for storing and analyzing big data, making it a key structured format in Big Data workflows.
What is Data Modeling?
Defining how data should be organized and structured so it can be stored and used as effectively as possible.
Data model: Conceptual?
A high-level view of business entities and their relationships, leaving out finer details like attributes and data types; mainly for non-technical stakeholder understanding.
Data model: Logical?
More detailed than conceptual: defines attributes, relationships, and PKs/FKs, but doesn't yet account for hardware or specific database technology.
Data model: Physical?
The specific implementation that accounts for database-specific factors, focusing on how data is actually stored in the particular database technology you're using.
What is Data Warehousing (DWH)?
The practice of collecting, storing, and managing large amounts of structured data for the sake of data analytics, helping draw conclusions from historical data.
What is an "informed business decision"?
A decision backed by analysis of a company's historical data, leveraging a data warehouse to act on evidence rather than guesswork (often to make more money or improve the product).
Name some DWH vendors.
Google BigQuery, Snowflake, and Amazon AWS Redshift.
What is OLTP?
Online Transaction Processing: fast, small, real-time transactions like INSERT, UPDATE, and DELETE; what you do with small SQL databases.
What is OLAP?
Online Analytical Processing: complex analytical queries on large datasets (like SELECT, but far more involved).
Which processing type does Data Warehousing tend to be?
OLAP (analytical), not OLTP.
What is an ODS, and where does it sit?
An Operational Data Store: a staging-area-style database used for real-time operational reporting and data integration. It sits between your core data sources and the data warehouse.
What kind of data does an ODS deal with?
Current/real-time data, used either to quickly generate reports on live data or to integrate that data into the warehouse, where it joins the historical data.
What is a "staging area"?
A temporary holding area where raw data gets cleaned, transformed, and loaded into the data warehouse. (Like git's staging area: git add stages changes before git commit.)
What is a Data Mart?
A smaller, department-specific subset of the data warehouse, a focused slice of data for a purpose like sales, marketing, finance, or HR. It's an organizational tool.
What's the niche/benefit of a Data Mart?
It supports faster, simpler querying and gives teams focused datasets without needing access to the warehouse's full scope.
Why does Ben like the name "Data Mart"?
Because it's like a little convenience store of data that stocks its "merchandise" from the data warehouse (or directly from data sources).
Data Warehouse vs Data Lake, what is each optimized for?
A Data Lake stores raw data of any structure and is optimized for cheap, long-term storage (schema-on-read). A Data Warehouse stores cleaned, structured/semi-structured data and is optimized for data analytics (schema-on-write).
Schema-on-read vs schema-on-write?
Data Lake = schema-on-read (a schema is applied when you query). Data Warehouse = schema-on-write (data must fit a predefined schema before being loaded and queried).
Are we working with Data Warehouses or Data Lakes in BigQuery?
Data Warehouses. We're doing only data warehousing with BigQuery because it's free for that.
What could we use for Data Lakes?
Google Cloud Storage with BigQuery's BigLake (queries data living outside BigQuery), AWS S3 with Lake Formation, or Azure Data Lake Storage (ADLS). All three cloud providers have data lake offerings.
What is Dimension Modeling, and why do we do it?
An approach to structuring data into facts and dimensions to support fast, understandable analytical queries; typically used in data warehousing/analytics (OLAP) workflows.
Is Dimension Modeling typically used in OLAP or OLTP?
OLAP (analytical workflows).
What is a Fact (in dimension modeling)?
A measurable business event, such as sales, orders, clicks, or views. Facts are the things you measure.
What is a Dimension?
An attribute that provides context for facts, such as customers, products, regions, or time/date. Dimensions are what you filter or group by.
What does it mean for a Dimension to "provide context to" a Fact?
A raw measurement (the fact) is meaningless alone; dimensions tell you the who/what/where/when. Example: "500 trips" (fact) only becomes useful when tied to a station and a date (dimensions).
What is a Star Schema?
A central fact table connected directly to dimension tables (just one hop). The fact table holds FKs to the dimension tables. Common in BI tools like Tableau and Power BI.
What is a Snowflake Schema?
Still a central fact table, but dimensions are more normalized, spread out more granularly to reduce redundancy. Dimension tables can have FKs to other dimension tables.
Star vs Snowflake, pros and cons?
Star: easier to read with simpler queries (fewer joins), but more redundancy (e.g., the same timestamp stored in multiple places). Snowflake: no redundancy and quick to update, but more complex relationships, harder to read, with more joins and slightly slower queries.
What is SCD in dimension modeling?
Slowly Changing Dimensions: how dimension data changes over time and how that history is stored.
SCD scenario: overwriting old values?
You don't keep history; the old value is simply overwritten. Example: a customer changes their email and the old email is overwritten.
SCD scenario: preserving history with new columns?
Add a new column for the new value, storing old and new data in the same row. Example: a customer moves and their old address goes into an "old address" column on the same row.
SCD scenario: preserving history with new records?
Add a new row for the new value and keep the old row as a historical record. Example: a customer moves and the old row/address stays on file. (This tends to be the industry standard over adding new columns.)
What is BigQuery?
A cloud-based big data analytics platform from Google, great for large-scale SQL queries, providing a fully managed environment for data warehousing (OLAP) jobs.
What vendor is BigQuery from?
Google (Google Cloud Platform / GCP).
What is BigQuery optimized for?
Processing massive amounts of read-only data: big queries and analytics (OLAP). It is not optimized for data manipulation.
Does BigQuery do OLAP or OLTP?
OLAP: analytics and processing of large datasets, not small real-time transactions.
Why use BigQuery over a traditional SQL database?
It's optimized for OLAP/analytics, distributes queries across thousands of Google machines in parallel (huge speed), offers GoogleSQL's vast function set, and uses denormalization (a NoSQL-leaning approach). Traditional SQL databases are built for OLTP instead.
WHY is BigQuery so fast?
It separates compute and storage, so the two can scale independently, and it runs queries in parallel across thousands of machines on Google's infrastructure, letting it process terabytes in seconds.
What is GoogleSQL?
BigQuery's SQL dialect: similar to other SQL dialects but with a vast set of built-in functions supporting querying and analysis, including some not seen in traditional SQL.
GoogleSQL function: SELECT EXCEPT()?
Selects all columns except the ones you name, e.g., SELECT * EXCEPT(col1, col2). Not standard in traditional SQL.
GoogleSQL function: REPLACE()?
Modifies values inline without rewriting the whole table, great for data cleaning. Example: SELECT REPLACE(city_name, 'NYC', 'New York') to unify inconsistent entries.
GoogleSQL function: EXTRACT()?
Pulls a specific part of a date or timestamp, e.g., EXTRACT(MONTH FROM date_col) or EXTRACT(YEAR FROM date_col). Useful for grouping/filtering by time; BigQuery can also use it with complex types like arrays and structs.
GoogleSQL function: SAFE prefix?
A prefix (e.g., SAFE.DIVIDE()) that returns NULL instead of throwing an error when something goes wrong. Handy for messy data.
Joins in BigQuery?
Similar to regular SQL, and you can join across datasets, which is powerful (though it makes queries a bit bloated). Note there are no real PK/FK relationships; joins match on column values because the data is denormalized.
Advanced BigQuery techniques worth knowing?
Window functions, ARRAY and STRUCT data types, and Partitioning & Clustering, plus joining across datasets and denormalization.
What is a Window Function?
A function that performs calculations across a set of rows without collapsing them into one result (unlike GROUP BY), enabling things like running totals or rankings. It can be very expensive and is defined by the OVER() clause.
What is OVER()?
Not a function itself; it's the clause that makes something a window function, defining the partitions and ordering for the calculation. No OVER(), no window function.