3rd interview

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/147

There's no tags or description

Looks like no tags are added yet.

Last updated 6:57 PM on 6/19/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

148 Terms

New cards

What is Cloud Computing?

The practice of using a network of remote servers hosted on the internet to store, manage, and process data, rather than using your own local computer.

New cards

What is "the cloud" (simple definition)?

Essentially someone else's computer(s) connected to the internet, usually a company charging you to use their servers. There's no single cloud; many companies each provide cloud services on their own infrastructure.

New cards

Why host on the cloud instead of your local machine?

You avoid buying and maintaining your own hardware, you only pay for what you use (cheaper and faster than running your own IT room), and you get scalability and availability you couldn't easily achieve locally.

New cards

What is AWS?

Amazon Web Services, a collection of hundreds of cloud computing services that provide the building blocks of IT infrastructure for individuals or companies to rent and use for a fee. It is a cloud provider.

New cards

AWS vs GCP vs Azure (quick comparison)?

All three are major cloud providers that do the same fundamental thing: sell cloud resources. AWS (Amazon) leads at ~31% market share, Azure (Microsoft) is ~20%, and GCP (Google, ~12%) is the one we use for BigQuery.

New cards

What do cloud providers do for us, and is it free?

They provide on-demand infrastructure and services so you don't build everything from scratch. It is not free in general (you pay for what you use), though many offer limited free-tier resources.

New cards

Why does Ben say cloud providers are "filling a niche in our field"?

A small business or solo developer doesn't need to buy hardware and build everything just to host one app. Providers supply those assets to rent, so it's a smaller investment for us and the provider makes money off us.

New cards

Name three benefits of using a cloud provider like AWS.

Pay only for what you use (cheaper/faster than your own IT room), highly scalable (easily upsize or downsize), and highly available (resources in locations worldwide, via availability zones).

New cards

What are the 3 Cloud Computing service models?

Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). They're defined by how much of the stack the provider manages versus how much you handle yourself.

New cards

What is IaaS? Give an example.

Infrastructure as a Service: provides infrastructure-level tech like virtual machines, storage, firewalls (security groups), load balancers, and VLANs. Example: Amazon EC2, a raw virtual server you configure and maintain yourself.

New cards

What is PaaS? Give examples.

Platform as a Service: provides the cloud infrastructure to run software you've written, with the provider managing servers and databases. Examples: AWS RDS (managed DB server), AWS Elastic Beanstalk (deploys your app and handles the EC2s/infra), AWS Lambda (serverless).

New cards

What is SaaS? Give examples.

Software as a Service: a complete, ready-to-use cloud application delivered to the user; one instance that many users consume, with no configuration needed. Examples: the apps on your phone, Squarespace, even the AWS.com console itself.

New cards

The car analogy for IaaS / PaaS / SaaS?

IaaS = leasing a car (you handle gas, insurance, maintenance). PaaS = renting a car (gassed up and insured, you just drive it). SaaS = using a rideshare app (you manage nothing, you just get the service).

New cards

How does price differ across IaaS, PaaS, and SaaS?

Price increases from IaaS to PaaS to SaaS (more managed = more expensive). Example: hosting a site on S3 (IaaS) is ~2 cents/month, while Squarespace (SaaS) starts ~$16/month.

New cards

What are the 3 cloud privacy/deployment models?

Public, Private, and Hybrid. They describe whether your infrastructure is managed by a provider, by yourself, or a mix.

New cards

What is the Public cloud model? Use case?

Cloud computing run by a third-party provider (e.g., AWS). Use case: startups, solo devs, or any org wanting scalable infrastructure without owning hardware; most web apps and SaaS products are public.

New cards

What is the Private cloud model? Use case?

Cloud services and infrastructure hosted privately within a company's own intranet; the company owns and maintains the hardware with no third party. Use case: enterprises or government agencies with strict security, compliance, or regulatory requirements.

New cards

What is the Hybrid cloud model? Use case?

Applications run across a combination of public and private environments. Use case: orgs keeping sensitive data on private infrastructure but offloading everything else to the public cloud, like hospitals storing patient records on-prem but using the cloud for everything else.

New cards

What is Big Data, and why is it called that?

Any dataset so massive and complex that it can't be efficiently stored or processed by traditional means like Excel or a relational SQL database. "Big" refers to that scale and complexity.

New cards

What do Big Data workflows do better than traditional SQL workflows?

They stay operationally efficient on enormous datasets and can handle various data types (unstructured and semi-structured), not just the structured, fixed-schema data of a SQL table.

New cards

Common Big Data use cases beyond storage?

Data analytics, data cleansing, forecasting, reporting, and machine learning modeling.

New cards

What are the Five Vs of Big Data?

Volume, Variety, and Velocity (the original three), plus Veracity and Value (the newer two that reflect maturing industry priorities).

New cards

Five Vs: Volume?

The amount of data. We process "big" amounts, in the petabyte range or an ever-growing dataset of unknown size.

New cards

Five Vs: Variety?

The type of data. Big data goes beyond traditional SQL types to include unstructured data (like video) and semi-structured data (like JSON).

New cards

Five Vs: Velocity?

The speed at which data is received and processed. Efficiency is critical because huge datasets lose value if they take too long to process.

New cards

Five Vs: Veracity?

How reliably accurate the data is. You need high quality and integrity, or you can't trust your analysis, conclusions, and resulting business decisions.

New cards

Five Vs: Value?

Data has intrinsic value to businesses. Data with actionable insights (location, age, purchase history) is worth far more than simple record-keeping.

New cards

The "SELECT * on 200 million records in 30 seconds" demo, what's the point?

It illustrates Big Data's velocity/processing speed: a query that would choke a traditional database returns almost instantly on a Big Data platform like BigQuery.

New cards

What is "data," and what are common sources of it?

Any recordable, storable piece of information. Common sources: transaction records, activity logs, images, emails, databases, social media feeds, and form submissions.

New cards

The typical components of a Big Data architecture?

Data Sources, then Data Ingestion (Consolidation) Tools, then Data Streaming/Processing Tools, then Data Storage Tools, then Analysis & Visualization Tools.

New cards

Big Data component: Data Sources?

Where the raw data comes from, such as databases, social media feeds, or form submissions.

New cards

Big Data component: Data Ingestion / Consolidation Tools?

Tools that pull data from the sources and funnel it into your Big Data tool of choice. Example: Apache Kafka.

New cards

Big Data component: Data Streaming / Processing Tools?

Tools that actually process the data, either in real time (streaming) or in large batches (batch processing); where transformation and cleaning often happen. Examples: Apache Spark, Apache Flink.

New cards

Big Data component: Data Storage Tools?

Where the processed data lives, ready to be used, such as a data lake or data warehouse. Examples: Google BigQuery, Amazon S3.

New cards

Big Data component: Analysis and Visualization Tools?

Tools to make sense of the data and communicate findings, where the payoff happens. Examples: BigQuery for querying, Matplotlib or Tableau for visualization, NumPy/Pandas for programmatic analysis.

New cards

Key benefits of Big Data?

Scalable analytics (scale compute and storage independently), handling diverse data, high processing speeds, and support for machine learning and business intelligence. (Be sure to hit scalability and speed.)

New cards

Key challenges of Big Data?

Complexity and a steep learning curve, cost (more power means more expensive), data quality (dirty data must be cleaned), and sheer-scale issues like compliance, security, latency, and migrations. (Be sure to mention complexity and cost.)

New cards

What are the steps of the Data Lifecycle?

Create/Capture, Store, Process, Analyze, Archive/Retain, and Delete.

New cards

Data Lifecycle: Create/Capture?

Data is received, either created on the spot or pulled from an external source like logs, transaction records, or form inputs.

New cards

Data Lifecycle: Store?

Data gets saved to a storage solution such as a database, data lake, or data warehouse.

New cards

Data Lifecycle: Process?

The data is cleaned, transformed, and aggregated to fit business analysis needs.

New cards

Data Lifecycle: Analyze?

The data is used to create reports, dashboards, and ML models; arguably the most impactful step of all.

New cards

Data Lifecycle: Archive/Retain?

Older data is moved into cheaper, long-term storage solutions.

New cards

Data Lifecycle: Delete?

Data is sometimes (not always) intentionally destroyed, often due to company data-retention policies.

New cards

Do the Data Lifecycle steps always happen in that exact order?

No. It's a general guideline, not a strict sequence; steps can overlap or switch places depending on context.

New cards

What are the three main types of data?

Structured, Semi-Structured, and Unstructured.

New cards

Structured data, with file types?

Highly organized data with a fixed schema. File types: SQL tables and Parquet (a columnar format optimized for analytics).

New cards

Semi-Structured data, with file types?

Has some structure but no rigid schema. File types: JSON, XML, CSV, and logs.

New cards

Unstructured data, with file types?

Free-form data with no particular structure. File types: images, videos, PDFs, and social media comments.

New cards

Why is Parquet worth knowing?

It's a columnar file format designed specifically for storing and analyzing big data, making it a key structured format in Big Data workflows.

New cards

What is Data Modeling?

Defining how data should be organized and structured so it can be stored and used as effectively as possible.

New cards

Data model: Conceptual?

A high-level view of business entities and their relationships, leaving out finer details like attributes and data types; mainly for non-technical stakeholder understanding.

New cards

Data model: Logical?

More detailed than conceptual: defines attributes, relationships, and PKs/FKs, but doesn't yet account for hardware or specific database technology.

New cards

Data model: Physical?

The specific implementation that accounts for database-specific factors, focusing on how data is actually stored in the particular database technology you're using.

New cards

What is Data Warehousing (DWH)?

The practice of collecting, storing, and managing large amounts of structured data for the sake of data analytics, helping draw conclusions from historical data.

New cards

What is an "informed business decision"?

A decision backed by analysis of a company's historical data, leveraging a data warehouse to act on evidence rather than guesswork (often to make more money or improve the product).

New cards

Name some DWH vendors.

Google BigQuery, Snowflake, and Amazon AWS Redshift.

New cards

What is OLTP?

Online Transaction Processing: fast, small, real-time transactions like INSERT, UPDATE, and DELETE; what you do with small SQL databases.

New cards

What is OLAP?

Online Analytical Processing: complex analytical queries on large datasets (like SELECT, but far more involved).

New cards

Which processing type does Data Warehousing tend to be?

OLAP (analytical), not OLTP.

New cards

What is an ODS, and where does it sit?

An Operational Data Store: a staging-area-style database used for real-time operational reporting and data integration. It sits between your core data sources and the data warehouse.

New cards

What kind of data does an ODS deal with?

Current/real-time data, used either to quickly generate reports on live data or to integrate that data into the warehouse, where it joins the historical data.

New cards

What is a "staging area"?

A temporary holding area where raw data gets cleaned, transformed, and loaded into the data warehouse. (Like git's staging area: git add stages changes before git commit.)

New cards

What is a Data Mart?

A smaller, department-specific subset of the data warehouse, a focused slice of data for a purpose like sales, marketing, finance, or HR. It's an organizational tool.

New cards

What's the niche/benefit of a Data Mart?

It supports faster, simpler querying and gives teams focused datasets without needing access to the warehouse's full scope.

New cards

Why does Ben like the name "Data Mart"?

Because it's like a little convenience store of data that stocks its "merchandise" from the data warehouse (or directly from data sources).

New cards

Data Warehouse vs Data Lake, what is each optimized for?

A Data Lake stores raw data of any structure and is optimized for cheap, long-term storage (schema-on-read). A Data Warehouse stores cleaned, structured/semi-structured data and is optimized for data analytics (schema-on-write).

New cards

Schema-on-read vs schema-on-write?

Data Lake = schema-on-read (a schema is applied when you query). Data Warehouse = schema-on-write (data must fit a predefined schema before being loaded and queried).

New cards

Are we working with Data Warehouses or Data Lakes in BigQuery?

Data Warehouses. We're doing only data warehousing with BigQuery because it's free for that.

New cards

What could we use for Data Lakes?

Google Cloud Storage with BigQuery's BigLake (queries data living outside BigQuery), AWS S3 with Lake Formation, or Azure Data Lake Storage (ADLS). All three cloud providers have data lake offerings.

New cards

What is Dimension Modeling, and why do we do it?

An approach to structuring data into facts and dimensions to support fast, understandable analytical queries; typically used in data warehousing/analytics (OLAP) workflows.

New cards

Is Dimension Modeling typically used in OLAP or OLTP?

OLAP (analytical workflows).

New cards

What is a Fact (in dimension modeling)?

A measurable business event, such as sales, orders, clicks, or views. Facts are the things you measure.

New cards

What is a Dimension?

An attribute that provides context for facts, such as customers, products, regions, or time/date. Dimensions are what you filter or group by.

New cards

What does it mean for a Dimension to "provide context to" a Fact?

A raw measurement (the fact) is meaningless alone; dimensions tell you the who/what/where/when. Example: "500 trips" (fact) only becomes useful when tied to a station and a date (dimensions).

New cards

What is a Star Schema?

A central fact table connected directly to dimension tables (just one hop). The fact table holds FKs to the dimension tables. Common in BI tools like Tableau and Power BI.

New cards

What is a Snowflake Schema?

Still a central fact table, but dimensions are more normalized, spread out more granularly to reduce redundancy. Dimension tables can have FKs to other dimension tables.

New cards

Star vs Snowflake, pros and cons?

Star: easier to read with simpler queries (fewer joins), but more redundancy (e.g., the same timestamp stored in multiple places). Snowflake: no redundancy and quick to update, but more complex relationships, harder to read, with more joins and slightly slower queries.

New cards

What is SCD in dimension modeling?

Slowly Changing Dimensions: how dimension data changes over time and how that history is stored.

New cards

SCD scenario: overwriting old values?

You don't keep history; the old value is simply overwritten. Example: a customer changes their email and the old email is overwritten.

New cards

SCD scenario: preserving history with new columns?

Add a new column for the new value, storing old and new data in the same row. Example: a customer moves and their old address goes into an "old address" column on the same row.

New cards

SCD scenario: preserving history with new records?

Add a new row for the new value and keep the old row as a historical record. Example: a customer moves and the old row/address stays on file. (This tends to be the industry standard over adding new columns.)

New cards

What is BigQuery?

A cloud-based big data analytics platform from Google, great for large-scale SQL queries, providing a fully managed environment for data warehousing (OLAP) jobs.

New cards

What vendor is BigQuery from?

Google (Google Cloud Platform / GCP).

New cards

What is BigQuery optimized for?

Processing massive amounts of read-only data: big queries and analytics (OLAP). It is not optimized for data manipulation.

New cards

Does BigQuery do OLAP or OLTP?

OLAP: analytics and processing of large datasets, not small real-time transactions.

New cards

Why use BigQuery over a traditional SQL database?

It's optimized for OLAP/analytics, distributes queries across thousands of Google machines in parallel (huge speed), offers GoogleSQL's vast function set, and uses denormalization (a NoSQL-leaning approach). Traditional SQL databases are built for OLTP instead.

New cards

WHY is BigQuery so fast?

It separates compute and storage, so the two can scale independently, and it runs queries in parallel across thousands of machines on Google's infrastructure, letting it process terabytes in seconds.

New cards

What is GoogleSQL?

BigQuery's SQL dialect: similar to other SQL dialects but with a vast set of built-in functions supporting querying and analysis, including some not seen in traditional SQL.

New cards

GoogleSQL function: SELECT EXCEPT()?

Selects all columns except the ones you name, e.g., SELECT * EXCEPT(col1, col2). Not standard in traditional SQL.

New cards

GoogleSQL function: REPLACE()?

Modifies values inline without rewriting the whole table, great for data cleaning. Example: SELECT REPLACE(city_name, 'NYC', 'New York') to unify inconsistent entries.

New cards

GoogleSQL function: EXTRACT()?

Pulls a specific part of a date or timestamp, e.g., EXTRACT(MONTH FROM date_col) or EXTRACT(YEAR FROM date_col). Useful for grouping/filtering by time; BigQuery can also use it with complex types like arrays and structs.

New cards

GoogleSQL function: SAFE prefix?

A prefix (e.g., SAFE.DIVIDE()) that returns NULL instead of throwing an error when something goes wrong. Handy for messy data.

New cards

Joins in BigQuery?

Similar to regular SQL, and you can join across datasets, which is powerful (though it makes queries a bit bloated). Note there are no real PK/FK relationships; joins match on column values because the data is denormalized.

New cards

Advanced BigQuery techniques worth knowing?

Window functions, ARRAY and STRUCT data types, and Partitioning & Clustering, plus joining across datasets and denormalization.

New cards

What is a Window Function?

A function that performs calculations across a set of rows without collapsing them into one result (unlike GROUP BY), enabling things like running totals or rankings. It can be very expensive and is defined by the OVER() clause.

100

New cards

What is OVER()?

Not a function itself; it's the clause that makes something a window function, defining the partitions and ordering for the calculation. No OVER(), no window function.