Cloud Computing and Data Engineering: Comprehensive Notes

Yesterday's recap

  • Topics discussed yesterday:

    • Data engineering: data governance, data ingestion, data manipulation, data standardization, data cleaning, and building data engineering pipelines.

    • Three types of data processing: batch processing, real-time processing, and near-real-time processing.

    • Big data overview: what big data is, how it is generated, and sources of big data.

    • Five V's of big data: volume, velocity, variety, veracity, and value.

    • Data types: structured data, unstructured data, semi-structured data.

    • File formats: CSV, JSON, Excel (and other formats mentioned).

  • Emphasis on theory as a foundation before labs: three to four days of theory, then labs from day five.

Today's topic: Cloud computing

  • Cloud computing defined: a model of delivering computing resources such as servers, storage, databases, networking, software, and analytics capabilities to host applications.

    • Typical lifecycle: develop locally on a laptop, then deploy to the cloud so users worldwide can access via a link.

    • Resources provided by cloud: virtual machines, storage, databases, networking, software, analytics, AI, and more.

Types of cloud models

  • Public cloud

    • Definition: resources (servers, storage, etc.) provided by third-party providers (e.g., AWS, Azure, Google Cloud Platform).

    • Major players with market share: AWS, Microsoft Azure, Google Cloud Platform (GCP).

    • Advantages:

    • Highly scalable and cost-effective.

    • Pay-for-what-you-use model (auto-scaling based on demand).

    • No need to manage physical infrastructure.

    • Example usage: Netflix uses AWS to stream content; Dropbox stores files in public cloud.

    • Auto-scaling concept: cost varies with usage, e.g., if maximum capacity is 5,000 users but actual users vary, you pay only for actual usage.

    • Example used in lecture: Day 1 = 1,000 users, Day 2 = 200 users, Day 3 = 2,400 users; with a max cap of 5,000 users, total charge could be \$4,600 over three days (instead of \$15,000 if all 5,000 were used each day).

    • Basic formula: extcostperday=extpriceperuserperdayimesextactiveusersthatdayext{cost per day} = ext{price per user per day} imes ext{active users that day}; total over period is \$
      $$ ext{Total} = \