Datacentre Design , Interconnection Network & Cloud Logging

Datacentre Design & Interconnection Network

Page 2:

  • Definition of a data center

    • Facility made up of networked computers, storage systems, and computing infrastructure

    • Used by businesses and organizations to organize, process, and store large amounts of data

  • Importance of data centers for businesses

    • Businesses rely heavily on applications, services, and data within data centers for everyday operations

  • Size of data centers

    • Small data centers can have 1000 servers

    • Warehouse-level data centers can have 400,000 to 1 million servers

Page 3:

  • Function of data centers

    • Enable organizations to assemble resources and infrastructure for data processing, storage, and communication

    • Includes systems for storing, sharing, accessing, and processing data

    • Requires physical infrastructure and utilities such as cooling, electricity, and network access

Page 4:

  • Components of data centers

    • Server with sockets, CPU, and internal cache

    • Storage subsystems: Local and coherent DRAM and drives

    • Networking switches, routers, and firewalls

    • Cabling and physical racks for organizing and interconnecting IT equipment

    • Power distribution and supplementary power subsystems

    • Electrical switching, UPS, backup generator

    • Ventilation and data center cooling systems

    • Network carrier (telecom) connectivity

Page 5:

  • Design of warehouse-scale data centers

    • Illustration of server, rack, and switch configurations

    • Different specifications for DRAM and disk resources

    • Cluster-level switch for accessing resources in all racks

Page 6:

  • Server and switch-centric designs for datacenter-scale networks

    • Switch-centric design connects server nodes using switches

    • Server-centric design modifies the operating system on servers and uses special drivers for traffic relay

    • Both designs require organizing switches for connections

Page 7:

  • Cooling system in data centers

    • Raised floors for hiding cables, power lines, and cooling supplies

    • CRAC (computer room air conditioning) unit pressurizes the raised floor plenum with cold air

Page 9:

  • Different levels of data centers

    • Tier I: Basic data centers with UPS and 99.671% uptime guarantee

    • Tier II: Data centers with redundancy and 99.741% uptime guarantee

    • Tier III: Data centers with partial fault tolerance, redundancy, and 99.982% uptime guarantee

    • Tier IV: Data centers with full fault tolerance, redundancy, and 99.995% uptime guarantee

Page 10:

  • Importance of interconnection network in data centers

    • Critical core design for connecting all servers in a data center cluster

    • Requirements for the interconnection network: low latency, high bandwidth, fault tolerance, low cost, message passing interface

Page 11:

  • Fat-tree topology for interconnecting server nodes in data centers

    • Two-layer topology with server nodes in the bottom layer and edge switches connecting them

    • Upper layer aggregates the lower-layer edge switches

    • Provides fault-tolerant capability with multiple paths between server nodes

    • Uses low-cost Ethernet switches and modifies routing table and algorithms for redundancy

Page 12:

  • Advantages of fat-tree topology in data centers

    • Identical bandwidth at any bisections

    • Each layer has the same aggregated bandwidth

    • Can be built using cheap devices with uniform capacity

    • Great scalability with k-port switch supporting k^3/4 servers

Page 13:

  • Details of the fat-tree topology in data centers

    • K-ary fat tree with three-layer topology (edge, aggregation, and core)

    • Pods consisting of servers, edge switches, aggregation switches, and core switches

    • Core switches connect to multiple pods, providing fault tolerance and high bandwidth

Page 14:

  • Challenges in using fat-tree topology in data centers

    • Routing protocols for switches

    • Layer 2 switch algorithm: data plane flooding

    • Layer 3 IP routing: shortest path routing may not utilize path diversity

Page 15

  • Cloud Logs Logs Explorer

    • OPTIONS

    • REFINE SCOPE

    • Project e

    • SHARE LINK

  • Query

    • Recent (1)

    • Saved (0)

    • Suggested (3)

  • resource.type="k8s. container"

  • Log fields

  • Query results

  • SEVERITY

  • TIMESTAMP PDT

  • SUMMARY

  • Search fields and values

  • Showing logs for last 1 hour from 7/1/21, 12:41 PM to 7/1/21, 1:41 PM.

  • RESOURCE TYPE

  • Extend time by: 1 hour

  • Edit time

  • Kubernetes Container

  • Clear x

  • 2021-07-01 13:41:08.028 PDT 1bm-slis-server Status log 200

  • 2021-07-01 13:41:08.027 PDT 1bm-slis-server request made

  • SEVERITY

  • 2021-07-01 13:41:03.338 PDT 1bm-slis-server Status log 200

  • Info 2,228

  • 2021-07-01 13:41:03.337 PDT 1bm-slis-server request made

  • Error 24

  • 2021-07-01 13:40:53.726 PDT 1bm-slis-server Status log 200

  • LOG NAME

  • 2021-07-01 13:40:53.725 PDT 1bm-slis-server request made

  • stdout 2,147

  • 2021-07-01 13:40:53.699 PDT 1bm-slis-server Status log 200

  • 2021-07-01 13:40:53.699 PDT 1bm-slis-server request made

  • stderr 105

  • 2021-07-01 13:40:52.736 PDT 1bm-slis-server Status log 200

  • PROJECT ID

  • 2021-07-01 13:40:52.736 PDT 1bm-slis-server request made

  • ygrinshteyn-sandbox 2,252

  • 2021-07-01 13:40:51.602 PDT 1bm-slis-server Status log 200

  • LOCATION

  • 2021-07-01 13:40:51.602 PDT 1bm-slis-server request made

  • us-central 2,252

  • 2021-07-01 13:40:47.838 PDT 1bm-slis-server Status log 500

  • 2021-07-01 13:40:47.038 PDT CLUSTER NAME 1bm-slis-server request made

  • 2021-07-01 13:40:45.687 PDT 1bm-slis-server Status log 200

  • prod-cluster-autopilot 2,252

  • 2021-07-01 13:40:45.687 PDT 1bm-slis-server request made

  • NAMESPACE NAME

  • 2021-07-01 13:40:45.203 PDT 1bm-slis-server Status log 200

  • default 2,168

  • I 2021-07-01 13:40:45.202 PDT 1bm-slis-server request made

  • kube-system 84

  • 2021-07-01 13:40:41.674 PDT 1bm-slis-server Status log 200

  • POD NAME

  • 2021-07-01 13:40:41.673 PDT 1bm-slis-server request made

Page 16

  • Google Cloud Logging : Case Study

    • Cloud Logging is a service for storing, viewing and interacting with logs.

    • Answers the questions “Who did what, where and when” within the GCP projects

    • Maintains non-tamperable audit logs for each project and organizations

    • Logs buckets are a regional resource, which means the infrastructure that stores, indexes, and searches the logs are located in a specific geographical location.

    • Google manages that infrastructure so that the applications are available redundantly across the zones within that region.

Page 17

  • Cloud Logs : Parameter Logs Explorer

    • OPTIONS

    • REFINE SCOPE

    • Project e

    • SHARE LINK

    • LAS

  • Query

    • Recent (1)

    • Saved (0)

    • Suggested (3)

  • resource.type="k8s_container"

  • Log fields

  • Query results

  • SEVERITY

  • TIMESTAMP

  • Search fields and values

  • PDT SUMMARY

  • RESOURCE TYPE

  • Showing logs for last 1 hour from 7/1/21 12:41 PM to 7/1/21, 1:41 PM

  • Extend time by: 1 hour

  • Edit time

  • Kubernetes Container

  • Clear x

  • 2021-07-01 13:41:08.028 PDT 1bm-slis-server Status log 200

  • SEVERITY

  • Hide log summary

  • Collapse nested fiel

  • insertId: "2a@whjtawblx1tj2"

  • Info 2,228

  • labels: ( compute.googleapis.com/resource_name "gk3-prod-cluster-autopil-default-pool-cbb6e9d5-9ig3

  • Error 24

  • k8s-pod/app "1bm-slis"

  • k8s-pod/pod-template-hash: "b94f879cd"

  • LOG NAME

  • stdout 2,147

  • logName : receiveTimestamp "2021-07-01T20:41:12.9513874552"

  • stderr 105

  • resource: labels: (

  • PROJECT ID

  • cluster_name: "prod-cluster-autopilot" 2,252

  • container_name: "lba-slis-server"

  • location: "us-central1"

  • LOCATION

  • namespace_name: "default"

  • us-central1 2,252

  • pod_name: 1bm-slis-deployment-b94f879cd-rzg4v"

  • project_id: CLUSTER NAME

  • type "k8s_container"

  • prod-cluster-autopilot 2,252

  • NAMESPACE NAME

  • severity: "INFO"

  • textPayload: *Status log - 2001 default 2,168

  • timestamp: "2021-07-01T20:41:08.0283518842"

  • kube-system 84

  • 2021-07-01 13:41:08.027 PDT 1bm-slis-server request made

  • POD NAME

  • 2021-07-01 13:41:03.338 PDT 1bm-slis-server Status log 200

Page 18

  • Log Sources

    • Platform logs

    • Custom logs

    • Network logs

    • Services Agent

    • VPC flow

    • Audit logs

    • API and client

    • Firewall rules

    • libraries NAT gateways

    • User apps

    • Load Balancer

    • Third-party applications

    • System software

Page 19

  • Log Sources

    • Cloud Platform Logs

      • Cloud platform logs are service-specific logs that can help troubleshoot and debug issues, as well as better understand the Google Cloud services.

      • Cloud Platform logs are logs generated by GCP services and vary depending on which Google Cloud resources are used in your Google Cloud project or organization.

      • Access Transparency Logs provides logs of actions taken by Google staff when accessing the Google Cloud content.

        • can help track compliance with the organization’s legal and regulatory requirements.

        • have 400-day retention

Page 20

  • Log Sources

    • Security Logs

      • Audit Logs

        • Cloud Audit Logs includes three types of audit logs:

          • Admin Activity

          • Data Access

          • System Event

        • Cloud Audit Logs provide audit trails of administrative changes and data accesses of the Google Cloud resources.

      • Admin Activity

        • captures user-initiated resource configuration changes

        • enabled by default

        • no additional charge

        • admin activity – administrative actions and API calls

        • have 400-day retention

      • System Events

        • captures system initiated resource configuration changes

        • enabled by default

        • no additional charge

        • system events – GCE system events like live migration

        • have 400-day retention

      • Data Access logs

        • Log API calls that create, modify or read user-provided data for e.g. object created in a GCS bucket.

        • 30-day retention

        • disabled by default

        • size can be huge

        • charged beyond free limits

        • Available for GCP-visible services only. Not available for public resources.

Page 21

  • Log Sources

    • User Logs

      • User logs are generated by user software, services, or applications and written to Cloud Logging using a logging agent, the Cloud Logging API, or the Cloud Logging client libraries

    • Agent logs

      • produced by logging agent installed that collects logs from user applications and VMs

      • covers log data from third