Definition of a data center
Facility made up of networked computers, storage systems, and computing infrastructure
Used by businesses and organizations to organize, process, and store large amounts of data
Importance of data centers for businesses
Businesses rely heavily on applications, services, and data within data centers for everyday operations
Size of data centers
Small data centers can have 1000 servers
Warehouse-level data centers can have 400,000 to 1 million servers
Function of data centers
Enable organizations to assemble resources and infrastructure for data processing, storage, and communication
Includes systems for storing, sharing, accessing, and processing data
Requires physical infrastructure and utilities such as cooling, electricity, and network access
Components of data centers
Server with sockets, CPU, and internal cache
Storage subsystems: Local and coherent DRAM and drives
Networking switches, routers, and firewalls
Cabling and physical racks for organizing and interconnecting IT equipment
Power distribution and supplementary power subsystems
Electrical switching, UPS, backup generator
Ventilation and data center cooling systems
Network carrier (telecom) connectivity
Design of warehouse-scale data centers
Illustration of server, rack, and switch configurations
Different specifications for DRAM and disk resources
Cluster-level switch for accessing resources in all racks
Server and switch-centric designs for datacenter-scale networks
Switch-centric design connects server nodes using switches
Server-centric design modifies the operating system on servers and uses special drivers for traffic relay
Both designs require organizing switches for connections
Cooling system in data centers
Raised floors for hiding cables, power lines, and cooling supplies
CRAC (computer room air conditioning) unit pressurizes the raised floor plenum with cold air
Different levels of data centers
Tier I: Basic data centers with UPS and 99.671% uptime guarantee
Tier II: Data centers with redundancy and 99.741% uptime guarantee
Tier III: Data centers with partial fault tolerance, redundancy, and 99.982% uptime guarantee
Tier IV: Data centers with full fault tolerance, redundancy, and 99.995% uptime guarantee
Importance of interconnection network in data centers
Critical core design for connecting all servers in a data center cluster
Requirements for the interconnection network: low latency, high bandwidth, fault tolerance, low cost, message passing interface
Fat-tree topology for interconnecting server nodes in data centers
Two-layer topology with server nodes in the bottom layer and edge switches connecting them
Upper layer aggregates the lower-layer edge switches
Provides fault-tolerant capability with multiple paths between server nodes
Uses low-cost Ethernet switches and modifies routing table and algorithms for redundancy
Advantages of fat-tree topology in data centers
Identical bandwidth at any bisections
Each layer has the same aggregated bandwidth
Can be built using cheap devices with uniform capacity
Great scalability with k-port switch supporting k^3/4 servers
Details of the fat-tree topology in data centers
K-ary fat tree with three-layer topology (edge, aggregation, and core)
Pods consisting of servers, edge switches, aggregation switches, and core switches
Core switches connect to multiple pods, providing fault tolerance and high bandwidth
Challenges in using fat-tree topology in data centers
Routing protocols for switches
Layer 2 switch algorithm: data plane flooding
Layer 3 IP routing: shortest path routing may not utilize path diversity
Cloud Logs Logs Explorer
OPTIONS
REFINE SCOPE
Project e
SHARE LINK
Query
Recent (1)
Saved (0)
Suggested (3)
resource.type="k8s. container"
Log fields
Query results
SEVERITY
TIMESTAMP PDT
SUMMARY
Search fields and values
Showing logs for last 1 hour from 7/1/21, 12:41 PM to 7/1/21, 1:41 PM.
RESOURCE TYPE
Extend time by: 1 hour
Edit time
Kubernetes Container
Clear x
2021-07-01 13:41:08.028 PDT 1bm-slis-server Status log 200
2021-07-01 13:41:08.027 PDT 1bm-slis-server request made
SEVERITY
2021-07-01 13:41:03.338 PDT 1bm-slis-server Status log 200
Info 2,228
2021-07-01 13:41:03.337 PDT 1bm-slis-server request made
Error 24
2021-07-01 13:40:53.726 PDT 1bm-slis-server Status log 200
LOG NAME
2021-07-01 13:40:53.725 PDT 1bm-slis-server request made
stdout 2,147
2021-07-01 13:40:53.699 PDT 1bm-slis-server Status log 200
2021-07-01 13:40:53.699 PDT 1bm-slis-server request made
stderr 105
2021-07-01 13:40:52.736 PDT 1bm-slis-server Status log 200
PROJECT ID
2021-07-01 13:40:52.736 PDT 1bm-slis-server request made
ygrinshteyn-sandbox 2,252
2021-07-01 13:40:51.602 PDT 1bm-slis-server Status log 200
LOCATION
2021-07-01 13:40:51.602 PDT 1bm-slis-server request made
us-central 2,252
2021-07-01 13:40:47.838 PDT 1bm-slis-server Status log 500
2021-07-01 13:40:47.038 PDT CLUSTER NAME 1bm-slis-server request made
2021-07-01 13:40:45.687 PDT 1bm-slis-server Status log 200
prod-cluster-autopilot 2,252
2021-07-01 13:40:45.687 PDT 1bm-slis-server request made
NAMESPACE NAME
2021-07-01 13:40:45.203 PDT 1bm-slis-server Status log 200
default 2,168
I 2021-07-01 13:40:45.202 PDT 1bm-slis-server request made
kube-system 84
2021-07-01 13:40:41.674 PDT 1bm-slis-server Status log 200
POD NAME
2021-07-01 13:40:41.673 PDT 1bm-slis-server request made
Google Cloud Logging : Case Study
Cloud Logging is a service for storing, viewing and interacting with logs.
Answers the questions âWho did what, where and whenâ within the GCP projects
Maintains non-tamperable audit logs for each project and organizations
Logs buckets are a regional resource, which means the infrastructure that stores, indexes, and searches the logs are located in a specific geographical location.
Google manages that infrastructure so that the applications are available redundantly across the zones within that region.
Cloud Logs : Parameter Logs Explorer
OPTIONS
REFINE SCOPE
Project e
SHARE LINK
LAS
Query
Recent (1)
Saved (0)
Suggested (3)
resource.type="k8s_container"
Log fields
Query results
SEVERITY
TIMESTAMP
Search fields and values
PDT SUMMARY
RESOURCE TYPE
Showing logs for last 1 hour from 7/1/21 12:41 PM to 7/1/21, 1:41 PM
Extend time by: 1 hour
Edit time
Kubernetes Container
Clear x
2021-07-01 13:41:08.028 PDT 1bm-slis-server Status log 200
SEVERITY
Hide log summary
Collapse nested fiel
insertId: "2a@whjtawblx1tj2"
Info 2,228
labels: ( compute.googleapis.com/resource_name "gk3-prod-cluster-autopil-default-pool-cbb6e9d5-9ig3
Error 24
k8s-pod/app "1bm-slis"
k8s-pod/pod-template-hash: "b94f879cd"
LOG NAME
stdout 2,147
logName : receiveTimestamp "2021-07-01T20:41:12.9513874552"
stderr 105
resource: labels: (
PROJECT ID
cluster_name: "prod-cluster-autopilot" 2,252
container_name: "lba-slis-server"
location: "us-central1"
LOCATION
namespace_name: "default"
us-central1 2,252
pod_name: 1bm-slis-deployment-b94f879cd-rzg4v"
project_id: CLUSTER NAME
type "k8s_container"
prod-cluster-autopilot 2,252
NAMESPACE NAME
severity: "INFO"
textPayload: *Status log - 2001 default 2,168
timestamp: "2021-07-01T20:41:08.0283518842"
kube-system 84
2021-07-01 13:41:08.027 PDT 1bm-slis-server request made
POD NAME
2021-07-01 13:41:03.338 PDT 1bm-slis-server Status log 200
Log Sources
Platform logs
Custom logs
Network logs
Services Agent
VPC flow
Audit logs
API and client
Firewall rules
libraries NAT gateways
User apps
Load Balancer
Third-party applications
System software
Log Sources
Cloud Platform Logs
Cloud platform logs are service-specific logs that can help troubleshoot and debug issues, as well as better understand the Google Cloud services.
Cloud Platform logs are logs generated by GCP services and vary depending on which Google Cloud resources are used in your Google Cloud project or organization.
Access Transparency Logs provides logs of actions taken by Google staff when accessing the Google Cloud content.
can help track compliance with the organizationâs legal and regulatory requirements.
have 400-day retention
Log Sources
Security Logs
Audit Logs
Cloud Audit Logs includes three types of audit logs:
Admin Activity
Data Access
System Event
Cloud Audit Logs provide audit trails of administrative changes and data accesses of the Google Cloud resources.
Admin Activity
captures user-initiated resource configuration changes
enabled by default
no additional charge
admin activity â administrative actions and API calls
have 400-day retention
System Events
captures system initiated resource configuration changes
enabled by default
no additional charge
system events â GCE system events like live migration
have 400-day retention
Data Access logs
Log API calls that create, modify or read user-provided data for e.g. object created in a GCS bucket.
30-day retention
disabled by default
size can be huge
charged beyond free limits
Available for GCP-visible services only. Not available for public resources.
Log Sources
User Logs
User logs are generated by user software, services, or applications and written to Cloud Logging using a logging agent, the Cloud Logging API, or the Cloud Logging client libraries
Agent logs
produced by logging agent installed that collects logs from user applications and VMs
covers log data from third