Hybrid Cloud Cost Optimization, Security Architecture & Automation – Meeting Notes
Summary of progress and priorities
- APAC and EMEA teams are steadily implementing PWC and CloudHealth rightsizing options; Americas follow-up ongoing to get an update as it starts picking up.
- Azure China tenancy activities are gaining momentum: auto power off, reconsolidation services, decommissioning, and retention policy trimming are all underway; a number of changes in waste and service reductions are being implemented (e.g., auto power off). Adam is shadowing the cost savings we expect to realize.
- A more scientific approach is being used to quantify the hybrid benefit with Microsoft; Adam pulled a fresh report after many VM size changes. The VM sizes correlate with OS licensing costs: bigger VMs imply higher operating costs.
- For next year, there is an emphasis on being more conservative with software licensing (hard licensing) and on identifying which small VMs provide more benefit on basic licensing rather than giving hub licenses to larger servers. This drives further work on the PXQ model for cost charging.
- The cost breakdown is being analyzed across three regions, with line items for NetOps, infrastructure services (on‑prem backup protection and NetOps on top of data), cloud ops costs, and tooling. Feedback and time with Sean and Alex Boaz are being sought to refine the view; alignment with BXQ and the new TCS contract is also in scope to understand contributions to central technology costs.
- Navec migration progress: non-production environment has migrated zonder issues; production grouping plans are in progress, with a focus on pushing production servers across.
- A staging approach is being introduced: non-prod environment will use Jambox servers to constrain management connectivity to predictable IP addresses, allowing access lists to secure management traffic.
- In-house vs regional/global teaming: NaviX relies on two DevOps engineers and ISIS in Brazil as the local IT team; regional and global teams provide governance and security alignment. TCS architecture is ensuring security patterns are followed while central management is maintained.
- Security PoC and decommissioning decisions: ITS IPS PoC (Azure Firewall Premium with traffic inspection and Sentinel integration) is completed; DOC environment may be decommissioned if not needed due to cost. Americas discussions are ongoing to determine the migration path.
- Planning for next steps includes improved engagement from Sean and Blair (likely next week) and some improvement may appear next week; Alex Boaz's availability remains uncertain.
- Scheduling and availability issues: Wahegon’s schedule is clashing with school drop-off; potential rescheduling of weekly calls. Also, a finance discussion with Steve wrapped around previous calculations of the PXQ model and how spend is determined (a 26% figure is mentioned as a reference in discussing contributory spend).
- Sensor technology and multi-entity data sharing (security, EEC, data, global marketing, HR) makes cost calculation hairy; the team is trying to understand the initial calculation that led to the 26% spend and how to project the PXQ model for the next year.
- DENSO-related security initiatives: pursuing centralized logging configuration with security-driven central logging; discussions with TCS ops to enable central locking for syslocks and to determine how to present the scope and ingestion links. Kevin Stone (Merkle) is the go‑to person for deeper details on configuration.
- Merkle/Syslog/Sentinel architecture overview (high level): Windows and Linux servers push logs to a Merkle-managed Syslog forwarder, which forwards to regional collectors (APAC, EMEA, US). Filtering occurs at the Syslog collector level to reduce data going to Sentinel. Sentinel is Azure-based; retention policies are defined in Log Analytics, and purging can be limited to what is ingested. The aim is to trim non‑interesting events before they reach Sentinel to control costs.
- Agent model vs. syslog filtering: there is a potential alternative to direct server-to-Sentinel logging that allows filtering, but current discussions indicate that filtering is primarily performed by the Merkle intermediaries and the Syslog forwarders. Kevin can clarify the exact filtering rules and configurations.
- Coverage across clouds: Merkle has expanded logging beyond Windows/Linux servers to include cloud platforms (AWS, GCP), and is starting to include VMware and other devices where appropriate. There is ongoing work to ensure consistent logging architecture across cloud environments.
- Azure/AWS/GCP tagging and integration: security team confirmed Azure tagging works; tagging is now enabled for AWS and slated for GCP. There is a plan to harmonize tagging across Merkel and TCS: a central tag name with a value indicating “centrally supported” vs “Merkle” vs others. The proposed values include terms like central supported Merkle and central supported TCS. There is a preference to align on a single tag schema to avoid confusion.
- AWS tagging status and next steps: Merkel team is implementing tagging in AWS using the same tag structure; an update on AWS tagging is to be confirmed. Changes to tagging scripts may be needed to reflect the updated universal tag values.
- Future alignment on tagging is important to ensure proper queueing in security workflows and avoidance of misrouted tickets.
- Additional Azure changes: Microsoft is changing public IP addressing rules—basic SKU IPs must be converted to standard by the end of the month. This will require brief interruptions and changes to IP addresses attached to VMs and load balancers.
- Internal automation and platform-level improvements: a proposal to start an internal global project focused on automation across teams. Current discussions focus on building an automation platform that reduces manual steps and formalizes standard processes.
- Service catalog automation discussion: improvements to ServiceNow requests are needed to support multi-item requests (e.g., Synapse environments with data protection requirements). There is interest in building automation on the back end to translate front-end requests into repeatable, standardized deployments (consider automation with active directory groups and provisioning based on predefined templates).
- Rundeck exploration: suggestion to evaluate Rundeck (community edition) as an automation tool; potential personal deployment on a laptop or via Docker for a playground, with VPN access for testing. James is supportive of evaluating Rundeck for a platform that could be scaled to the TCS level.
- Next steps and meetings: Andrew’s Friday morning call is a potential forum to align on requirements, including security configurations; the team will coordinate with Kevin for deeper technical details. James and the broader team plan to regroup through Sean’s All Hands and align on where things stand.
- Overall tone: the meeting signals steady progress across multiple streams, with a focus on automation, cost transparency, secure and scalable logging, cloud governance, and structured project planning. The team anticipates better engagement and progress in the coming week, while acknowledging some scheduling challenges and ongoing decision points.
Key concepts and mechanisms
- Rightsizing and cost control
- PWC and CloudHealth rightsizing options are being implemented regionally.
- Auto power off, reconsolidation, and retention policy trimming reduce waste and running costs.
- Adam’s hybrid-benefit analysis aims to quantify the cost savings of VM size changes and licensing strategies.
- VM size changes affect OS licensing costs; a move toward smaller VMs or conservative licensing could yield savings.
- PXQ model for cost charging is being refined to map cost drivers and allocations across lines of business and regions.
- Cost accounting across regions
- Costs are broken down into three regions and include line items for NetOps, on‑prem backups, cloud operations, and tooling.
- The goal is to map central technology costs and understand what information drives uplift.
- Migration and environment strategy
- Non-prod environments migrate first; production servers to be grouped and migrated later.
- Jambox servers will serve as controlled entry points for management traffic with fixed IPs and ACLs.
- Local teams (DevOps engineers, ISIS in Brazil) support migrations; TCS architecture ensures security patterns are followed.
- Security tooling and PoCs
- ITS IPS PoC in Azure: Azure Firewall Premium with traffic inspection and Sentinel integration.
- Discussion about DOC environment decommissioning due to cost and future direction.
- Potential sequencing: starting with Americas for future deployments.
- Logging architecture and cost control
- Merkle’s syslog-forwarding model captures Windows and Linux logs, plus cloud platform logs, and routes through regional collectors.
- Filtering happens at the Syslog collectors to minimize data reaching Sentinel, reducing storage and processing costs.
- There is an ongoing evaluation of agent-based filtering vs. syslog-based filtering; deeper dive with Kevin Stone suggested.
- Sentinel retention is defined in Log Analytics; data cannot be purged after ingestion, so pre-ingestion filtering is essential.
- Cloud tagging and governance
- Tagging has started across Azure, AWS, and planned for GCP; alignment is key to separating central Merkel and TCS ownership.
- A unified tag strategy is being discussed to avoid ticket routing confusion; plan to adjust scripted tags accordingly.
- Azure IP address transition and network changes
- End-of-month deadline to convert basic IPs to standard across Azure; may cause brief Internet interruptions.
- Related changes to public IP addresses and load balancer associations require coordination across networks.
- Automation mindset and platform strategy
- Proposal to establish a global internal project to modernize automation capabilities.
- Exploration of Rundeck as a production-grade automation platform; potential for a shared playground and VPN-backed deployment.
- Service catalog improvements to enable multi-item requests and standardized deployment content (e.g., Synapse with data protection, resource group access, etc.).
- Align automation with ServiceNow workflows and Active Directory groups for scalable provisioning.
- People, roles, and collaboration
- Stakeholders mentioned: Adam (cost savings and hybrid benefit), Sean, Blair, Alex Boaz, JP, Andrew, Achilles, Kevin Stone (Merkle), James, and Nath (noted in the call).
- The team emphasizes cross-region collaboration (APAC, EMEA, Americas) and the ongoing involvement of regional teams for implementation and governance.
- Open questions and follow-ups
- Confirm AWS tagging status and next steps for GCP tagging; determine a single, agreed-upon tagging scheme.
- Clarify exact scope and ingestion targets for central logging (Syslog forwarders, Merkle collectors, and Sentinel integration).
- Validate whether Linux logs are being captured and processed equally across cloud environments.
- Decide on the Friday call on logging architecture alignment and secure onboarding with Kevin Stone.
- Determine the best path for onboarding and automation tooling (Rundeck) to support global operations and reduce manual work.
- Practical implications
- The changes touch cost, security posture, and operational efficiency across multiple clouds and regions.
- There may be brief service interruptions during IP standardization and migration events.
- The move toward automation and standardized service provisioning should yield long-term time savings and more predictable cost management.
- Summary takeaway
- The team is actively pursuing cost optimization, tighter security/logging controls, and scalable automation. Several decisions remain open (tagging, on/off strategy for certain environments, and tooling adoption), but the path forward is clearly focused on standardization, cross-team collaboration, and measured progress in the coming weeks.