Updates on Alerts Review Document

Waseem requested an update on the alerts review document.
- Michael provided that he sent a customer-friendly summary of last Tuesday's cadence via email.
- The actual customer agreement documentation is extensive (approximately 300 pages).
- Michael created a summarized version that includes:
- Flow charts explaining the alerting system
- A breakdown of the documentation's contents.
- This document was sent in an email attached to the Tuesday night update.
- Michael is open to walk through the document with Waseem if there are any questions.

Daily Report on Tickets

Michael returned to discussing the daily report on incident tickets.
- Total of approximately 29 tickets reported, categorized as follows:
- Most tickets are P2 (priority 2) related to degraded systems.
- Many tickets are P3 concerning applications that are down.
- A few tickets are P4, which do not compromise production.
- Breakdown of the top alerting hosts by site over the past seven days:
- Tokyo Data Center: highest alert count
- Bangalore location: second highest
- Madrid: third place
- Grasberg: fourth place.

High-Priority Ticket Updates

Michael provided updates on specific high-priority tickets:
- Incident 5593 (Provo location):
- Replacement of ion 9,200 for fault T H A L one is en route.
- Site status: Stable and awaiting installation and return of the defective unit.
- Tracking information should be in the relevant ticket. Michael will notify local contacts of the incoming unit.
- Incident 1019 (Aquila location):
- Teams live event issues stabilized following MTU adjustments.
- Ongoing investigation into DF flagged packet drops; vendor still under consultation.
- Incident 10972 (Palo Alto):
- Ongoing investigation into route advertisement failures.
- Engineering team is reviewing configuration and several firmware options.
- Incident 3644:
- Traffic was dropped inside Prisma Access.
- Joint troubleshooting session is scheduled to address the issue.
- Incident 10860 (MAC flapping on Ascent):
- Ongoing issue persists; affected ports are disabled.
- Palo Alto is currently reviewing for a fix.

Report on System Performance

Michael shared information on uptimes, packet loss, and pre/post Quality of Experience (QoE):
- No significant dips observed in pre-QOEs.
- Post-QOEs show smooth performance; a previous dip related to Bengaluru site maintenance was discussed in Tuesday's meetings.
- Most high-end points are from data centers, indicating expected results.
- Observations about data points:
- Smaller branches and offices peak but stay within the acceptable bandwidth (50 gigabits).
- Aquila has a higher user count (around 2,000) affecting its usage metrics.

NetOps Escalations Q&A

Michael is active in NetOps escalations and invited questions, particularly about:
- Incident 10860 (MAC flapping) where he will verify ongoing investigations with Palo Alto.
- Confirmation on whether current deployment models are appropriate, especially concerning redundancy in configurations.

Updates on Ongoing Issues and Upgrades

Michael addressed some ongoing issues:
- Inquiries about the Tenton site and upgrades related to CPU spiking and latency.
- Discussion of a memory leak impacting CPU performance.
- Review of engineering pending updates from Palo Alto regarding the 6.5.3 firmware version, addressing ongoing issues.

Future Meetings and Coordination

Discussion of upcoming scheduling and actions:
- Scheduling of an upcoming meeting on March 8 for sites in Chandler and Ashburn.
- Expected to confirm the timeline and whether additional time needs to be allocated for the joint meeting.
- Need for return shipping labels from the Lithia Springs location post-device cleanup and account separation.

Open Action Points

Action items include:
- Follow up with Palo Alto regarding incident updates.
- Get shipping labels prepared for returning devices.
- Confirm a scheduled meeting for additional insights on device concerns, focused on configuration stability.
- Notify local contacts about changes and updates in ticket resolutions.

### Updates on Alerts Review Document - Waseem requested a detailed update on the alerts review document to ensure all stakeholders are aligned with the current status and developments. - Michael provided that he sent a customer-friendly summary of last Tuesday's cadence via email, aiming to distill complex information into a digestible format for better understanding. - The actual customer agreement documentation is extensive, consisting of approximately 300 pages of critical information, making it crucial for ongoing operations and compliance. - Michael created a summarized version that includes: - Flow charts explaining the alerting system, designed to visualize the alert hierarchy and response process for clarity. - A comprehensive breakdown of the documentation's contents, outlining key sections and highlighting important clauses relevant to customer agreements. - This document was sent in an email attached to the Tuesday night update for easy reference and review by the team. - Michael is open to walk through the document with Waseem if there are any questions, emphasizing his availability for additional clarification to enhance understanding and address concerns. ### Daily Report on Tickets - Michael returned to discussing the daily report on incident tickets, providing an overview of the current ticket landscape. - Total of approximately 29 tickets reported, categorized as follows: - Most tickets are P2 (priority 2), indicating issues related to degraded systems that require timely resolution to ensure operational efficiency. - Many tickets are P3 concerning applications that are down, which could lead to disruptions in service delivery and negatively impact customers. - A few tickets are P4, which do not compromise production, suggesting that although these issues are low priority, they still need to be monitored. - Breakdown of the top alerting hosts by site over the past seven days: - Tokyo Data Center: highest alert count, indicating increased activity or issues that need attention. - Bangalore location: second highest, reflecting possibly larger operational challenges or higher user activity. - Madrid: third place, ensuring a focus on regional disparities in system performance. - Grasberg: fourth place, which might require further investigation to understand alert origins better. ### High-Priority Ticket Updates - Michael provided updates on specific high-priority tickets, essential for maintaining service quality and operational stability: - **Incident 5593** (Provo location): - Replacement of ion 9,200 for fault T H A L one is on its way, critical for restoring service stability. - Site status: Stable and awaiting installation and return of the defective unit, underlining a prompt recovery plan. - Tracking information should be in the relevant ticket. Michael will ensure local contacts are notified of the incoming unit, enhancing communication and coordination efforts. - **Incident 1019** (Aquila location): - Teams live event issues stabilized following MTU adjustments, demonstrating successful intervention measures. - Ongoing investigation into DF flagged packet drops; vendor still under consultation, indicating a thorough review process to prevent future occurrences. - **Incident 10972** (Palo Alto): - Ongoing investigation into route advertisement failures, essential to stabilizing the routing environment. - Engineering team is reviewing configuration and several firmware options to identify the most effective solutions. - **Incident 3644**: - Traffic was dropped inside Prisma Access due to network issues impacting service delivery. - Joint troubleshooting session is scheduled to address the issue, promoting a collaborative approach to problem-solving. - **Incident 10860** (MAC flapping on Ascent): - Ongoing issue persists; affected ports are disabled until a permanent fix is established. - Palo Alto is currently reviewing for a fix, highlighting the proactive measures taken to resolve such network anomalies. ### Report on System Performance - Michael shared detailed information on uptimes, packet loss, and pre/post Quality of Experience (QoE): - No significant dips observed in pre-QOEs, indicating system reliability prior to current operational assessments. - Post-QOEs show smooth performance; a previous dip related to Bengaluru site maintenance was discussed in Tuesday's meetings, illustrating that maintenance activities are necessary yet can impact user experience. - Most high-end points are from data centers, indicating expected results for high-capacity environments. - Observations about data points: - Smaller branches and offices peak but stay within the acceptable bandwidth (50 gigabits), reinforcing that regional performance varies according to infrastructure. - Aquila has a higher user count (around 2,000) affecting its usage metrics, necessitating ongoing assessments to accommodate growth. ### NetOps Escalations Q&A - Michael is active in NetOps escalations and invited questions, particularly about: - Incident 10860 (MAC flapping) where he will verify ongoing investigations with Palo Alto, ensuring transparency and accountability. - Confirmation on whether current deployment models are appropriate, especially concerning redundancy in configurations, highlighting the importance of robust network design. ### Updates on Ongoing Issues and Upgrades - Michael addressed some ongoing issues that require attention: - Inquiries about the Tenton site regarding upgrades related to CPU spiking and latency, indicating a focus on performance enhancements. - Discussion of a memory leak impacting CPU performance, necessitating a strategic response to mitigate potential operational impacts. - Review of engineering pending updates from Palo Alto regarding the 6.5.3 firmware version, addressing ongoing issues critical for system stability. ### Future Meetings and Coordination - Discussion of upcoming scheduling and actions: - Schedule of an upcoming meeting on March 8 for sites in Chandler and Ashburn, emphasizing the importance of inter-site communication. - Expected to confirm the timeline and whether additional time needs to be allocated for the joint meeting to ensure thorough coverage of key topics. - Need for return shipping labels from the Lithia Springs location post-device cleanup and account separation, coordinating logistics effectively. ### Open Action Points - Action items include: - Follow up with Palo Alto regarding incident updates, establishing a communication flow for ongoing issue management. - Get shipping labels prepared for returning devices to streamline the inventory process. - Confirm a scheduled meeting for additional insights on device concerns, focused on configuration stability, enhancing operational security. - Notify local contacts about changes and updates in ticket resolutions to maintain clarity and cohesion within team efforts.