Updates on Alerts Review Document

Updates on Alerts Review Document

  • Waseem requested an update on the alerts review document.

    • Michael provided that he sent a customer-friendly summary of last Tuesday's cadence via email.

    • The actual customer agreement documentation is extensive (approximately 300 pages).

    • Michael created a summarized version that includes:

    • Flow charts explaining the alerting system

    • A breakdown of the documentation's contents.

    • This document was sent in an email attached to the Tuesday night update.

    • Michael is open to walk through the document with Waseem if there are any questions.

Daily Report on Tickets

  • Michael returned to discussing the daily report on incident tickets.

    • Total of approximately 29 tickets reported, categorized as follows:

    • Most tickets are P2 (priority 2) related to degraded systems.

    • Many tickets are P3 concerning applications that are down.

    • A few tickets are P4, which do not compromise production.

    • Breakdown of the top alerting hosts by site over the past seven days:

    • Tokyo Data Center: highest alert count

    • Bangalore location: second highest

    • Madrid: third place

    • Grasberg: fourth place.

High-Priority Ticket Updates

  • Michael provided updates on specific high-priority tickets:

    • Incident 5593 (Provo location):

    • Replacement of ion 9,200 for fault T H A L one is en route.

    • Site status: Stable and awaiting installation and return of the defective unit.

    • Tracking information should be in the relevant ticket. Michael will notify local contacts of the incoming unit.

    • Incident 1019 (Aquila location):

    • Teams live event issues stabilized following MTU adjustments.

    • Ongoing investigation into DF flagged packet drops; vendor still under consultation.

    • Incident 10972 (Palo Alto):

    • Ongoing investigation into route advertisement failures.

    • Engineering team is reviewing configuration and several firmware options.

    • Incident 3644:

    • Traffic was dropped inside Prisma Access.

    • Joint troubleshooting session is scheduled to address the issue.

    • Incident 10860 (MAC flapping on Ascent):

    • Ongoing issue persists; affected ports are disabled.

    • Palo Alto is currently reviewing for a fix.

Report on System Performance

  • Michael shared information on uptimes, packet loss, and pre/post Quality of Experience (QoE):

    • No significant dips observed in pre-QOEs.

    • Post-QOEs show smooth performance; a previous dip related to Bengaluru site maintenance was discussed in Tuesday's meetings.

    • Most high-end points are from data centers, indicating expected results.

    • Observations about data points:

    • Smaller branches and offices peak but stay within the acceptable bandwidth (50 gigabits).

    • Aquila has a higher user count (around 2,000) affecting its usage metrics.

NetOps Escalations Q&A

  • Michael is active in NetOps escalations and invited questions, particularly about:

    • Incident 10860 (MAC flapping) where he will verify ongoing investigations with Palo Alto.

    • Confirmation on whether current deployment models are appropriate, especially concerning redundancy in configurations.

Updates on Ongoing Issues and Upgrades

  • Michael addressed some ongoing issues:

    • Inquiries about the Tenton site and upgrades related to CPU spiking and latency.

    • Discussion of a memory leak impacting CPU performance.

    • Review of engineering pending updates from Palo Alto regarding the 6.5.3 firmware version, addressing ongoing issues.

Future Meetings and Coordination

  • Discussion of upcoming scheduling and actions:

    • Scheduling of an upcoming meeting on March 8 for sites in Chandler and Ashburn.

    • Expected to confirm the timeline and whether additional time needs to be allocated for the joint meeting.

    • Need for return shipping labels from the Lithia Springs location post-device cleanup and account separation.

Open Action Points

  • Action items include:

    • Follow up with Palo Alto regarding incident updates.

    • Get shipping labels prepared for returning devices.

    • Confirm a scheduled meeting for additional insights on device concerns, focused on configuration stability.

    • Notify local contacts about changes and updates in ticket resolutions.

### Updates on Alerts Review Document - Waseem requested a detailed update on the alerts review document to ensure all stakeholders are aligned with the current status and developments. - Michael provided that he sent a customer-friendly summary of last Tuesday's cadence via email, aiming to distill complex information into a digestible format for better understanding. - The actual customer agreement documentation is extensive, consisting of approximately 300 pages of critical information, making it crucial for ongoing operations and compliance. - Michael created a summarized version that includes: - Flow charts explaining the alerting system, designed to visualize the alert hierarchy and response process for clarity. - A comprehensive breakdown of the documentation's contents, outlining key sections and highlighting important clauses relevant to customer agreements. - This document was sent in an email attached to the Tuesday night update for easy reference and review by the team. - Michael is open to walk through the document with Waseem if there are any questions, emphasizing his availability for additional clarification to enhance understanding and address concerns. ### Daily Report on Tickets - Michael returned to discussing the daily report on incident tickets, providing an overview of the current ticket landscape. - Total of approximately 29 tickets reported, categorized as follows: - Most tickets are P2 (priority 2), indicating issues related to degraded systems that require timely resolution to ensure operational efficiency. - Many tickets are P3 concerning applications that are down, which could lead to disruptions in service delivery and negatively impact customers. - A few tickets are P4, which do not compromise production, suggesting that although these issues are low priority, they still need to be monitored. - Breakdown of the top alerting hosts by site over the past seven days: - Tokyo Data Center: highest alert count, indicating increased activity or issues that need attention. - Bangalore location: second highest, reflecting possibly larger operational challenges or higher user activity. - Madrid: third place, ensuring a focus on regional disparities in system performance. - Grasberg: fourth place, which might require further investigation to understand alert origins better. ### High-Priority Ticket Updates - Michael provided updates on specific high-priority tickets, essential for maintaining service quality and operational stability: - **Incident 5593** (Provo location): - Replacement of ion 9,200 for fault T H A L one is on its way, critical for restoring service stability. - Site status: Stable and awaiting installation and return of the defective unit, underlining a prompt recovery plan. - Tracking information should be in the relevant ticket. Michael will ensure local contacts are notified of the incoming unit, enhancing communication and coordination efforts. - **Incident 1019** (Aquila location): - Teams live event issues stabilized following MTU adjustments, demonstrating successful intervention measures. - Ongoing investigation into DF flagged packet drops; vendor still under consultation, indicating a thorough review process to prevent future occurrences. - **Incident 10972** (Palo Alto): - Ongoing investigation into route advertisement failures, essential to stabilizing the routing environment. - Engineering team is reviewing configuration and several firmware options to identify the most effective solutions. - **Incident 3644**: - Traffic was dropped inside Prisma Access due to network issues impacting service delivery. - Joint troubleshooting session is scheduled to address the issue, promoting a collaborative approach to problem-solving. - **Incident 10860** (MAC flapping on Ascent): - Ongoing issue persists; affected ports are disabled until a permanent fix is established. - Palo Alto is currently reviewing for a fix, highlighting the proactive measures taken to resolve such network anomalies. ### Report on System Performance - Michael shared detailed information on uptimes, packet loss, and pre/post Quality of Experience (QoE): - No significant dips observed in pre-QOEs, indicating system reliability prior to current operational assessments. - Post-QOEs show smooth performance; a previous dip related to Bengaluru site maintenance was discussed in Tuesday's meetings, illustrating that maintenance activities are necessary yet can impact user experience. - Most high-end points are from data centers, indicating expected results for high-capacity environments. - Observations about data points: - Smaller branches and offices peak but stay within the acceptable bandwidth (50 gigabits), reinforcing that regional performance varies according to infrastructure. - Aquila has a higher user count (around 2,000) affecting its usage metrics, necessitating ongoing assessments to accommodate growth. ### NetOps Escalations Q&A - Michael is active in NetOps escalations and invited questions, particularly about: - Incident 10860 (MAC flapping) where he will verify ongoing investigations with Palo Alto, ensuring transparency and accountability. - Confirmation on whether current deployment models are appropriate, especially concerning redundancy in configurations, highlighting the importance of robust network design. ### Updates on Ongoing Issues and Upgrades - Michael addressed some ongoing issues that require attention: - Inquiries about the Tenton site regarding upgrades related to CPU spiking and latency, indicating a focus on performance enhancements. - Discussion of a memory leak impacting CPU performance, necessitating a strategic response to mitigate potential operational impacts. - Review of engineering pending updates from Palo Alto regarding the 6.5.3 firmware version, addressing ongoing issues critical for system stability. ### Future Meetings and Coordination - Discussion of upcoming scheduling and actions: - Schedule of an upcoming meeting on March 8 for sites in Chandler and Ashburn, emphasizing the importance of inter-site communication. - Expected to confirm the timeline and whether additional time needs to be allocated for the joint meeting to ensure thorough coverage of key topics. - Need for return shipping labels from the Lithia Springs location post-device cleanup and account separation, coordinating logistics effectively. ### Open Action Points - Action items include: - Follow up with Palo Alto regarding incident updates, establishing a communication flow for ongoing issue management. - Get shipping labels prepared for returning devices to streamline the inventory process. - Confirm a scheduled meeting for additional insights on device concerns, focused on configuration stability, enhancing operational security. - Notify local contacts about changes and updates in ticket resolutions to maintain clarity and cohesion within team efforts.