Updates on Alerts Review Document
Updates on Alerts Review Document
Waseem requested an update on the alerts review document.
Michael provided that he sent a customer-friendly summary of last Tuesday's cadence via email.
The actual customer agreement documentation is extensive (approximately 300 pages).
Michael created a summarized version that includes:
Flow charts explaining the alerting system
A breakdown of the documentation's contents.
This document was sent in an email attached to the Tuesday night update.
Michael is open to walk through the document with Waseem if there are any questions.
Daily Report on Tickets
Michael returned to discussing the daily report on incident tickets.
Total of approximately 29 tickets reported, categorized as follows:
Most tickets are P2 (priority 2) related to degraded systems.
Many tickets are P3 concerning applications that are down.
A few tickets are P4, which do not compromise production.
Breakdown of the top alerting hosts by site over the past seven days:
Tokyo Data Center: highest alert count
Bangalore location: second highest
Madrid: third place
Grasberg: fourth place.
High-Priority Ticket Updates
Michael provided updates on specific high-priority tickets:
Incident 5593 (Provo location):
Replacement of ion 9,200 for fault T H A L one is en route.
Site status: Stable and awaiting installation and return of the defective unit.
Tracking information should be in the relevant ticket. Michael will notify local contacts of the incoming unit.
Incident 1019 (Aquila location):
Teams live event issues stabilized following MTU adjustments.
Ongoing investigation into DF flagged packet drops; vendor still under consultation.
Incident 10972 (Palo Alto):
Ongoing investigation into route advertisement failures.
Engineering team is reviewing configuration and several firmware options.
Incident 3644:
Traffic was dropped inside Prisma Access.
Joint troubleshooting session is scheduled to address the issue.
Incident 10860 (MAC flapping on Ascent):
Ongoing issue persists; affected ports are disabled.
Palo Alto is currently reviewing for a fix.
Report on System Performance
Michael shared information on uptimes, packet loss, and pre/post Quality of Experience (QoE):
No significant dips observed in pre-QOEs.
Post-QOEs show smooth performance; a previous dip related to Bengaluru site maintenance was discussed in Tuesday's meetings.
Most high-end points are from data centers, indicating expected results.
Observations about data points:
Smaller branches and offices peak but stay within the acceptable bandwidth (50 gigabits).
Aquila has a higher user count (around 2,000) affecting its usage metrics.
NetOps Escalations Q&A
Michael is active in NetOps escalations and invited questions, particularly about:
Incident 10860 (MAC flapping) where he will verify ongoing investigations with Palo Alto.
Confirmation on whether current deployment models are appropriate, especially concerning redundancy in configurations.
Updates on Ongoing Issues and Upgrades
Michael addressed some ongoing issues:
Inquiries about the Tenton site and upgrades related to CPU spiking and latency.
Discussion of a memory leak impacting CPU performance.
Review of engineering pending updates from Palo Alto regarding the 6.5.3 firmware version, addressing ongoing issues.
Future Meetings and Coordination
Discussion of upcoming scheduling and actions:
Scheduling of an upcoming meeting on March 8 for sites in Chandler and Ashburn.
Expected to confirm the timeline and whether additional time needs to be allocated for the joint meeting.
Need for return shipping labels from the Lithia Springs location post-device cleanup and account separation.
Open Action Points
Action items include:
Follow up with Palo Alto regarding incident updates.
Get shipping labels prepared for returning devices.
Confirm a scheduled meeting for additional insights on device concerns, focused on configuration stability.
Notify local contacts about changes and updates in ticket resolutions.
### Updates on Alerts Review Document - Waseem requested a detailed update on the alerts review document to ensure all stakeholders are aligned with the current status and developments. - Michael provided that he sent a customer-friendly summary of last Tuesday's cadence via email, aiming to distill complex information into a digestible format for better understanding. - The actual customer agreement documentation is extensive, consisting of approximately 300 pages of critical information, making it crucial for ongoing operations and compliance. - Michael created a summarized version that includes: - Flow charts explaining the alerting system, designed to visualize the alert hierarchy and response process for clarity. - A comprehensive breakdown of the documentation's contents, outlining key sections and highlighting important clauses relevant to customer agreements. - This document was sent in an email attached to the Tuesday night update for easy reference and review by the team. - Michael is open to walk through the document with Waseem if there are any questions, emphasizing his availability for additional clarification to enhance understanding and address concerns. ### Daily Report on Tickets - Michael returned to discussing the daily report on incident tickets, providing an overview of the current ticket landscape. - Total of approximately 29 tickets reported, categorized as follows: - Most tickets are P2 (priority 2), indicating issues related to degraded systems that require timely resolution to ensure operational efficiency. - Many tickets are P3 concerning applications that are down, which could lead to disruptions in service delivery and negatively impact customers. - A few tickets are P4, which do not compromise production, suggesting that although these issues are low priority, they still need to be monitored. - Breakdown of the top alerting hosts by site over the past seven days: - Tokyo Data Center: highest alert count, indicating increased activity or issues that need attention. - Bangalore location: second highest, reflecting possibly larger operational challenges or higher user activity. - Madrid: third place, ensuring a focus on regional disparities in system performance. - Grasberg: fourth place, which might require further investigation to understand alert origins better. ### High-Priority Ticket Updates - Michael provided updates on specific high-priority tickets, essential for maintaining service quality and operational stability: - **Incident 5593** (Provo location): - Replacement of ion 9,200 for fault T H A L one is on its way, critical for restoring service stability. - Site status: Stable and awaiting installation and return of the defective unit, underlining a prompt recovery plan. - Tracking information should be in the relevant ticket. Michael will ensure local contacts are notified of the incoming unit, enhancing communication and coordination efforts. - **Incident 1019** (Aquila location): - Teams live event issues stabilized following MTU adjustments, demonstrating successful intervention measures. - Ongoing investigation into DF flagged packet drops; vendor still under consultation, indicating a thorough review process to prevent future occurrences. - **Incident 10972** (Palo Alto): - Ongoing investigation into route advertisement failures, essential to stabilizing the routing environment. - Engineering team is reviewing configuration and several firmware options to identify the most effective solutions. - **Incident 3644**: - Traffic was dropped inside Prisma Access due to network issues impacting service delivery. - Joint troubleshooting session is scheduled to address the issue, promoting a collaborative approach to problem-solving. - **Incident 10860** (MAC flapping on Ascent): - Ongoing issue persists; affected ports are disabled until a permanent fix is established. - Palo Alto is currently reviewing for a fix, highlighting the proactive measures taken to resolve such network anomalies. ### Report on System Performance - Michael shared detailed information on uptimes, packet loss, and pre/post Quality of Experience (QoE): - No significant dips observed in pre-QOEs, indicating system reliability prior to current operational assessments. - Post-QOEs show smooth performance; a previous dip related to Bengaluru site maintenance was discussed in Tuesday's meetings, illustrating that maintenance activities are necessary yet can impact user experience. - Most high-end points are from data centers, indicating expected results for high-capacity environments. - Observations about data points: - Smaller branches and offices peak but stay within the acceptable bandwidth (50 gigabits), reinforcing that regional performance varies according to infrastructure. - Aquila has a higher user count (around 2,000) affecting its usage metrics, necessitating ongoing assessments to accommodate growth. ### NetOps Escalations Q&A - Michael is active in NetOps escalations and invited questions, particularly about: - Incident 10860 (MAC flapping) where he will verify ongoing investigations with Palo Alto, ensuring transparency and accountability. - Confirmation on whether current deployment models are appropriate, especially concerning redundancy in configurations, highlighting the importance of robust network design. ### Updates on Ongoing Issues and Upgrades - Michael addressed some ongoing issues that require attention: - Inquiries about the Tenton site regarding upgrades related to CPU spiking and latency, indicating a focus on performance enhancements. - Discussion of a memory leak impacting CPU performance, necessitating a strategic response to mitigate potential operational impacts. - Review of engineering pending updates from Palo Alto regarding the 6.5.3 firmware version, addressing ongoing issues critical for system stability. ### Future Meetings and Coordination - Discussion of upcoming scheduling and actions: - Schedule of an upcoming meeting on March 8 for sites in Chandler and Ashburn, emphasizing the importance of inter-site communication. - Expected to confirm the timeline and whether additional time needs to be allocated for the joint meeting to ensure thorough coverage of key topics. - Need for return shipping labels from the Lithia Springs location post-device cleanup and account separation, coordinating logistics effectively. ### Open Action Points - Action items include: - Follow up with Palo Alto regarding incident updates, establishing a communication flow for ongoing issue management. - Get shipping labels prepared for returning devices to streamline the inventory process. - Confirm a scheduled meeting for additional insights on device concerns, focused on configuration stability, enhancing operational security. - Notify local contacts about changes and updates in ticket resolutions to maintain clarity and cohesion within team efforts.