Performance and Recovery Notes

Performance and Recovery

This module focuses on optimizing network performance, protecting it from faults and failures, and ensuring recovery from outages or disasters.

Collect Network Data

Network management involves assessing, monitoring, and maintaining all aspects of a network, including:

Controlling user access.
Monitoring performance baselines.
Checking for hardware faults.
Optimizing QoS for critical applications.
Maintaining records of network assets and software configurations.
Determining optimal times for hardware and software upgrades.

The goals are to enhance efficiency and performance while preventing downtime or loss, ideally by predicting problems before they occur. Before assessing network health, you must understand its logical and physical structure and how it functions under typical conditions. This requires collecting data about the network's state, devices, and traffic.

Environmental Monitoring

Maintaining the physical environment is crucial for reliable network function.

Key environmental factors to monitor include:

Device, rack, or room temperature.
Humidity, dew point, or barometric pressure.
Flooding (using liquid detectors).
Smoke or fire.
Airflow.
Vibration.
Motion (using security cameras).
Room lights (on or off).
Room or rack doors (open or closed).
Power (main or UPS voltage, battery level, outages, power consumption).

Sensors feed data to a physical device or software installed on a server, which presents the data in an administrative dashboard accessible over the network or Internet. This allows admins to check current data, adjust alarm thresholds, analyze historical data, and respond to alerts.

Products like PRTG Network Monitor by Paessler collect and organize information about monitored devices and sensors using protocols like ICMP, SNMP, WMI, and HTTPS. Alerts from monitoring software can be transmitted via email, SMS, phone calls, push notifications, audible alerts, or SNMP traps. Some solutions allow remote control of environmental factors.

Monitoring Device Health Using Task Manager

Environmental monitoring systems collect data indicating the health and status of individual devices, such as CPU usage, memory demand, storage space, network throughput, and uptime. This information can be viewed in Windows Task Manager.

To check Windows 10 performance statistics:

Press Ctrl + Alt + Del, then click Task Manager.
Click the Performance tab.
Note the changing numbers as you use the computer.

Task Manager provides real-time data on CPU utilization, uptime, running processes, available memory, memory usage, connected storage drives, drive read/write speeds, and network connections.

Traffic Monitoring Tools

These tools provide real-time analysis with alerts or log data for retroactive analysis to investigate network performance issues. Gaining access to the traffic itself is a primary challenge.

A network monitor continuously monitors network traffic and receives data from monitored devices. A protocol analyzer monitors traffic at a specific interface between a server/client and the network. The terms are often used interchangeably, but they differ in the kinds of data gathered.

Network Monitoring Software: Spiceworks can monitor multiple devices on a network.
Protocol Analyzer: Wireshark monitors traffic on the interface between a single device and the network.

Methods to track network traffic:

Wireless monitoring: Run monitoring software on a computer connected wirelessly. The network adapter must support promiscuous mode, where the NIC passes all wireless frames to the OS and monitoring software.
Port mirroring (SPAN): Configure a switch to copy all traffic sent to any port to a mirrored port, allowing monitoring of all traffic on the switch.
In-line monitoring: Install a network TAP (test access point) or packet sniffer in line with network traffic to mirror traffic to a computer running monitoring software in promiscuous mode.
Reporting: Configure devices to report their traffic and statistics to a network monitor using protocols like syslog and SNMP.

Network monitoring tools can:

Set the NIC to promiscuous mode.
Monitor network traffic continuously.
Capture network data transmitted.
Capture frames sent to/from a specific node.
Reproduce network conditions by transmitting selected data.
Generate statistics about network activity.

Some tools can also:

Discover all network nodes.
Establish a baseline.
Track utilization of network and device resources and present it in graphs/charts.
Store traffic data and generate reports.
Trigger alarms when traffic conditions meet thresholds.
Identify usage anomalies, such as top talkers or listeners.

Captured data helps solve networking problems by identifying the source of issues, such as excessive bad transmissions from a workstation or a compromised server generating a flood of data.

Traffic analysis examines network traffic flow for patterns and exceptions to identify bottlenecks. Protocol analysis digs into packet details to identify protocols, errors, and misconfigurations. Both approaches are useful, but focusing on the most relevant one speeds up problem resolution.

Types of data errors and transmission problems:

Runts: Packets smaller than the medium’s minimum size (e.g., < 64 bytes for Ethernet).
Giants: Packets exceeding the medium’s maximum size (e.g., > 1518 bytes for Ethernet).
Jabber: A device improperly handling electrical signals, causing constant retransmissions and halting the network, usually due to a bad NIC.
Ghosts: Aberrations caused by a device misinterpreting stray voltage, resulting in invalid frame patterns.
Packet loss: Packets lost due to unknown protocol, unrecognized port, network noise, or other anomaly, never reaching their destination.
Discarded packets (Discards): Packets arriving beyond their usable timeframe, discarded due to buffer overflow, latency, bottlenecks, or congestion.
Interface resets: Repeated connection resets, lowering utilization quality, typically caused by interface misconfiguration.

To identify a process hogging network resources on Windows:

Open an elevated PowerShell or Command Prompt and enter netstat -o to display the PID (process identifier) for each network connection.
Identify process names for each PID in Task Manager (Ctrl + Alt + Del).
Alternatively, use netstat -b to resolve process names (takes longer).
Stop an unruly process using the Services console (services.msc) or the taskkill /f /pid:#### command. If necessary, take ownership of the process program file using takeown /f <filename>.

Event Logging

Faults and conditions exceeding thresholds trigger alerts (messages indicating a threshold has been met), generating notifications to IT personnel via email, SMS, or support tickets. Alerts are also recorded in system and event logs kept by routers, switches, servers, and workstations. These logs can be accessed through Event Viewer in Windows.

Virtually every condition recognized by an OS can be recorded in an event log. Network admins can customize logs by defining conditions for new entries. In Windows, the Event Viewer application displays the event log, classifying events as Critical, Error, Warning, Information, Audit Success, or Audit Failure.

Syslog Messages

Event logs and additional information are routinely recorded and can be collected centrally via the syslog utility. Syslog is a standard for generating, storing, and processing messages about events on networked systems. It has three primary components:

Event message format: Messages must be organized and formatted in a specific manner.
Event message transmission: Messages are transported across the network on port 514 (or 6514 for TLS secured messages).
Event message handling: The syslog utility follows protocols for creating, handling, analyzing, and storing event messages.

Syslog defines two roles for devices:

Generator: The monitored device issuing event information.
Collector: The server gathering event messages.

Syslog assigns a severity level (logging level or priority value) to each event, from 0 (emergency) to 7 (debugging). A filter is configured on the device to send events from a specific level and above to the syslog server. Syslog messages can also be filtered by facility (machine process), such as kernel, users, or security/authorization.

Tracking every movement of every user generates an audit log, or audit trail, which provides consistent and thorough data to retroactively prove compliance and defensibly prove user actions (for forensics investigations). Syslog doesn’t alert you to any problems, but it does keep a history of messages issued by the system for later review and analysis.

SNMP Communications

Organizations use enterprise-wide network management systems to perform real-time monitoring functions across an entire network. These systems rely on entities working together:

NMS (network management system) server: A network management console collects data from multiple managed devices at regular intervals (polling).
Managed device: Any network node monitored by the NMS, containing several managed objects (any monitored characteristic of the device).
OID (object identifier): Each managed object is assigned a unique identifier, standardized across all NMSs.
Network management agent: A software routine running on each managed device, collecting information about the device’s operation and providing it to the NMS.
MIB (Management Information Base): A database containing the list of objects managed by the NMS, their descriptions, and performance data. It is designed in a top-down, hierarchical tree structure.

Agents communicate via application layer protocols; most modern networks use SNMP (Simple Network Management Protocol), which runs over UDP ports 161 and 162 (or TCP ports 10161 and 10162). SNMP can reconfigure managed devices and be used for real-time network monitoring. There are three versions of SNMP:

SNMPv1: The original version, rarely used today.
SNMPv2: Improved performance and slightly better security.
SNMPv3: Similar to SNMPv2, but adds authentication, validation, and encryption.

SNMPv3 is the most secure version, but SNMPv2 is still widely used, despite its vulnerabilities. Additional security measures for older versions of SNMP include:

Disabling SNMP on unneeded devices.
Limiting approved sources of SNMP messages.
Requiring read-only mode.
Configuring strong passwords (community strings).
Using different community strings on different device types.

Key SNMP messages:

Get Request: NMS requests data from the agent.
Get Response: Agent responds with requested information.
Get Next: NMS requests the next row of data in the MIB database.
Walk: NMS issues a sequence of Get Next messages to walk through sequential rows in the MIB database.
Trap: Agent sends unsolicited data to the NMS when specified conditions are met.

After data collection, the network management application presents the data in various ways, such as line graphs or maps indicating device status with different colors (green for fully functional, yellow for partially functional, red for failed).

Due to their flexibility, network management applications are challenging to configure and fine-tune and require the collection of useful data while avoiding excessive routine information.

NetFlow

NetFlow is a proprietary traffic monitoring protocol from Cisco that tracks all IP traffic crossing any interface where NetFlow is enabled. It creates flow records that show relationships between various traffic types, focusing on how network bandwidth is utilized by identifying how communications from all devices are related to each other.

When NetFlow is enabled, each unique conversation is collected in a NetFlow cache as a flow record. Completed flow records are exported to a centralized NetFlow collector for analysis. A NetFlow analyzer collates flow records to provide insights into traffic patterns.

NetFlow exporters (routers, switches) monitor traffic and transferring flow records, negatively impacting network performance. A balance must be struck between tracking all traffic and sufficient accurate traffic analysis, as NetFlow analyzes very high volumes of traffic with fewer resources than other options that capture entire packets.

sFlow is a similar technology, compatible with many platforms, relying on a dedicated hardware chip to avoid placing additional demand on a network device’s CPU and memory. While NetFlow is limited to capturing IP traffic, sFlow can sample traffic from all layers 2 through 7.

Manage Network Traffic

After collecting data on network traffic patterns, you can monitor the network’s status and make changes to optimize performance. This includes:

Performance management: Monitoring how well links and devices are keeping up with demands.
Fault management: Detecting and signaling device, link, or component faults.

Effective network administration involves responding to errors and tweaking device/network configurations to optimize performance. This requires knowing your starting point.

Performance Baselines

To know when there’s a problem, one must first know what is normal for that network. A baseline is a report of the network’s normal state of operation and might include a range of acceptable measurements.

Network performance baselines are obtained by analyzing network traffic information, including:

Utilization rate for the network backbone.
Number of users logged on per day/hour.
Number of protocols running on the network.
Statistics about errors (runts, jabbers, giants).
Frequency of networked application use.
Information regarding which users take up the most bandwidth.

Baseline measurements serve as a basis of comparison to identify changes or events. Obtaining baseline measurements is critical to identify changes or upgrade benefits.

Network traffic patterns vary and must account for:

Normal variations throughout the day, week, month, and different seasons.
Changes to the network and their unpredictable impact.

Several software applications can perform baselining, ranging from freeware to expensive, customizable products. Choosing a tool depends on the network size and needs. For example, iPerf establishes throughput between network hosts, and its data can form a baseline if one experiences traffic problems in the future.

Common network performance KPIs (key performance indicators) include:

Device availability and performance: CPU/memory usage, temperature, network connection speed.
Interface statistics: Insights into network changes and potential issues.
Utilization: Actual throughput used as a percentage of available bandwidth.
Error rate: Percentage of damaged bits due to EMI or other interference.
Packet drops: Packets that are damaged, expired, or not allowed through an interface.
Jitter: Varying latency between successive packets, degrading user experience.

Bandwidth Management

Bandwidth management refers to strategies to optimize the volume of traffic a network can support. Techniques include:

Flow control: Configure interfaces and protocols to balance permitted traffic volume with device capability.
Congestion control: Adjust how network devices respond to traffic congestion.
QoS (quality of service): Prioritize some traffic over other traffic.

Flow Control

Flow control is a bandwidth management technique configured on a local connection between two devices to ensure the receiver isn’t overwhelmed. It can be managed at the data link, network, or transport layers. Rate-based flow control (higher layers) limits the amount of data, but does not provide feedback, resulting in traffic loss. Feedback-based flow control (data link layer) indicates when the transmission rate is exceeding capacity.

Common approaches are:

Stop-and-wait: Sender transmits a frame and waits for acknowledgment before transmitting the next frame. If an acknowledgment is not received, the sender retransmits the unacknowledged frame. Slow and accurate.
Go-back-n sliding window: Sender transmits multiple frames while considering the receiver’s capacity, waiting for acknowledgments. If an acknowledgment is missing, the sender retransmits all frames, even if only one frame was lost.
Selective repeat sliding window: Only the unacknowledged frame is retransmitted. Receiver must be able to receive frames out of order and reorganize them.

Congestion Control

Congestion control manages traffic volume throughout the network (compared to flow control, which is between two devices). It aims to prevent congestion before it occurs (open-loop congestion control) and remedy it after it starts (closed-loop congestion control).

Open-loop techniques include:

Retransmission policy: Retransmission timers help reduce increasing congestion caused by devices attempt ing to resend lost packets too quickly or too often.
Window policy: Senders might be required to use the selective repeat sliding window method to reduce the number of frames that must be resent when errors occur.
Acknowledgment policy: Receivers can be required to send a single ACK message for multiple received frames, thereby reducing acknowledgment traffic on the network.
Discarding policy: Less sensitive frames are discarded so important traffic can survive the congestion.
Admission policy: Routers and switches can temporarily reject new traffic that will contribute to or create congestion rather than admitting that new traffic onto the network.

Closed-loop techniques include:

Implicit signaling: Sender detects congestion after experiencing missed acknowledgments.
Explicit signaling: Congested devices alter data packets to indicate congestion to the sender (backward signaling) or receiver (forward signaling).
Choke packet: Router creates/sends a choke packet to the traffic source, informing it of congestion.
Backpressure: Downstream node stops accepting traffic, transferring pressure upstream.

QoS (Quality of Service) Assurance

Voice and video transmissions are delay-sensitive and loss-tolerant. Network administrators must manage QoS configurations, adjusting traffic priorities based on:

Application protocols
Bandwidth requirements

Variable delays of VoIP packets result in choppy voice quality, requiring prioritization. Optimized QoS translates into uninterrupted audio/visual reproduction.

Traffic Shaping

Traffic shaping (packet shaping) involves manipulating packet/data stream characteristics to manage traffic flow. Goals include timely delivery of important traffic while optimizing performance for all users. Approaches include:

Delaying less-important traffic (buffering).
Increasing priority of more important traffic.
Limiting the volume of traffic flowing into/out of an interface.
Limiting the momentary throughput rate for an interface.

The last two techniques represent traffic policing, resulting in dropped traffic rather than buffered traffic. ISPs might impose a maximum capacity, and traffic shaping can dynamically increase a busy user’s bandwidth without affecting others when a network is not at risk of congestion.

Traffic can be prioritized based on:

Protocol
IP address
User group
DiffServ flag
VLAN tag
Service/Application

Different types of traffic can be assigned priority classes (high, normal, low, or slow) or rated on a prioritization scale from 0 (lowest) to 7 (highest).

DiffServ (Differentiated Services)

DiffServ prioritizes traffic at layer 3, considering all types of traffic. It places information in the DiffServ field of an IPv4 packet (the first 6 bits of the 8 bit field are called DSCP (Differentiated Services Code Point) or the Traffic Class field (IPv6) to indicate router's data stream forwarding preference.

DiffServ defines two types of forwarding:

EF (Expedited Forwarding): A data stream is assigned a minimum departure rate from a given node.
AF (Assured Forwarding): Different levels of router resources can be assigned to data streams. AF priori tizes data handling but provides no guarantee that on a busy network, messages will arrive on time and in sequence

CoS (Class of Service)

CoS refers only to layer 2 techniques on Ethernet frames, efficiently routing Ethernet traffic between VLANs. Frames tagged to specific VLANs contain a 3-bit PCP (Priority Code Point) field in the frame header. These bits are set to one of eight levels (0-7), indicating the message priority if the port receives more traffic than it can forward.

Plan Response and Recovery Strategies

Despite precautions, disasters and security breaches happen. Training and preparation are crucial.

Key terms:

Incident: Any event that has adverse effects on a network’s availability or resources.
Disaster: An extreme type of incident, involving a network outage that affects more than a single system or limited group of users.

Incident Response

An incident response plan defines events qualifying as formal incidents and the steps to follow; it aims to:

Keep people safe
Protect sensitive data
Ensure network availability and integrity
Collect data

The incident response is a six-stage process:

Preparation: Response team brainstorms possible incidents and plans procedures.
Detection and identification: Staff is educated about incidents and reporting potential problems.
Containment: Team limits the damage by isolating affected systems/areas.
Remediation: Team identifies the problem's cause and resolves it.
Recovery: Operations return to normal as affected systems are repaired.
Review: Team learns from the incident and adjusts for future prevention.

Team roles:

Dispatcher: Notifies the technical lead and manager upon detection, creating a record.
Technical support specialist: Solves the problem quickly, detailing what happened.
Manager: Coordinates resources, ensures policy is followed, communicates with public relations.
Public relations specialist: Acts as spokesperson, if necessary.

Data Preservation

During some incidents, data will need to be collected in such a way that it can be presented in a court of law for the purpose of prosecuting an instigator of illegal activity. Forensic data can be damaged if improperly handled, requiring first responders (trained/certified in evidence handling) handling. Every IT technician must know how to safeguard sensitive data/legal evidence until the response team arrives:

Secure the area: Prevent contamination of evidence by isolating and securing devices by disconnecting from the network (leave device running without closing files).
Document the scene: Creating a defensible audit trail, document all actions with time and reasoning. Maintain a list of everyone present with access.
Monitor evidence and data collection: Record all items collected, maintaining their original state (do not access any files).
Protect the chain of custody: Track all collected data, ensuring it remains in official hands (sign off on chain of custody documents).
Monitor transport of data and equipment: Document every item for replication in the lab, (possible hot seizure and removal).
Create a report: Report all activities observed/participated in, writing in full when the event is still a fresh on the mind.

Disaster Recovery Planning

When bad things happen, a BCP (business continuity plan) details the resources and protocols the business will use to continue providing service to its customers with little or no disruption during a disaster, covering prevention, damage limitation, and restoration. A disaster recovery plan details the processes for restoring critical functionality and data after an outage, accounting for worst-case scenarios. It includes contingencies for restoring/replacing computer systems, power, telephone systems, and paper-based files, and also includes contact list of emergency coordinators to execute the disaster response plan. A regular test of the disaster recovery plan is also important.

A BCP takes a big-picture approach to preparations, such as identifying critical operations that require significant backups and ensuring core communications channels are available under a variety of possible circumstances. A BCP also includes ways to prevent disasters from affecting the company at all, ways to limit the damage if or when those disasters occur, and processes for restoring operations and limiting downtime.

Disaster Recovery Contingencies

Redundancy can be provided by multiple servers to run a website or multiple ISP connections. The same principle applies to disaster recovery planning.

An organization can choose from several options for recovering its network infrastructure from a disaster, commonly divided into:

Cold site: Computers, devices, and connectivity exist but are not configured/updated/connected. Restoring functionality could take a long time.
Warm site: Computers, devices, and connectivity exist, with some pieces configured/updated/connected. Restoring could take hours/days.
Hot site: Computers, devices, and connectivity are all configured/updated/connected, matching the network’s current state. Immediate return to service is expected.
DRaaS (disaster recovery as a service): A cloud site, provides a highly scalable, inexpensive DR option by establishing a cloud configuration that could take over many or most business processes in the event of a disaster. Resources can also be scripted using IaC (infrastructure as code) and created only when needed after a disaster occurs.

Power Management

Managing a network’s availability involves managing supporting facilities/infrastructure like power connections during outages/fluctuations. Power surges/fluctuations can damage sensitive equipment.

An electric circuit provides a medium for the transfer of electric power over a closed loop. If the loop is broken in any way, the circuit won’t conduct electricity. In a circuit, DC (direct current) flows at a steady rate in only one direction. By contrast, AC (alternating current) continually switches direction on the circuit. A flashlight, for example, uses DC. The batteries in a flashlight have positive and negative poles, and the current always flows at a steady rate in the same direction between those poles. AC, however, travels in compression waves, similar to the coils of a Slinky®, alternating direction on the power line back and Between the source and destination.

Power Flaws

Power loss or less-than-optimal power cannot be tolerated by networks. Power flaws include:

Surge: Momentary voltage increase due to lightning, solar flares, or electrical problems. Use surge protectors.
Noise: Voltage fluctuation caused by other devices or EMI. Excessive noise can damage equipment. Reduce noise with an electrical filter, resulting in clean power.
Brownout: Momentary voltage decrease (sag). Can cause failures and data corruption.
Blackout: Complete power loss. Can cause extensive damage, requiring a backup power source for graceful shutdown.

Network Power Devices

Devices commonly encountered include surge protectors, PDUs (power distribution units), and UPSs (uninterruptible power supplies).

A surge protector can absorb excess energy from power lines to protect sensitive network equipment from power surges.

A PDU (power distribution unit) brings power from outlets, a generator, or a UPS closer to the devices on the rack.

A UPS (uninterruptible power supply) provides backup power during outages, preventing fluctuations. They come in two categories:

Standby UPS (SPS): Switches to battery when it detects a loss of power switching back to AC when power is restored to the outlet.. Provides continuous voltage to a device by switching virtually instantaneously to the battery when it detects a loss of power from the wall outlet. Upon restoration of power, the standby UPS switches the device back to AC power. The problem with standby UPSs is that, in the brief amount of time it takes the UPS to discover that power from the wall outlet has faltered, a device may have already detected the power loss and shut down or restarted.
Online UPS Uses the AC power from the wall outlet to contin uously charge its battery while providing power to a network device through its battery.

A generator serves as a backup power source, typically combined with a large UPS to ensure clean power is always available. In the event of a power failure, the UPS supplies electricity until the generator starts and reaches its full capacity, typically no more than three minutes. In the event of a power failure, the UPS supplies electricity until the generator starts and reaches its full capacity, typically no more than three minutes.

Backup Systems

Maintaining good backups is essential for providing fault toler ance and reliability.

Steps for designing, configuring, deploying, and maintaining a backup system:

Decide what to back up (user/app data, profile folders, configuration files).
Select backup methods (cloud backups vs. onsite backups) and verify backup hardware/software compatibility.
Decide what backup types will be made regularly (full, incremental, differential).
Decide how often backups are needed (after 4 hours of data entry or less).
Develop a backup schedule and governance (who is responsible, what information is kept in the backup logs).
Regularly verify backups are being performed (attempt to recover files from backup media/system).

The 3-2-1-1 Rule of data recovery defines the following backup principles.

3 - Keep at least three complete copies of the data.
2 - Save backups on at least two different media types.
1 - Store at least one backup copy offsite.
1 - Keep at least one backup copy offline for protection against ransomware.

Factors affecting decisions can be defined using the RTO (Recovery Time Objective) and RPO (Recovery Point Objective):

RTO: The time your network can reasonably tolerate an outage.
RPO: the amount of historical data you’ll need to be able to restore from backup in response to an outage.