22 - Anomaly Detection

Definition: Anomaly detection is the process of identifying patterns or observations that do not conform to expected behavior in datasets.
- Identifies rare items or events that deviate from the norm.
- Anomalies raise suspicion by significantly differing from expected patterns.

Why not use normal classification or clustering algorithms?
- Traditional methods may not effectively handle limited outlier data.
Complexity: Defining what is considered 'abnormal' can be challenging due to
- Limited training examples of anomalies.

Natural Variation: Normal behavior variation among observations.
Errors: Incorrect data points that may indicate anomalies.
Outlier Definition: An observation that is markedly different from other observations, potentially generated by a different mechanism (Douglas Hawkins).

Financial Fraud Detection: Identifying credit card transactions outside normal spending patterns.
- Example: Unusually large purchase made in an unfamiliar location.
Network Security: Recognizing unusual traffic patterns indicating potential cyber threats.
- Example: Dramatic increase in server requests.
Manufacturing Quality Control: Monitoring production processes for anomalies.
- Example: Microchips operating outside standard temperature range.

Data points considered anomalous in specific contexts
- Example: Temperature of 25°C in winter vs summer.

A group of related data points that deviate significantly from the whole dataset.

Attribute Definition of Anomalies: Anomalies may be defined by multiple attributes.
Global vs. Local Perspective:
- Global Perspective: Assesses anomalies concerning the overall dataset.
- Local Perspective: Focuses on local neighborhoods where data may imply different normal behavior.
Degree of Anomaly: The extent of deviation from normal patterns.
Detection Methods:
- One at a time: Sequential detection for streaming data.
- Many at once: Batch detection for static datasets, analyzing relationships comprehensively.
- Masking/SWAMP: Overlapping anomalies making identification difficult or misclassifying normal points due to influence from actual anomalies.

Supervised Learning:
- Uses labeled data for training.
Unsupervised Learning:
- Identifies outliers without prior labels, relying on data patterns.
Semi-supervised Learning:
- Trained on labeled normal data to identify deviations.

Model-Based Strategies: Construct statistical models to evaluate data fit, where poor fit indicates potential anomalies.
Probabilistic Definitions: Outliers are determined by low probability under the fitted distribution from the model.

Gaussian Distribution: Essential in identifying anomalies several standard deviations from the mean.
Bernoulli, Poisson, Binomial, Uniform: Each applicable to different scenarios of anomaly detection based on their characteristics and formulaic assumptions.

Methodology: Identifies normal clusters, labeling outliers in low-density regions.
Challenges: May misclassify points in sparse regions.

Assumes data emanates from multiple distributions; iteratively refines classifications.

Based on spatial relationships, identifying anomalies as points distanced significantly from neighbors.

Two-step Technique: Initial clustering followed by outlier removal and re-clustering to refine anomaly identification.

Principle: Easier isolation of anomalies due to their rarity.
Key steps include partitioning through random feature splits and scoring based on isolation efficiency.

Utilized for novelty detection, capable of handling only normal data with specific hyperplane classification for anomalies.

Model-Based: Deviations from statistical models signal anomalies.
Clustering: Points not conforming to clusters are flagged.
Support Vector: Detects anomalies based on hyperplane separation from normal data.
Proximity: Flags data distanced from a dataset majority as anomalies.

Key metrics include Precision, Recall, F1 Score, and Area Under ROC Curve; considerations include False Positive Rate.

Class Imbalance: Adjusting thresholds or employing techniques like SMOTE.
High Dimensionality: Feature selection or PCA to simplify data complexity.
Real-time Detection Needs: Implementation of streaming algorithms for immediate analysis.

From Fraud Detection and Cybersecurity to Industrial Inspection and Market Surveillance, anomaly detection plays a crucial role across various sectors, providing insights that guide decisions and strategy adaptation.

Early Detection of Issues: Anomaly detection can identify problems before they escalate, leading to timely resolution.
Automation: It allows for automated monitoring of systems and datasets, reducing reliance on human analysis.
Improved Security: Particularly in cybersecurity, it helps in detecting threats that do not follow normal patterns.
Customization: Algorithms can be tailored to suit specific domains or operational needs, enhancing relevance and accuracy.

False Positives: There is a risk of incorrectly flagging normal observations as anomalies, leading to unnecessary investigations.
Dependency on Quality Data: The effectiveness of anomaly detection methods heavily relies on the quality and quantity of historical data.
Complexity in Model Selection: Choosing the right anomaly detection technique for specific contexts can be challenging due to the variety of methods available.
Resource Intensive: Anomaly detection systems may require significant computational resources, particularly for real-time analysis.

Financial Transactions: Use when monitoring for fraud in financial systems to capture unusual spending behavior.
Cybersecurity: Implement in networks to identify unauthorized access or unusual traffic patterns.
Manufacturing: Employ to maintain quality control by monitoring machine performance and detecting malfunctions early.
Healthcare: Apply in medical data analysis to detect atypical patient behavior or anomalies in vital signs.
Log Analysis: Utilize in IT logs to track unusual patterns that might indicate security breaches or system issues.