1/25
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Recall the definition of statistics
Statistics is the science of learning from data. It is a broad field that touches every area of study, from physics to social sciences to finance, providing tools that help separate fact from fiction.
In technical analysis, statistical measures are core ingredients in many popular classical and modern technical indicators and trading strategies. Understanding statistical analysis is what separates professional technical analysts from hobbyists dabbling with trading.
Outline the difference between descriptive and inferential statistics
Descriptive statistics focuses on describing or summarizing data using quantitative or visual tools. This is why we often call them “summary statistics.” For example, when analyzing market data, descriptive statistics help us understand what happened in the past by organizing and presenting the data in a meaningful way.
On the other hand, inferential statistics builds on descriptive statistics to draw conclusions or make inferences about the broader population based on sample data. In technical analysis, this is particularly important because we often use sample data (like historical price movements) to make predictions or draw conclusions about future market behavior.
For instance, when we analyze a trading system’s returns, we use descriptive statistics to summarize its past performance. Then, we use inferential statistics to determine if those results suggest the strategy might be profitable in the future, or if the results could have occurred by random chance.
Define sample and population
A population is a collection of events, persons, things, or objects under study that cannot be studied directly. For example, in technical analysis, this could represent all future market outcomes which are unknowable.
A sample is a portion of the population that we can actually study. The sample generates statistics (values or descriptors) that describe properties of the sample, which we hope will accurately describe the population and its parameters. For instance, when analyzing a trading system, we might use historical price data as our sample to make inferences about how the system might perform in the future (the population).
It’s important to note that while statistics measure sample data, parameters describe population data. Sometimes samples will accurately describe the population, but other times, due to natural sampling variability, they will not.
Explain the two main types of data
Quantitative data: Consists of measures and counts. This type of data answers questions such as how long, how many, and how far.
Categorical data: Also known as qualitative data, this type is grouped based on characteristics that are not quantitative, such as gender, color, or hometowns.
Describe the three most common measures of central tendency: arithmetic mean, median, and mode
Arithmetic Mean: The arithmetic mean is what most people refer to as “average.” It is calculated by summing all observations and dividing by the number of observations. While it’s the most common measure, it can be heavily impacted by outliers.
Median: The median is the middle number in a dataset when arranged in order. It divides the dataset in half and is less sensitive to outliers than the mean. For datasets with an even number of observations, take the mean of the middle two values.
Mode: The mode is the observation that occurs most frequently in a dataset. While it’s the least commonly used of the three measures, it’s particularly valuable for categorical data. A dataset can have no mode or multiple modes (bimodal).
Discuss alternative methods of calculating means and their uses
There are several alternative methods for calculating means in statistics, each suited for different types of data and analysis requirements:
Arithmetic Mean: The arithmetic mean is the most common method, calculated by summing all observations and dividing by the number of observations. While simple to calculate, it can be heavily impacted by outliers and treats all observations equally.
Geometric Mean: The geometric mean is particularly valuable in finance as it helps calculate compound rates of return. It works with scaling factors rather than absolute values, making it ideal for portfolio analysis. This method better represents actual investment performance when dealing with time series data.
Weighted Arithmetic Mean: The weighted arithmetic mean addresses the limitation of equal observation weights in the simple arithmetic mean. It’s particularly useful in technical analysis for calculating moving averages, where recent data may need more weight than older data. The exponential moving average is an example where weighting factors decrease exponentially with data age.
Each method has its strengths and limitations, and selecting the appropriate mean calculation depends on your data characteristics and analysis goals.
State why the geometric mean is so important to investors
The geometric mean is critically important to investors because it solves a major problem with time series datasets that make arithmetic means misleading. When dealing with investment returns over time, the arithmetic mean can give a deceptive picture of performance.
For example, consider a stock that loses 50% in year one and gains 75% in year two. The arithmetic mean would suggest a positive 12.5% average return. However, a $100 investment would actually result in a loss, dropping to $50 after year one and only recovering to $87.50 after year two.
The geometric mean accounts for this compounding effect by treating the observations as scaling factors rather than absolute values. In the example above, it correctly shows a compound annual return of -6.46%, which accurately reflects the actual investment performance. This makes the geometric mean essential for understanding true investment returns over multiple time periods.
Describe what is meant by “measures of dispersion”
Measures of dispersion help us understand how spread out or scattered data points are from their central tendency (like the mean or median). While measures of central tendency tell us about the typical or middle values in our data, measures of dispersion tell us about the variability or volatility in the data.
In financial markets and technical analysis, measures of dispersion are particularly important because they help quantify risk. They show us how much actual values tend to deviate from expected values. For example, when analyzing market returns, dispersion measures can tell us how volatile those returns are – whether they tend to stay close to the average or frequently swing to extremes.
There are several key measures of dispersion that analysts commonly use:
Variance: Measures the average squared deviation from the mean
Standard deviation: The square root of variance, giving us a measure in the same units as our original data
Z-scores: Tells us how many standard deviations an observation is from the mean
These measures help analysts understand not just what typical values look like, but also how reliable or predictable those typical values are.
Explain two measures of dispersion: standard deviation and variance
Variance and standard deviation are two fundamental measures of dispersion in statistics that help us understand how spread out our data is from the mean. Let’s explore each:
Variance: Sample variance is defined as the average squared deviation from the mean. It measures how far numbers in a dataset spread from their average value. The formula for variance is:
s2=∑(xi–¯x)2n−1
Where s² is variance, ∑ indicates the sum of observations, xᵢ is each observation, x̄ is the sample’s mean, and n-1 is the number of observations minus 1 (Bessel’s Correction).
Standard Deviation: The standard deviation is simply the square root of variance. We often prefer standard deviation because it gives us a measure in the same units as our original data. The formula is:
s=√∑(xi–¯x)2n−1
Using our Dow Jones example from 2007-2017, the calculated variance is 3.35% and the standard deviation is 18.3%. This tells us that approximately 68% of annual returns fall between -10.5% and 26.1% (the mean of 7.8% ± one standard deviation).
Both measures are sensitive to outliers in the data, just like the mean. In cases where outliers are a concern, analysts often prefer using the median with interquartile range as alternative measures.
State what z-scores measure and how they can be used
Z-scores measure how many standard deviations an observation is from the mean of a dataset. They are calculated by taking the difference between an observation and the mean, then dividing by the standard deviation. The formula is:
zi=xi–¯xs
Z-scores serve two important purposes in technical analysis:
They allow us to standardize measurements, making it possible to compare observations from different samples
They help quantify how extreme or unusual a particular observation is relative to the mean
For example, a return of -34% produces a z-score of -2.28, indicating it was 2.28 standard deviations below the mean return of 7.8%. A positive z-score indicates the observation is above the mean, while a negative z-score shows it’s below the mean.
State the value of data visualization as a complement to descriptive statistics
Data visualization serves as an essential complement to descriptive statistics in several ways. While descriptive statistics can summarize datasets numerically, they may not reveal the complete picture. As demonstrated by Anscombe’s quartet, four distinctly different datasets can share identical summary statistics (means, variances, correlations, regression lines, and coefficients of determination), yet their visual representations show dramatically different patterns.
Visual methods provide an instant snapshot of what is really happening in the data. Through tools like histograms, box plots, and scatterplots, analysts can quickly identify patterns, distributions, and relationships that might not be apparent from numerical summaries alone. For example, histograms can show how data is distributed across variables, while box plots can effectively reveal outliers and data spread.
The combination of visual and statistical methods allows for more robust analysis and helps prevent misinterpretation of data that might occur when relying solely on numerical summaries. This is particularly valuable in technical analysis, where visual representations of market data have been used effectively for centuries.
Explain how to calculate outliers in a data set
There are several methods to calculate outliers in a dataset:
Using the Interquartile Range (IQR) method:
Calculate the interquartile range (IQR) by finding the difference between the third and first quartiles
Multiply the IQR by 1.5
Subtract this value from the first quartile and add it to the third quartile
Any values outside these boundaries are considered outliers
Using z-scores:
Calculate how far each value is from the mean in terms of standard deviations
Values with z-scores beyond +3 or -3 (three standard deviations) are considered outliers
This works because in normal distributions, 99.7% of data falls within three standard deviations
Visual inspection method:
Enter all values into a spreadsheet
Sort the list to identify values that are notably different from others
Visually scan for unusual values
Remember to always validate whether identified outliers are genuine data points or errors before taking any action.
Express what scatterplots are used for
Scatterplots are effective tools used to analyze two-variable quantitative data. They describe the nature or shape, direction, and degree or strength of the relation between two variables x and y, where (x, y) gives a pair of measurements.
Identify the three features of a data set that scatterplots describe
Scatterplots describe three key features of a dataset:
Shape: Whether the relationship between variables is linear or nonlinear
Direction: Whether the y-value increases (positive relationship) or decreases (negative relationship) as the x-value changes
Strength: How closely the data points follow the trend line – points close to the line indicate a strong relationship, loosely scattered points indicate a weaker relationship, and randomly scattered points indicate no relationship
Define Pearson’s Correlation Coefficient r
Pearson’s Correlation Coefficient, r, is a numerical measure used to assess the linear relationship between two variables, specifically the direction and strength of the relationship.
The formula for calculating r is:
r=∑(xi–¯xsx)(yi–¯ysy)n−1
Where:
r is Pearson’s Correlation Coefficient
∑ (sigma) indicates the sum of a set of observations xi and yi for each i th observation
(x-bar) and (y-bar) are averages for each variable
s is the standard deviation for each variable
The coefficient has two key aspects:
Direction: The positive or negative sign of r describes the direction of the linear relation between the two variables:
A positive value indicates a positive relation between x and y – they move in the same direction
A negative value indicates a negative relation between x and y – they move in opposite directions
Strength: The numeric values of r describes the strength of the linear relation:
Values range from -1 to +1 (-1 ≤ r ≤ 1)
1 or -1 indicates a perfect positive or negative correlation
As values move towards 0, the strength of the relationship decreases
Differentiate between correlation and causation
The key distinction between correlation and causation is that just because two variables are correlated (move together or have a relationship) does not mean that one caused the other. For example, there might be some other variable influencing both of them. The text provides a clear example: seeing fire trucks is highly correlated with large house fires, but this does not mean that fire trucks cause large house fires.
State what a linear model can be used to determine
Having established that two variables are related to each other, a linear model can be used to determine how changes in one can allow for the prediction of changes in the other. When working with a linear model, a line of best fit is needed because the line’s equation will be used to estimate how the dependent variable (y-axis) will be influenced by the independent variable (x-axis).
For example, if analyzing stock market data, a linear model could help predict how changes in a major index might affect an individual stock’s price. The model provides a mathematical way to estimate these relationships and make predictions about future changes.
Recall the linear regression equation
The linear regression equation is:
Y=a+bX
Y is the dependent or response variable
X is the independent or predictor variable
a is the y-intercept. It is the value of Y for X = 0
b is the slope of the line
Examine the use of regression analysis in technical studies
Regression analysis in technical studies serves several important purposes:
It helps predict the movement of one financial instrument based on another, as shown in the Broadcom/Nasdaq-100 example where a 1.5% move in the index predicted a 1.73% move in Broadcom
The coefficient of determination (r²) measures how much of a security’s price movement can be explained by another variable – for example, 50% of Broadcom’s daily price movements were attributable to Nasdaq-100 movements
Multiple regression allows analysis of relationships between more than two variables, as demonstrated by the three-dimensional analysis of Apple, Amazon, and Alphabet returns
The least-squares regression line provides a mathematical model for these relationships by minimizing the sum of squared errors, making it a valuable tool for technical analysis and prediction.
Compare coefficients of correlation and determination
The coefficients of correlation (r) and determination (r²) are related but distinct measures of relationships between variables:
Pearson’s Correlation Coefficient (r):
Measures both direction and strength of linear relationships
Values range from -1 to +1
Sign indicates direction (positive or negative relationship)
Absolute value indicates strength (closer to 1 = stronger relationship)
Coefficient of Determination (r²):
Is literally r squared
Measures percentage of variation in Y-values attributable to variation in X-values
Values range from 0 to 1
Expressed as a percentage (e.g., r² = 0.5 means 50% of variation is explained)
For example, if r = 0.71 (showing moderately strong positive correlation), then r² = 0.5018, meaning about 50% of the variation in one variable can be explained by changes in the other variable.
Define probability
Probability measures the extent to which an event is likely to occur. Probabilities are measured on a scale between 0 and 1, where 0 indicates an event is more or less impossible, while a probability of 1 indicates proximate certainty. For a given event (E) with N mutually exclusive and equally likely outcomes, the probability of E is classically defined as:
P(E)=NEN
This means the probability of E is the number of potential outcomes that result in the event divided by the number of total possible outcomes. For example, with a fair coin flip, the probability of getting heads is:
P(H)=HeadsHeads or Tails=12
Explain the impact of the law of large numbers on a series of outcomes
Based on the law of large numbers, while any individual outcome remains random, over many repetitions the average of the results will converge with the expected probability. For example, in coin flips, while each flip is random, the empirical probability (actual observed frequency) of getting heads will eventually converge with the theoretical probability of 50% as the number of flips increases.
This principle has important implications for trading strategies. Just as with coin flips, while any single trade outcome is random, a genuinely profitable strategy should see its observed win rate converge with its true win rate over a large number of trades. However, this also means that valid trading systems can still experience prolonged streaks of underperformance due to randomness, especially over smaller sample sizes.
Define random variable and the phrase “independent and identically distributed”
A random variable is a result that is dictated by a random phenomenon. For example, with a coin flip, while we know it will result in either heads or tails, the specific result on any given flip is random.
The phrase “independent and identically distributed” (often abbreviated i.i.d.) refers to two key properties:
Independence means that the occurrence of one event does not affect the probability of the occurrence of another. For example, with coin flips, getting heads on one flip does not impact the probability of getting heads on the next flip.
Identically distributed means the variables have the same probability distribution. For instance, when flipping two identical coins, each has the same probability of resulting in heads. However, identically distributed does not necessarily mean outcomes must be equally probable – two coins that each have a 70% probability of heads would still be identically distributed.
It’s important to note that many statistical calculations rely on the i.i.d. assumption. However, research has shown that financial return series data is not an i.i.d. process.
Describe a normal probability distribution
The normal distribution (also called Gaussian distribution) is the best-known bell curve probability distribution. It has several key characteristics:
Normal distributions tend to be symmetric around a mean
The mean, median, and mode are close to equal
Normal distributions are denser in the center and less dense in their tails
Normal distributions are defined by their mean of 0 and standard deviation of 1
Follows the Empirical Rule
The normal distribution can be found in many applications, like human height and weight across populations. Understanding where data lies within a normal distribution is useful for evaluating strategy performance and the likelihood of surprising events.
State the Empirical Rule
The 68-95-99.7 rule, also known as the empirical rule, states that approximately 68% of the data values in a normal distribution are within one standard deviation of the mean, 95% are within two standard deviations of the mean, and 99.7% are within three standard deviations of the mean.
Explain skew and kurtosis
Skewness measures the degree to which returns are asymmetric around the mean – it’s about the length and tilt of the data. A symmetric normal distribution has a skewness of zero. Positive skew (right skew) means the right tail is longer with extreme positive outliers, while negative skew (left skew) means the left tail is longer with extreme negative outliers.
Kurtosis measures the degree to which returns show up in the tails of a distribution. It measures the combined weight or heaviness of the tails compared to the rest of the distribution. There are three categories:
Mesokurtic: Normal distribution with kurtosis value of 3 (excess kurtosis of 0)
Platykurtic: Kurtosis less than 3, with lower/broader peaks and lighter tails (fewer outliers)
Leptokurtic: Kurtosis greater than 3, with taller/sharper peaks and fatter tails (more outliers)