Factor Analysis Notes

Factor Analysis

Factor analysis (FA) is a statistical technique used to identify coherent subsets of variables within a single set of variables.
It aims to discover which variables form relatively independent groups.
Variables that are highly correlated are combined into factors.
Factors reflect underlying processes that create correlations among variables.
Example: Combining personality measures and motivation scale variables to form an "independence factor".
A major application of FA in psychology is in the development of objective tests for personality and intelligence measurement.
The process involves:
- Starting with a large number of items.
- Administering the items to randomly selected participants.
- Deriving factors.
- Adding and deleting items based on the factor analysis results.
- Repeating the process until a test with numerous items forming several factors is achieved.
- Testing the validity of factors by making predictions about behavior differences based on factor scores.
Specific goals of FA:
- Summarize correlation patterns among observed variables.
- Reduce a large number of observed variables to a smaller number of factors.
- Provide an operational definition for an underlying process using observed variables.
- Test a theory about the nature of underlying processes.
FA reduces numerous variables to a few factors.
Mathematically, FA produces linear combinations of observed variables, with each linear combination representing a factor.
Factors summarize correlation patterns and can reproduce the observed correlation matrix.
Using factor analysis provides parsimony since the number of factors is usually less than the number of observed variables.
Factor scores, when estimated, are often more reliable than individual observed variable scores.
Steps in FA:
- Selecting and measuring a set of variables.
- Preparing the correlation matrix.
- Extracting a set of factors from the correlation matrix.
- Determining the number of factors.
- Rotating the factors to increase interpretability.
- Interpreting the results.
Interpretability is an important test of the analysis.
A good FA should be interpretable.
A factor is more easily interpreted when several observed variables correlate highly with it, and these variables do not correlate with other factors.
The final step is to verify the factor structure by establishing construct validity.
This involves demonstrating that scores on latent variables (factors) co-vary with other variables or change with experimental conditions as predicted by theory.
Problems with FA:
- Lack of readily available criteria against which to test the solution.
- The presence of an infinite number of rotations available after extraction.
- The dependence on the researcher's assessment of interpretability and scientific utility.
- The common use of FA to "save" poorly conceived research.
Two major types of FA:
- Exploratory FA: Seeks to describe and summarize data by grouping correlated variables.
- Confirmatory FA: Tests a theory about latent processes.
Basic terms and definitions in FA:
- Observed correlation matrix: The correlation matrix produced by the observed variables.
- Reproduced correlation matrix: The correlation matrix produced from factors.
- Residual correlation matrix: The difference between observed and reproduced correlation matrices.
- Rotation of factors: A process by which the solution is made more interpretable without changing its underlying mathematical properties.
  - Orthogonal rotation: Factors are uncorrelated with each other, producing a loading matrix.
  - Oblique rotation: Factors are correlated, producing a factor correlation matrix, a structure matrix, and a pattern matrix.
- Factor-score coefficients matrix: A matrix of coefficients used to predict scores on factors from scores on observed variables.
Factors are thought to "cause" variables.
Exploratory FA is associated with theory development, and confirmatory FA is associated with theory testing.
- Exploratory FA asks: “What are the underlying processes that could have produced correlations among these variables?”
- Confirmatory FA asks: “Are the correlations among variables consistent with a hypothesized factor structure?”

Factor Analysis Decision Tree

Factor Analysis decision diagram.
Research Problem
- Is the analysis exploratory or confirmatory?
Confirmatory
- Confirmatory Factor Analysis
- Structural Equation Modelling
Exploratory
- What is being grouped - variables or cases?
  - Cases
    - Q-type factor analysis or cluster analysis
  - Variables
    - R-type factor analysis
    - Selecting a factor method
      - Is total variance or only common variance analysed?
        Total variance
        Extract factors with components analysis
        Common variance
        Extract factors with common factor analysis
    - Selecting a rotational method
      - Should the factors be correlated (orthogonal) or uncorrelated (oblique)?
        No
        Orthogonal methods
        VARIMAX
        EQUIMAX
        QUARTIMAX
        Yes
        Oblique methods
        Oblimin
        Promax
        Orthoblique
Research Design
- What variables are included?
- How are the variables measures?
- What is the desired sample size?
Assumptions
- Statistical: normality, linearity & homoscedasticity
- Homogeneity of sample
- Conceptual linkages
Specifying the Factor Matrix
- Determine the number of factors to be retained
Interpreting the Rotated Factor Matrix
- Can significant loadings be found?
- Can factors be named?
- Are communalities sufficient?
  - No
    - Factor Model re-specification
      - Were any variables deleted?
      - Do you want to change the number of factors?
      - Do you want another type of rotation?
Validation of Factor Matrix
- Split/multiple samples
- Separate analysis for subgroups
- Identify influential cases
- Selection of surrogate variables
- Computation of factor scores
- Creation of summated scales

Exploratory Factor Analysis (EFA)

EFA is an interdependence technique used to define the underlying structure among variables.
It analyzes the interrelationships (correlations) among a large number of variables.
It defines sets of highly interrelated variables, known as factors.
These factors are assumed to represent dimensions within the data.
If the goal is only to reduce the number of variables, the dimensions can guide the creation of new composite measures.
If there is a conceptual basis, the dimensions may have meaning for what they collectively represent.
Example: Store atmosphere, defined by sensory components.
The general purpose of EFA is to condense information into a smaller set of new, composite dimensions or variates (factors) with minimal information loss.
It searches for and defines fundamental constructs or dimensions assumed to underlie the original variables.
- Specifying the unit of analysis.
- Achieving data summarization and/or data reduction.
- Variable selection.
- Using factor analysis results with other multivariate techniques.

Specifying the Unit of Analysis

EFA can identify the structure of relationships among variables or respondents.
If the objective is to summarise characteristics, EFA is applied to a correlation matrix of the variables (R factor analysis).
If the objective is to combine or condense people, EFA is applied to a correlation matrix of the individual respondents (Q factor analysis).
Q factor analysis is less frequently used due to computational difficulties.
Researchers typically use cluster analysis to group individual respondents.

Achieving Data Summarization Versus Data Reduction

EFA provides data summarization and data reduction.
In summarizing data, EFA derives underlying dimensions that describe the data in a smaller number of concepts.
Data reduction extends this process by deriving a factor score for each dimension and substituting this value for the original values.

Data Summarization

Data summarization involves the definition of structure.
The researcher can view the set of variables at various levels of generalization.
EFA differs from dependence techniques because all variables are simultaneously considered.
In EFA, variates (factors) are formed to maximize their explanation of the entire variable set.
The goal of data summarization is achieved by defining a small number of factors that adequately represent the original set of variables.
Structure is defined by the interrelatedness among variables.

Data Reduction

EFA can be used to achieve data reduction.
- Identifying representative variables for use in subsequent multivariate analyses.
- Creating an entirely new set of variables to replace the original set of variables.
The purpose is to retain the nature and character of the original variables but reduce their number.
EFA provides the empirical basis for assessing the structure of variables and the potential for creating composite measures.
Data summarization makes the identification of underlying dimensions an end in itself.
Data reduction relies on factor loadings and uses them as the basis for either identifying variables or making estimates of the factors themselves.

Variable Selection

The researcher should consider the conceptual underpinnings of the variables.
The researcher implicitly specifies the potential dimensions that can be identified.
EFA will always produce factors.
The quality and meaning of the derived factors reflect the conceptual underpinnings of the variables included in the analysis.

Using EFA with Other Multivariate Techniques

Factor analysis provides a clear understanding of which variables may act in concert.
Variables within a single factor affect stepwise procedures of multiple regression and multiple discriminant analysis.
Factor analysis provides the basis for creating a new set of variables that incorporate the character of the original variables.

Designing an EFA

Designing a EFA involves three basic decisions:
- Calculation of the input data (a correlation matrix).
- Design of the study in terms of number of variables and measurement properties.
- The sample size necessary.

Correlations Among Variables or Respondents

The first decision is calculating the input data for the analysis.
Both R-type and Q-type EFA use a correlation matrix as the basic data input.
R-type EFA uses a traditional correlation matrix (correlations among variables).
Q-type EFA is derived from the correlations between the individual respondents.

Variable Selection and Measurement Issues

The primary requirement is that a correlation value can be calculated among all variables.
Metric variables are easily measured by several types of correlations.
Nonmetric variables are more problematic and are easiest to avoid.
If a nonmetric variable must be included, one approach is to define dummy variables (coded 0-1).

Number of Variables

It is important to minimize the number of variables included while maintaining a reasonable number of variables per factor.
If a study is being designed to assess a proposed structure, then it is important to include several variables (5+) that may represent each proposed factor.

Sample Size

A sample of fewer than 50 observations should not be factor analysed.
The sample size should be 100 or larger.
As a general rule, the minimum is to have at least 5 times as many observations as the number of variables to be analysed, and a more acceptable sample size would have a 10:1 ratio.
As the number of variables increases, the probability of spurious correlations increases.
It is important to obtain the highest cases-per-variable ratio to minimize the chances of overfitting the data.

Conceptual and Statistical Issues

Conceptual Issues

A basic assumption of EFA is that some underlying structure exists in the set of selected variables.
The researcher must ensure that the observed patterns are conceptually valid.
The researcher must also ensure that the sample is homogeneous with respect to the underlying factor structure.

Statistical Issues

Departures from normality, homoscedasticity, and linearity only apply to the extent that they diminish the observed correlations.
Some degree of multicollinearity is desirable.

Overall Measures of Intercorrelation

It is essential to ensure that the data matrix has sufficient correlations to justify the application of EFA.
- If visual inspection reveals no substantial number of correlations with r > .30, then factor analysis is probably inappropriate.
- The correlations among variables can also be analysed by computing the partial correlations among variables.
- SPSS Factor Analysis provides the anti-image correlation matrix, which is the negative value of the partial correlation.
- Bartlett test of sphericity provides the statistical significance that the correlation matrix has significant correlations among at least some of the variables.
- The measure of sampling adequacy (MSA) index ranges from 0 to 1.
- >.80 = meritorious; >.70 = middling; >.60 = mediocre; >.50 = miserable; and < .50 = unacceptable.
- The MSA increases as
  - the sample size increases
  - The average correlations increase
  - The number of variables increases, or
  - The number of factors decreases

Variable-Specific Measures of Intercorrelation

The researcher should examine the MSA values for each variable and exclude those falling in the unacceptable range.

Deriving Factors and Assessing Overall Fit

The researcher must make decisions concerning:
- The method of extracting the factors (common factor analysis versus components analysis).
- The number of factors selected to represent the underlying structure in the data.

Selecting the Factor Extraction Method

This decision must take into account the objectives of the factor analysis along with knowledge about some basic characteristics of the relationships between variables.

Partitioning the Variance of a Variable

It is essential to first understand the variance for a variable and how it is divided or partitioned.
Variance is a value that represents the total amount of dispersion of values for a single variable about its mean.
A variable's communality is the estimate of its shared, or common, variance among the variables as represented by the derived factors.

Common Factor Analysis Versus Component Analysis

The selection of one method over the other is based on two criteria:
- The objectives of the EFA.
- The amount of prior knowledge about the variance in the variables
Component analysis (principal components analysis) is used when the objective is to summarize most of the original information (variance) in a minimum number of factors for prediction purposes.
Considers the total variance and derives factors that contain small proportions of unique variance and, in some instances, error variance.
Common factor analysis is used primarily to identify underlying factors or dimensions that reflect what the variables share in common.
Considers only the common or shared variance, assuming that both the unique and error variance are not of interest in defining the structure of the variables.
Common factor analysis suffers from factor indeterminacy, meaning that for any individual respondent, several different factor scores can be calculated from a single factor model result

Criteria for the Number of Factors to Extract

Both factor analysis methods are interested in the best linear combination of variables.
The first factor may be viewed as the single best summary of linear relationships displayed in the data.
The second factor is defined as the second-best linear combination of the variables, subject to the constraint that it is orthogonal to the first factor.
Latent root criterion
- The rationale is that any individual factor should account for the variance of at least a single variable if it is to be retained for interpretation.
- With component analysis, each variable contributes a value of 1 to the total eigenvalue.
- Thus, only the factors having latent roots or eigenvalues > 1 are considered significant; all factors with latent roots < 1 are considered insignificant and are disregarded.
A priori criterion
- The a priori criterion can be applied when the number of factors to be extracted are known before performing the EFA.
Percentage of variance criterion
- This approach is based on achieving a specified cumulative percentage of total variance extracted by successive factors.
In the natural sciences, the factoring procedure usually should not be stopped until the extracted factors account for at least 95% of the variance or until the last factor accounts for only a small portion (less than 5%).
Scree test criterion
- The scree test is used to identify the optimum number of factors that can be extracted before the amount of unique variance begins to dominate the common variance structure.
- The scree test is derived by plotting the latent roots against the number of factors in their order of extraction, and the shape of the resulting curve is used to evaluate the cut-off point.
Heterogeneity of the respondents
- Shared variance among variables is the basis for both common and component factor models.
- An underlying assumption is that shared variance extends across the entire sample.

Interpreting a Factor Matrix

The task of interpreting a factor-loading matrix involves a five-step process:
- Examine the factor matrix of loadings
- Identify significant loadings
- Assess the communalities of the variables
- Respecify the model
- Label the factors

Examine the Factor Matrix of Loadings

The factor-loading matrix contains the factor loading of each variable on each factor.
If an oblique rotation has been used, two matrices of factor loadings are provided:
- factor pattern matrix, which has loadings that represent the unique contribution of each variable to the factor
- factor structure matrix, which has simple correlations between variables and factors, but these loadings contain both the unique variance between variables and factors and the correlation among factors.

Identify Significant Loadings

The interpretation should start with the first variable on the first factor and move horizontally from left to right, looking for the highest loading for that variable on any factor.
When the highest loading (largest absolute factor loading) is identified, it should be marked if it is significant
This procedure should continue for each variable until all variables have been reviewed for their highest loading on a factor.
When a variable is found to have more than one significant loading, it is termed a cross-loading.

Assess the Commonalities of the Variables

Once all the significant loadings have been identified, the next step is to identify any variables that are not adequately accounted for by the factor solution.
Identify any variable(s) lacking at least one significant loading.
Examine each variable's communality, representing the amount of variance accounted for by the factor solution for each variable.

Respecify the Model

Once all the significant loadings have been identified and the communalities examined, you may find any one of several problems:
- A variable has no significant loadings.
- Even with a significant loading, a variable's communality is deemed too low.
- A variable has a cross-loading.
In this situation, the option is to take any combination of the following remedies:
- Ignore those problematic variables and interpret the solution as is, which is appropriate if the objective is solely data reduction, but you must still note that the variables in question are poorly represented in the factor solution.
- Evaluate each of those variables for possible deletion, depending on the variable's overall contribution to the research as well as its communality index.
- Employ an alternative rotation method, particularly an oblique method if only orthogonal methods had been used.
- Decrease/increase the number of factors retained to see whether a smaller/larger factor structure will represent those problematic variables.
- Modify the type of factor model used (component versus common factor) to assess whether varying the type of variance considered affects the factor structure.

Label the Factors

When an acceptable factor solution has been obtained in which all variables have a significant loading on a factor, the researcher then attempts to assign some meaning to the pattern of factor loadings.
Variables with higher loadings are considered more important and have greater influence on the name or label selected to represent a factor.
The signs are interpreted as with other correlation coefficients:
- positive signs (+) mean the variables are positively related
- negative signs (-) mean the variables are negatively related

Validation of EFA

The final stage of undertaking an EFA is assessing the degree of generalisability of the results to the population and the potential influence of individual cases or respondents on the overall results.
Use of a Confirmatory Perspective
- The most direct method of validating the results is to move to a confirmatory analysis and assess the replicability of the results, either with a split sample in the original data set or with a separate sample.
Assessing factor Structure Stability
- Factor stability is primarily dependent on the sample size and on the number of cases per variable.
- Comparison of the two resulting factor matrices will provide an assessment of the robustness of the solution across the sample.

Defining Individual Constructs

The process begins by listing the constructs that will comprise the measurement model.
As such, it is essential that the researcher consider not only the operational requirements but also establish the construct validity of the newly designed scale.

Developing the Overall Measurement Model

In this stage, the need is to consider how all of the individual constructs will come together to form an overall measurement model.
Unidimensionality
- Unidimensional measures mean that a set of measured variables (indicators) can be explained by only one underlying construct.
- In such a situation, each measured variable is hypothesized to relate to only a single construct.
Congeneric Measurement Model
- A measurement model is constrained by the model hypotheses.
- Congeneric measurement models are considered to be sufficiently constrained to represent good measurement properties.
Items per Construct
- Good practice dictates a minimum of 3 items per factor, preferably 4, not only to provide minimum coverage of the construct's theoretical domain but also to provide adequate identification for the construct.
Reflective Versus Formative Constructs
- The contrasting direction of causality leads to different measurement approaches: reflective versus formative measurement models.
- Reflective measurement theory is based on the idea that latent constructs cause the measured variables and that the error results in an inability to fully explain these measured variables.
- Formative measurement theory is modelled based on the assumption that the measured variables cause the construct.

Designing a Study to Produce Empirical Results

The third stage involves designing a study that will produce confirmatory results.
Initial data analysis procedures should first be performed to identify any problems in the data, including issues such as data input errors.

Measurement Scales in CFA

CFA models typically contain reflective indicators measured with an ordinal or better measurement scale (ie interval/ratio).

CFA and Sampling

CFA in most cases will require the use of multiple samples.
Then an additional sample(s) should be drawn to perform the CFA.

Specifying the Model

CFA, not EFA, should be used to test the measurement model.
The researcher specifies (frees for estimation) the indicators associated with each construct and the correlations between constructs.
In addition, the researcher does not specify cross loadings, which fixes the loadings at zero.

Issues in Identification

Overidentification is the desired state for CFA models in general.
Avoiding identification problems:
- Meeting the Order and Rank Conditions
- Three-indicator rule
Recognizing identification problems:
**

Problems in Estimation

Even with no identification problems, CFA models may result in the estimation of parameters that are logically impossible.
Illogical standardized parameters
Heywood cases

Assessing Measurement Model Validity

Once the measurement model is correctly specified, a CFA model is estimated to provide an empirical measure of the relationships among variables and constructs represented by the measurement theory.

Assessing Fit

Fit compares the two covariance matrices. Guidelines for good fit apply.
The result is that CFA enables us to test or confirm whether a theoretical measurement model is valid.

Path Estimates

One of the most fundamental assessments of construct validity involves the measurement relationships between items and constructs
- Size of path estimates and statistical significance
Standardised loadings of at least .5 and ideally .7 or higher confirm that the indicators are strongly related to their associated constructs and are one indication of construct validity.

Construct Validity

Validity is defined as the extent to which research is accurate.
CFA eliminates the need to summate scales because latent construct scores are computed for each participant.
Convergent validity
- The items that are indicators of a specific construct should converge or share a high proportion of variance in common
- Methods to estimate the relative amount of convergent validity among item measures in CFA include: *Factor Loadings
  - In the case of high convergent validity, high loadings on a factor would indicate that they converge on a common point, the latent construct.
  - At a minimum, all factor loadings should be statistically significant, but because a significant loading could still be fairly weak in strength, a good rule of thumb is that standardized loading estimates should be .5 or higher, and ideally .7 or higher.
- Average Variance Extracted.
  - AVE is calculated as the mean variance extracted for the items loading on a construct and is a summary indicator of convergence.
  - An AVE of >.5 suggests adequate convergence.
- Reliability is also an indicator of convergent validity.
  - A construct reliability (CR) value is often used in conjunction with CFA models. The rule of thumb is that .7 or higher suggests good reliability, between .6 and .7 may be acceptable
Discriminant validity
- Discriminant validity is the extent to which a construct is truly distinct from other constructs.
- CFA provides 2 ways to assessing discriminant validity:
  -The correlation between any two constructs can be specified (fixed) as equal to one. If the fit of the two-construct model is significantly different from that of the one-construct model, then discriminant validity is supported.

Model Diagnostics

CFA's ultimate goal is to obtain an answer as to whether a given measurement model is valid.
Standardised residuals
- The better the fit, the smaller the residuals.
- Standardised residuals < 2.5 do not suggest a problem
Modification indices
- A modification index is calculated for every possible relationship that is not estimated in a model.
- Modification indices of > 4.0 suggest that the fit could be improved significantly by freeing the corresponding path to be estimated.
Specification searches
- A specification search is an empirical trial-and-error approach that uses model diagnostics to suggest changes in the model.
- This process is based on freeing fixed (non-estimated) relationships with the largest modification index.