# Factor Analysis

When analysing data containing many measured variables, it may happen that some of the variables are correlated. This could be because they share an underlying influence or common factor. It would be useful to understand how these variables are correlated and seek an intuitive explanation about what's common among them. This will also simplify further analysis by reducing the dataset into fewer variables or factors. This is what factor analysis tries to achieve.

A good factor is intuitive, easy to interpret, has a simple structure and lacks complex loadings. Factor analysis is in some sense an art. It's been said,

Factor analysis is not a purely statistical technique; there is always a certain amount of guesswork in it... Factor analysis is certainly a very treacherous tool in inexperienced hands.

## Discussion

• Could you provide an intuitive explanation of Factor Analysis?

Suppose a village survey is conducted and the questionnaire includes 500 questions. This survey therefore results in a large dataset of 500 variables. However, we may discover that many of the variables are correlated. We can probably put related variables into groups such as income, education, healthcare, cleanliness, etc. These are called factors. Now our analysis becomes easier from many variables to fewer factors.

Let's say we measure students' abilities in terms of four variables: vocabulary, grammar, arithmetic, and geometry. We can make a hypothesis that vocabulary and grammar abilities must be correlated. Likewise, arithmetic and geometry abilities must be correlated. We can therefore hypothesize two factors: language ability and math ability. Subsequent analysis on the data can either confirm or reject the presence of these factors and to what extent they relate to the variables. In fact, the factors themselves could be correlated with each other and we might identify a single factor that we can call academic ability.

What we call factors are in fact latent variables. These are variables that can't be measured but in fact influence variables that are measured.

The coefficients of latent variables are called factor loadings with respect to a measured variable. In other words, the extent to which a variable is associated with the factor is quantitatively expressed by its factor loading. It's possible that a measured variable is influenced by more than one factor.

To give an example, when income, education and occupation are correlated, the common factor could be "individual socioeconomic status" (F1). On the other hand, house value, neighbourhood crimes and amenities can point to another factor "neighbourhood socioeconomic status" (F2). Consider a loading of 0.65 between income and F1; and a loading of 0.48 between occupation and F1. This implies that F1 influences income more strongly than occupation.

An absolute value of 0.4 or higher can be considered as a high loading.

• What are the main types of Factor Analysis?

The two main types of FA are:

• Exploratory Factor Analysis (EFA): This is used when we wish to summarize data efficiently, when we want to know how many factors are present and their associated factor loadings. EFA is about revealing patterns in the relationships among variables.
• Confirmatory Factor Analysis (CFA): This is used when a researcher starts with one or more hypotheses. Each hypothesis may state the presence of certain factors. Analysis on measured data must prove or disprove each hypothesis. A graphical representation of a hypothesis is called path diagram. CFA produces fit statistics that are used to confirm if data fits a particular hypothesis.

Structural Equation Modelling (SEM) is similar to CFA but allows us to test complex hypotheses about the structure of variables. SEM may be seen as a method to do CFA.

• What are the some methods of doing Factor Analysis?

All methods of factor analysis are looking for correlations among variables. FA is usually done in one of these ways: Principal Component Analysis (PCA), Principal Axis Factoring (PAF), Ordinary or Unweighted Least Squares (ULS), Generalized or Weighted Least Squares (WLS), Maximum Likelihood (ML). Other methods include Image Factoring (based on ULS) and Alpha Factoring.

PAF is considered the conventional technique. It uses eigenvalue decomposition of a correlation matrix. ULS is considered one of the better methods. It produces the Minimum Residual (MinRes) solution. One study that compared some of these methods found that ULS gave accurate results.

• What are some assumptions for Factor Analysis?

Variables have to be correlated but there shouldn't be perfect multicollinearity among the variables; that is, one variable cannot be predicated accurately from other variables. Data shouldn't have outliers. We assume interval data.

Non-linearity is not allowed but non-linear variables can be transformed to linear ones before applying factor analysis. In fact, in the discipline of statistics, factor analysis is considered as a part of Generalized Linear Model (GLM).

• Why do we do Factor Rotation?

Sometimes we will find that a variable has high factor loadings due to more than one factor. This makes it difficult to interpret the factors. Since factor models are not unique, factor rotation allows us to find another factor model that can perhaps be interpreted better.

There are two rotation types:

• Orthogonal: This uses the loading matrix that represents the correlation between variables and factors. In this type, the rotated factors remain orthogonal to one another.
• Oblique: This uses factor correlation matrix, structure matrix, pattern matrix, and factor coefficient matrix. In this type, the rotated factors are allowed to become correlated. Oblique rotation may be considered as a subset of orthogonal rotation. If data clusters are in fact uncorrelated, then an oblique rotation will result in orthogonal factors.

There are different methods to perform these rotations. For orthogonal rotation, we have Quartimax, Varimax and Equamax. For oblique rotation, we have Oblimin and Promax.

• Isn't Factor Analysis the same as Principal Component Analysis?

Both PCA and FA achieve dimensionality reduction while minimizing information loss. Both appear to use similar techniques of extraction, interpretation and rotation to reduce many variables to fewer components or factors. Yet, they are fundamentally different.

PCA extracts maximum variance into the first component, then extracts maximum variance into second component, and so on. Factors in FA have no such order. Factors identify common variance among variables. For this reason, FA is also called Common Factor Analysis (CFA). FA doesn't capture error or unique variance whereas PCA considers all the variance.

If variables are uncorrelated, PCA will still find suitable components but EFA will be unable to identify useful factors. It's been noted that as the number of variables increases (at least 40 variables), results from PCA and EFA tend to come closer.

• Considering PCA and FA, how are variables related to components or factors?

Factors cause variables. Components are aggregate of variables.

In PCA, components (C) are a linear combination of variables (Y). FA aims to identify latent variables or factors (F). Latent variables themselves can't be measured directly but are seen to cause or influence the measured variables. The extent of influence (b) is called factor loading. While components in PCA explain all of the variance in data, factors may not explain all the variance in a variable, thus resulting in a term that's unique (u) to each measured variable.

Mathematically,

$$FA:\ Y_1=b_1F+u_1;\ Y_2=b_2F+u_2;\ Y_3=b_3F+u_3;\ Y_4=b_4F+u_4\\PCA:\ C=w_1Y_1+w_2Y_2+w_3Y_3+w_4Y_4$$

• What tools are available to perform Factor Analysis?

In R language (which is free and open source), factor analysis can be performed easily thanks to the psych package. EFA can be performed using your choice of method: MinRes, PAF, ULS, WLS or ML. The scree function can help in determining the number of significant factors. An alternative to this is parallel analysis that can be done using the fa.parallel function.

An example of a commercial product is JMP of SAS. The Multivariate platform of JMP can do both PCA and FA. SPSS is another commercial product that can do both PCA and EFA.

## Milestones

1884

The genesis of factor analysis is in human personality psychology: to identify attributes and then categorize them into a structural model. Francis Galton uses a dictionary to identify terms that describe personality. However, he fails to come up with a model.

1901

Spearman gets interested in the work of Galton. In a paper published in 1904, he uses the terms factor and loadings, although he doesn't describe the methods he used for factorizing.

1934

L.L. Thurstone, considered the father of factor analysis, uses 60 terms across 1300 subjects to arrive at five broad factors. He doesn't pursue his analysis further. R.B. Cattell identifies at least a dozen factors in the 1940s but it's Donald Fiske who shows that they reduce to five factors.

1960

Harry Harman introduces Minimum Residuals (MinRes), an approach to factor analysis via least squares.

1966

To determine how many factors to retain, R.B. Cattell proposes a graphical method called the Scree Plot, a plot of eigenvalues vs. factors.

1969

The term Confirmatory Factor Analysis (CFA) is introduced. Prior to this, factor analysis was exploratory in nature but the "exploratory" prefix was not used.

1980

Following Fiske and other researchers after him, it's in the 1980s that the Big-Five factor structure to describe human personality finally takes shape. This goes to prove that finding the correct number of factors, and identifying those factors, is not a trivial problem.

## Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins arvindpdmn
2
2
1405 raam.raam
3
0
356
1630
Words
5
Likes
4767
Hits

## Cite As

Devopedia. 2019. "Factor Analysis." Version 5, February 5. Accessed 2021-09-09. https://devopedia.org/factor-analysis
• Site Map