Multivariate analysis in Excel with BESHStatNG

Multivariate analysis helps you understand patterns in datasets with many variables at the same time. Instead of looking at one variable after another, you can study how variables move together, how observations group naturally, and how well known groups can be separated.

BESHStatNG makes it possible to perform these workflows directly in Excel. This tutorial shows how to use BESHStatNG for practical multivariate analysis with two downloadable workbooks:

  • a numeric multivariate workbook for principal component analysis, factor analysis, clustering, and discriminant analysis
  • a categorical multivariate workbook for correspondence analysis and multiple correspondence analysis

Both workbooks are based on a real biomedical dataset derived from the UCI Wisconsin Diagnostic Breast Cancer data. For the tutorial, the numeric workbook uses a balanced 120-case sample with 10 continuous mean features, and the categorical workbook uses discretized versions of selected variables for CA and MCA.

The goal is not only to show where to click, but also to explain how to interpret multivariate output correctly and how to use workbook-based helper sheets and UDF sections as reusable templates.

In this tutorial, you will learn how to

  • reduce dimensionality with principal component analysis
  • identify latent structure with factor analysis
  • group observations with k-means and hierarchical clustering
  • classify known groups with discriminant analysis
  • analyse contingency structure with correspondence analysis
  • analyse multiple categorical variables jointly with multiple correspondence analysis
  • use workbook-based helper sheets and UDF sections to create reusable Excel templates

Which method should I use?

  • Principal component analysis: reduce many numeric variables to a smaller number of components
  • Factor analysis: identify latent dimensions behind correlated numeric variables
  • K-means clustering: partition observations into a fixed number of clusters
  • Hierarchical clustering: build a cluster tree and inspect nested grouping structure
  • Discriminant analysis: separate known groups and classify new observations
  • Correspondence analysis: map association structure in a contingency table
  • Multiple correspondence analysis: explore multiple categorical variables in one low-dimensional space

Download the tutorial workbooks

This tutorial uses two example workbooks.

[Download the numeric multivariate workbook]

Use this workbook for:

[Download the categorical multivariate workbook]

Use this workbook for:

Both workbooks are designed so you can follow the tutorial step by step, reproduce the analysis in Excel, and later adapt the same workbook structure to your own multivariate projects.

The example workbooks

This tutorial is split into two parts.

1. Numeric multivariate workbook

The numeric workbook contains a balanced 120-case biomedical dataset with 10 continuous predictor variables:

  • radius_mean
  • texture_mean
  • perimeter_mean
  • area_mean
  • smoothness_mean
  • compactness_mean
  • concavity_mean
  • concave_points_mean
  • symmetry_mean
  • fractal_dimension_mean

The workbook supports:

  • PCA with workbook-based variance summaries and retained components
  • factor analysis with extraction, rotation, and scoring output
  • k-means clustering with cluster centers and observation assignments
  • hierarchical clustering with Ward linkage
  • discriminant analysis with classification summaries
  • formula-driven UDF helper sheets
  • reference sheets for later R replication

2. Categorical multivariate workbook

The categorical workbook contains a matched 120-case dataset with discretized clinical variables derived from the same source data.

The main categorical variables are:

  • Diagnosis
  • RadiusBin
  • TextureBin
  • SmoothnessBin
  • CompactnessBin

This workbook supports:

  • simple correspondence analysis from a 3×3 contingency table
  • multiple correspondence analysis from multiple categorical variables
  • formula-driven UDF sheets
  • reference sheets for category coordinates and inertia summaries

Start from the BESH Stat NG ribbon

All multivariate methods used in this tutorial are available from the Multivariate Analysis section of the BESH Stat NG ribbon.

BESH Stat NG ribbon in Excel showing the Multivariate Analysis menu used to access PCA, factor analysis, clustering, discriminant analysis, correspondence analysis, and multiple correspondence analysis.
Start the tutorial from the Multivariate Analysis section of the BESH Stat NG ribbon.

For PCA, factor analysis, clustering, and discriminant analysis, start from the numeric workbook.
For CA and MCA, use the categorical workbook.

Part I — Numeric multivariate methods

Principal component analysis

Principal component analysis is useful when many numeric variables are correlated and you want a smaller number of summary dimensions.

In this workbook, the PCA analysis uses:

  • Input matrix: the 10 numeric mean features
  • Matrix type: correlation
  • Retention rule: cumulative explained variance
  • Retention threshold: 80%
BESHStatNG principal component analysis input dialog using the numeric breast-cancer workbook variables.
PCA input selection for the numeric tutorial workbook.

 

BESHStatNG principal component analysis options dialog showing correlation-based PCA and cumulative variance retention settings.
PCA options used in the tutorial workbook.

From the workbook results:

  • Rows analysed: 120
  • Variables: 10
  • Retained components: 2
  • PC1 eigenvalue: 5.3930
  • PC2 eigenvalue: 2.7085
  • Cumulative variance explained by PC1 and PC2: 81.0%

The strongest positive loadings on PC1 come from variables such as:

  • concave_points_mean
  • concavity_mean
  • perimeter_mean
  • compactness_mean
  • radius_mean
  • area_mean

This means the first component mainly reflects a size and tumour-structure severity dimension.

The second component is driven more by:

  • fractal_dimension_mean
  • smoothness_mean
  • symmetry_mean

Interpretation

A two-component PCA already captures most of the structure in this tutorial dataset. In practical terms, this means that the original 10 variables can be summarized reasonably well by two major dimensions: one dominated by tumor size and invasive structure, and another capturing shape and texture-related variation.

For full GUI based output results see this workbook.

Factor analysis

Factor analysis is useful when the goal is not only dimension reduction, but also understanding whether the variables reflect a smaller number of underlying latent constructs.

In this workbook, factor analysis uses:

  • Matrix type: correlation
  • Extraction method: principal axis
  • Retained factors: 2
  • Rotation: varimax
  • Score method: regression
  • Initial communality rule: SMC
BESHStatNG factor analysis input dialog using the numeric breast-cancer workbook variables.
Factor analysis input selection for the numeric tutorial workbook.

 

BESHStatNG factor analysis options dialog showing principal axis extraction, varimax rotation, and regression scoring.
Factor analysis options used in the tutorial workbook.

From the workbook results:

  • Rows analysed: 120
  • Variables: 10
  • Retained factors: 2
  • Overall KMO: 0.7902
  • Bartlett chi-square: 2382.9
  • Bartlett p-value: < 0.001
  • RMSR: 0.0310

Selected rotated loadings show this pattern:

  • Factor 1 loads strongly on radius_mean, perimeter_mean, area_mean, and concave_points_mean
  • Factor 2 loads strongly on smoothness_mean, compactness_mean, concavity_mean, symmetry_mean, and fractal_dimension_mean

Interpretation

The KMO value and Bartlett test suggest that factor analysis is appropriate for this dataset. The rotated two-factor solution is clinically interpretable: one factor mainly reflects size and extent, while the second reflects surface irregularity and local structure. The sign of a factor is arbitrary, so the emphasis should be on the relative grouping of variables, not on whether a loading is positive or negative.

For full GUI based output results see this workbook.

K-means clustering

K-means clustering is useful when you want to partition observations into a predefined number of groups based on their multivariate similarity.

In this workbook, k-means clustering uses:

  • Number of clusters: 2
  • Initialization: k-means++
  • Distance metric: squared Euclidean
  • Random starts: 20
  • Maximum iterations: 100
  • Standardization: z-scores
  • Missing-value policy: listwise deletion
  • Random seed: 42
BESHStatNG k-means clustering input dialog using the numeric breast-cancer workbook variables.
K-means clustering input selection for the numeric tutorial workbook.

 

BESHStatNG k-means clustering options dialog showing two clusters, k-means++ initialization, z-score standardization, and a fixed random seed.
K-means options used in the tutorial workbook.

From the workbook results:

  • Number of clusters: 2
  • Active observations: 120
  • Converged: Yes
  • Iterations: 6
  • Cluster sizes: 67 and 53

The cluster centers show a clear separation:

  • Cluster 1 has lower radius, perimeter, area, compactness, and concavity
  • Cluster 2 has larger values on those variables

When compared back to diagnosis in the workbook data:

  • one cluster contains all 53 malignant cases assigned to that cluster
  • the other cluster contains 60 benign and 7 malignant cases

Interpretation

This is a strong example of unsupervised structure emerging from the data. Even without using the diagnosis variable in the clustering step, the two-cluster solution separates lower-risk and higher-risk observations quite well. This makes k-means a useful exploratory tool before supervised modeling.

For full GUI based output results see this workbook.

Hierarchical clustering

Hierarchical clustering is useful when you want to explore nested grouping structure and inspect how clusters merge across different similarity levels.

In this workbook, hierarchical clustering uses:

  • Linkage: Ward
  • Distance metric: Euclidean
  • Minkowski power: 2
  • Standardization: z-scores
  • Missing-value policy: listwise deletion
  • Cluster view: cut at 2 clusters
BESHStatNG hierarchical clustering input dialog using the numeric breast-cancer workbook variables.
Hierarchical clustering input selection for the numeric tutorial workbook.

 

BESHStatNG hierarchical clustering options dialog showing Ward linkage, Euclidean distance, and z-score standardization.
Hierarchical clustering options used in the tutorial workbook.

From the workbook results:

  • Active observations: 120
  • Merge steps: 119
  • Final merge height: 442.28
  • Two-cluster cut gives groups of 61 and 59 observations

Compared back to diagnosis:

  • one cluster contains 57 benign and 4 malignant
  • the other contains 56 malignant and 3 benign

That means the two-cluster cut agrees with diagnosis for 113 of 120 cases, or about 94.2%.

Interpretation

Ward hierarchical clustering gives a very similar story to k-means, but with the added benefit of a dendrogram-based view of the grouping structure. It is especially useful when you want to inspect whether the data support a two-cluster solution or whether a finer nested structure is visible.

For full GUI based output results see this workbook.

BESHStatNG hierarchical clustering dendrogram output created using the numeric breast-cancer data.
Hierarchical clustering dendrogram output.

Discriminant analysis

Discriminant analysis is useful when the groups are already known and the goal is to find linear combinations of variables that best separate them.

In this workbook, discriminant analysis uses:

  • Grouping variable: Diagnosis
  • Predictors: the 10 numeric mean features
  • Method: linear
  • Standardization: z-scores
  • Missing-value policy: listwise deletion
  • Priors: equal
BESHStatNG discriminant analysis input dialog using diagnosis as the grouping variable and numeric breast-cancer features as predictors.
Discriminant analysis input selection for the numeric tutorial workbook.

 

BESHStatNG discriminant analysis options dialog showing linear discriminant analysis with z-score standardization and equal priors.
Discriminant analysis options used in the tutorial workbook.

From the workbook results:

  • Groups: 2
  • Rows analysed: 120
  • Canonical correlation: 0.8369
  • Wilks lambda: 0.2995
  • Training classification accuracy: 94.17%

Classification summary:

  • Malignant correctly classified: 54 of 60
  • Benign correctly classified: 59 of 60

Interpretation

Discriminant analysis provides a supervised counterpart to the clustering results. Here, the single discriminant function separates the two groups very well. The high canonical correlation and strong training accuracy show that the selected multivariate feature set contains substantial discriminatory information for diagnosis.

For full GUI based output results see this workbook.

Part II — Categorical multivariate methods

Correspondence analysis

Correspondence analysis is useful when the input is a contingency table and the goal is to visualize associations between row and column categories.

In this workbook, the CA example uses a 3×3 table:

  • Rows: RadiusBin
  • Columns: TextureBin
BESHStatNG correspondence analysis input dialog using the RadiusBin by TextureBin contingency table from the categorical workbook.
Correspondence analysis input based on a 3×3 contingency table.

From the workbook results:

  • Available axes: 2
  • Total inertia: 0.1288
  • Dimension 1 inertia: 0.1143
  • Percent inertia explained by Dimension 1: 88.8%

Selected coordinates show a clear gradient:

  • Small and Low lie on the negative side of Dimension 1
  • Large and High lie on the positive side of Dimension 1

Interpretation

The first correspondence axis captures almost all of the association structure in this example. The map mainly contrasts smaller / lower-texture cases against larger / higher-texture cases. This is exactly the kind of low-dimensional summary that makes CA useful for exploratory analysis of contingency tables.

Multiple correspondence analysis

Multiple correspondence analysis extends the same idea to several categorical variables at once.

In this workbook, MCA uses:

  • Diagnosis
  • RadiusBin
  • TextureBin
  • SmoothnessBin
  • CompactnessBin
BESHStatNG multiple correspondence analysis input dialog using Diagnosis, RadiusBin, TextureBin, SmoothnessBin, and CompactnessBin from the categorical workbook.
MCA input selection for the categorical tutorial workbook.

From the workbook results:

  • Observations: 120
  • Variables: 5
  • Category levels: 14
  • Dimension 1 inertia: 27.60%
  • Dimension 2 inertia: 16.47%
  • Cumulative inertia for the first two dimensions: 44.07%

Selected category coordinates show a clinically interpretable pattern:

  • Diagnosis = B, RadiusBin = Small, and TextureBin = Low appear on one side of the map
  • Diagnosis = M, RadiusBin = Large, and CompactnessBin = High appear on the opposite side

Interpretation

MCA makes it possible to inspect relationships among multiple categorical variables in one common space. In this tutorial dataset, the first two dimensions already reveal a meaningful contrast between lower-risk and higher-risk profiles, which is useful both for exploratory analysis and for teaching categorical multivariate methods.

For full GUI based output results see this workbook.

How to choose the right multivariate tool

A simple rule is to use:

  • PCA when you want a smaller number of components from many correlated numeric variables
  • factor analysis when you want interpretable latent dimensions
  • k-means when you want a fixed number of numeric clusters
  • hierarchical clustering when you want cluster structure and dendrogram-style merging
  • discriminant analysis when the groups are already known
  • CA when the input is a contingency table
  • MCA when the input contains multiple categorical variables

Results match the workbooks and are ready for later R replication

The companion workbooks are designed to mirror the GUI examples shown in this tutorial.

The numeric workbook includes:

  • PCA_UDFs
  • Factor_UDFs
  • KMeans_UDFs
  • HClust_UDFs
  • DA_UDFs
  • reference sheets for independent cross-checking

The categorical workbook includes:

  • CA_UDFs
  • MCA_UDFs
  • reference sheets for inertia and coordinates

This makes the workbooks useful for two purposes:

  • reproducing the dialog-based analyses shown on this page
  • building reusable multivariate templates for later work and later R replication

In other words, the GUI and workbook views are not separate workflows. They are two ways of working with the same multivariate logic inside Excel.

Final remarks

Multivariate methods are especially useful when one-variable-at-a-time summaries are no longer enough.

In practical work, they help you answer questions such as:

  • which variables move together?
  • can the dimensionality be reduced?
  • are there natural groups in the data?
  • how well do known groups separate?
  • how are categorical profiles related?

BESHStatNG makes these analyses accessible in a reproducible Excel workflow, with both ribbon-based dialogs and formula-based workbook structures.

For full method details, options, and formulas, see the online help and the multivariate UDF documentation.

What to do next

Once you have completed this tutorial, the next useful BESHStatNG workflows are: