Method comparison and agreement in Excel

Method comparison and agreement analysis is essential whenever two measurement methods, two devices, or two raters are used on the same subjects.

BESHStatNG makes it possible to perform these workflows directly in Excel, including:

  • Passing–Bablok regression
  • Deming regression
  • Bland–Altman analysis
  • Lin’s concordance correlation coefficient
  • Cohen’s kappa
  • Weighted kappa

This tutorial shows how to use these methods in a practical Excel workflow with two downloadable workbooks:

  • a quantitative agreement workbook for paired numeric measurements
  • a categorical agreement workbook for paired ratings

The goal is not only to show where to click, but also to explain how to interpret agreement results correctly and how to compare BESHStatNG output with workbook-based reference results.

In this tutorial, you will learn how to:

  • compare two quantitative measurement methods
  • separate association from agreement
  • interpret bias, limits of agreement, and concordance
  • analyse paired categorical ratings with unweighted and weighted kappa
  • use workbook-based helper sheets and UDF sections to create reusable Excel templates

Which method should I use?

  • Passing–Bablok: robust method-comparison regression
  • Deming: regression when both methods have error
  • Bland–Altman: bias and limits of agreement
  • Lin’s CCC: single-number concordance summary
  • Kappa / weighted kappa: categorical agreement

Download the tutorial workbooks

This tutorial uses two example workbooks.

[Download the quantitative agreement workbook]

Use this workbook for:

  • Passing–Bablok regression
  • Deming regression
  • Bland–Altman analysis
  • Lin’s concordance correlation coefficient

[Download the categorical agreement workbook]

Use this workbook for:

  • Cohen’s kappa
  • linearly weighted kappa
  • quadratically weighted kappa

Both workbooks are designed so you can follow the tutorial step by step, reproduce the analysis in Excel, and later adapt the same workbook structure to your own method-comparison or agreement studies.

The example workbooks

This tutorial is split into two parts.

1. Quantitative agreement workbook

The quantitative workbook contains 87 paired measurements from 53 patients, with repeated observations for some subjects.

The main analysis sheet includes:

  • PatientID
  • Reference_X
  • Test_Y
  • PairMean
  • Diff_Y_minus_X

This workbook supports:

  • grouped Passing–Bablok regression
  • Deming regression
  • Bland–Altman analysis in repeated-by-subject mode
  • Lin’s CCC
  • formula-driven helper calculations
  • a placeholder worksheet for future agreement UDF support

2. Categorical agreement workbook

The categorical workbook contains 51 paired ratings across 4 ordered categories.

The main analysis sheet includes:

  • Rater_1
  • Rater_2
  • Code_1
  • Code_2

This workbook supports:

  • unweighted Cohen’s kappa
  • linear weighted kappa
  • quadratic weighted kappa
  • formula-driven confusion matrix calculations
  • a placeholder worksheet for future agreement UDF support

Start from the BESH Stat NG ribbon

All agreement methods used in this tutorial are available from the Agreement section of the BESH Stat NG ribbon.

BESH Stat NG ribbon showing the Agreement menu with Passing–Bablok regression, Deming regression, Bland–Altman analysis, Lin’s CCC, Cohen’s / Weighted Kappa, and ICC.

For quantitative methods, start from the quantitative agreement workbook.
For categorical agreement, use the kappa workbook.

Part I — Quantitative method comparison and agreement

Passing–Bablok regression

Passing–Bablok regression is a robust method-comparison regression that is widely used when two quantitative methods are compared on the same subjects.

In this workbook, the analysis uses:

  • Group (optional): PatientID
  • Reference method (X): Reference_X
  • Test method (Y): Test_Y

Because the dataset contains repeated observations for some patients, the grouped / block version is the most appropriate summary for the tutorial.

Passing–Bablok regression input dialog using PatientID as the grouping variable and the paired measurement columns as X and Y.

From the workbook results:

  • Sample size: 87 paired observations
  • Number of groups: 53
  • Slope: 0.8543
  • 95% CI for slope: 0.6886 to 1.0592
  • Intercept: 0.1551
  • 95% CI for intercept: 0.0773 to 0.2144

The Passing–Bablok slope is close to 1, but its confidence interval includes values below and above 1. This means there is no strong evidence of clear proportional bias from this method alone. The intercept is clearly above 0, which suggests a positive systematic difference between the two methods across the observed range.

Deming regression

Deming regression is useful when both methods are measured with error, which is often a more realistic assumption than ordinary least squares in method-comparison studies.

In this tutorial, the Deming analysis uses:

  • Reference method (X): Reference_X
  • Test method (Y): Test_Y
  • Confidence interval type: Bootstrap percentile
  • Bootstrap replicates: 2000
  • alpha: 0.050
  • Variance model: Constant lambda
  • Error ratio: 1.000
  • Fit intercept: checked
Deming regression input dialog using the paired quantitative worksheet columns.

 

Deming regression options used in the tutorial workbook: bootstrap percentile CI, constant lambda, error ratio 1, and fitted intercept.

From the workbook results:

  • Slope: 0.7918
  • 95% CI for slope: 0.5904 to 1.1350
  • Intercept: 0.2025
  • 95% CI for intercept: 0.1087 to 0.2763

Deming regression tells a similar story to Passing–Bablok. The slope is below 1, but the confidence interval still includes 1, so strong proportional bias is not established. The positive intercept again suggests systematic offset between the two methods.

When Passing–Bablok and Deming point in the same direction, confidence in the overall interpretation increases.

Bland–Altman analysis

Bland–Altman analysis focuses on agreement on the measurement scale, not just regression or correlation.

In this workbook, the Bland–Altman analysis uses:

  • Subject ID: PatientID
  • Reference method (X): Reference_X
  • Test method (Y): Test_Y
  • Mode: Repeated by subject
  • Scale: Raw difference
  • X-axis: Mean of methods
  • Plot: All observations + subject means
  • CI method: Bootstrap percentile
  • Bootstrap replicates: 2000
  • alpha: 0.050
  • Check proportional bias: checked
  • Exclude singleton subjects: checked
  • Allow fallback to simple analysis: checked
Bland–Altman input dialog using PatientID plus the paired X and Y measurement columns.

 

Bland–Altman options in repeated-by-subject mode with bootstrap percentile confidence intervals.

From the workbook results:

  • Mean bias (Y − X): 0.1394
  • 95% CI for bias: 0.1052 to 0.1780
  • SD of differences: 0.1575
  • Lower limit of agreement: −0.1693
  • Upper limit of agreement: 0.4481
  • Proportional-bias slope: −0.1611

The positive mean bias shows that, on average, the test method tends to read higher than the reference method. The limits of agreement are fairly wide relative to the scale of the measurements, which means the two methods should not automatically be treated as interchangeable. The negative proportional-bias slope also suggests that the size of the difference may depend somewhat on the measurement level.

Bland–Altman analysis answers a different question than regression: not whether two methods are associated, but whether their differences are small enough to be acceptable in practice.

Lin’s concordance correlation coefficient

Lin’s CCC combines correlation and agreement into a single concordance measure.

In this tutorial, the Lin’s CCC analysis uses:

  • Reference method (X): Reference_X
  • Test method (Y): Test_Y
  • Confidence interval type: Bootstrap percentile
  • Bootstrap replicates: 2000
  • alpha: 0.050
  • Null concordance: 0.00
Lin’s concordance correlation coefficient input dialog using the paired quantitative worksheet columns.

 

Lin’s CCC options used in the tutorial workbook: bootstrap percentile CI and null concordance of 0.

From the workbook results:

  • Lin CCC: 0.3777
  • 95% CI: 0.2102 to 0.5344
  • Pearson r: 0.5237
  • Bias-correction factor Cb: 0.7211

The Pearson correlation is moderate, but Lin’s CCC is noticeably lower because concordance depends not only on correlation but also on how closely the data follow the 45-degree line of perfect agreement. The bias-correction factor below 1 confirms that the methods differ in level and/or scale, so agreement is weaker than correlation alone would suggest.

A method pair can be correlated without agreeing well. Lin’s CCC helps make that distinction explicit.

Interpreting the quantitative agreement workflow

Taken together, the quantitative workbook shows a consistent pattern:

  • Passing–Bablok and Deming both suggest a positive systematic difference between methods
  • Bland–Altman shows a positive average bias and fairly wide limits of agreement
  • Lin’s CCC indicates that overall concordance is only moderate

In this example, the test method is not simply a noisy version of the reference method. The results suggest both offset and imperfect agreement. That means the two methods may be related, but they should not automatically be treated as interchangeable without further clinical or practical justification.


Part II — Categorical agreement with kappa

Cohen’s kappa and weighted kappa

For categorical ratings, agreement is usually analysed through kappa statistics rather than regression-based methods.

This workbook uses 51 paired ratings across 4 categories and compares:

  • unweighted Cohen’s kappa
  • linear weighted kappa
  • quadratic weighted kappa

The workbook also includes:

  • the observed confusion matrix
  • row and column totals
  • linear and quadratic weight matrices
  • formula-driven agreement calculations
Agreement menu showing Cohen’s / Weighted Kappa in the BESH Stat NG ribbon.

 

Cohen’s / Weighted Kappa input dialog using the paired rating columns.

 

Kappa options used in the tutorial: unweighted Cohen’s kappa, bootstrap percentile CI, alpha 0.050, 2000 bootstrap replicates.

 

Global settings showing the default bootstrap seed used for the tutorial results.

Kappa results

The confusion matrix in the workbook is:

  • A/A = 10
  • A/B = 1
  • A/C = 2
  • A/D = 0
  • B/A = 1
  • B/B = 10
  • B/C = 2
  • B/D = 2
  • C/A = 0
  • C/B = 0
  • C/C = 6
  • C/D = 1
  • D/A = 0
  • D/B = 2
  • D/C = 0
  • D/D = 14

From the GUI results:

Unweighted Cohen’s kappa

  • Kappa: 0.7080
  • 95% CI: 0.5373 to 0.8425
  • Observed agreement: 0.7843
  • Expected agreement: 0.2614

Linear weighted kappa

  • Kappa: 0.7439
  • 95% CI: 0.5731 to 0.8783
  • Weighted observed agreement: 0.8889
  • Weighted expected agreement: 0.5662

Quadratic weighted kappa

  • Kappa: 0.7919
  • 95% CI: 0.6290 to 0.9113
  • Weighted observed agreement: 0.9368
  • Weighted expected agreement: 0.6964

The unweighted kappa already shows substantial agreement. When ordinal disagreement is weighted so that near-misses are penalized less than large disagreements, the kappa increases. Quadratic weighting gives the highest value because it treats small category disagreements as less serious than large ones.

Interpreting weighted kappa

This example is useful because it shows why weighting matters.

If the categories have a natural order, then a disagreement between adjacent categories is usually less serious than a disagreement between categories at opposite ends of the scale.

That is why:

  • unweighted kappa is the strictest nominal-agreement summary
  • linear weighted kappa gives partial credit for near agreement
  • quadratic weighted kappa gives even more credit when disagreements are small

In this workbook, the agreement conclusion becomes stronger as the weighting scheme better reflects the ordered nature of the categories.

Results match workbook references

Both tutorial workbooks include sheets that help you cross-check the analysis.

Quantitative workbook

Use:

  • Reference_Results
  • Excel_Helper
  • BESH_UDF_Placeholders

The key quantitative reference values are:

  • Passing–Bablok slope: 0.8543
  • Passing–Bablok intercept: 0.1551
  • Deming slope: 0.7918
  • Deming intercept: 0.2025
  • Bland–Altman bias: 0.1394
  • Lin CCC: 0.3777

Kappa workbook

Use:

  • Confusion_Matrix
  • GUI_results1
  • GUI_results2
  • GUI_results3
  • BESH_UDF_Placeholders

The key kappa reference values are:

  • Unweighted kappa: 0.7080
  • Linear weighted kappa: 0.7439
  • Quadratic weighted kappa: 0.7919

This makes the workbooks useful not only as tutorials but also as validation examples. You can compare formula-driven helper calculations, GUI-based BESHStatNG output, and agreement UDF results in the same Excel file.

R reference

You can cross-check the main workbook results in R with the script below.
It reads the two tutorial workbooks and reproduces the core analyses used on this page:
Passing–Bablok regression, Deming regression,
Bland–Altman summary statistics, Lin’s concordance correlation coefficient,
and Cohen’s / weighted kappa.

The quantitative workbook uses the Analysis_Data sheet with
PatientID, Reference_X, and Test_Y.
The categorical workbook uses the Analysis_Data sheet with
Rater_1 and Rater_2.

Note: BESHStatNG uses grouped Passing–Bablok and repeated-by-subject Bland–Altman settings in this tutorial, so exact confidence intervals may differ slightly depending on the R package and CI method used. The main coefficients and agreement summaries should still be very close.

# install.packages(c("readxl", "mcr", "DescTools", "irr"))
library(readxl)
library(mcr)
library(DescTools)
library(irr)

# Quantitative workbook
qdat <- read_excel("BESHStatNG_agreement_tutorial_quantitative.xlsx",  sheet = "Analysis_Data")
qdat <- as.data.frame(qdat)

x  <- qdat$Reference_X
y  <- qdat$Test_Y
id <- qdat$PatientID

# Passing–Bablok regression
fit_pb <- mcreg(x, y, method.reg = "PaBa")
summary(fit_pb)

# Deming regression (error ratio = 1)
fit_dem <- mcreg(x, y, method.reg = "Deming", error.ratio = 1)
summary(fit_dem)

# Bland–Altman summary
diff_xy <- y - x
mean_xy <- (x + y) / 2

data.frame(
  Mean_Bias = mean(diff_xy, na.rm = TRUE),
  SD_Diff = sd(diff_xy, na.rm = TRUE),
  LoA_Lower = mean(diff_xy, na.rm = TRUE) - 1.96 * sd(diff_xy, na.rm = TRUE),
  LoA_Upper = mean(diff_xy, na.rm = TRUE) + 1.96 * sd(diff_xy, na.rm = TRUE),
  Proportional_Bias_Slope = coef(lm(diff_xy ~ mean_xy))[2]
)

# Lin’s concordance correlation coefficient
CCC(x, y)

# Categorical workbook
kdat <- read_excel("BESHStatNG_agreement_tutorial_kappa.xlsx", sheet = "Analysis_Data")
kdat <- as.data.frame(kdat)

ratings <- kdat[, c("Rater_1", "Rater_2")]

# Unweighted Cohen’s kappa
kappa2(ratings, weight = "unweighted")

# Linear weighted kappa
kappa2(ratings, weight = "equal")

# Quadratic weighted kappa
kappa2(ratings, weight = "squared")

What should match

    • Passing–Bablok slope and intercept
    • Deming slope and intercept
    • Bland–Altman mean bias and limits of agreement
    • Lin’s CCC
    • Unweighted, linear weighted, and quadratic weighted kappa

Practical conclusion

This tutorial highlights an important general point:

  • correlation is not agreement
  • regression is not interchangeability
  • agreement depends on the measurement scale and the practical context
  • categorical agreement depends on whether disagreements are treated as nominal or ordinal

In the quantitative example, the methods are related, but agreement is not strong enough to assume interchangeability automatically.

In the categorical example, agreement is already substantial without weighting, and becomes stronger when the ordinal structure of the categories is taken into account.

That is exactly why BESHStatNG includes both regression-style comparison methods and agreement-specific methods in the same Excel workflow.

What to do next

Once you complete this tutorial, the next useful BESHStatNG workflows are:

If you want to reuse these workbooks for your own studies, keep the same structure:

  • one clean analysis sheet
  • one sheet for helper calculations
  • one sheet for GUI outputs
  • one sheet for static benchmark or reference results
  • one sheet for UDF-based outputs

That structure makes it much easier to build reusable, transparent, and well-documented Excel analysis templates.