Method comparison and agreement analysis is essential whenever two measurement methods, two devices, or two raters are used on the same subjects.
BESHStatNG makes it possible to perform these workflows directly in Excel, including:
- Passing–Bablok regression
- Deming regression
- Bland–Altman analysis
- Lin’s concordance correlation coefficient
- Cohen’s kappa
- Weighted kappa
This tutorial shows how to use these methods in a practical Excel workflow with two downloadable workbooks:
- a quantitative agreement workbook for paired numeric measurements
- a categorical agreement workbook for paired ratings
The goal is not only to show where to click, but also to explain how to interpret agreement results correctly and how to compare BESHStatNG output with workbook-based reference results.
In this tutorial, you will learn how to:
- compare two quantitative measurement methods
- separate association from agreement
- interpret bias, limits of agreement, and concordance
- analyse paired categorical ratings with unweighted and weighted kappa
- use workbook-based helper sheets and UDF sections to create reusable Excel templates
Which method should I use?
- Passing–Bablok: robust method-comparison regression
- Deming: regression when both methods have error
- Bland–Altman: bias and limits of agreement
- Lin’s CCC: single-number concordance summary
- Kappa / weighted kappa: categorical agreement
Download the tutorial workbooks
This tutorial uses two example workbooks.
[Download the quantitative agreement workbook]
Use this workbook for:
- Passing–Bablok regression
- Deming regression
- Bland–Altman analysis
- Lin’s concordance correlation coefficient
[Download the categorical agreement workbook]
Use this workbook for:
- Cohen’s kappa
- linearly weighted kappa
- quadratically weighted kappa
Both workbooks are designed so you can follow the tutorial step by step, reproduce the analysis in Excel, and later adapt the same workbook structure to your own method-comparison or agreement studies.
The example workbooks
This tutorial is split into two parts.
1. Quantitative agreement workbook
The quantitative workbook contains 87 paired measurements from 53 patients, with repeated observations for some subjects.
The main analysis sheet includes:
- PatientID
- Reference_X
- Test_Y
- PairMean
- Diff_Y_minus_X
This workbook supports:
- grouped Passing–Bablok regression
- Deming regression
- Bland–Altman analysis in repeated-by-subject mode
- Lin’s CCC
- formula-driven helper calculations
- a placeholder worksheet for future agreement UDF support
2. Categorical agreement workbook
The categorical workbook contains 51 paired ratings across 4 ordered categories.
The main analysis sheet includes:
- Rater_1
- Rater_2
- Code_1
- Code_2
This workbook supports:
- unweighted Cohen’s kappa
- linear weighted kappa
- quadratic weighted kappa
- formula-driven confusion matrix calculations
- a placeholder worksheet for future agreement UDF support
Start from the BESH Stat NG ribbon
All agreement methods used in this tutorial are available from the Agreement section of the BESH Stat NG ribbon.

For quantitative methods, start from the quantitative agreement workbook.
For categorical agreement, use the kappa workbook.
Part I — Quantitative method comparison and agreement
Passing–Bablok regression
Passing–Bablok regression is a robust method-comparison regression that is widely used when two quantitative methods are compared on the same subjects.
In this workbook, the analysis uses:
- Group (optional):
PatientID - Reference method (X):
Reference_X - Test method (Y):
Test_Y
Because the dataset contains repeated observations for some patients, the grouped / block version is the most appropriate summary for the tutorial.

From the workbook results:
- Sample size: 87 paired observations
- Number of groups: 53
- Slope: 0.8543
- 95% CI for slope: 0.6886 to 1.0592
- Intercept: 0.1551
- 95% CI for intercept: 0.0773 to 0.2144
The Passing–Bablok slope is close to 1, but its confidence interval includes values below and above 1. This means there is no strong evidence of clear proportional bias from this method alone. The intercept is clearly above 0, which suggests a positive systematic difference between the two methods across the observed range.
Deming regression
Deming regression is useful when both methods are measured with error, which is often a more realistic assumption than ordinary least squares in method-comparison studies.
In this tutorial, the Deming analysis uses:
- Reference method (X):
Reference_X - Test method (Y):
Test_Y - Confidence interval type: Bootstrap percentile
- Bootstrap replicates: 2000
- alpha: 0.050
- Variance model: Constant lambda
- Error ratio: 1.000
- Fit intercept: checked


From the workbook results:
- Slope: 0.7918
- 95% CI for slope: 0.5904 to 1.1350
- Intercept: 0.2025
- 95% CI for intercept: 0.1087 to 0.2763
Deming regression tells a similar story to Passing–Bablok. The slope is below 1, but the confidence interval still includes 1, so strong proportional bias is not established. The positive intercept again suggests systematic offset between the two methods.
When Passing–Bablok and Deming point in the same direction, confidence in the overall interpretation increases.
Bland–Altman analysis
Bland–Altman analysis focuses on agreement on the measurement scale, not just regression or correlation.
In this workbook, the Bland–Altman analysis uses:
- Subject ID:
PatientID - Reference method (X):
Reference_X - Test method (Y):
Test_Y - Mode: Repeated by subject
- Scale: Raw difference
- X-axis: Mean of methods
- Plot: All observations + subject means
- CI method: Bootstrap percentile
- Bootstrap replicates: 2000
- alpha: 0.050
- Check proportional bias: checked
- Exclude singleton subjects: checked
- Allow fallback to simple analysis: checked


From the workbook results:
- Mean bias (Y − X): 0.1394
- 95% CI for bias: 0.1052 to 0.1780
- SD of differences: 0.1575
- Lower limit of agreement: −0.1693
- Upper limit of agreement: 0.4481
- Proportional-bias slope: −0.1611
The positive mean bias shows that, on average, the test method tends to read higher than the reference method. The limits of agreement are fairly wide relative to the scale of the measurements, which means the two methods should not automatically be treated as interchangeable. The negative proportional-bias slope also suggests that the size of the difference may depend somewhat on the measurement level.
Bland–Altman analysis answers a different question than regression: not whether two methods are associated, but whether their differences are small enough to be acceptable in practice.
Lin’s concordance correlation coefficient
Lin’s CCC combines correlation and agreement into a single concordance measure.
In this tutorial, the Lin’s CCC analysis uses:
- Reference method (X):
Reference_X - Test method (Y):
Test_Y - Confidence interval type: Bootstrap percentile
- Bootstrap replicates: 2000
- alpha: 0.050
- Null concordance: 0.00


From the workbook results:
- Lin CCC: 0.3777
- 95% CI: 0.2102 to 0.5344
- Pearson r: 0.5237
- Bias-correction factor Cb: 0.7211
The Pearson correlation is moderate, but Lin’s CCC is noticeably lower because concordance depends not only on correlation but also on how closely the data follow the 45-degree line of perfect agreement. The bias-correction factor below 1 confirms that the methods differ in level and/or scale, so agreement is weaker than correlation alone would suggest.
A method pair can be correlated without agreeing well. Lin’s CCC helps make that distinction explicit.
Interpreting the quantitative agreement workflow
Taken together, the quantitative workbook shows a consistent pattern:
- Passing–Bablok and Deming both suggest a positive systematic difference between methods
- Bland–Altman shows a positive average bias and fairly wide limits of agreement
- Lin’s CCC indicates that overall concordance is only moderate
In this example, the test method is not simply a noisy version of the reference method. The results suggest both offset and imperfect agreement. That means the two methods may be related, but they should not automatically be treated as interchangeable without further clinical or practical justification.
Part II — Categorical agreement with kappa
Cohen’s kappa and weighted kappa
For categorical ratings, agreement is usually analysed through kappa statistics rather than regression-based methods.
This workbook uses 51 paired ratings across 4 categories and compares:
- unweighted Cohen’s kappa
- linear weighted kappa
- quadratic weighted kappa
The workbook also includes:
- the observed confusion matrix
- row and column totals
- linear and quadratic weight matrices
- formula-driven agreement calculations




Kappa results
The confusion matrix in the workbook is:
- A/A = 10
- A/B = 1
- A/C = 2
- A/D = 0
- B/A = 1
- B/B = 10
- B/C = 2
- B/D = 2
- C/A = 0
- C/B = 0
- C/C = 6
- C/D = 1
- D/A = 0
- D/B = 2
- D/C = 0
- D/D = 14
From the GUI results:
Unweighted Cohen’s kappa
- Kappa: 0.7080
- 95% CI: 0.5373 to 0.8425
- Observed agreement: 0.7843
- Expected agreement: 0.2614
Linear weighted kappa
- Kappa: 0.7439
- 95% CI: 0.5731 to 0.8783
- Weighted observed agreement: 0.8889
- Weighted expected agreement: 0.5662
Quadratic weighted kappa
- Kappa: 0.7919
- 95% CI: 0.6290 to 0.9113
- Weighted observed agreement: 0.9368
- Weighted expected agreement: 0.6964
The unweighted kappa already shows substantial agreement. When ordinal disagreement is weighted so that near-misses are penalized less than large disagreements, the kappa increases. Quadratic weighting gives the highest value because it treats small category disagreements as less serious than large ones.
Interpreting weighted kappa
This example is useful because it shows why weighting matters.
If the categories have a natural order, then a disagreement between adjacent categories is usually less serious than a disagreement between categories at opposite ends of the scale.
That is why:
- unweighted kappa is the strictest nominal-agreement summary
- linear weighted kappa gives partial credit for near agreement
- quadratic weighted kappa gives even more credit when disagreements are small
In this workbook, the agreement conclusion becomes stronger as the weighting scheme better reflects the ordered nature of the categories.
Results match workbook references
Both tutorial workbooks include sheets that help you cross-check the analysis.
Quantitative workbook
Use:
- Reference_Results
- Excel_Helper
- BESH_UDF_Placeholders
The key quantitative reference values are:
- Passing–Bablok slope: 0.8543
- Passing–Bablok intercept: 0.1551
- Deming slope: 0.7918
- Deming intercept: 0.2025
- Bland–Altman bias: 0.1394
- Lin CCC: 0.3777
Kappa workbook
Use:
- Confusion_Matrix
- GUI_results1
- GUI_results2
- GUI_results3
- BESH_UDF_Placeholders
The key kappa reference values are:
- Unweighted kappa: 0.7080
- Linear weighted kappa: 0.7439
- Quadratic weighted kappa: 0.7919
This makes the workbooks useful not only as tutorials but also as validation examples. You can compare formula-driven helper calculations, GUI-based BESHStatNG output, and agreement UDF results in the same Excel file.
R reference
You can cross-check the main workbook results in R with the script below.
It reads the two tutorial workbooks and reproduces the core analyses used on this page:
Passing–Bablok regression, Deming regression,
Bland–Altman summary statistics, Lin’s concordance correlation coefficient,
and Cohen’s / weighted kappa.
The quantitative workbook uses the Analysis_Data sheet with
PatientID, Reference_X, and Test_Y.
The categorical workbook uses the Analysis_Data sheet with
Rater_1 and Rater_2.
Note: BESHStatNG uses grouped Passing–Bablok and repeated-by-subject Bland–Altman settings in this tutorial, so exact confidence intervals may differ slightly depending on the R package and CI method used. The main coefficients and agreement summaries should still be very close.
# install.packages(c("readxl", "mcr", "DescTools", "irr"))
library(readxl)
library(mcr)
library(DescTools)
library(irr)
# Quantitative workbook
qdat <- read_excel("BESHStatNG_agreement_tutorial_quantitative.xlsx", sheet = "Analysis_Data")
qdat <- as.data.frame(qdat)
x <- qdat$Reference_X
y <- qdat$Test_Y
id <- qdat$PatientID
# Passing–Bablok regression
fit_pb <- mcreg(x, y, method.reg = "PaBa")
summary(fit_pb)
# Deming regression (error ratio = 1)
fit_dem <- mcreg(x, y, method.reg = "Deming", error.ratio = 1)
summary(fit_dem)
# Bland–Altman summary
diff_xy <- y - x
mean_xy <- (x + y) / 2
data.frame(
Mean_Bias = mean(diff_xy, na.rm = TRUE),
SD_Diff = sd(diff_xy, na.rm = TRUE),
LoA_Lower = mean(diff_xy, na.rm = TRUE) - 1.96 * sd(diff_xy, na.rm = TRUE),
LoA_Upper = mean(diff_xy, na.rm = TRUE) + 1.96 * sd(diff_xy, na.rm = TRUE),
Proportional_Bias_Slope = coef(lm(diff_xy ~ mean_xy))[2]
)
# Lin’s concordance correlation coefficient
CCC(x, y)
# Categorical workbook
kdat <- read_excel("BESHStatNG_agreement_tutorial_kappa.xlsx", sheet = "Analysis_Data")
kdat <- as.data.frame(kdat)
ratings <- kdat[, c("Rater_1", "Rater_2")]
# Unweighted Cohen’s kappa
kappa2(ratings, weight = "unweighted")
# Linear weighted kappa
kappa2(ratings, weight = "equal")
# Quadratic weighted kappa
kappa2(ratings, weight = "squared")
What should match
- Passing–Bablok slope and intercept
- Deming slope and intercept
- Bland–Altman mean bias and limits of agreement
- Lin’s CCC
- Unweighted, linear weighted, and quadratic weighted kappa
Practical conclusion
This tutorial highlights an important general point:
- correlation is not agreement
- regression is not interchangeability
- agreement depends on the measurement scale and the practical context
- categorical agreement depends on whether disagreements are treated as nominal or ordinal
In the quantitative example, the methods are related, but agreement is not strong enough to assume interchangeability automatically.
In the categorical example, agreement is already substantial without weighting, and becomes stronger when the ordinal structure of the categories is taken into account.
That is exactly why BESHStatNG includes both regression-style comparison methods and agreement-specific methods in the same Excel workflow.
What to do next
Once you complete this tutorial, the next useful BESHStatNG workflows are:
If you want to reuse these workbooks for your own studies, keep the same structure:
- one clean analysis sheet
- one sheet for helper calculations
- one sheet for GUI outputs
- one sheet for static benchmark or reference results
- one sheet for UDF-based outputs
That structure makes it much easier to build reusable, transparent, and well-documented Excel analysis templates.