Calibration and threshold selection in Excel with BESH Stat NG

A binary classifier is not finished once the model converges.

For students and junior researchers, this is one of the most important lessons in applied analysis. A fitted model can have sensible coefficients and still be poorly reported if we stop at the coefficient table.

In practice, a good reporting workflow asks:

  • how well does the model separate positives from negatives?
  • what happens when the classification threshold changes?
  • are the predicted probabilities believable?
  • does the model still perform well on new data?

This tutorial shows how to answer those questions in Excel with BESH Stat NG using one workbook built around a simple binomial GLM and a holdout evaluation workflow.

The page is written for:

  • students
  • junior researchers
  • early-career biomedical analysts
  • anyone who wants a practical way to think through calibration and threshold selection in Excel

The workbook is intentionally built around a simpler two-predictor model. That is useful for teaching because it avoids the earlier complete-separation problem and gives a more realistic example of threshold and calibration reporting.

Download the example workbook

The workbook contains these sheets:

  • Threshold_Calibration_Guide — a reading guide for the reporting workflow
  • README — workbook overview
  • Data_All — all cases with train/test split
  • Data_Train — training subset used for fitting
  • Data_Test — holdout subset used for evaluation
  • GLM_Model — training-set binomial GLM workflow
  • Holdout_Reporting — holdout probabilities plus classifier-report tables
  • Plot_Data — chart-ready ROC and calibration data
  • Sources — dataset and documentation links

What question are we trying to answer?

The scientific question in this workbook is practical:

Can a simple binary classifier separate malignant from benign cases, and how should we report that classifier in a way that is actually useful?

That question is broader than model fitting alone. A coefficient table answers one kind of question:

Which predictors are associated with the outcome?

A reporting workflow answers a different set of questions:

  • How good is classification on new data?
  • Is the default threshold of 0.50 appropriate?
  • How much do sensitivity and specificity change across thresholds?
  • Does the model rank cases well?
  • Are the predicted probabilities calibrated?

That is the flow of thought this tutorial follows.

The first thing to understand: training data and holdout data are not the same thing

A very common beginner mistake is to fit a model and then report only the training-set output. That can be misleading, because a model usually looks better on the same data used to fit it. This workbook keeps the process explicit by separating:

  • Data_Train for fitting
  • Data_Test for evaluation

Split used in the workbook

SubsetTotal casesMalignantBenign
Training398148250
Holdout / test17164107
All data569212357

This split makes the reporting logic much more honest.

A good teaching rule is:

fit on the training set, report performance on the holdout set.

Why this workbook is useful for students

This workbook keeps the reporting logic visible directly in the worksheet. Instead of hiding everything inside a one-time output sheet, the workbook shows:

  • the train/test split
  • the model control block
  • the fit handle
  • the training-set summary
  • the holdout confusion report
  • the threshold table
  • the calibration table
  • the chart-ready ROC and calibration spills

That makes it useful both as:

  • a practical analysis example
  • a teaching workbook that shows the whole reasoning process

Step 1. Fit a simple binary GLM on the training set

The sheet GLM_Model fits a binomial logit GLM using two predictors:

  • mean_radius
  • mean_texture

The model formula in the workbook is:

A + B
This simpler model is actually a good teaching choice. It still performs well, but it avoids the complete-separation warning that can make a logistic example less suitable for introductory reporting.

Main training-set UDFs

The training workflow is built around:

=BESH.REGR.GLM_FIT(...)
with downstream reporting from:
=BESH.REGR.GLM_SUMMARY(...)
=BESH.REGR.GLM_TESTS(...)
=BESH.REGR.GLM_CLASS(...)
=BESH.REGR.GLM_THRESH(...)
=BESH.REGR.GLM_CALIB(...)
=BESH.REGR.GLM_BRIER(...)
The current public GLM documentation explicitly describes this kind of binary-model reporting, including confusion matrices, threshold tables, calibration, Brier score, ROC tables, and ROC plots.

Training-set model summary

StatisticValue
Training observations398
FamilyBinomial
LinkLogit
Log-likelihood-106.569
Residual deviance213.138
AIC219.138
Pseudo(McFadden) R²0.5943
Iterations7

Training-set coefficient table

ParameterCoefSEZP-value95% CI lower95% CI upper
Intercept-18.79071.9680-9.54811.32E-21-22.6479-14.9335
mean_radius1.00100.11138.99352.39E-190.78281.2191
mean_texture0.20100.04304.67772.90E-060.11680.2852

How should we read this table?

A good first reading is:

  • both predictors have positive coefficients
  • both are strongly statistically significant
  • larger values of these features are associated with a higher probability of malignancy
  • the model converged without the separation warning, which makes it much easier to use as a teaching example

But a good reporting workflow does not stop here.

The next question is:

How does this model perform on new data?

Step 2. Move immediately to the holdout set

This is the point where the tutorial becomes more realistic. The sheet Holdout_Reporting scores the holdout cases using:

=BESH.REGR.GLM_PRED(...)
and then builds the reporting layer with generic classifier UDFs:
=BESH.CLASS.CONFUSION(...)
=BESH.CLASS.THRESH(...)
=BESH.CLASS.CALIB(...)
=BESH.CLASS.BRIER(...)
This is a very useful teaching pattern because it separates:
  • fitting a model
    from
  • reporting a classifier

That distinction is important. A model may fit well, but a classifier still has to be reported carefully.

Step 3. Read the confusion matrix first

At the default threshold of 0.50, the holdout confusion report is:

Observed \\ PredictedBenignMalignantRecall %
Benign104397.20
Malignant145078.13
Precision % / Overall88.1494.3490.06

Holdout summary at threshold 0.50

MetricValue (%)
Sensitivity78.13
Specificity97.20
Precision94.34
NPV88.14
Accuracy90.06
Balanced accuracy87.66
Youden J75.32
F1 score85.47

How should we interpret this?

A sensible reasoning process is:

  1. The model is very good at identifying benign cases:
    • specificity = 97.20%
  2. It is also reasonably good at identifying malignant cases:
    • sensitivity = 78.13%
  3. Precision is strong:
    • when the model predicts malignant, it is correct 94.34% of the time at this threshold
  4. Overall accuracy is high:
    • 90.06%
  5. Balanced accuracy is a useful summary when both classes matter:
    • 87.66%

This is already a good result, but it does not mean that 0.50 is automatically the best threshold. That is the next question to ask.

Step 4. Threshold selection: 0.50 is a starting point, not a rule

One of the most important teaching points in classifier reporting is this:

A default threshold of 0.50 is only one possible operating point.

Different thresholds produce different trade-offs between:

  • sensitivity
  • specificity
  • precision
  • negative predictive value
  • balanced accuracy
  • F1

The holdout threshold table in the workbook compares thresholds from 0.10 to 0.90.

Holdout threshold comparison

ThresholdSensitivity %Specificity %Precision %Accuracy %Balanced accuracy %Youden J %F1 %
0.1098.4471.0367.0281.2984.7369.4779.75
0.2093.7581.3175.0085.9687.5375.0683.33
0.3087.5089.7283.5888.8988.6177.2285.50
0.4082.8191.5985.4888.3087.2074.4084.13
0.5078.1397.2094.3490.0687.6675.3285.47
0.6071.8897.2093.8887.7284.5469.0781.42
0.7070.31100.00100.0088.8985.1670.3182.57
0.8062.50100.00100.0085.9681.2562.5076.92
0.9054.69100.00100.0083.0477.3454.6970.71

How should a junior researcher think through this table?

A good way to read it is:

  • At 0.10, sensitivity is extremely high, but specificity is much lower.
  • At 0.90, specificity and precision are perfect, but sensitivity falls a lot.
  • Around 0.30, the classifier has the best balanced overall compromise in this holdout grid.
  • At 0.50, the classifier becomes more conservative:
    • sensitivity drops
    • specificity rises
    • precision becomes very high

Best thresholds in the workbook

CriterionBest threshold / value
Best Youden threshold0.30
Best F1 threshold0.30
Best balanced accuracy88.61%
Best sensitivity98.44%
Best specificity100.00%

The main lesson is:

there is no universal best threshold.
The right threshold depends on the scientific or clinical goal.

For example:

  • if missing a malignant case is very costly, a lower threshold may be preferable
  • if false alarms are more costly, a higher threshold may be preferable

Step 5. ROC and AUC: how well does the model rank cases?

Threshold tables are helpful, but they still depend on choosing specific cut-offs.

ROC analysis answers a different question:

Across all possible thresholds, how well does the model rank positive cases above negative cases?

The workbook generates chart-ready ROC output using:

=BESH.PLOT.ROC_STATS(...)
and
=BESH.PLOT.ROC_POINTS(...)
The public plot-data documentation includes the ROC and calibration plot-point functions used for this kind of worksheet-based charting. The ROC figure below shows the holdout ROC curve for the fitted GLM.
ROC curve in Excel for holdout predictions from a binomial GLM, showing strong discrimination and a curve well above the diagonal reference line.
ROC curve for the holdout predictions from the binomial GLM, showing strong discrimination with AUC about 0.965.

Holdout ROC summary

MetricValue
Wilcoxon AUC0.96495
DeLong standard error0.01198
DeLong 95% CI0.94147 to 0.98844
Hanley–McNeil standard error0.01204
Hanley–McNeil 95% CI0.94136 to 0.98855
Two-sided p-value for AUC ≠ 0.52.90E-24

How should we interpret this?

A clear interpretation is:

  • the model has excellent discrimination
  • malignant cases tend to receive higher predicted probabilities than benign cases
  • the ROC curve stays close to the upper-left corner
  • the AUC is well above 0.90, which is strong evidence of useful ranking performance

But ROC is only part of the story. A model can rank cases very well and still produce poorly calibrated probabilities. That is why the next step matters.

Step 6. Calibration: are the probabilities believable?

Calibration asks a different question from ROC.

ROC asks:

Does the model rank cases well?

Calibration asks:

When the model predicts about 20% risk, do roughly 20% of those cases actually turn out positive?

That is why calibration is essential whenever predicted probabilities may be interpreted directly. The workbook reports holdout calibration with:

=BESH.CLASS.CALIB(...)
and chart-ready plot data with:
=BESH.PLOT.CALIB_POINTS(...)
Holdout calibration bins
BinNMean predicted probabilityObserved event rate95% CI lower95% CI upper
1180.00350.00000.00000.1759
2170.01270.00000.00000.1843
3170.02760.00000.00000.1843
4170.05080.05880.01050.2698
5170.10040.05880.01050.2698
6170.23000.29410.13280.5313
7170.41300.52940.30960.7383
8170.74400.82350.58970.9381
9170.97881.00000.81571.0000
10170.99791.00000.81571.0000

How should we interpret this table?

The calibration picture is reassuring:

  • very low predicted probabilities correspond to almost no observed events
  • very high predicted probabilities correspond to almost all observed events
  • the middle bins show a reasonable increase in observed event rate as mean predicted probability rises

This means the model is not only discriminative, but also reasonably well calibrated on the holdout set. That is an important teaching point:

good classifier reporting should discuss both discrimination and calibration.

Step 7. Brier score: a compact probability-quality summary

The Brier score is useful because it evaluates the full probability forecast, not just hard classifications at one threshold.

Holdout Brier score

MetricValue
Brier score0.03365

A lower Brier score is better.

In this workbook, the Brier score is low, which is consistent with:

  • good discrimination
  • and reasonably good calibration

This is why Brier score is such a useful teaching metric. It reminds us that predicted probabilities matter, not only thresholded class labels.

A practical flow of thought for junior researchers

If you are new to classifier reporting, a good order to follow is:

1. Start with the train/test split

Ask:

  • what data were used for fitting?
  • what data were held out for evaluation?

2. Fit the model

Check:

  • did it converge?
  • do the coefficient signs make sense?
  • is the model simple enough to explain?

3. Move to the holdout confusion matrix

This gives the first practical picture of performance.

4. Compare thresholds

Ask:

  • what happens if sensitivity matters more?
  • what happens if specificity matters more?
  • is 0.50 really the right operating point?

5. Look at ROC and AUC

This tells you how well the model ranks cases across all thresholds.

6. Look at calibration and Brier score

This tells you whether the predicted probabilities themselves are believable and useful. That sequence is exactly why the workbook is organized the way it is.

GUI or UDF workflow?

A very practical question for students is:

Should I use the GUI or the worksheet formulas?

The best answer is: both are useful, but for different reasons.

Use the GUI when you want:

  • a fast first fit
  • an easy way to explore options
  • a visual teaching workflow
  • quick output for review

Use UDFs when you want:

  • a reproducible workbook
  • a visible audit trail
  • reusable reporting blocks
  • chart-ready plot data in the sheet
  • a template that can be adapted later

A good learning path is:

  1. start with the GUI to understand the model
  2. then move to the UDF workbook
  3. use the formula-driven version as the long-term reporting template

That is one of the strengths of BESH Stat NG: it supports both approaches.

Compatibility note for older Excel versions

This workbook works best in modern versions of Excel that support dynamic arrays and spill ranges. Older Excel versions do not support spills. In those versions, users may need to:

  • preselect the full target output range
  • enter the formula across that range
  • confirm it as an array formula
  • in some versions, use Ctrl+Shift+Enter

So the same BESH Stat NG logic can still be used in older Excel, but the workbook may require more manual array-formula entry.

Learn more in the documentation

The following help pages are good companion links for this tutorial:

Validation and transparency

This workbook is designed to be transparent and reusable. It keeps:

  • the full dataset
  • the train/test split
  • the model control block
  • the fit handle
  • the training summary
  • the holdout confusion report
  • the threshold table
  • the calibration table
  • the chart-ready ROC and calibration data
  • the source links

That makes it useful not only as a tutorial workbook, but also as a reusable reporting template. For a broader overview of how BESH Stat NG results are checked, including reference datasets, automated tests, and comparison workflows, see the Validation page.

Summary

This tutorial shows that calibration and threshold selection in Excel should not be treated as an afterthought.

A good classifier report should ask:

  • how was the model trained?
  • how does it perform on holdout data?
  • what does the confusion matrix show?
  • how does performance change across thresholds?
  • how good is discrimination?
  • are the probabilities calibrated?

In this workbook, the answers are strong:

  • the holdout confusion matrix is good
  • the default threshold of 0.50 works well
  • the threshold 0.30 gives the best overall compromise on this holdout grid
  • the ROC AUC is 0.96495
  • calibration is sensible
  • the Brier score is low

That makes this workbook a strong teaching example for students and junior researchers learning how to report a binary classifier in Excel.

Frequently asked questions

Why is a holdout set important?

Because a model usually looks better on the same data used to fit it. A holdout set gives a more honest picture of predictive performance.

Why is 0.50 not always the best threshold?

Because the best threshold depends on the goal. Sometimes sensitivity matters most, sometimes specificity, and sometimes a balanced compromise is preferred.

What does AUC tell me?

AUC tells you how well the model ranks positive cases above negative cases across all thresholds.

Why do I need calibration if the ROC curve is excellent?

Because a model can rank cases well but still produce poorly calibrated probabilities. ROC and calibration answer different questions.

What does the Brier score add?

It evaluates the full probability forecast, not only thresholded class labels, and gives a compact summary of probability quality.

See also