A binary classifier is not finished once the model converges.
For students and junior researchers, this is one of the most important lessons in applied analysis. A fitted model can have sensible coefficients and still be poorly reported if we stop at the coefficient table.
In practice, a good reporting workflow asks:
- how well does the model separate positives from negatives?
- what happens when the classification threshold changes?
- are the predicted probabilities believable?
- does the model still perform well on new data?
This tutorial shows how to answer those questions in Excel with BESH Stat NG using one workbook built around a simple binomial GLM and a holdout evaluation workflow.
The page is written for:
- students
- junior researchers
- early-career biomedical analysts
- anyone who wants a practical way to think through calibration and threshold selection in Excel
The workbook is intentionally built around a simpler two-predictor model. That is useful for teaching because it avoids the earlier complete-separation problem and gives a more realistic example of threshold and calibration reporting.
Download the example workbook
The workbook contains these sheets:
- Threshold_Calibration_Guide — a reading guide for the reporting workflow
- README — workbook overview
- Data_All — all cases with train/test split
- Data_Train — training subset used for fitting
- Data_Test — holdout subset used for evaluation
- GLM_Model — training-set binomial GLM workflow
- Holdout_Reporting — holdout probabilities plus classifier-report tables
- Plot_Data — chart-ready ROC and calibration data
- Sources — dataset and documentation links
What question are we trying to answer?
The scientific question in this workbook is practical:
Can a simple binary classifier separate malignant from benign cases, and how should we report that classifier in a way that is actually useful?
That question is broader than model fitting alone. A coefficient table answers one kind of question:
Which predictors are associated with the outcome?
A reporting workflow answers a different set of questions:
- How good is classification on new data?
- Is the default threshold of 0.50 appropriate?
- How much do sensitivity and specificity change across thresholds?
- Does the model rank cases well?
- Are the predicted probabilities calibrated?
That is the flow of thought this tutorial follows.
The first thing to understand: training data and holdout data are not the same thing
A very common beginner mistake is to fit a model and then report only the training-set output. That can be misleading, because a model usually looks better on the same data used to fit it. This workbook keeps the process explicit by separating:
- Data_Train for fitting
- Data_Test for evaluation
Split used in the workbook
| Subset | Total cases | Malignant | Benign |
|---|---|---|---|
| Training | 398 | 148 | 250 |
| Holdout / test | 171 | 64 | 107 |
| All data | 569 | 212 | 357 |
This split makes the reporting logic much more honest.
A good teaching rule is:
fit on the training set, report performance on the holdout set.
Why this workbook is useful for students
This workbook keeps the reporting logic visible directly in the worksheet. Instead of hiding everything inside a one-time output sheet, the workbook shows:
- the train/test split
- the model control block
- the fit handle
- the training-set summary
- the holdout confusion report
- the threshold table
- the calibration table
- the chart-ready ROC and calibration spills
That makes it useful both as:
- a practical analysis example
- a teaching workbook that shows the whole reasoning process
Step 1. Fit a simple binary GLM on the training set
The sheet GLM_Model fits a binomial logit GLM using two predictors:
mean_radiusmean_texture
The model formula in the workbook is:
A + BMain training-set UDFs
The training workflow is built around:
=BESH.REGR.GLM_FIT(...)=BESH.REGR.GLM_SUMMARY(...)=BESH.REGR.GLM_TESTS(...)=BESH.REGR.GLM_CLASS(...)=BESH.REGR.GLM_THRESH(...)=BESH.REGR.GLM_CALIB(...)=BESH.REGR.GLM_BRIER(...)Training-set model summary
| Statistic | Value |
|---|---|
| Training observations | 398 |
| Family | Binomial |
| Link | Logit |
| Log-likelihood | -106.569 |
| Residual deviance | 213.138 |
| AIC | 219.138 |
| Pseudo(McFadden) R² | 0.5943 |
| Iterations | 7 |
Training-set coefficient table
| Parameter | Coef | SE | Z | P-value | 95% CI lower | 95% CI upper |
|---|---|---|---|---|---|---|
| Intercept | -18.7907 | 1.9680 | -9.5481 | 1.32E-21 | -22.6479 | -14.9335 |
| mean_radius | 1.0010 | 0.1113 | 8.9935 | 2.39E-19 | 0.7828 | 1.2191 |
| mean_texture | 0.2010 | 0.0430 | 4.6777 | 2.90E-06 | 0.1168 | 0.2852 |
How should we read this table?
A good first reading is:
- both predictors have positive coefficients
- both are strongly statistically significant
- larger values of these features are associated with a higher probability of malignancy
- the model converged without the separation warning, which makes it much easier to use as a teaching example
But a good reporting workflow does not stop here.
The next question is:
How does this model perform on new data?
Step 2. Move immediately to the holdout set
This is the point where the tutorial becomes more realistic. The sheet Holdout_Reporting scores the holdout cases using:
=BESH.REGR.GLM_PRED(...)=BESH.CLASS.CONFUSION(...)=BESH.CLASS.THRESH(...)=BESH.CLASS.CALIB(...)=BESH.CLASS.BRIER(...)- fitting a model
from - reporting a classifier
That distinction is important. A model may fit well, but a classifier still has to be reported carefully.
Step 3. Read the confusion matrix first
At the default threshold of 0.50, the holdout confusion report is:
| Observed \\ Predicted | Benign | Malignant | Recall % |
|---|---|---|---|
| Benign | 104 | 3 | 97.20 |
| Malignant | 14 | 50 | 78.13 |
| Precision % / Overall | 88.14 | 94.34 | 90.06 |
Holdout summary at threshold 0.50
| Metric | Value (%) |
|---|---|
| Sensitivity | 78.13 |
| Specificity | 97.20 |
| Precision | 94.34 |
| NPV | 88.14 |
| Accuracy | 90.06 |
| Balanced accuracy | 87.66 |
| Youden J | 75.32 |
| F1 score | 85.47 |
How should we interpret this?
A sensible reasoning process is:
- The model is very good at identifying benign cases:
- specificity = 97.20%
- It is also reasonably good at identifying malignant cases:
- sensitivity = 78.13%
- Precision is strong:
- when the model predicts malignant, it is correct 94.34% of the time at this threshold
- Overall accuracy is high:
- 90.06%
- Balanced accuracy is a useful summary when both classes matter:
- 87.66%
This is already a good result, but it does not mean that 0.50 is automatically the best threshold. That is the next question to ask.
Step 4. Threshold selection: 0.50 is a starting point, not a rule
One of the most important teaching points in classifier reporting is this:
A default threshold of 0.50 is only one possible operating point.
Different thresholds produce different trade-offs between:
- sensitivity
- specificity
- precision
- negative predictive value
- balanced accuracy
- F1
The holdout threshold table in the workbook compares thresholds from 0.10 to 0.90.
Holdout threshold comparison
| Threshold | Sensitivity % | Specificity % | Precision % | Accuracy % | Balanced accuracy % | Youden J % | F1 % |
|---|---|---|---|---|---|---|---|
| 0.10 | 98.44 | 71.03 | 67.02 | 81.29 | 84.73 | 69.47 | 79.75 |
| 0.20 | 93.75 | 81.31 | 75.00 | 85.96 | 87.53 | 75.06 | 83.33 |
| 0.30 | 87.50 | 89.72 | 83.58 | 88.89 | 88.61 | 77.22 | 85.50 |
| 0.40 | 82.81 | 91.59 | 85.48 | 88.30 | 87.20 | 74.40 | 84.13 |
| 0.50 | 78.13 | 97.20 | 94.34 | 90.06 | 87.66 | 75.32 | 85.47 |
| 0.60 | 71.88 | 97.20 | 93.88 | 87.72 | 84.54 | 69.07 | 81.42 |
| 0.70 | 70.31 | 100.00 | 100.00 | 88.89 | 85.16 | 70.31 | 82.57 |
| 0.80 | 62.50 | 100.00 | 100.00 | 85.96 | 81.25 | 62.50 | 76.92 |
| 0.90 | 54.69 | 100.00 | 100.00 | 83.04 | 77.34 | 54.69 | 70.71 |
How should a junior researcher think through this table?
A good way to read it is:
- At 0.10, sensitivity is extremely high, but specificity is much lower.
- At 0.90, specificity and precision are perfect, but sensitivity falls a lot.
- Around 0.30, the classifier has the best balanced overall compromise in this holdout grid.
- At 0.50, the classifier becomes more conservative:
- sensitivity drops
- specificity rises
- precision becomes very high
Best thresholds in the workbook
| Criterion | Best threshold / value |
|---|---|
| Best Youden threshold | 0.30 |
| Best F1 threshold | 0.30 |
| Best balanced accuracy | 88.61% |
| Best sensitivity | 98.44% |
| Best specificity | 100.00% |
The main lesson is:
there is no universal best threshold.
The right threshold depends on the scientific or clinical goal.
For example:
- if missing a malignant case is very costly, a lower threshold may be preferable
- if false alarms are more costly, a higher threshold may be preferable
Step 5. ROC and AUC: how well does the model rank cases?
Threshold tables are helpful, but they still depend on choosing specific cut-offs.
ROC analysis answers a different question:
Across all possible thresholds, how well does the model rank positive cases above negative cases?
The workbook generates chart-ready ROC output using:
=BESH.PLOT.ROC_STATS(...)=BESH.PLOT.ROC_POINTS(...)
Holdout ROC summary
| Metric | Value |
|---|---|
| Wilcoxon AUC | 0.96495 |
| DeLong standard error | 0.01198 |
| DeLong 95% CI | 0.94147 to 0.98844 |
| Hanley–McNeil standard error | 0.01204 |
| Hanley–McNeil 95% CI | 0.94136 to 0.98855 |
| Two-sided p-value for AUC ≠ 0.5 | 2.90E-24 |
How should we interpret this?
A clear interpretation is:
- the model has excellent discrimination
- malignant cases tend to receive higher predicted probabilities than benign cases
- the ROC curve stays close to the upper-left corner
- the AUC is well above 0.90, which is strong evidence of useful ranking performance
But ROC is only part of the story. A model can rank cases very well and still produce poorly calibrated probabilities. That is why the next step matters.
Step 6. Calibration: are the probabilities believable?
Calibration asks a different question from ROC.
ROC asks:
Does the model rank cases well?
Calibration asks:
When the model predicts about 20% risk, do roughly 20% of those cases actually turn out positive?
That is why calibration is essential whenever predicted probabilities may be interpreted directly. The workbook reports holdout calibration with:
=BESH.CLASS.CALIB(...)=BESH.PLOT.CALIB_POINTS(...)| Bin | N | Mean predicted probability | Observed event rate | 95% CI lower | 95% CI upper |
|---|---|---|---|---|---|
| 1 | 18 | 0.0035 | 0.0000 | 0.0000 | 0.1759 |
| 2 | 17 | 0.0127 | 0.0000 | 0.0000 | 0.1843 |
| 3 | 17 | 0.0276 | 0.0000 | 0.0000 | 0.1843 |
| 4 | 17 | 0.0508 | 0.0588 | 0.0105 | 0.2698 |
| 5 | 17 | 0.1004 | 0.0588 | 0.0105 | 0.2698 |
| 6 | 17 | 0.2300 | 0.2941 | 0.1328 | 0.5313 |
| 7 | 17 | 0.4130 | 0.5294 | 0.3096 | 0.7383 |
| 8 | 17 | 0.7440 | 0.8235 | 0.5897 | 0.9381 |
| 9 | 17 | 0.9788 | 1.0000 | 0.8157 | 1.0000 |
| 10 | 17 | 0.9979 | 1.0000 | 0.8157 | 1.0000 |
How should we interpret this table?
The calibration picture is reassuring:
- very low predicted probabilities correspond to almost no observed events
- very high predicted probabilities correspond to almost all observed events
- the middle bins show a reasonable increase in observed event rate as mean predicted probability rises
This means the model is not only discriminative, but also reasonably well calibrated on the holdout set. That is an important teaching point:
good classifier reporting should discuss both discrimination and calibration.
Step 7. Brier score: a compact probability-quality summary
The Brier score is useful because it evaluates the full probability forecast, not just hard classifications at one threshold.
Holdout Brier score
| Metric | Value |
|---|---|
| Brier score | 0.03365 |
A lower Brier score is better.
In this workbook, the Brier score is low, which is consistent with:
- good discrimination
- and reasonably good calibration
This is why Brier score is such a useful teaching metric. It reminds us that predicted probabilities matter, not only thresholded class labels.
A practical flow of thought for junior researchers
If you are new to classifier reporting, a good order to follow is:
1. Start with the train/test split
Ask:
- what data were used for fitting?
- what data were held out for evaluation?
2. Fit the model
Check:
- did it converge?
- do the coefficient signs make sense?
- is the model simple enough to explain?
3. Move to the holdout confusion matrix
This gives the first practical picture of performance.
4. Compare thresholds
Ask:
- what happens if sensitivity matters more?
- what happens if specificity matters more?
- is 0.50 really the right operating point?
5. Look at ROC and AUC
This tells you how well the model ranks cases across all thresholds.
6. Look at calibration and Brier score
This tells you whether the predicted probabilities themselves are believable and useful. That sequence is exactly why the workbook is organized the way it is.
GUI or UDF workflow?
A very practical question for students is:
Should I use the GUI or the worksheet formulas?
The best answer is: both are useful, but for different reasons.
Use the GUI when you want:
- a fast first fit
- an easy way to explore options
- a visual teaching workflow
- quick output for review
Use UDFs when you want:
- a reproducible workbook
- a visible audit trail
- reusable reporting blocks
- chart-ready plot data in the sheet
- a template that can be adapted later
A good learning path is:
- start with the GUI to understand the model
- then move to the UDF workbook
- use the formula-driven version as the long-term reporting template
That is one of the strengths of BESH Stat NG: it supports both approaches.
Compatibility note for older Excel versions
This workbook works best in modern versions of Excel that support dynamic arrays and spill ranges. Older Excel versions do not support spills. In those versions, users may need to:
- preselect the full target output range
- enter the formula across that range
- confirm it as an array formula
- in some versions, use Ctrl+Shift+Enter
So the same BESH Stat NG logic can still be used in older Excel, but the workbook may require more manual array-formula entry.
Learn more in the documentation
The following help pages are good companion links for this tutorial:
Validation and transparency
This workbook is designed to be transparent and reusable. It keeps:
- the full dataset
- the train/test split
- the model control block
- the fit handle
- the training summary
- the holdout confusion report
- the threshold table
- the calibration table
- the chart-ready ROC and calibration data
- the source links
That makes it useful not only as a tutorial workbook, but also as a reusable reporting template. For a broader overview of how BESH Stat NG results are checked, including reference datasets, automated tests, and comparison workflows, see the Validation page.
Summary
This tutorial shows that calibration and threshold selection in Excel should not be treated as an afterthought.
A good classifier report should ask:
- how was the model trained?
- how does it perform on holdout data?
- what does the confusion matrix show?
- how does performance change across thresholds?
- how good is discrimination?
- are the probabilities calibrated?
In this workbook, the answers are strong:
- the holdout confusion matrix is good
- the default threshold of 0.50 works well
- the threshold 0.30 gives the best overall compromise on this holdout grid
- the ROC AUC is 0.96495
- calibration is sensible
- the Brier score is low
That makes this workbook a strong teaching example for students and junior researchers learning how to report a binary classifier in Excel.
Frequently asked questions
Why is a holdout set important?
Because a model usually looks better on the same data used to fit it. A holdout set gives a more honest picture of predictive performance.
Why is 0.50 not always the best threshold?
Because the best threshold depends on the goal. Sometimes sensitivity matters most, sometimes specificity, and sometimes a balanced compromise is preferred.
What does AUC tell me?
AUC tells you how well the model ranks positive cases above negative cases across all thresholds.
Why do I need calibration if the ROC curve is excellent?
Because a model can rank cases well but still produce poorly calibrated probabilities. ROC and calibration answer different questions.
What does the Brier score add?
It evaluates the full probability forecast, not only thresholded class labels, and gives a compact summary of probability quality.