Binary classifiers are often introduced as simple “yes/no” prediction tools, but good reporting goes beyond fitting a model and checking whether it is statistically significant.
In real analysis, we usually want to know:
- how well the model separates positives from negatives
- whether the probabilities are well calibrated
- what happens when the classification threshold changes
- whether the model still performs well on new data, not only on the training set
This tutorial shows how to do binary classifier reporting in Excel with BESH Stat NG using one workbook built around a binomial GLM and a holdout evaluation workflow.
The workbook uses a real breast-cancer diagnostic dataset and walks through:
- fitting the classifier on a training set
- scoring a holdout set
- reading the confusion matrix
- comparing thresholds
- interpreting ROC and AUC
- checking calibration
- using the Brier score as a probability-quality summary
This page is written for students, junior researchers, and applied biomedical users. The goal is not only to show how to run the formulas, but also how to think through the results in a sensible order.
Download the example workbook
The workbook contains these sheets:
- README — workbook overview and reading order
- Data_All — all cases with train/test split
- Data_Train — training subset used for model fitting
- Data_Test — holdout subset used for external-style reporting
- GLM_Model — fit-handle workflow for a binomial GLM
- Holdout_Reporting — holdout probabilities plus generic classifier-report UDFs
- Plot_Data — ROC and calibration plot-data spills
- Sources — dataset and documentation links
What question are we trying to answer?
The scientific question in this workbook is straightforward:
Can a small set of numeric features help distinguish malignant from benign breast lesions, and how should we report the model’s predictive performance?
That question is more practical than “is any coefficient significant?” because binary classifier reporting is usually about decision support, not just model fitting.
A useful reporting workflow asks the questions in this order:
- What data were used for fitting and what data were held out?
- Does the fitted model show sensible directions of effect?
- How good is performance on the holdout set?
- Is the default threshold of 0.50 the best choice?
- Does the model rank cases well?
- Are the predicted probabilities calibrated?
- Can the whole process be made reproducible inside the workbook?
That is the logic this tutorial follows.
The example data and split
The workbook uses the Breast Cancer Wisconsin (Diagnostic) dataset referenced in the workbook’s Sources sheet. The dataset loader documents 569 cases, and the workbook applies a deterministic stratified train/test split so the class balance stays sensible in both subsets.
Split used in the workbook
| Subset | Total cases | Malignant | Benign |
|---|---|---|---|
| Training | 398 | 148 | 250 |
| Holdout / test | 171 | 64 | 107 |
| All data | 569 | 212 | 357 |
This split is important because it creates a clean separation between:
- model fitting
- model reporting on unseen cases
That is one of the main habits students should learn early: do not rely only on training-set performance when reporting a classifier.
Why this workbook is useful for teaching
This workbook is useful because it turns classifier reporting into a visible worksheet workflow. Instead of hiding the analysis inside one-off output sheets, the workbook keeps the logic visible:
- the raw data
- the train/test split
- the fit handle
- the summary and tests
- the holdout confusion report
- the threshold table
- the calibration table
- the plot-data formulas
That makes it useful both as:
- a practical biomedical example
- a teaching workbook for students learning how classifier reporting should be done
Step 1: Fit the binary GLM on the training set
The sheet GLM_Model fits a binomial logit GLM using four predictors:
mean_radiusmean_texturemean_smoothnessworst_concave_points
The workbook keeps the model settings visible in a control block, including:
- family
- link
- formula text
- alpha
- default threshold
- calibration settings
- iteration settings
Main worksheet functions in the training sheet
The model fit is built around:
=BESH.REGR.GLM_FIT(...)and downstream reporting uses:
=BESH.REGR.GLM_SUMMARY(...)=BESH.REGR.GLM_TESTS(...)=BESH.REGR.GLM_CLASS(...)=BESH.REGR.GLM_THRESH(...)=BESH.REGR.GLM_CALIB(...)=BESH.REGR.GLM_BRIER(...)Training-set model summary
| Statistic | Value |
|---|---|
| Training observations | 398 |
| Log-likelihood | -47.583 |
| Deviance | 95.167 |
| AIC | 105.167 |
| Iterations | 9 |
Training-set coefficient table
| Predictor | Coef | SE | Z | P-value |
|---|---|---|---|---|
| mean_radius | 1.129 | 0.212 | 5.331 | 9.79E-08 |
| mean_texture | 0.436 | 0.088 | 4.937 | 7.93E-07 |
| mean_smoothness | 87.585 | 30.955 | 2.829 | 0.00466 |
| worst_concave_points | 40.677 | 9.211 | 4.416 | 1.01E-05 |
How should a junior researcher read this table?
A good first reading is:
- all four predictors are positively associated with the probability of malignancy
- all four are statistically significant in this fit
- the signs are consistent with what we would expect from more abnormal lesion characteristics
- the exact coefficient magnitudes are harder to compare directly because the predictors are on different numerical scales
That last point matters.
Students often want to rank predictors by raw coefficient size, but for continuous predictors on different scales that can be misleading. The safer interpretation is usually:
- direction of effect
- statistical evidence
- and later, actual predictive performance on new data
That is why the tutorial moves next to holdout reporting.
Step 2: Do not stop at the training set
A model can look excellent on the training data and still report poorly on new cases.
So the next question is:
What happens when we score the holdout set?
The sheet Holdout_Reporting uses the fitted model to generate holdout probabilities with:
=BESH.REGR.GLM_PRED(...)and then applies the generic classifier-report functions:
=BESH.CLASS.CONFUSION(...)=BESH.CLASS.THRESH(...)=BESH.CLASS.CALIB(...)=BESH.CLASS.BRIER(...)This is a very important workbook pattern because it separates:
- model fitting on the training set
from - classifier reporting on the holdout set
That is much closer to how predictive analysis should be taught in practice.
Step 3: Read the holdout confusion matrix first
At the default threshold of 0.50, the holdout results are:
Holdout confusion matrix at threshold 0.50
| Observed \\ Predicted | Benign | Malignant |
|---|---|---|
| Benign | 103 | 4 |
| Malignant | 4 | 60 |
Holdout summary metrics at threshold 0.50
| Metric | Value |
|---|---|
| Sensitivity | 0.9375 |
| Specificity | 0.9626 |
| Precision | 0.9375 |
| Accuracy | 0.9532 |
| Balanced accuracy | 0.9501 |
| F1 score | 0.9375 |
| Brier score | 0.0337 |
How should we interpret this?
A good step-by-step interpretation is:
- Only 8 of 171 holdout cases are misclassified.
- The model identifies malignant cases well:
- sensitivity = 93.75%
- It also identifies benign cases well:
- specificity = 96.26%
- Precision is also strong:
- when the model predicts malignant, it is correct 93.75% of the time at this threshold
- The Brier score is low:
- the predicted probabilities are, on average, close to the true outcomes
This is exactly why confusion-matrix reporting should be reported, it translates the model into concrete classification performance.
Step 4: But is 0.50 really the right threshold?
This is one of the most important teaching points.
Students often assume the threshold must be 0.50, but that is not always the best choice.
The best threshold depends on the practical goal to maximize:
- sensitivity if missing a positive case is costly
- specificity if false alarms are costly
- balanced accuracy if both error types matter similarly
- F1 when precision and recall both matter
The holdout threshold table in the workbook evaluates thresholds from 0.10 to 0.90.
Selected holdout threshold results
| Threshold | Sensitivity | Specificity | Precision | Accuracy | Balanced accuracy | F1 |
|---|---|---|---|---|---|---|
| 0.30 | 0.9531 | 0.9533 | 0.9242 | 0.9532 | 0.9532 | 0.9385 |
| 0.40 | 0.9531 | 0.9626 | 0.9385 | 0.9591 | 0.9579 | 0.9457 |
| 0.50 | 0.9375 | 0.9626 | 0.9375 | 0.9532 | 0.9501 | 0.9375 |
| 0.80 | 0.8438 | 0.9907 | 0.9818 | 0.9357 | 0.9172 | 0.9076 |
| 0.90 | 0.8125 | 1.0000 | 1.0000 | 0.9298 | 0.9063 | 0.8966 |
What does this table teach?
A good reasoning process is at:
- 0.40, the model performs slightly better overall than at 0.50 on this holdout set.
- 0.10, sensitivity is highest, but specificity drops.
- 0.90, specificity and precision become perfect, but sensitivity falls.
So the right threshold depends on the use case.
For example:
- in a screening context, you may prefer a lower threshold and higher sensitivity
- in a confirmatory or costly-follow-up context, you may prefer a higher threshold and higher specificity
Best thresholds on this holdout grid
| Criterion | Best threshold |
|---|---|
| Youden index | 0.40 |
| F1 score | 0.40 |
| Balanced accuracy | 0.40 |
| Maximum sensitivity | 0.10 |
| Maximum specificity | 0.90 |
That is a very useful teaching result, because it shows that threshold choice is not arbitrary.
Step 5: ROC and AUC — how well does the model rank cases?
The workbook also uses the plot-data functions to produce chart-ready ROC output:
=BESH.PLOT.ROC_STATS(...)=BESH.PLOT.ROC_POINTS(...)BESH.PLOT.ROC_POINTS and BESH.PLOT.CALIB_POINTS, and the GLM help explicitly positions ROC/AUC as part of binary model reporting.The ROC figure below shows the holdout ROC curve for the fitted GLM.

ROC result
| Metric | Value |
|---|---|
| Holdout AUC | 0.9931 |
How should we interpret the ROC result?
A simple interpretation is:
- the model has excellent discrimination
- malignant cases tend to receive much higher predicted probabilities than benign cases
- the curve stays close to the upper-left corner
- the AUC is very close to 1.0
This is strong evidence that the model ranks cases well. But ROC alone is not enough. A model can discriminate well and still produce poorly calibrated probabilities. That is why the tutorial moves next to calibration.
Step 6: Calibration — are the probabilities believable?
Calibration asks a different question from ROC.
ROC asks:
Does the model rank cases well?
Calibration asks:
When the model says “about 60% risk,” do about 60% of those cases actually turn out positive?
That makes calibration very important whenever predicted probabilities might be interpreted directly. The workbook uses:
=BESH.CLASS.CALIB(...)=BESH.PLOT.CALIB_POINTS(...)Selected calibration bins from the holdout set
| Bin summary | Mean predicted probability | Observed event rate | N |
|---|---|---|---|
| Very low-risk bin | 0.00003 | 0.000 | 18 |
| Low-risk bin | 0.0124 | 0.000 | 17 |
| Middle bin | 0.0815 | 0.1176 | 17 |
| Higher-risk bin | 0.6122 | 0.6250 | 16 |
| Very high-risk bin | 0.9724 | 1.0000 | 18 |
What does this tell us?
The calibration picture is reassuring:
- very low predicted probabilities correspond to almost no observed events
- very high predicted probabilities correspond to almost all observed events
- the middle-to-high bin with mean predicted probability around 0.61 has an observed event rate of 0.625, which is very close
So in this example the model is not only discriminative, but also reasonably well calibrated on the holdout set.
Step 7: Why the Brier score matters
The Brier score is worth teaching explicitly because it is easy to overlook.
Unlike threshold-based classification metrics, the Brier score evaluates the full probability forecast rather than a single hard classification rule. The UDF help describes it as a threshold-free probability-error measure that combines aspects of calibration and discrimination.
Holdout Brier score
| Metric | Value |
|---|---|
| Brier score | 0.0337 |
A lower Brier score is better.
In this case, a Brier score around 0.034 is consistent with a model that is making very good probability forecasts on the holdout set.
This is why Brier score is useful:
- it reminds students that predicted probabilities matter, not only class labels
- it helps connect discrimination and calibration thinking
A practical flow of thought for real analysis
If you are a student or junior researcher, here is a sensible order to follow in a real binary-classifier project.
1. Start with the split
Ask:
- what is training data?
- what is holdout data?
- are the classes reasonably represented in both?
2. Fit the model
Check:
- which predictors are included?
- do the coefficient signs make scientific sense?
- does the model converge?
3. Move immediately to holdout results
Do not stop at training performance.
4. Read the confusion matrix at a default threshold
This gives a first practical sense of performance.
5. Compare thresholds
Ask:
- what happens if sensitivity matters more?
- what happens if specificity matters more?
- does 0.50 actually make sense?
6. Check ROC / AUC
This tells you how well the model ranks cases overall.
7. Check calibration and Brier
This tells you whether the predicted probabilities themselves are believable and useful. That sequence is exactly why this workbook is structured the way it is.
GUI or UDF workflow?
A natural question for students is:
Should I do this through the GUI, or through worksheet formulas?
The honest answer is:
both are useful, but for different reasons.
When the GUI is helpful
Use the GUI when you want:
- a fast first fit
- an easy way to explore options
- quick output for inspection
- a visual workflow in teaching
GUI drawbacks
The GUI is less ideal when you want:
- a reusable template
- visible control cells
- a formula-driven audit trail
- one workbook that documents the full workflow
When UDFs are helpful
Use the UDF workflow when you want:
- a reproducible workbook
- clear training/holdout separation
- reusable reporting blocks
- chart-ready plot data inside the sheet
- a teaching file where the logic stays visible
UDF drawbacks
The UDF approach is less beginner-friendly at first because:
- the workbook structure needs more planning
- dynamic spills matter
- formulas must be organized carefully
Best approach
For many users, the best learning path is:
- start with the GUI to understand the model
- then move to the UDF workbook
- use the formula-driven workbook as the long-term reporting template
Compatibility note for older Excel versions
This workbook works best in modern versions of Excel that support dynamic arrays and spill ranges.
Older versions of Excel do not support spills. In those versions, users may need to:
- preselect the full target output range
- enter the formula across that range
- confirm it as an array formula
- in some versions, use Ctrl+Shift+Enter
So the same BESH Stat NG logic can still be used in older Excel, but the workbook may require more manual array-formula entry.
Learn more in the documentation
The following BESH Stat NG help pages are good companion links for this tutorial:
Validation and transparency
This workbook is designed to be transparent and reproducible.
It keeps:
- the full dataset
- the train/test split
- the model control block
- the fit handle
- the training-set outputs
- the holdout reporting tables
- the chart-ready plot data
- the source links
That makes it useful not only as a tutorial workbook, but also as a reusable reporting template.
For a broader overview of how BESH Stat NG results are checked, including reference datasets, automated tests, and comparison workflows, see the Validation page.
Summary
This tutorial shows that binary classifier reporting in Excel should not stop at fitting a model and reading the coefficient table.
A good reporting workflow asks:
- how was the model trained?
- how does it perform on holdout data?
- what does the confusion matrix show?
- how does performance change across thresholds?
- how good is the ranking performance?
- are the probabilities calibrated?
In this workbook, the answers are strong:
- the holdout confusion matrix is very good
- the default 0.50 threshold works well
- a threshold of 0.40 performs slightly better overall on this holdout set
- the ROC AUC is 0.9931
- calibration is good
- the Brier score is low
That makes this a strong example for teaching classifier reporting and for building reusable Excel-based reporting templates.
Frequently asked questions
Why is a holdout set important?
Because training-set performance can look better than real-world predictive performance. A holdout set gives a more honest evaluation.
Why is 0.50 not always the best threshold?
Because the best threshold depends on the practical goal. Sometimes sensitivity matters most, sometimes specificity, and sometimes a balanced compromise is preferred.
What does AUC tell me?
AUC tells you how well the model ranks positive cases above negative cases across all possible thresholds.
Why do I need calibration if the ROC curve is excellent?
Because a model can rank cases well but still output poorly calibrated probabilities. ROC and calibration answer different questions.
What does the Brier score add?
It evaluates the full probability forecast, not just threshold-based classification, and is useful when you care about how good the predicted probabilities are.