How BESHStatNG results were validated

BESHStatNG is an open-source statistical add-in for Microsoft Excel. This page explains how its results are checked, what kinds of evidence are available publicly, and what users should expect when comparing outputs with other software.

The goal of this page is transparency, not marketing. It is meant to help users understand how the software is checked, where they can inspect the evidence, and what kinds of differences may still appear across implementations.

BESHStatNG is open-source software published under the Apache License 2.0 and is provided without warranty. That means users are free to inspect the code, documentation, example workbooks, and test materials, but they should still evaluate the software appropriately for their own use case.

Validation at a glance

BESHStatNG results are checked through a combination of documented formulas, tutorial workbooks, external reference comparisons, certified benchmark datasets, automated tests, and public issue tracking. In the current public test suite, this includes a broad automated validation layer together with NIST Statistical Reference Dataset (StRD) benchmarks for linear least-squares regression and one-way balanced ANOVA. Because the software is open-source and published under the Apache License 2.0, users can inspect the code, tests, reference datasets, and documentation directly.

The goal of this page is transparency, not marketing. It is meant to show what kinds of validation evidence are publicly available, which method families have the strongest numerical benchmark support, and where users should still expect small implementation-level differences across software.

The latest public validation evidence is available directly on GitHub alongside the source code and test project. For the current public release, users can inspect:

These artifacts are provided for transparency. They complement the method-coverage matrix below, but they should not be read as a claim that every feature is benchmarked to the same degree. The strongest certified benchmark evidence remains the NIST suites for linear least-squares regression and one-way balanced ANOVA.

What “validated” means here

In BESHStatNG, validation does not mean that every method is guaranteed to match every other statistical package digit-for-digit in every situation.

Instead, validation means that results are checked through a combination of:

  • standard statistical formulas and published method definitions
  • comparison with reference implementations and external software, especially R
  • tutorial workbooks and reproducible example datasets
  • automated tests for selected numerical components and model families
  • explicit documentation of known limitations, assumptions, and expected discrepancies

This approach is intended to make the software inspectable, reproducible, and improvable rather than opaque.

How results are checked

1. Method implementation from documented formulas

Each statistical method is implemented from standard definitions, model equations, or established algorithms. Where appropriate, the method documentation explains the formulas being used, the available options, and how those choices affect the output.

This is especially important for methods where different software packages may use different defaults, conventions, or approximations.

2. Comparison with external reference outputs

For many methods, the documentation includes a section showing how the analysis can be reproduced in R or compared against a reference implementation.

These sections are important because they let users see:

  • how the same dataset can be analyzed outside Excel
  • which defaults or conventions need to be matched
  • what kinds of differences are expected when the underlying method is the same but the implementation details differ

In many cases, the documentation includes R reference code or “Relationship to R / how to reproduce” sections so users can verify the workflow themselves.

3. Tutorial workbooks and reproducible examples

BESHStatNG is accompanied by tutorial pages and downloadable workbooks. These are not only educational materials; they also serve as practical validation artifacts.

Tutorial workbooks help in several ways: They

  • show the exact input layout used to produce an example
  • allow users to reproduce the same result on their own machine
  • provide a public reference for screenshots and reported outputs
  • help detect regressions when a workflow is changed later

This is particularly valuable for Excel-based software, where practical workflow reproducibility matters as much as the underlying formulas.

4. Automated unit tests and regression-style checks

The project also includes an automated test project with dedicated test modules, reference datasets, and scripted comparisons for a growing part of the code base.

The current automated test project includes:

  • dedicated test modules across core numerical/statistical components
  • model-family tests for areas such as linear models, generalized linear models, GEE, ordinal and multinomial models, survival analysis, ROC analysis, zero-inflated models, clustering, PCA, factor analysis, correspondence analysis, discriminant analysis, and UDFs
  • reference CSV datasets used to verify outputs
  • a separate R_referenceScripts folder used for external cross-checking of selected workflows

This is not yet the same as claiming full automated coverage of every feature, but it does provide a growing body of machine-checkable evidence.

5. Input validation and edge-case behavior

Validation is not only about matching final numbers. It also includes checking how the software behaves when inputs are incomplete, malformed, or numerically difficult.

Examples include:

  • missing values
  • degenerate inputs such as constant or singular variables
  • incompatible options
  • convergence limits
  • bootstrap reproducibility
  • split reproducibility for validation workflows
  • invalid worksheet ranges or UDF input shapes

Good statistical software should fail clearly when inputs are not appropriate, and this is treated as part of validation too.

Public validation evidence currently available

Documentation with reproducible reference sections

Many method pages in the online help include sections such as:

  • R code (reference)
  • Relationship to R (how to reproduce)
  • R code to reproduce the analysis
  • Reference R code

These sections help users inspect how the same example can be reproduced in another environment.

For some methods, the documentation also includes explicit notes about:

  • expected discrepancies
  • limitations
  • implementation-specific defaults
  • sources of small numerical differences

This is important because a trustworthy validation story should explain not only where results agree, but also where they may differ and why.

Tutorial pages and downloadable workbooks

The tutorials section provides worked examples with downloadable files and practical workflows. These serve as public, user-facing validation material because they make the steps, data structure, and expected outputs visible.

Tutorials are especially useful for:

  • survival analysis workflows
  • multivariate analysis workflows
  • regression and UDF workflows
  • sample-size examples
  • workbook-based reproducibility

Automated tests and reference data

Current automated validation snapshot

The current automated test suite provides a method-focused validation layer rather than a simple pass/fail claim for the add-in as a whole. At a high level, the public test project currently includes:

  • a broad automated suite across numerical utilities, statistical helpers, model families, worksheet/UDF workflows, and edge-case behavior
  • NIST benchmark coverage for linear least-squares regression
  • NIST benchmark coverage for one-way balanced ANOVA
  • reference CSV datasets and reproducible scripted comparisons for selected workflows

For methods that fall within the NIST benchmark scope, these certified benchmark datasets are the strongest public numerical evidence currently available in the project. For methods outside that scope, validation is based on a combination of analytical checks, reference-output comparisons, workflow tests, and edge-case testing.

NIST benchmark coverage

BESHStatNG includes public automated checks against NIST Statistical Reference Datasets (StRD) for:

  • linear least squares regression: 11 datasets with certified values
  • one-way balanced ANOVA: 11 datasets with certified values

These benchmarks are especially valuable because NIST designed them specifically to assess the numerical accuracy of statistical software, including difficult datasets that stress cancellation, accumulation, and ill-conditioning. Because the linear-model solver is also reused by downstream model families in parts of the codebase, the NIST linear-regression benchmarks provide not only direct evidence for LinearModel, but also added confidence in higher-level models that depend on the same regression core (GLM, GLM_NB, Zero-Inflated Poisson).

Open source code and issue tracking

Because the software is open-source, users can inspect:

  • source code
  • test code
  • reference datasets
  • documentation sources
  • issue history and bug reports

This makes it easier to track how discrepancies are investigated and corrected over time.

Inspect the evidence directly

Users who want to inspect the validation materials directly can use the following public resources:

The .trx file is the native Visual Studio / VSTest test-results artifact produced by the automated test project. It provides a machine-readable record of the public test run and should be read together with the source code, test modules, reference datasets, and method-coverage notes on this page.

Validation evidence by analysis family

Method-coverage matrix

The table below is intended as a method-evidence matrix rather than a code-coverage report. It summarizes which analysis families are supported by certified benchmarks, by external or analytical reference checks, and by workflow/UDF tests.

Method familyHighest-value public evidenceWhat that means in practice
Linear least-squares coreNIST linear regression benchmark suiteThis is the strongest public numerical evidence for the core least-squares path used by the linear model implementation.
One-way balanced ANOVANIST ANOVA benchmark suiteThis provides certified external benchmark coverage for the core one-way balanced ANOVA calculation.
Generalized and count regression familiesDedicated model-family tests plus shared dependence on validated numerical core routinesMethods such as GLM, negative binomial, zero-inflated Poisson, and related workflows are checked through dedicated automated tests. Where higher-level models reuse the same least-squares or shared numerical machinery internally, the linear-regression benchmark layer also adds confidence in those underlying components.
Survival and Cox modelsDedicated reference-output and workflow testsValidation focuses on matched-settings comparisons, residual checks, baseline-function outputs, and worksheet-facing workflows.
Classical parametric and nonparametric methodsReference calculations, analytical checks, and dedicated unit testsThese methods are generally easier to cross-check directly because their formulas are well established, although defaults and interval conventions can still differ across packages.
Multivariate methodsReference-output tests, invariants, and workflow checksPCA, factor analysis, clustering, correspondence analysis, and discriminant analysis are supported by growing automated coverage plus user-facing examples and documentation.
Excel UDF and reporting layerDedicated worksheet/UDF lifecycle and output-shape testsThis checks not only numerical correctness, but also whether worksheet-facing functions return the expected tables, selectors, handles, and chart-ready outputs.

In short, the NIST benchmark suites should be read as the highest-value validation layer for the method families they directly cover, while the broader automated suite provides additional evidence across the rest of the add-in.

Core numerical and utility code

Core numerical routines and shared statistical helpers are validated through dedicated test modules and reference calculations. These routines matter because they are reused across many higher-level analyses.

Classical parametric and nonparametric methods

Many standard procedures are documented with formula descriptions and reference comparisons. These are usually easier to compare across software because the underlying methods are well established, although small differences can still arise from rounding, continuity corrections, tie handling, or confidence-interval conventions.

Regression and generalized models

Regression families are among the areas with the strongest automated validation evidence in the project. The linear least-squares core is benchmarked against certified NIST linear-regression datasets, including numerically difficult cases that are widely used to stress software accuracy. This gives particularly strong public evidence for the core linear-model implementation.

Higher-level regression families such as generalized linear models, negative binomial models, zero-inflated count models, and related workflows are additionally covered by dedicated automated tests. These models still depend on family-specific likelihoods, links, optimizers, offsets, starting values, and convergence behavior, so they should not be presented as “validated by NIST” in the same direct sense. However, the benchmarked linear-regression layer does strengthen confidence in shared lower-level numerical components used across the regression stack.

This is why the regression validation story in BESHStatNG should be read in layers: certified benchmark evidence for the least-squares core, plus dedicated automated model-family tests for the broader regression ecosystem.

Survival analysis

Survival workflows are supported by both tutorial materials and dedicated tests. These methods may still show small differences across software when weighting conventions, tie handling, or default options differ, so validation is best understood as comparison under matched settings rather than blind expectation of identical output.

Multivariate analysis

The newer multivariate components are supported by:

  • tutorial-style documentation
  • example datasets
  • explicit output descriptions
  • growing test coverage for PCA, factor analysis, clustering, correspondence analysis, and discriminant analysis

For some multivariate methods, matching another software package exactly may require matching preprocessing, priors, starting values, normalization choices, or axis/sign conventions.

UDF workflows

Excel UDFs are also validated through:

  • direct formula testing
  • workbook examples
  • comparison with GUI workflows where relevant
  • chart-ready output checks for plot-data functions such as ROC, Kaplan–Meier, and histogram tables

This matters because worksheet functions must be correct both numerically and structurally.

Why small differences can still occur

Even when two programs implement the same named method, small numerical or presentational differences can still occur.

Common reasons include:

  • default settings differences
  • different factor/reference coding rules
  • different missing-value handling
  • tie handling differences
  • choice of continuity corrections
  • optimizer tolerances or convergence criteria
  • bootstrap seeds and replicate counts
  • split-generation rules for validation workflows
  • matrix decomposition details
  • display rounding versus stored precision

For that reason, validation should usually be interpreted as:

  • agreement in method and workflow,
  • agreement in the main conclusions,
  • and close numerical agreement when the same settings are matched,

rather than assuming that every output will always be byte-for-byte identical across platforms.

What users can inspect themselves

Users who want to verify results independently can do the following:

  • read the method documentation and formula explanations
  • use the R reference sections where available
  • download tutorial workbooks and reproduce the examples
  • inspect the automated test project and reference datasets
  • inspect the GitHub source code
  • compare GUI workflows with UDF workflows
  • report suspected discrepancies through GitHub Issues

This is one of the advantages of developing the project openly: validation evidence is not limited to internal claims.

Current limitations

BESHStatNG is actively developed, and validation depth is not yet identical for every method or every workflow.

In particular:

  • some methods have more extensive automated coverage than others
  • some method pages have richer external reference material than others
  • newly added methods may initially rely more on example-based and regression-style validation before broader coverage is added
  • some tutorial families are more complete than others
  • exact cross-software matching may depend on carefully aligning defaults and options

This page should therefore be read as a description of the current validation approach and evidence, not as a claim that validation is finished once and for all. The method-evidence matrix above is also not the same thing as full code coverage: some families currently have certified benchmark support, while others are validated mainly through reference-output tests, analytical checks, workflow checks, and edge-case behavior.

Open-source license and warranty

BESHStatNG is published as open-source software under the Apache License 2.0.

That means the source code can be inspected, reused, and redistributed under the terms of that license. It also means the software is provided on an “as is” basis, without warranties or conditions of any kind, including implied warranties of merchantability, fitness for a particular purpose, or noninfringement.

In practice, that means:

  • the software is developed transparently
  • validation materials are made public where possible
  • users are encouraged to verify results for their own context
  • bug reports and discrepancies are welcome and help improve the project

Help improve validation

If you think a result is incorrect, unclear, or unexpectedly different from another package, please report it.

A good report should include:

  • the BESHStatNG version
  • Excel version and Office bitness
  • Windows version
  • the exact steps to reproduce
  • the dataset or a simplified workbook if possible
  • the expected result
  • the actual result
  • the external reference used for comparison, if any

This helps turn isolated observations into reproducible fixes and better documentation.