Propensity Score Matching in Excel with BESH Stat NG

This tutorial shows how to run a propensity score matching (PSM) analysis in Microsoft Excel using BESH Stat NG. The example uses a RAND Health Insurance Experiment subset and compares a cost-sharing insurance plan group with a no-cost-sharing group on the number of outpatient visits to a medical doctor.

The goal is not simply to fit a good treatment-assignment model. The goal is to create a more credible treated-versus-control comparison by improving balance in measured pre-treatment covariates, then to inspect whether the adjusted comparison is believable enough to interpret.

What is propensity score matching?

Propensity score methods are used in observational data when treatment assignment was not randomized. The propensity score is the estimated probability of receiving treatment given measured baseline covariates. Subjects with similar propensity scores have similar measured covariate profiles, at least with respect to the variables included in the score model.

In a matching workflow, treated subjects are matched to control subjects with similar propensity scores or covariate profiles. The matched sample is then used to estimate a treatment effect, such as the average treatment effect among the treated (ATT).

Important limitation: propensity-score methods adjust only for measured pre-treatment covariates. They do not remove bias due to unmeasured confounding, post-treatment variables, poor timing, or lack of overlap.

Example data

download example workbook

The input data contain 320 observations: 160 treated and 160 controls. In this tutorial, treatment is defined as 1 = cost-sharing plan and 0 = no-cost-sharing plan. The outcome is outcome_mdvis, the number of outpatient visits to a medical doctor. The propensity score model uses seven pre-treatment covariates: lpi, idp, physlm, disea, hlthg, hlthf, and hlthp.

VariableDescription
idSynthetic row ID used for output labels and audit tables.
treatmentTreatment indicator: 1 = cost-sharing plan, 0 = no-cost-sharing plan.
outcome_mdvisNumber of outpatient visits to a medical doctor.
lpiLog of annual participation incentive payment.
idpIndicator for individual deductible plan.
physlmIndicator for physical limitation.
diseaNumber of chronic diseases.
hlthg / hlthf / hlthpSelf-rated health indicators for good, fair, and poor health; excellent health is the omitted category.
pscorePreviously estimated propensity score column, useful for supplied-score sensitivity and external comparisons.
Variables used in the RAND HIE propensity-score example.

Open the PSM dialog

On the Excel ribbon, go to BESH Stat NG → Analyse → Causal Inference → Propensity Score Matching.

Excel ribbon showing BESH Stat NG Analyse menu with Causal Inference and Propensity Score Matching selected.
Open the PSM dialog from BESH Stat NG → Analyse → Causal Inference → Propensity Score Matching.

GUI workflow

1. Select the data

Use the Data tab to assign worksheet columns to analysis roles. Treatment, outcome, and covariates are required. ID is optional but recommended because it makes matched-pair and audit tables easier to read.

PSM Data tab showing treatment, outcome, ID, covariates, and active worksheet selections for the RAND HIE example.
Data tab with treatment, outcome, ID, and covariate columns selected from the Input_Data sheet.
  1. Set Active Worksheet to Input_Data.
  2. Move treatment to Treatment (0/1).
  3. outcome_mdvis to Outcome.
  4. id to ID.
  5. Leave Supplied score empty for the main logistic-score tutorial run.
  6. Move lpi, idp, physlm, disea, hlthg, hlthf, and hlthp to Covariates.

2. Specify the propensity model

In the Propensity model tab, keep the tutorial model simple: main effects plus an intercept. This makes the example easy to reproduce and keeps the model auditable.

PSM Propensity model tab showing selected main-effect covariates and intercept for the logistic propensity-score model.
Propensity model tab using main-effect covariates and an intercept.

Do not include the outcome in the propensity-score model. Only include variables measured before treatment assignment that could affect both treatment and outcome.

3. Choose matching and scoring options

The attached result workbook was created with the default PSM options shown below: logistic-regression score estimation, ATT estimand, 1:1 nearest-neighbor matching, propensity-score distance, no replacement, no caliper, no common-support restriction, and standardized covariates for the propensity model.

PSM Options tab showing logistic score method, nearest-neighbor ATT matching, propensity-score distance, ratio 1, no replacement, and no caliper.
Default options used in this tutorial: logistic score, ATT, 1:1 nearest-neighbor matching, no replacement, no caliper.

 

OptionValue
Run methodStandardNearestNeighbor
Score methodLogisticRegression
EstimandATT
Distance metricPropensityScore
Matching ratio1
With replacementNo
Caliper scaleNone
Matching orderPropensityDescending
Common supportNone
Standardize covariatesYes
Logistic ridge penalty1e-07
SMD threshold0.1000
Normalize weights to sample sizeYes
Include doubly robust AIPWYes
Include overlap diagnosticsYes
Include weight diagnosticsYes
Include Love-plot rowsYes
Overlap bin count20
Love-plot threshold0.1000
Extreme weight cutoff10
Key PSM options used for the tutorial run.

4. Select diagnostics and outputs

For teaching and applied analysis, keep the diagnostic outputs enabled. The matched-pair table provides auditability; balance diagnostics show whether adjustment helped; overlap and weight diagnostics show whether the analysis depends on extreme extrapolation; and Love-plot output makes balance easier to inspect visually.

PSM Diagnostics and Outputs tab with AIPW estimate, overlap diagnostics, weight diagnostics, Love-plot data, matched-pair table, and diagnostics tables enabled.
Diagnostics and output settings used to write matched-pair, balance, overlap, weight, AIPW, and Love-plot data.
  • Include doubly robust AIPW estimate: useful as a model-assisted sensitivity estimate.
  • Overlap diagnostics: useful for detecting weak treated/control overlap.
  • Weight diagnostics: important when interpreting weighting and AIPW results.
  • Love-plot data: recommended for every PSM tutorial and report.
  • Write matched-pair table: recommended for matching methods so users can audit each match.
  • Write diagnostics tables: recommended for checking balance, overlap, weights, and sensitivity output.

Results from the tutorial run

Download full results excel workbook generate by BESHstatNG.

Run summary

ItemValue
Run MethodStandardNearestNeighbor
Score MethodLogisticRegression
Score Model ConvergedYes
Score Model Iterations4
Total Rows320
Treated Rows160
Control Rows160
Matched Sets160
Dropped by Common Support0
Dropped by Trimming0
Warnings0
Run summary extracted from the BESH Stat NG result workbook.

Sample-size summary

MetricValue
Total rows320
Treated rows160
Control rows160
Eligible treated rows160
Eligible control rows160
Matched treated rows160
Matched control rows160
Matched sets160
Unmatched treated rows0
Unmatched control rows0
Dropped by common support0
Dropped by trimming0
Sample-size summary from the PSM output.

Treatment-effect estimates

The matched-pair ATT estimate is negative, meaning that treated subjects had fewer MD visits on average than their matched controls. However, the matched-pair 95% confidence interval crosses zero. In this default run, the matched comparison alone does not provide strong evidence of a non-zero effect.

MethodEstimandEstimateStd. ErrorLower 95%Upper 95%Treated MeanControl MeanEff. Treated NEff. Control N
Matched mean differenceATT-0.58750.5358-1.6380.46262.9383.525160160
Propensity-score weightingATT-2.3570.9893-4.296-0.41782.9385.29416065.71
Effect estimates extracted from the BESH Stat NG result workbook.

The weighting and AIPW estimates are more negative and their confidence intervals do not cross zero. These estimates should be interpreted cautiously because the diagnostics show residual imbalance and limited effective sample size among weighted controls.

Effect sensitivity summary

MethodEstimandEstimateStd. Errorzp-valueLower 95%Upper 95%Crosses zeroWarning
Matched mean differenceATT-0.58750.5358-1.0970.2728-1.6380.4626Yes 
Propensity-score weightingATT-2.3570.9893-2.3820.0172-4.296-0.4178No 
Doubly robust AIPWATT-2.0421.009-2.0240.0429-4.020-0.0651No 
Sensitivity summary for matched, weighting, and AIPW estimates.

A useful teaching point is that matching, weighting, and AIPW do not have to produce identical estimates. Differences between them are often a signal to inspect overlap, balance, covariate specification, and target estimand rather than to choose the most favorable estimate.

Propensity-score model

TermEstimateStd. ErrorMethod
Intercept-0.05270.1315LogisticRegression
lpi1.0560.1529LogisticRegression
idp-0.85350.1378LogisticRegression
physlm-0.15650.1517LogisticRegression
disea0.36340.1541LogisticRegression
hlthg0.11460.1371LogisticRegression
hlthf0.01050.1434LogisticRegression
hlthp0.09650.2018LogisticRegression
Logistic propensity-score model coefficients. Coefficients are from the standardized-covariate score model used by the GUI run.

Balance diagnostics and Love plot

Balance diagnostics are the most important part of a propensity-score analysis. The standardized mean difference (SMD) compares covariate means between treated and control groups on a standardized scale. A common rule of thumb is to review covariates with absolute SMD greater than 0.1.

Love plot showing absolute standardized mean differences before matching, after matching, and after weighting with a 0.1 threshold line.
Love plot for the RAND HIE example. Several covariates remain above the 0.1 threshold, so balance should be reviewed before interpreting effects.

 

Plot RowVariable|SMD| before|SMD| after matching|SMD| after weightingThresholdFlag
1lpi0.79080.79080.24180.1000Review
2idp0.60110.60110.01400.1000Review
3disea0.16390.16390.13310.1000Review
4hlthg0.07840.07840.10130.1000Review
5hlthp000.09650.1000OK
6physlm0.07890.07890.08330.1000OK
7hlthf0.07240.07240.05330.1000OK
Love-plot data extracted from the BESH Stat NG output.

In this tutorial run, lpi, idp, and disea remain above the 0.1 SMD threshold after matching. The reason is visible in the run summary: there are 160 treated rows and 160 control rows, and 1:1 matching without replacement uses all controls. Because all rows are retained, the matched sample has the same covariate means as the original sample. This is a useful diagnostic lesson: a completed match is not automatically a balanced match.

Weight and overlap diagnostics

SampleGroupNNon-zero NSum WMean WMin WMax WCVESSESS/NExtreme W NFlag
StandardNearestNeighborAll320320320.01.0000.01648.2430.8484186.30.58220OK
StandardNearestNeighborTreated160160160111016010OK
StandardNearestNeighborControl160160160.01.00000.01648.2431.20265.710.41070Low ESS
Weight diagnostics from the BESH Stat NG output.

The treated group has an effective sample size of 160, while the weighted control effective sample size is about 65.7. This indicates that the weighted comparison relies on unequal control weights. That does not invalidate the analysis, but it means the weighting results should be interpreted together with balance and overlap diagnostics.

GroupNMinQ1MedianQ3MaxMeanSDBelow overlapAbove overlapExtreme PSFlag
Treated1600.05510.57550.72690.77810.95110.63690.2135041Outside support,Extreme PS
Control1600.01520.16850.25610.59010.88590.36310.2342606Outside support,Extreme PS
Propensity-score overlap summary.

The overlap summary flags observations outside support and extreme propensity scores in both groups. This supports a cautious interpretation and motivates sensitivity analyses with calipers, common-support restrictions, or overlap-focused estimands.

Matched pairs

The full BESH Stat NG output contains all 160 matched sets. The first 10 matched pairs are shown below as an audit excerpt.

SetTreated RowTreated IDControl RowControl IDDistancePS DistanceMahalanobisExact Group
1216R0216144R01440.06520.0652#N/A 
2296R0296275R02750.11110.1111#N/A 
3210R021059R00590.12780.1278#N/A 
4237R023785R00850.12020.1202#N/A 
5117R0117121R01210.10170.1017#N/A 
6290R029096R00960.10180.1018#N/A 
7220R0220218R02180.09740.0974#N/A 
8300R030094R00940.09410.0941#N/A 
9129R0129224R02240.07660.0766#N/A 
10263R0263307R03070.08720.0872#N/A 
First 10 matched pairs from the BESH Stat NG matched-pair table.

Doubly robust AIPW estimate

MethodEstimandEstimateStd. ErrorLower 95%Upper 95%Treated meanControl meanTreated NControl N
Doubly robust AIPWATT-2.0421.009-4.020-0.06512.9383.525160160
Doubly robust AIPW estimate from the diagnostics output.

The AIPW result is useful as a sensitivity estimate because it combines propensity-score adjustment with an outcome-model component. It should not be treated as a substitute for balance checking. If covariate balance or overlap is poor, the AIPW estimate can still depend heavily on modeling assumptions.

Rosenbaum-style matched-pair sensitivity

MetricValue
MetricValue
AlternativeTwoSided
Informative pairs130
Positive differences51
Negative differences79
Tied differences30
Mean difference-0.5875
Median difference0
Matched-pair sensitivity summary.

The Rosenbaum-style sensitivity section provides additional information for matched-pair analyses. It is most useful when the matched comparison is the primary estimand and the matched pairs show acceptable balance.

UDF workflow

The same analysis can be driven from worksheet formulas. This is useful for templates, reproducible teaching examples, and dashboards where users want to refresh outputs after changing inputs or options.

First fit the propensity-score analysis and store the returned handle in a cell, for example A1 on a worksheet named UDF_Workflow.

=BESH.PS.FIT(Input_Data!A2:A321, Input_Data!B2:B321, Input_Data!C2:C321,
             Input_Data!D2:J321, Input_Data!D1:J1,
             "matching", "ATT", "logit", , ,
             "lpi + idp + physlm + disea + hlthg + hlthf + hlthp",
             "ratio=1; replacement=false; distance=ps; order=descending; ridge=1e-7; maxIter=100; tol=1e-7",
             "smd=0.1; overlapBins=20; lovePlot=true")

Then use the handle to return output tables:

=BESH.PS.SUMMARY(A1)
=BESH.PSM.MATCHES(A1)
=BESH.PS.SCORES(A1)
=BESH.PS.WEIGHTS(A1)
=BESH.PS.BALANCE(A1)
=BESH.PS.EFFECT(A1)
=BESH.PS.LOVEPLOT_DATA(A1)
=BESH.PS.CLEANUP(A1)

For an exact benchmark using an already-computed propensity score column, use scoreMethod = supplied and pass the existing score column:

=BESH.PS.FIT(Input_Data!A2:A321, Input_Data!B2:B321, Input_Data!C2:C321,
             Input_Data!D2:J321, Input_Data!D1:J1,
             "matching", "ATT", "supplied", Input_Data!K2:K321, ,
             "",
             "ratio=1; replacement=false; distance=ps; order=descending",
             "smd=0.1; overlapBins=20; lovePlot=true")

When to use each PSM option

OptionWhen to use itWhy it matters
Logistic score methodDefault for most analyses when no existing propensity score is available.Fits the treatment model directly from selected pre-treatment covariates.
Supplied scoreUse for validation, external benchmark comparisons, or when scores were estimated elsewhere.Makes matching deterministic relative to the supplied score column.
ATTUse when the question is the effect among treated subjects.This is the most common target for nearest-neighbor matching.
ATEUse when the target is the full eligible sample.Usually more natural for weighting than simple matching.
ATOUse when overlap is limited and the clinically relevant target is the region of equipoise.Can improve stability when extreme scores are present.
Without replacementUse for a simple auditable matched-pair design.Each control appears at most once, but balance may be worse.
With replacementUse when there are few good controls for high-score treated subjects.Can improve match quality but may reduce effective control sample size.
CaliperUse when poor matches are possible.Prevents distant matches, but may leave treated or control rows unmatched.
Common support / trimmingUse when score overlap is poor.Removes observations where comparison requires extrapolation.
Exact groupsUse for variables that must not be crossed, such as site, sex, country, or risk stratum.Protects the design but can reduce the number of possible matches.
Polynomial / interaction termsUse when diagnostics show residual imbalance after the main-effect score model.Improves score-model flexibility, but should be guided by balance, not outcome fishing.
Practical guide to common PSM options.

How to report the example

A compact report might read as follows:

We fitted a logistic propensity-score model for cost-sharing plan assignment using lpi, idp, physlm, disea, and health-status indicators. The default 1:1 nearest-neighbor ATT analysis matched all 160 treated observations to 160 controls. The matched mean difference in MD visits was -0.588 (95% CI -1.638 to 0.463), so the matched-pair estimate was not statistically distinguishable from zero. However, balance diagnostics showed residual imbalance for lpi, idp, and disea, and overlap diagnostics flagged observations outside support. Weighted and AIPW sensitivity estimates were more negative, but they should be interpreted cautiously because residual imbalance and control effective sample size indicate that the adjusted comparison remains model-dependent.

Sensitivity analyses with alternative matching options

Download results workbooks: psm_results2.xlsx, psm_results3.xlsx

The default run is useful, but it also shows a common pitfall: a completed 1:1 match is not automatically a well-balanced matched design. Because the default run matches all 160 treated observations to all 160 controls, the after-matching covariate means are the same as the original treated/control comparison. For that reason, is good to include at least one sensitivity analysis that changes the design.

The two additional runs below use the same RAND HIE input data and the same logistic propensity-score model, but they change the matching rules. The goal is not to select the run with the smallest p-value. The goal is to show how sample size, balance, overlap, and effect estimates change when the matching design is made stricter or more flexible.

RunMain optionsMatched setsMatched treatedMatched controls / unique controlsUnmatched treatedUnmatched controlsDropped by support
Default nearest-neighbor ATTLogistic score; 1:1 nearest-neighbor ATT; propensity-score distance; no replacement; no caliper; no common-support restriction.160160160000
Caliper sensitivitySame as default, but with 0.2 caliper on the standardized logit propensity score.87878773730
Mahalanobis + replacement + common supportLogistic score; ATT; Mahalanobis distance; matching with replacement; drop observations outside overlap support.15615647411310
Comparison of the default PSM run and two sensitivity runs from the BESH Stat NG result workbooks.

The caliper run keeps the same nearest-neighbor ATT design but adds a 0.2 caliper on the standardized logit propensity score. This prevents many distant matches, so the matched sample drops from 160 pairs to 87 pairs. The Mahalanobis/common-support run keeps almost all eligible treated observations, but it allows replacement and therefore uses only 47 unique controls for 156 matched treated observations.

Treatment-effect estimates across runs

RunMatched ATTStd. Error95% CIp-valueCI crosses zero?Max |SMD| after matchingVariables still above 0.1
Default nearest-neighbor ATT-0.5880.536-1.638 to 0.4630.2728Yes0.791lpi, idp, disea
Caliper sensitivity-1.4140.804-2.990 to 0.1620.0788Yes0.191lpi, physlm, hlthg
Mahalanobis + replacement + common support-1.2820.507-2.275 to -0.2890.0114No0.153lpi
Matched ATT estimates and balance summary across the three PSM runs.

The default matched estimate is -0.588 MD visits and its confidence interval crosses zero. The caliper run gives a more negative matched estimate (-1.414), but the smaller matched sample makes the interval wider and it still crosses zero. The Mahalanobis/common-support run gives a negative estimate with a confidence interval below zero, but it achieves this design by reusing controls: only 47 unique controls are used for 156 matched treated observations.

The weighting and AIPW estimates are unchanged across these three runs because the same propensity-score model and weighting definitions are used. They are useful as sensitivity estimates, but they do not replace balance and overlap diagnostics for the matched design.

Balance comparison

Covariate|SMD| beforeDefault after matchingCaliper after matchingMahalanobis/replacement/support after matchingAfter weighting
lpi0.7910.7910.1910.1530.242
idp0.6010.6010.0560.0900.014
disea0.1640.1640.0040.0270.133
hlthg0.0780.0780.1180.0260.101
hlthp000.097
physlm0.0790.0790.15100.083
hlthf0.0720.0720.08400.053
Absolute standardized mean differences from the Love-plot data. Values above 0.1 usually deserve review.

The default match does not improve balance because all treated and all control observations are used. The caliper run improves balance for idp and disea, but lpi, physlm, and hlthg still exceed the 0.1 threshold. The Mahalanobis/common-support run gives the best after-matching balance in this example, leaving only lpi above 0.1, but the repeated use of controls should be reported as part of the design.

How to interpret the sensitivity runs

  • Default run: It is deliberately diagnostic: matching all rows leaves the main imbalances unchanged.
  • Caliper run: best for showing the cost of stricter matching. Balance improves for some covariates, but 73 treated rows are left unmatched and the matched estimate becomes less precise.
  • Mahalanobis/common-support run: best for showing a more balanced matched design. It improves most SMDs, but it relies on matching with replacement and therefore reuses controls.
  • Weighted and AIPW estimates: useful sensitivity estimates, especially when matching and weighting point in the same direction, but they remain model-dependent and should be read together with overlap and effective-sample-size diagnostics.

Recommended wording for the tutorial interpretation

In this RAND HIE example, the default nearest-neighbor ATT run matched all 160 treated observations to all 160 controls, but covariate balance did not improve because all controls were used. Adding a standardized-logit caliper produced a smaller matched sample of 87 pairs and improved balance for several covariates, although the matched ATT confidence interval still crossed zero. A Mahalanobis-distance run with replacement and common-support restriction produced the best balance among the three runs and a negative matched ATT estimate with a confidence interval below zero, but the design reused controls heavily, with 47 unique controls matched to 156 treated observations. The practical conclusion is that PSM should be reported as a design-and-diagnostics workflow: treatment-effect estimates should be interpreted only after balance, overlap, matched sample size, and control reuse have been reviewed.

References and source material

See also