This tutorial shows how to run a propensity score matching (PSM) analysis in Microsoft Excel using BESH Stat NG. The example uses a RAND Health Insurance Experiment subset and compares a cost-sharing insurance plan group with a no-cost-sharing group on the number of outpatient visits to a medical doctor.
The goal is not simply to fit a good treatment-assignment model. The goal is to create a more credible treated-versus-control comparison by improving balance in measured pre-treatment covariates, then to inspect whether the adjusted comparison is believable enough to interpret.
What is propensity score matching?
Propensity score methods are used in observational data when treatment assignment was not randomized. The propensity score is the estimated probability of receiving treatment given measured baseline covariates. Subjects with similar propensity scores have similar measured covariate profiles, at least with respect to the variables included in the score model.
In a matching workflow, treated subjects are matched to control subjects with similar propensity scores or covariate profiles. The matched sample is then used to estimate a treatment effect, such as the average treatment effect among the treated (ATT).
Important limitation: propensity-score methods adjust only for measured pre-treatment covariates. They do not remove bias due to unmeasured confounding, post-treatment variables, poor timing, or lack of overlap.
Example data
The input data contain 320 observations: 160 treated and 160 controls. In this tutorial, treatment is defined as 1 = cost-sharing plan and 0 = no-cost-sharing plan. The outcome is outcome_mdvis, the number of outpatient visits to a medical doctor. The propensity score model uses seven pre-treatment covariates: lpi, idp, physlm, disea, hlthg, hlthf, and hlthp.
| Variable | Description |
|---|---|
| id | Synthetic row ID used for output labels and audit tables. |
| treatment | Treatment indicator: 1 = cost-sharing plan, 0 = no-cost-sharing plan. |
| outcome_mdvis | Number of outpatient visits to a medical doctor. |
| lpi | Log of annual participation incentive payment. |
| idp | Indicator for individual deductible plan. |
| physlm | Indicator for physical limitation. |
| disea | Number of chronic diseases. |
| hlthg / hlthf / hlthp | Self-rated health indicators for good, fair, and poor health; excellent health is the omitted category. |
| pscore | Previously estimated propensity score column, useful for supplied-score sensitivity and external comparisons. |
Open the PSM dialog
On the Excel ribbon, go to BESH Stat NG → Analyse → Causal Inference → Propensity Score Matching.

GUI workflow
1. Select the data
Use the Data tab to assign worksheet columns to analysis roles. Treatment, outcome, and covariates are required. ID is optional but recommended because it makes matched-pair and audit tables easier to read.

- Set Active Worksheet to
Input_Data. - Move
treatmentto Treatment (0/1). outcome_mdvisto Outcome.idto ID.- Leave Supplied score empty for the main logistic-score tutorial run.
- Move
lpi,idp,physlm,disea,hlthg,hlthf, andhlthpto Covariates.
2. Specify the propensity model
In the Propensity model tab, keep the tutorial model simple: main effects plus an intercept. This makes the example easy to reproduce and keeps the model auditable.

Do not include the outcome in the propensity-score model. Only include variables measured before treatment assignment that could affect both treatment and outcome.
3. Choose matching and scoring options
The attached result workbook was created with the default PSM options shown below: logistic-regression score estimation, ATT estimand, 1:1 nearest-neighbor matching, propensity-score distance, no replacement, no caliper, no common-support restriction, and standardized covariates for the propensity model.

| Option | Value |
|---|---|
| Run method | StandardNearestNeighbor |
| Score method | LogisticRegression |
| Estimand | ATT |
| Distance metric | PropensityScore |
| Matching ratio | 1 |
| With replacement | No |
| Caliper scale | None |
| Matching order | PropensityDescending |
| Common support | None |
| Standardize covariates | Yes |
| Logistic ridge penalty | 1e-07 |
| SMD threshold | 0.1000 |
| Normalize weights to sample size | Yes |
| Include doubly robust AIPW | Yes |
| Include overlap diagnostics | Yes |
| Include weight diagnostics | Yes |
| Include Love-plot rows | Yes |
| Overlap bin count | 20 |
| Love-plot threshold | 0.1000 |
| Extreme weight cutoff | 10 |
4. Select diagnostics and outputs
For teaching and applied analysis, keep the diagnostic outputs enabled. The matched-pair table provides auditability; balance diagnostics show whether adjustment helped; overlap and weight diagnostics show whether the analysis depends on extreme extrapolation; and Love-plot output makes balance easier to inspect visually.

- Include doubly robust AIPW estimate: useful as a model-assisted sensitivity estimate.
- Overlap diagnostics: useful for detecting weak treated/control overlap.
- Weight diagnostics: important when interpreting weighting and AIPW results.
- Love-plot data: recommended for every PSM tutorial and report.
- Write matched-pair table: recommended for matching methods so users can audit each match.
- Write diagnostics tables: recommended for checking balance, overlap, weights, and sensitivity output.
Results from the tutorial run
Download full results excel workbook generate by BESHstatNG.
Run summary
| Item | Value |
|---|---|
| Run Method | StandardNearestNeighbor |
| Score Method | LogisticRegression |
| Score Model Converged | Yes |
| Score Model Iterations | 4 |
| Total Rows | 320 |
| Treated Rows | 160 |
| Control Rows | 160 |
| Matched Sets | 160 |
| Dropped by Common Support | 0 |
| Dropped by Trimming | 0 |
| Warnings | 0 |
Sample-size summary
| Metric | Value |
|---|---|
| Total rows | 320 |
| Treated rows | 160 |
| Control rows | 160 |
| Eligible treated rows | 160 |
| Eligible control rows | 160 |
| Matched treated rows | 160 |
| Matched control rows | 160 |
| Matched sets | 160 |
| Unmatched treated rows | 0 |
| Unmatched control rows | 0 |
| Dropped by common support | 0 |
| Dropped by trimming | 0 |
Treatment-effect estimates
The matched-pair ATT estimate is negative, meaning that treated subjects had fewer MD visits on average than their matched controls. However, the matched-pair 95% confidence interval crosses zero. In this default run, the matched comparison alone does not provide strong evidence of a non-zero effect.
| Method | Estimand | Estimate | Std. Error | Lower 95% | Upper 95% | Treated Mean | Control Mean | Eff. Treated N | Eff. Control N |
|---|---|---|---|---|---|---|---|---|---|
| Matched mean difference | ATT | -0.5875 | 0.5358 | -1.638 | 0.4626 | 2.938 | 3.525 | 160 | 160 |
| Propensity-score weighting | ATT | -2.357 | 0.9893 | -4.296 | -0.4178 | 2.938 | 5.294 | 160 | 65.71 |
The weighting and AIPW estimates are more negative and their confidence intervals do not cross zero. These estimates should be interpreted cautiously because the diagnostics show residual imbalance and limited effective sample size among weighted controls.
Effect sensitivity summary
| Method | Estimand | Estimate | Std. Error | z | p-value | Lower 95% | Upper 95% | Crosses zero | Warning |
|---|---|---|---|---|---|---|---|---|---|
| Matched mean difference | ATT | -0.5875 | 0.5358 | -1.097 | 0.2728 | -1.638 | 0.4626 | Yes | |
| Propensity-score weighting | ATT | -2.357 | 0.9893 | -2.382 | 0.0172 | -4.296 | -0.4178 | No | |
| Doubly robust AIPW | ATT | -2.042 | 1.009 | -2.024 | 0.0429 | -4.020 | -0.0651 | No |
A useful teaching point is that matching, weighting, and AIPW do not have to produce identical estimates. Differences between them are often a signal to inspect overlap, balance, covariate specification, and target estimand rather than to choose the most favorable estimate.
Propensity-score model
| Term | Estimate | Std. Error | Method |
|---|---|---|---|
| Intercept | -0.0527 | 0.1315 | LogisticRegression |
| lpi | 1.056 | 0.1529 | LogisticRegression |
| idp | -0.8535 | 0.1378 | LogisticRegression |
| physlm | -0.1565 | 0.1517 | LogisticRegression |
| disea | 0.3634 | 0.1541 | LogisticRegression |
| hlthg | 0.1146 | 0.1371 | LogisticRegression |
| hlthf | 0.0105 | 0.1434 | LogisticRegression |
| hlthp | 0.0965 | 0.2018 | LogisticRegression |
Balance diagnostics and Love plot
Balance diagnostics are the most important part of a propensity-score analysis. The standardized mean difference (SMD) compares covariate means between treated and control groups on a standardized scale. A common rule of thumb is to review covariates with absolute SMD greater than 0.1.

| Plot Row | Variable | |SMD| before | |SMD| after matching | |SMD| after weighting | Threshold | Flag |
|---|---|---|---|---|---|---|
| 1 | lpi | 0.7908 | 0.7908 | 0.2418 | 0.1000 | Review |
| 2 | idp | 0.6011 | 0.6011 | 0.0140 | 0.1000 | Review |
| 3 | disea | 0.1639 | 0.1639 | 0.1331 | 0.1000 | Review |
| 4 | hlthg | 0.0784 | 0.0784 | 0.1013 | 0.1000 | Review |
| 5 | hlthp | 0 | 0 | 0.0965 | 0.1000 | OK |
| 6 | physlm | 0.0789 | 0.0789 | 0.0833 | 0.1000 | OK |
| 7 | hlthf | 0.0724 | 0.0724 | 0.0533 | 0.1000 | OK |
In this tutorial run, lpi, idp, and disea remain above the 0.1 SMD threshold after matching. The reason is visible in the run summary: there are 160 treated rows and 160 control rows, and 1:1 matching without replacement uses all controls. Because all rows are retained, the matched sample has the same covariate means as the original sample. This is a useful diagnostic lesson: a completed match is not automatically a balanced match.
Weight and overlap diagnostics
| Sample | Group | N | Non-zero N | Sum W | Mean W | Min W | Max W | CV | ESS | ESS/N | Extreme W N | Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| StandardNearestNeighbor | All | 320 | 320 | 320.0 | 1.000 | 0.0164 | 8.243 | 0.8484 | 186.3 | 0.5822 | 0 | OK |
| StandardNearestNeighbor | Treated | 160 | 160 | 160 | 1 | 1 | 1 | 0 | 160 | 1 | 0 | OK |
| StandardNearestNeighbor | Control | 160 | 160 | 160.0 | 1.0000 | 0.0164 | 8.243 | 1.202 | 65.71 | 0.4107 | 0 | Low ESS |
The treated group has an effective sample size of 160, while the weighted control effective sample size is about 65.7. This indicates that the weighted comparison relies on unequal control weights. That does not invalidate the analysis, but it means the weighting results should be interpreted together with balance and overlap diagnostics.
| Group | N | Min | Q1 | Median | Q3 | Max | Mean | SD | Below overlap | Above overlap | Extreme PS | Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Treated | 160 | 0.0551 | 0.5755 | 0.7269 | 0.7781 | 0.9511 | 0.6369 | 0.2135 | 0 | 4 | 1 | Outside support,Extreme PS |
| Control | 160 | 0.0152 | 0.1685 | 0.2561 | 0.5901 | 0.8859 | 0.3631 | 0.2342 | 6 | 0 | 6 | Outside support,Extreme PS |
The overlap summary flags observations outside support and extreme propensity scores in both groups. This supports a cautious interpretation and motivates sensitivity analyses with calipers, common-support restrictions, or overlap-focused estimands.
Matched pairs
The full BESH Stat NG output contains all 160 matched sets. The first 10 matched pairs are shown below as an audit excerpt.
| Set | Treated Row | Treated ID | Control Row | Control ID | Distance | PS Distance | Mahalanobis | Exact Group |
|---|---|---|---|---|---|---|---|---|
| 1 | 216 | R0216 | 144 | R0144 | 0.0652 | 0.0652 | #N/A | |
| 2 | 296 | R0296 | 275 | R0275 | 0.1111 | 0.1111 | #N/A | |
| 3 | 210 | R0210 | 59 | R0059 | 0.1278 | 0.1278 | #N/A | |
| 4 | 237 | R0237 | 85 | R0085 | 0.1202 | 0.1202 | #N/A | |
| 5 | 117 | R0117 | 121 | R0121 | 0.1017 | 0.1017 | #N/A | |
| 6 | 290 | R0290 | 96 | R0096 | 0.1018 | 0.1018 | #N/A | |
| 7 | 220 | R0220 | 218 | R0218 | 0.0974 | 0.0974 | #N/A | |
| 8 | 300 | R0300 | 94 | R0094 | 0.0941 | 0.0941 | #N/A | |
| 9 | 129 | R0129 | 224 | R0224 | 0.0766 | 0.0766 | #N/A | |
| 10 | 263 | R0263 | 307 | R0307 | 0.0872 | 0.0872 | #N/A |
Doubly robust AIPW estimate
| Method | Estimand | Estimate | Std. Error | Lower 95% | Upper 95% | Treated mean | Control mean | Treated N | Control N |
|---|---|---|---|---|---|---|---|---|---|
| Doubly robust AIPW | ATT | -2.042 | 1.009 | -4.020 | -0.0651 | 2.938 | 3.525 | 160 | 160 |
The AIPW result is useful as a sensitivity estimate because it combines propensity-score adjustment with an outcome-model component. It should not be treated as a substitute for balance checking. If covariate balance or overlap is poor, the AIPW estimate can still depend heavily on modeling assumptions.
Rosenbaum-style matched-pair sensitivity
| Metric | Value |
|---|---|
| Metric | Value |
| Alternative | TwoSided |
| Informative pairs | 130 |
| Positive differences | 51 |
| Negative differences | 79 |
| Tied differences | 30 |
| Mean difference | -0.5875 |
| Median difference | 0 |
The Rosenbaum-style sensitivity section provides additional information for matched-pair analyses. It is most useful when the matched comparison is the primary estimand and the matched pairs show acceptable balance.
UDF workflow
The same analysis can be driven from worksheet formulas. This is useful for templates, reproducible teaching examples, and dashboards where users want to refresh outputs after changing inputs or options.
First fit the propensity-score analysis and store the returned handle in a cell, for example A1 on a worksheet named UDF_Workflow.
=BESH.PS.FIT(Input_Data!A2:A321, Input_Data!B2:B321, Input_Data!C2:C321,
Input_Data!D2:J321, Input_Data!D1:J1,
"matching", "ATT", "logit", , ,
"lpi + idp + physlm + disea + hlthg + hlthf + hlthp",
"ratio=1; replacement=false; distance=ps; order=descending; ridge=1e-7; maxIter=100; tol=1e-7",
"smd=0.1; overlapBins=20; lovePlot=true")Then use the handle to return output tables:
=BESH.PS.SUMMARY(A1)
=BESH.PSM.MATCHES(A1)
=BESH.PS.SCORES(A1)
=BESH.PS.WEIGHTS(A1)
=BESH.PS.BALANCE(A1)
=BESH.PS.EFFECT(A1)
=BESH.PS.LOVEPLOT_DATA(A1)
=BESH.PS.CLEANUP(A1)For an exact benchmark using an already-computed propensity score column, use scoreMethod = supplied and pass the existing score column:
=BESH.PS.FIT(Input_Data!A2:A321, Input_Data!B2:B321, Input_Data!C2:C321,
Input_Data!D2:J321, Input_Data!D1:J1,
"matching", "ATT", "supplied", Input_Data!K2:K321, ,
"",
"ratio=1; replacement=false; distance=ps; order=descending",
"smd=0.1; overlapBins=20; lovePlot=true")When to use each PSM option
| Option | When to use it | Why it matters |
|---|---|---|
| Logistic score method | Default for most analyses when no existing propensity score is available. | Fits the treatment model directly from selected pre-treatment covariates. |
| Supplied score | Use for validation, external benchmark comparisons, or when scores were estimated elsewhere. | Makes matching deterministic relative to the supplied score column. |
| ATT | Use when the question is the effect among treated subjects. | This is the most common target for nearest-neighbor matching. |
| ATE | Use when the target is the full eligible sample. | Usually more natural for weighting than simple matching. |
| ATO | Use when overlap is limited and the clinically relevant target is the region of equipoise. | Can improve stability when extreme scores are present. |
| Without replacement | Use for a simple auditable matched-pair design. | Each control appears at most once, but balance may be worse. |
| With replacement | Use when there are few good controls for high-score treated subjects. | Can improve match quality but may reduce effective control sample size. |
| Caliper | Use when poor matches are possible. | Prevents distant matches, but may leave treated or control rows unmatched. |
| Common support / trimming | Use when score overlap is poor. | Removes observations where comparison requires extrapolation. |
| Exact groups | Use for variables that must not be crossed, such as site, sex, country, or risk stratum. | Protects the design but can reduce the number of possible matches. |
| Polynomial / interaction terms | Use when diagnostics show residual imbalance after the main-effect score model. | Improves score-model flexibility, but should be guided by balance, not outcome fishing. |
How to report the example
A compact report might read as follows:
We fitted a logistic propensity-score model for cost-sharing plan assignment using lpi, idp, physlm, disea, and health-status indicators. The default 1:1 nearest-neighbor ATT analysis matched all 160 treated observations to 160 controls. The matched mean difference in MD visits was -0.588 (95% CI -1.638 to 0.463), so the matched-pair estimate was not statistically distinguishable from zero. However, balance diagnostics showed residual imbalance for lpi, idp, and disea, and overlap diagnostics flagged observations outside support. Weighted and AIPW sensitivity estimates were more negative, but they should be interpreted cautiously because residual imbalance and control effective sample size indicate that the adjusted comparison remains model-dependent.
Sensitivity analyses with alternative matching options
Download results workbooks: psm_results2.xlsx, psm_results3.xlsx
The default run is useful, but it also shows a common pitfall: a completed 1:1 match is not automatically a well-balanced matched design. Because the default run matches all 160 treated observations to all 160 controls, the after-matching covariate means are the same as the original treated/control comparison. For that reason, is good to include at least one sensitivity analysis that changes the design.
The two additional runs below use the same RAND HIE input data and the same logistic propensity-score model, but they change the matching rules. The goal is not to select the run with the smallest p-value. The goal is to show how sample size, balance, overlap, and effect estimates change when the matching design is made stricter or more flexible.
| Run | Main options | Matched sets | Matched treated | Matched controls / unique controls | Unmatched treated | Unmatched controls | Dropped by support |
|---|---|---|---|---|---|---|---|
| Default nearest-neighbor ATT | Logistic score; 1:1 nearest-neighbor ATT; propensity-score distance; no replacement; no caliper; no common-support restriction. | 160 | 160 | 160 | 0 | 0 | 0 |
| Caliper sensitivity | Same as default, but with 0.2 caliper on the standardized logit propensity score. | 87 | 87 | 87 | 73 | 73 | 0 |
| Mahalanobis + replacement + common support | Logistic score; ATT; Mahalanobis distance; matching with replacement; drop observations outside overlap support. | 156 | 156 | 47 | 4 | 113 | 10 |
The caliper run keeps the same nearest-neighbor ATT design but adds a 0.2 caliper on the standardized logit propensity score. This prevents many distant matches, so the matched sample drops from 160 pairs to 87 pairs. The Mahalanobis/common-support run keeps almost all eligible treated observations, but it allows replacement and therefore uses only 47 unique controls for 156 matched treated observations.
Treatment-effect estimates across runs
| Run | Matched ATT | Std. Error | 95% CI | p-value | CI crosses zero? | Max |SMD| after matching | Variables still above 0.1 |
|---|---|---|---|---|---|---|---|
| Default nearest-neighbor ATT | -0.588 | 0.536 | -1.638 to 0.463 | 0.2728 | Yes | 0.791 | lpi, idp, disea |
| Caliper sensitivity | -1.414 | 0.804 | -2.990 to 0.162 | 0.0788 | Yes | 0.191 | lpi, physlm, hlthg |
| Mahalanobis + replacement + common support | -1.282 | 0.507 | -2.275 to -0.289 | 0.0114 | No | 0.153 | lpi |
The default matched estimate is -0.588 MD visits and its confidence interval crosses zero. The caliper run gives a more negative matched estimate (-1.414), but the smaller matched sample makes the interval wider and it still crosses zero. The Mahalanobis/common-support run gives a negative estimate with a confidence interval below zero, but it achieves this design by reusing controls: only 47 unique controls are used for 156 matched treated observations.
The weighting and AIPW estimates are unchanged across these three runs because the same propensity-score model and weighting definitions are used. They are useful as sensitivity estimates, but they do not replace balance and overlap diagnostics for the matched design.
Balance comparison
| Covariate | |SMD| before | Default after matching | Caliper after matching | Mahalanobis/replacement/support after matching | After weighting |
|---|---|---|---|---|---|
| lpi | 0.791 | 0.791 | 0.191 | 0.153 | 0.242 |
| idp | 0.601 | 0.601 | 0.056 | 0.090 | 0.014 |
| disea | 0.164 | 0.164 | 0.004 | 0.027 | 0.133 |
| hlthg | 0.078 | 0.078 | 0.118 | 0.026 | 0.101 |
| hlthp | 0 | 0 | — | — | 0.097 |
| physlm | 0.079 | 0.079 | 0.151 | 0 | 0.083 |
| hlthf | 0.072 | 0.072 | 0.084 | 0 | 0.053 |
The default match does not improve balance because all treated and all control observations are used. The caliper run improves balance for idp and disea, but lpi, physlm, and hlthg still exceed the 0.1 threshold. The Mahalanobis/common-support run gives the best after-matching balance in this example, leaving only lpi above 0.1, but the repeated use of controls should be reported as part of the design.
How to interpret the sensitivity runs
- Default run: It is deliberately diagnostic: matching all rows leaves the main imbalances unchanged.
- Caliper run: best for showing the cost of stricter matching. Balance improves for some covariates, but 73 treated rows are left unmatched and the matched estimate becomes less precise.
- Mahalanobis/common-support run: best for showing a more balanced matched design. It improves most SMDs, but it relies on matching with replacement and therefore reuses controls.
- Weighted and AIPW estimates: useful sensitivity estimates, especially when matching and weighting point in the same direction, but they remain model-dependent and should be read together with overlap and effective-sample-size diagnostics.
Recommended wording for the tutorial interpretation
In this RAND HIE example, the default nearest-neighbor ATT run matched all 160 treated observations to all 160 controls, but covariate balance did not improve because all controls were used. Adding a standardized-logit caliper produced a smaller matched sample of 87 pairs and improved balance for several covariates, although the matched ATT confidence interval still crossed zero. A Mahalanobis-distance run with replacement and common-support restriction produced the best balance among the three runs and a negative matched ATT estimate with a confidence interval below zero, but the design reused controls heavily, with 47 unique controls matched to 156 treated observations. The practical conclusion is that PSM should be reported as a design-and-diagnostics workflow: treatment-effect estimates should be interpreted only after balance, overlap, matched sample size, and control reuse have been reviewed.
References and source material
- RAND HIE / statsmodels dataset
- MatchIt R package
- MatchIt nearest-neighbor documentation
- cobalt balance diagnostics package
- BESH Stat NG documentation