Great available resources from the web regarding basic statistical testing
import excel "-----ADDRESS on COMPUTER-----", sheet("NAME OF SHEET") firstrow clear
set more off = allows to stop from putting "more" each time and just present all the analysis
local variables .... , .... , .... = bank the list of variables you are interested in
summarize VARIABLE, detail == Summary with Median (25-75 IQR) for the whole population
summarize VARIABLE == Summary without Median (25-75 IQR) for the whole population
bysort GROUP: summarize VARIABLE, detail == Summary with Median (25-75 IQR) for the GROUP=1 and GROUP=0 (or each level of the GROUP, if the group is a category)
swilk VARIABLE = shapiro wilk test (to see if normally distributed or not). If normally distributed, your p will be 0.05 or more.
ttest VARIABLE, by(NAME of GROUP VARIABLE) = Student T-Test, if you have a normally related continuous variable.
ranksum VARIABLE, by(NAME of GROUP VARIABLE) = Mann–Whitney U test; for continous variable that are non-normally distributed to get the p-value for rank.
generate NAME_of_NEW_VARIABLE = logical format of interest.
Example:
generate large_weight = birth_weight > 4000 === every baby more than 4000 grams will be a "1" and those 4000 grams or less will be a 0
generate large_weight = birth_weight >= 4000 === every baby more or equal than 4000 grams will be a "1" and those less than 4000 grams will be a 0
generate low_gestation_or_low_birthweight = gestational_age_at_birth_w <=26 | birth_weight <=750 === babies that are 26 weeks or less; OR birthweight 750 grams or less will be classified as 1
generate low_gestation_or_low_birthweit2 = gestational_age_at_birth_w <=26 & birth_weight <=750 === babies that are 26 weeks or less; AND birthweight 750 grams or less will be classified as 1
tab VAR1 VAR2, exact col = Fischer Exact test and distribution of a categorical variable 1 by a categorical variable 2 with % based on the column
tab VAR1 = frequency of distribution in the whole population of the VAR1
logit BINARY VAR1 VAR2..., or==== Odds ratio for the multiple logistic regression of the outcomes (BINARY) by variable 1, 2, 3....
logit death rv_endo_gls, or === or for odds ration; logit for multiple logistic regression; outcome is death (1 or 0, where 1 is yes), rv_endo_gls is the variable of interest, which here is the RV strain value
reg birth_weight rv_endo_gls i.sex ==== multiple regression analysis of a continuous outcome (here birthweight) and its association with the RV strain, adjusted for sex (where sex is a categorical outcome so marked with a "i"; and must be coded numerically (1=male, 0=female for example).
Loops:
local variables var1 var2 var3 var4....
foreach var of local variables {
display "Running summary for variable: `var'"
summarize `var', detail
}
The loop above will run the summary for all the variables listed in local variables and provide the detail.
foreach var of local variables {
display "Running TTEST for variable: `var'"
ttest `var', by(GROUP)
}
The loop above will run the TTEST for all the variables listed in local variables by the GROUP.
foreach var of local variables {
display "Running details by group for variable: `var'"
bysort GROUP: summarize `var', detail
}
The loop above will run the summary for all the variables listed in local variables and provide the detail, but this time for each group of interest.
Prepared by Alexie Fonta Holder - August, 19 2025