API Reference

The package is broadly divided into:

Mixin Classes:

To handle specific assumption checks:

NormalityChecker class: Checks normality using Shapiro-Wilk test, skewness, and kurtosis.
HomoscedasticityChecker class: Checks homoscedasticity using Levene's test.
MonotonicityChecker class: Checks monotonic relationships using Spearman correlation.
PairedDataChecker class: Checks if data is properly paired.
CategoricalDataChecker class: Checks if variables are categorical.
MultivariateNormalityChecker class: Checks multivariate normality using Mardia's test.
SurvivalDataChecker class: Checks basic requirements for survival data.
ProportionalHazardsChecker class: Checks proportional hazards using Schoenfeld residuals.
PairedDifferenceChecker class: Checks properties of paired differences.

Specific Checkers:

Specific checkers for various statistical tests, each inheriting from AssumptionChecker and relevant mixins:

IndependentTTestChecker class: For Independent Samples t-test
RepeatedMeasuresANOVAChecker class: For Repeated ANOVA
LogisticRegressionChecker class: For Logistic Regression
PearsonCorrelationChecker class: For Pearson correlation
PairedTTestChecker class: For Paired t-test
ChiSquareIndependenceChecker class: For Chi-square independence test
MultipleRegressionChecker class: For Multiple Linear Regression
TwoWayANOVAChecker class: For Two-way ANOVA
KaplanMeierChecker class: For Kaplan-Meier
CoxPHChecker class: For
PoissonRegressionChecker class: For Cox Proportional Hazards
SpearmanCorrelationChecker class: For Spearman Corraletion
WilcoxonSignedRankChecker class: For Wilcoxon Signed-Rank Test
MANOVAChecker class: For MANOVA
OneWayANOVAChecker class: For One-way ANOVA
FactorialANOVAChecker class: For Factorial ANOVA
Main Class: StatisticalTestAssumptions is the main class that manages the assumption checkers. It allows checking assumptions for a specified test type and provides recommendations based on the results.

Inlcuded tests assumptions

Independent Samples t-test `t_test_ind`

Normality: The dependent variable should be (approximately) normally distributed within each group.
Homogeneity of Variances: The population variances in the two groups should be equal (often checked using Levene’s test).
Independence of Observations: Each observation in one group is independent of any observation in the other group.

Repeated-Measures ANOVA `repeated_anova`

Normality: The dependent variable should be (approximately) normally distributed at each time point or for each repeated measure.
Sphericity: The variances of the differences between all possible pairs of repeated-measure conditions should be equal (often checked by Mauchly’s test).
Independence of Observations: Observations from different subjects are assumed to be independent, although repeated measures on the same subject are inherently correlated.

Logistic Regression `logistic`

Dependent Variable: Binary (two categories, e.g., “success/fail” or “disease/no disease”). Independence of Errors: Residuals (errors) should be independent across observations.
Lack of Multicollinearity: Predictor variables should not be too highly correlated with each other.
Linearity in the Logit: Although logistic regression is for categorical outcomes, continuous predictors should have a linear relationship with the log odds of the outcome.

Factorial ANOVA `factorial_anova`

Normality: The dependent variable in each cell of the design should be (approximately) normally distributed.
Homogeneity of Variances: The variance across all cells (factorial combinations of levels) should be equal.
Independence of Observations: Observations in each cell are independent of those in other cells and within cells.

One-Way ANOVA `one_way_anova`

Normality: The dependent variable should be (approximately) normally distributed in each group.
Homogeneity of Variances: The variances in each of the groups are assumed to be equal.
Independence of Observations: Observations in one group should be independent from those in other groups.

Pearson Correlation `pearson_correlation`

Linearity: The relationship between the two variables should be linear.
Normality (for Significance Testing): Each variable should be (approximately) normally distributed if you want to use significance tests and confidence intervals for r.
Interval or Ratio Scale: Both variables are typically continuous and measured on an interval or ratio scale.
Independence of Observations: Each pair of observations comes from independent subjects or units.

Paired t-test `paired_ttest`

Normality of Difference Scores: The differences between the paired observations should be (approximately) normally distributed.
Dependence of Observations: Each pair is taken from the same subject or matched subjects (hence, “paired”).
No Significant Outliers: Extreme outliers in the difference scores can affect the test.

Chi-Square Test of Independence `chi_square_independence`

Independence of Observations: Each subject or unit should be counted only once in the contingency table.
Expected Cell Frequency: Each cell in the contingency table should have an expected count of at least 5 (rule of thumb for validity of p-values).
Categorical Variables: Both variables should be categorical (nominal or ordinal).

Multiple Regression `multiple_regression`

Linearity: The relationship between each predictor and the outcome (dependent variable) is assumed to be linear in the parameters.
Independence of Errors: Residuals should be independent (often checked by plotting residuals vs. predicted values).
Homoscedasticity: The variance of residuals is constant across all levels of the predictors (also checked by residual plots).
Normality of Residuals: The residuals should be (approximately) normally distributed (checked with Q-Q plots).
Lack of Multicollinearity: Predictors should not be too highly correlated with each other.

Two-Way ANOVA `two_way_anova`

Normality: The dependent variable in each group (combination of two independent factors) should be (approximately) normally distributed.
Homogeneity of Variances: The variances across all factor-level combinations should be equal (Levene’s test is common).
Independence of Observations: Observations in one factor-level combination are independent from other factor-level combinations and within each combination.

Kaplan-Meier Analysis `kaplan_meier`

Random Censoring: Assumes that censoring is non-informative (the reason an individual leaves the study or is censored is independent of their underlying risk).
Independence of Survival Times: Each subject’s survival time is independent of others.
Time-to-Event Data: Typically used when the outcome is the time until an event (e.g., death, relapse).

Cox Proportional Hazards Model `cox_ph`

Proportional Hazards: The hazard functions for different groups (or at different levels of a covariate) are proportional over time (i.e., hazard ratios remain constant over time).
Random/Non-Informative Censoring: Similar to Kaplan-Meier, censoring should not be related to the outcome.
Linearity (for Continuous Covariates): Often assumed that continuous covariates have a log-linear relationship with the hazard.
Independence of Observations: Each subject’s time-to-event is independent (unless modeling random effects or frailty for clustering).

Poisson Regression `poisson`

Count Outcome Variable: The dependent variable is a count (e.g., number of doctor visits).
Mean-Variance Relationship: The Poisson model assumes the mean and variance are equal. (If variance > mean significantly, a Negative Binomial model might be preferred.)
Independence of Observations: Each count is assumed to be independent of the others.
Linearity in the Log Link: The log of the expected count is assumed to be a linear combination of predictors.

Spearman Correlation `spearman`

Monotonic Relationship: The relationship between the two variables should be monotonic (does not have to be linear).
Ordinal or Interval/Ratio Data: Although often used for ordinal data, Spearman correlation can also handle interval/ratio data that fail Pearson’s normality assumption.
Independence of Observations: Each pair of observations is assumed independent.

Wilcoxon Signed-Rank Test `wilcoxon_signed_rank`

Paired or Matched Samples: The same subjects measured twice, or matched subjects.
Ordinal or Continuous Data: Used when data are ordinal or not normally distributed, but we assume differences can be meaningfully ranked.
Symmetry of Distribution of Differences (Ideal): While not as strict as the normality assumption, it is often assumed that the distribution of differences is symmetrical around the median.

MANOVA (Multivariate Analysis of Variance) `manova`

Multivariate Normality: The combination of dependent variables follows a multivariate normal distribution within each group.
Homogeneity of Variance-Covariance Matrices: The variance-covariance matrices for the dependent variables are the same in each group (Box’s M test).
Independence of Observations: Observations across groups (and within groups) are independent.
No Multicollinearity Among Dependent Variables: If the dependent variables are very highly correlated, MANOVA might not be the best approach.

Suggested test alternatives explanation

Independent T-Test:

Suggested Alternatives: Mann-Whitney U test, Welch's t-test
Explanation:
Mann-Whitney U test is a non-parametric alternative that does not assume normality.
Welch's t-test is used when the assumption of equal variances is violated.

Repeated Measures ANOVA:

Suggested Alternatives: Friedman test, Mixed-effects model
Explanation:
- Friedman test is a non-parametric alternative for repeated measures.
- Mixed-effects model can handle violations of sphericity and other complex data structures.

Logistic Regression:

Suggested Alternatives: Penalized regression (Ridge, Lasso), Decision trees
Explanation:
- Penalized regression methods like Ridge and Lasso can handle multicollinearity.
- Decision trees do not assume linearity or independence of errors.

Pearson Correlation:

Suggested Alternatives: Spearman rank correlation, Kendall rank correlation, Robust correlation methods
Explanation:
- Spearman rank correlation and Kendall rank correlation are non-parametric and do not assume normality.
- Robust correlation methods can handle outliers.

Paired T-Test:

Suggested Alternatives: Wilcoxon signed-rank test, Sign test, Randomization test
Explanation:
- Wilcoxon signed-rank test is a non-parametric alternative for paired data.
- Sign test is another non-parametric alternative.
- Randomization test can be used when assumptions are severely violated.

Chi-Square Test of Independence:

Suggested Alternatives: Fisher's exact test, G-test of independence, Freeman-Halton test, Log-linear analysis
Explanation:
- Fisher's exact test is used for small sample sizes.
- G-test is an alternative to the chi-square test.
- Freeman-Halton test extends Fisher's test to larger tables.
- Log-linear analysis is used for more complex categorical data.

Multiple Linear Regression:

Suggested Alternatives: Ridge Regression, Lasso Regression, Robust Regression, Quantile Regression, Non-linear regression models
Explanation:
- Ridge and Lasso Regression address multicollinearity.
- Robust Regression handles outliers.
- Quantile Regression does not assume homoscedasticity.
- Non-linear regression models are used when linearity is violated.

Two-Way ANOVA:

Suggested Alternatives: Non-parametric factorial analysis, Robust two-way ANOVA, Aligned Rank Transform ANOVA, Separate non-parametric tests with correction, Mixed-effects model
Explanation:
- Non-parametric factorial analysis is used when assumptions are violated.
- Robust ANOVA methods handle violations of assumptions.
- Aligned Rank Transform ANOVA is a non-parametric alternative.
- Mixed-effects model can handle complex designs.

Kaplan-Meier Survival Analysis:

Suggested Alternatives: Cox Proportional Hazards model, Parametric survival models, Competing risks analysis, Time-varying coefficient models
Explanation:
- Cox Proportional Hazards model is more flexible.
- Parametric survival models assume specific distributions.
- Competing risks analysis is used when there are competing events.
- Time-varying coefficient models handle time-dependent covariates.

Cox Proportional Hazards Regression:

Suggested Alternatives: Stratified Cox model, Time-varying coefficient Cox model, Parametric survival models, Additive hazards models
Explanation:
- Stratified Cox model handles non-proportional hazards.
- Time-varying coefficient models address time-dependent effects.
- Parametric survival models assume specific distributions.
- Additive hazards models are an alternative to proportional hazards.

Poisson Regression (continued):

Suggested Alternatives: Negative Binomial Regression, Zero-inflated Poisson Regression, Zero-inflated Negative Binomial Regression, Quasi-Poisson Regression
Explanation:
- Negative Binomial Regression is suitable when there is overdispersion (variance greater than the mean) in count data.
- Zero-inflated Poisson Regression is used when there are more zeros in the data than expected under a standard Poisson model.
- Zero-inflated Negative Binomial Regression combines the handling of excess zeros and overdispersion.
- Quasi-Poisson Regression is another approach to handle overdispersion by adjusting the variance function.

Spearman's Rank Correlation:

Suggested Alternatives: Kendall's tau, Kendall's tau-b (for ties), Pearson correlation (if relationship is linear), Distance correlation (for non-monotonic relationships)
Explanation:
- Kendall's tau is a non-parametric measure of correlation that is less sensitive to ties than Spearman's.
- Kendall's tau-b is specifically designed to handle ties in the data.
- Pearson correlation can be used if the relationship is linear and assumptions of normality are met.
- Distance correlation is a more general measure that can detect both linear and non-linear associations.

Wilcoxon Signed-Rank Test:

Suggested Alternatives: Sign test (for asymmetric differences), Paired t-test (if differences are normal), Permutation test, Bootstrap methods
Explanation:
- Sign test is a simpler non-parametric test that can be used if the differences are not symmetrically distributed.
- Paired t-test is appropriate if the differences are normally distributed.
- Permutation test is a non-parametric method that does not rely on distributional assumptions.
Bootstrap methods provide a flexible approach to estimate the sampling distribution of the test statistic.

MANOVA (Multivariate Analysis of Variance):

Suggested Alternatives: Separate univariate ANOVAs with Bonferroni correction, Robust MANOVA, Permutation MANOVA, Non-parametric multivariate tests (e.g., NPMANOVA), Linear Discriminant Analysis
Explanation:
- Separate univariate ANOVAs with Bonferroni correction control for Type I error across multiple tests.
- Robust MANOVA methods handle violations of assumptions such as multivariate normality.
- Permutation MANOVA is a non-parametric alternative that does not assume normality.
- Non-parametric multivariate tests like NPMANOVA are used when assumptions are violated.
- Linear Discriminant Analysis can be used for classification purposes when MANOVA assumptions are not met.

One-Way ANOVA:

Suggested Alternatives: Kruskal-Wallis H-test, Welch's ANOVA, Brown-Forsythe test
Explanation:
- Kruskal-Wallis H-test is a non-parametric alternative that does not assume normality.
- Welch's ANOVA is used when the assumption of equal variances is violated.
- Brown-Forsythe test is another alternative for testing equality of means when variances are unequal.

Factorial ANOVA:

Suggested Alternatives: Non-parametric factorial analysis, Mixed-effects model, Robust ANOVA
Explanation:
- Non-parametric factorial analysis is used when assumptions of normality and homoscedasticity are violated.
- Mixed-effects model can handle complex designs and violations of sphericity.

API Reference

The package is broadly divided into:

Mixin Classes:

Specific Checkers:

Inlcuded tests assumptions

Independent Samples t-test t_test_ind

Repeated-Measures ANOVA repeated_anova

Logistic Regression logistic

Factorial ANOVA factorial_anova

One-Way ANOVA one_way_anova

Pearson Correlation pearson_correlation

Paired t-test paired_ttest

Chi-Square Test of Independence chi_square_independence

Multiple Regression multiple_regression

Two-Way ANOVA two_way_anova

Kaplan-Meier Analysis kaplan_meier

Cox Proportional Hazards Model cox_ph

Poisson Regression poisson

Spearman Correlation spearman

Wilcoxon Signed-Rank Test wilcoxon_signed_rank

MANOVA (Multivariate Analysis of Variance) manova

Suggested test alternatives explanation

Independent T-Test:

Repeated Measures ANOVA:

Logistic Regression:

Pearson Correlation:

Paired T-Test:

Chi-Square Test of Independence:

Multiple Linear Regression:

Two-Way ANOVA:

Kaplan-Meier Survival Analysis:

Cox Proportional Hazards Regression:

Poisson Regression (continued):

Spearman's Rank Correlation:

Wilcoxon Signed-Rank Test:

MANOVA (Multivariate Analysis of Variance):

One-Way ANOVA:

Factorial ANOVA:

Independent Samples t-test `t_test_ind`

Repeated-Measures ANOVA `repeated_anova`

Logistic Regression `logistic`

Factorial ANOVA `factorial_anova`

One-Way ANOVA `one_way_anova`

Pearson Correlation `pearson_correlation`

Paired t-test `paired_ttest`

Chi-Square Test of Independence `chi_square_independence`

Multiple Regression `multiple_regression`

Two-Way ANOVA `two_way_anova`

Kaplan-Meier Analysis `kaplan_meier`

Cox Proportional Hazards Model `cox_ph`

Poisson Regression `poisson`

Spearman Correlation `spearman`

Wilcoxon Signed-Rank Test `wilcoxon_signed_rank`

MANOVA (Multivariate Analysis of Variance) `manova`