Skip to content
Mohsen Askar

API Reference

Detailed Function Documentation

Function: codebook

Generates a detailed codebook for a given DataFrame/variable in the dataframe, providing descriptive statistics and data quality checks.

Parameters:

  • df (pandas.DataFrame): The DataFrame to analyze.
  • column (str, optional): If specified, only this column will be analyzed. Defaults to None.
  • advanced (bool, optional): If True, includes additional statistics like standard deviation, confidence intervals, and normality tests. Defaults to False.
  • decimal_places (int, optional): The number of decimal places to round numerical results. Defaults to 3.

Returns: - pandas.DataFrame: A DataFrame containing the codebook with descriptive statistics and data quality checks.

Example Usage:

# Generate an advanced codebook for a specific column
codebook(df, column='age', advanced=True, decimal_places=2)
Variable Type Unique values Missing values Blank issues Range 25th percentile 50th percentile (Median) 75th percentile Mean Examples Top categories SD 95% CI Normality test p-value (normality)
0 age float64 4 1 Not applicable (25.0, 40.0) 28.75 32.5 36.25 32.5 [35.0, 25.0, 30.0] - 6.45 (26.18, 38.82) Shapiro-Wilk 0.97

Notes

If a column contains all missing values, the function will skip detailed analysis for that column and indicate that it is entirely missing. The function automatically handles mixed data types by converting the column to an object type and issuing a warning.

Output Explanation:

  • Variable: The name of the variable.
  • Type: The data type of the variable.
  • Unique values: The number of unique non-null values.
  • Missing values: The number of missing (null) values.
  • Blank issues: Any detected issues with leading, trailing, or embedded blanks in string variables.
  • Range: The minimum and maximum values for numeric variables.
  • 25th, 50th, 75th percentile: The respective percentiles for numeric variables.
  • Mean: The mean of numeric variables.
  • SD: The standard deviation for numeric variables (advanced mode).
  • 95% CI: The 95% confidence interval for numeric variables (advanced mode).
  • Normality test: The type of normality test applied (Shapiro-Wilk (for datasets with 5000 or fewer observations) or Kolmogorov-Smirnov (for larger datasets)).
  • p-value (normality): The p-value from the normality test.
  • Top categories: The most frequent categories for categorical variables.
  • Top category proportion: The proportion of the top category for categorical variables (advanced mode).
  • 95% CI (top category): The 95% confidence interval for the top category proportion (advanced mode).

FAQ/Troubleshooting

Q1: The codebook function isn't working for my DataFrame with mixed data types. What should I do?

A: The codebook function automatically detects and converts columns with mixed data types to object (string) type. If you see a warning about mixed types, ensure your data is clean and consistently typed, or allow the function to handle it automatically.

Q2: Why does the function skip some columns?

A: The function may skip columns if they contain all missing values (NaN). The output will indicate if a column is entirely missing.

Q3: How can I adjust the number of decimal places for numerical results?

A: You can adjust the decimal precision by setting the decimal_places parameter when calling the codebook function:

codebook(df, advanced=True, decimal_places=2)