Logistic regression#

Introduction#

We’ve spent a good amount of time exploring the intricacies of correlation, simple linear regression, and multiple using statsmodels. Now, let’s shift our focus to a powerful technique called logistic regression.

Logistic regression comes into play when our outcome variable (the dependent variable) has two possible categories - think “yes” or “no,” “success” or “failure,” or “present” or “absent”. Why can’t we just use linear regression in these cases? Well, in simple linear regression, we assume the error term follows a normal distribution, which can take on any value from negative to positive infinity. This doesn’t align with a binary outcome.

While we often use logistic regression with multiple independent variables (making it technically “multiple logistic regression”), the core concepts remain the same. Here’s a quick comparison of regression types:

Type of Regression

Dependent Variable (Y)

Examples

Linear

Continuous (interval, ratio)

Enzyme activity, renal function, weight

Logistic

Binary (dichotomous)

Death during surgery, graduation, recurrence of cancer

Proportional Hazards

Elapsed time to event

Months until death, time to graduation

Definitions#

The maths#

At the heart of logistic regression lies the concept of odds and odds ratios. As we have seen in the chapter on the comparison of proportions, the odds of an event occurring are calculated as the probability of that event happening divided by the probability of it not happening:

\[\text{Odds} = \frac{\text{Probability of event}}{\text{Probability of no event}} = \frac{p}{1-p}\]

The logistic regression model calculates the odds of our outcome based on two key components:

\[\begin{split} \begin{aligned} \text{Odds} &= (\text{Baseline odds}) \times \text{OR}_1 \times \text{OR}_2 \times \dots \times \text{OR}_n\\ &= (\text{Baseline odds}) \times \prod_{i=1}^n \text{OR}_i \end{aligned} \end{split}\]
  1. Baseline odds: these represent the odds of the outcome when all of our independent variables are set to zero. To make this more meaningful, we can center our variables (e.g., by using “Age - 20” instead of just “Age”). This way, our baseline odds would reflect the odds for a 20-year-old individual.

  2. Odds ratios (OR): each independent variable in our model has an associated odds ratio, which tells us how the odds of the outcome change with a one-unit increase in that variable:

    • For continuous variables, the odds ratio tells us how much the odds change for a one-unit increase in that variable.

    • For binary variables, it tells us how much the odds change when the binary variable switches from 0 to 1.

We can interpret the OR as follows:

  • An OR of 1 means no relationship between the variable and the outcome.

  • An OR greater than 1 indicates that the odds increase as the variable increases (or when it switches from 0 to 1 for binary variables).

  • An OR less than 1 indicates that the odds decrease as the variable increases (or when it switches from 0 to 1 for binary variables).

Let’s imagine we’re predicting the likelihood of a certain species of bird nesting in a particular habitat. Our model includes an odds ratio (OR) of 1.05 associated with tree density. This means that for every additional tree per hectare, the odds of a bird nesting in that habitat increase by 5%.

Now, let’s consider another example where we’re predicting the risk of a plant disease. Our model shows an OR of 0.90 for soil pH. This indicates that for every one-unit increase in soil pH, the odds of the plant disease decrease by 10%.

To illustrate how odds ratios work over a range of values, let’s say we have an OR of 0.98 for temperature in predicting insect abundance. We want to compare the odds of insect abundance at 30°C versus 20°C. We can use the following formula:

\[\begin{split} \begin{aligned} \text{OR}_\text{temperature} &= \text{OR}^\text{temperature difference} \\ &= 0.98 \times 0.98 \times \dots \times 0.98 \\ &= 0.98^{30 - 20} \\ &= 0.98^{10} \\ &\approx 0.82 \end{aligned} \end{split}\]

This means the odds of insect abundance at 30°C are about 82% of the odds at 20°C. In other words, the odds are about 18% lower at the higher temperature.

Logistic regression#

As we’ve discussed, linear regression assumes the dependent variable (\(Y_i\)) can take any value. But what if \(Y_i\) is binary—for example, representing the presence (\(1\)) or absence (\(0\)) of a disease? This is where logistic regression comes in.

In logistic regression, \(Y_i\) is binary (e.g., 0 or 1). This creates two main issues for linear regression:

  1. Non-linearity: the relationship between our predictors and a binary outcome is usually S-shaped (sigmoidal), not linear.

  2. Unbounded predictions: linear regression can predict values outside the 0 to 1 range, which doesn’t make sense for probabilities.

To address these issues, we introduce the logit transformation. This transformation takes a probability (\(p\)) and maps it from \([0,1)\) to a value that can range from negative to positive infinity \((-\infty, +\infty)\). The logit function is:

\[\mathrm{logit}(p_i) = \ln \left(\frac{p_i}{1 - p_i} \right)\]

Where:

  • \(p_i\) is the probability of the event (e.g., the probability of a patient having a certain disease) being 1 for the \(i\)-th participant.

  • \(\ln\) is the natural logarithm, not the common logarithm (\(\log_{10}\))

Now, instead of predicting \(Y_i\) directly, we predict the logit of \(p_i\). This allows us to use a linear equation, similar to multiple regression:

\[\mathrm{logit}(p_i) = \ln \left(\frac{p_i}{1 - p_i} \right) = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + ... + \beta_n X_{i,n}\]

Where:

  • \(Y_i\) is the dependent variable (what we’re trying to predict).

  • \(X_{i,1}, X_{i,2}, \dots, X_{i,n}\) are the independent variables (our predictors).

  • \(\beta_0\) is the intercept (the value of \(Y\) when all \(X\) are 0).

  • \(\beta_1, \beta_2, \dots, \beta_n\) are the coefficients (how much \(Y\) changes for a one-unit change in each \(X\)).

The \(Y_i\) value is the natural log of odds, which can be transformed to a probability. Since it implicitly embodies uncertainty, there is no need to explicitly add a random term to the model.

Maximum likelihood estimation#

Maximum likelihood estimation (MLE) is used to estimate the \(\beta\) coefficients in logistric regression. As discussed at the end of the linear regression chapter as an advanced technique, MLE is a different approach than the OLS method we used in linear regression:

  • OLS finds the coefficients that minimize the sum of squared errors between the observed and predicted values. It’s a good method when the dependent variable is continuous and the errors are normally distributed.

  • MLE finds the coefficients that maximize the likelihood of observing the data given the model. It’s a more general method that can be used for various types of data and models, including logistic regression where the dependent variable is binary.

In essence, MLE is a more suitable method for estimating the coefficients in logistic regression due to the nature of the outcome variable and the model’s assumptions.

We then apply back-transformation to retrieve the odds and probability:

  • Odds: \(\frac{p_i}{1 - p_i} = e^{Y_i} = e^{\beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + ... + \beta_n X_{i,n}}\)

  • Probability: \(p_i = \frac{1}{1 + e^{-Y_i}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + ... + \beta_n X_{i,n})}}\)

Example#

Let’s say we want to predict the likelihood of a student passing an exam (\(Y = 1\) for pass, \(0\) for fail) based on whether they attended a study session (\(X_1 = 1\) for attended, \(0\) for did not attend) and whether they completed the practice problems (\(X_2 = 1\) for completed, \(0\) for did not complete).

Our logistic regression model might look like this:

\[\ln \left( \frac{p_\text{pass=1}}{1-p_\text{pass=1}} \right) = -1.5 + 0.8 \times \text{Study Session} + 1.2 \times \text{Practice Problems}\]

Here’s how we can interpret the coefficients:

  • Intercept (\(\beta_0 = -1.5\)): this represents the log-odds of passing the exam for a student who did not attend the study session and did not complete the practice problems.

  • Study Session coefficient (\(\beta_1 = 0.8\)): attending the study session increases the log-odds of passing by 0.8.

  • Practice Problems coefficient (\(\beta_2 = 1.2\)): completing the practice problems increases the log-odds of passing by 1.2.

To get the actual probabilities, we would need to apply the transformations discussed earlier (exponentiate to get odds, then use the logistic function to get probability):

\[p = \frac{\text{odds}}{1 + \text{odd}}\]

Assumptions#

Just as with multiple linear regression, logistic regression relies on several assumptions. Violating these can lead to unreliable estimates, invalid P values, and incorrect conclusions.

Linearity#

  • In multiple regression, we assumed linearity between each independent variable and the dependent variable (after considering other predictors).

  • In logistic regression, we assume linearity between each independent variable and the log-odds of the outcome.

To assess this assumption, we examine residual plots (plotted against predicted log-odds) and consider component-component plus residual (CCPR) plots. If we find evidence of non-linearity, we can try transformations of the independent variables, including polynomial terms. In some cases, we might need to consider alternative models if the relationship is fundamentally non-linear.

Independence of errors#

  • In both multiple and logistic regressions, errors (residuals) should be independent of each other. This is often violated in data with time series, repeated measures, or clustered structures.

To assess this, we consider the study design and examine residual plots for patterns. If we suspect non-independence, we need to use appropriate models that account for the data structure, such as time series models, mixed-effects models, or generalized estimating equations (GEEs).

Homoscedasticity#

  • In multiple regression, we assumed constant variance of errors across all levels of the independent variables.

  • In logistic regression, this assumption is not as critical. Due to the binary nature of the outcome, some degree of heteroscedasticity is expected.

We assess this primarily through visual inspection of residual plots. If we detect severe heteroscedasticity, we can consider transformations or robust standard errors.

Normality of errors#

  • In multiple regression, we assumed normally distributed errors, which is especially important for small sample sizes.

  • In logistic regression, this assumption is less critical due to the Central Limit Theorem and the asymptotic properties of maximum likelihood estimation.

We can assess normality using histograms and Q-Q plots of the residuals. If substantial deviations from normality are observed, we might consider transformations or bootstrapping.

No perfect multicollinearity#

No independent variable should be a perfect linear combination of the others. This assumption applies to both multiple and logistic regression.

We can assess multicollinearity by examining correlation matrices and variance inflation factors (VIFs). To address it, we might remove redundant variables, combine variables, or use regularization techniques.

Real-world example#

Getting the data#

In the previous chapter on multiple regression, we analyzed a study examining the prevalence of mental disorders among male prisoners in France.

The dataset, coming from the study “Prevalence of mental disorders in French prisons for men” by Falissard et al. (2006), is available through a MOOC (Massive Open Online Course) on biostatistics using R offered by Bruno Falissard (one of the study’s authors).

import pandas as pd

# Load the dataset from the URL of the MOOC
data = pd.read_csv(
    "https://lms.fun-mooc.fr/c4x/Paris11/15001/asset/smp2.csv",
    delimiter=';',  # Specify the delimiter as ';'
)

# Display the first 5 rows of the DataFrame
data.head()
age prof duree discip n.enfant n.fratrie ecole separation juge.enfant place ... subst.cons scz.cons char rs ed dr suicide.s suicide.hr suicide.past dur.interv
0 31.0 autre 4.0 0.0 2.0 4 1.0 0.0 0.0 0.0 ... 0 0 1.0 2.0 1.0 1.0 0.0 0.0 0.0 NaN
1 49.0 NaN NaN 0.0 7.0 3 2.0 1.0 0.0 0.0 ... 0 0 1.0 2.0 2.0 1.0 0.0 0.0 0.0 70.0
2 50.0 prof.interm?diaire 5.0 0.0 2.0 2 2.0 0.0 0.0 0.0 ... 0 0 1.0 2.0 3.0 2.0 0.0 0.0 0.0 NaN
3 47.0 ouvrier NaN 0.0 0.0 6 1.0 1.0 0.0 1.0 ... 0 0 1.0 2.0 2.0 2.0 1.0 0.0 0.0 105.0
4 23.0 sans emploi 4.0 1.0 1.0 6 1.0 1.0 NaN 1.0 ... 0 0 1.0 2.0 2.0 2.0 0.0 0.0 1.0 NaN

5 rows × 26 columns

As seen in the previous chapter on multiple regression, the variable names in the dataset are in French, though a comprehensive PDF document explains each variable in detail, as summarized below.

Variable

Meaning

Units

age

Age

Years

prof

Profession (agriculteur, artisan, cadre, profession intermédiaire, employé, ouvrier, autre, sans emploi)

Categorical

dep.cons

Presence of depressive disorder (diagnosed by consensus of two clinicians)

Binary (0: No, 1: Yes)

scz.cons

Presence of schizophrenia (diagnosed by consensus of two clinicians)

Binary (0: No, 1: Yes)

grav.cons

Severity of the inmate’s psychopathology (1: normal, 2: borderline, 3: mild, 4: moderate, 5: manifest, 6: severe, 7: among the most ill)

Ordinal (1-7)

n.enfant

Number of children

Count

rs

Novelty seeking (1: low, 2: moderate, 3: high)

Ordinal (1-3)

ed

Harm avoidance (1: low, 2: moderate, 3: high)

Ordinal (1-3)

dr

Reward dependence (1: low, 2: moderate, 3: high)

Ordinal (1-3)

duree

Duration of incarceration (1: Less than 1 month, 2: 1 to 6 months, 3: 6 months to 1 year, 4: 1 to 5 years, 5: 5 years or more)

Ordinal (1-5)

discip

Disciplinary action since incarceration

Binary (0: No, 1: Yes)

n.fratrie

Number of siblings

Count

ecole

Education level (1: no diploma, 2: middle school, 3: CAP, BEP, 4: high school, 5: university)

Ordinal (1-5)

separation

Separation from parents for at least 6 months during childhood

Binary (0: No, 1: Yes)

juge.enfant

Followed by a children’s judge before age 18

Binary (0: No, 1: Yes)

place

Placement in a home or foster care before age 18

Binary (0: No, 1: Yes)

abus

History of childhood abuse (physical, psychological, or sexual)

Binary (0: No, 1: Yes)

ago.cons

Presence of agoraphobia

Binary (0: No, 1: Yes)

ptsd.cons

Presence of post-traumatic stress disorder

Binary (0: No, 1: Yes)

alc.cons

Presence of alcohol abuse

Binary (0: No, 1: Yes)

subst.cons

Presence of substance abuse

Binary (0: No, 1: Yes)

char

Intensity of personality disorder (1: absent, 2: mild, 3: moderate, 4: severe)

Ordinal (1-4)

suicide.s

Suicide risk score

Score (1-6)

suicide.hr

High suicide risk

Binary (0: No, 1: Yes)

suicide.past

History of suicide attempt

Binary (0: No, 1: Yes)

dur.interv

Duration of the interview

Minutes

Building the logistic regression model#

We’ll construct a simple logistic regression model to predict the probability of high suicide risk (’suicide.hr’) based on the presence of childhood abuse (‘abus’). Our model can be expressed as:

\[\text{logit}(p) = \ln \left( \frac{p(\text{suicide.hr = 1})}{1 - p(\text{suicide.hr = 1})} \right) = \beta_0 + \beta_\mathrm{abus} X_\mathrm{abus}\]

where:

  • \(p\) represents the probability of having a high suicide risk.

  • \(\text{logit}(p)\) is the log-odds of having a high suicide risk, which is transformed using the logit function to ensure the probability ranges between 0 and 1.

  • \(\beta_0\) is the intercept, representing the log-odds of having a high suicide risk when there is no history of childhood abuse.

  • \(\beta_\mathrm{abus}\) is the regression coefficient for ‘abus’, representing the change in the log-odds of having a high suicide risk associated with having a history of childhood abuse.

import statsmodels.api as sm

# Drop rows with missing values for `suicide.hr` and `abus`
data_abus = data[['suicide.hr', 'abus']].copy().dropna()

# Define the dependent variable (y) and independent variable (X)
y = data_abus['suicide.hr']
X = data_abus[['abus']]

# Add a constant term to the independent variable matrix
X = sm.add_constant(X)

# Fit the logistic regression model
model_abus = sm.Logit(y, X)  # Using Logit for logistic regression
result_abus = model_abus.fit()     # This uses Maximum Likelihood Estimation (MLE) by default

# Print the model summary
print(result_abus.summary())
Optimization terminated successfully.
         Current function value: 0.494196
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:             suicide.hr   No. Observations:                  753
Model:                          Logit   Df Residuals:                      751
Method:                           MLE   Df Model:                            1
Date:                Tue, 18 Feb 2025   Pseudo R-squ.:                 0.02098
Time:                        08:20:27   Log-Likelihood:                -372.13
converged:                       True   LL-Null:                       -380.11
Covariance Type:            nonrobust   LLR p-value:                 6.494e-05
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.6161      0.115    -14.003      0.000      -1.842      -1.390
abus           0.7688      0.190      4.052      0.000       0.397       1.141
==============================================================================

To make these coefficients more interpretable, we need to convert them to odds ratios.

import numpy as np

# Calculate the odds and probability for the intercept
intercept_odds = np.exp(result_abus.params.loc['const'])
intercept_prob = intercept_odds / (1 + intercept_odds)

# Calculate the odds ratio and confidence interval for 'abus'
abus_coeff_or = np.exp(result_abus.params.loc['abus'])
abus_coeff_ci = np.round(np.exp(result_abus.conf_int().loc['abus']), 2)

# Print the odds and probability for the intercept
print(f"Odds for intercept (no abuse): {intercept_odds:.3f}")
print(f"Probability for intercept (no abuse): {intercept_prob:.3f}")

# Print the odds ratio, confidence interval, and p-value for 'abus'
print(f"\nOdds ratio for 'abus': {abus_coeff_or:3.3f}")
print(f"CI of the OR for 'abus': [{abus_coeff_ci[0]}, {abus_coeff_ci[1]}]")
print(f"P value for 'abus': {result_abus.pvalues.loc['abus']:3.4f}")
Odds for intercept (no abuse): 0.199
Probability for intercept (no abuse): 0.166

Odds ratio for 'abus': 2.157
CI of the OR for 'abus': [1.49, 3.13]
P value for 'abus': 0.0001

Let’s interpret the coefficients and odds ratios:

  • Intercept: the intercept, or constant ‘const’ (-1.616) represents the log-odds of ‘suicide.hr = 1’ (high suicide risk) when there is no history of childhood abuse (‘abus = 0’). Exponentiating this value gives us the odds, which is approximately 0.20:1. This means that when there’s no history of abuse, the odds of having a high suicide risk are about 1 in 5, or a probability of \(0.20 / (1 + 0.20) = 16.6\%\).

  • ‘abus’ coefficient: the coefficient for ‘abus’ (0.7688) represents the change in the log-odds of ‘suicide.hr’ associated with having a history of childhood abuse. Exponentiating this value gives us the odds ratio, which is approximately 2.16. This means that prisoners with a history of childhood abuse have about 2.16 times the odds of having a high suicide risk compared to those without a history of abuse.

  • Confidence intervals: the 95% confidence interval for the ‘abus’ coefficient (0.397 to 1.141) does not include 1, confirming that the association between ‘abus’ and ‘suicide.hr’ is statistically significant at a significance level of 0.05.

  • P value: the P value for ‘abus’ is very small (less than 0.0001), which also provides strong evidence to reject the null hypothesis that there is no association between childhood abuse and suicide risk.

Our logistic regression analysis indicates a statistically significant positive association between a history of childhood abuse and high suicide risk among male prisoners. Prisoners with a history of abuse have more than twice the odds of being classified as high suicide risk compared to those without such a history. The table below shows the probability of ‘suicide.hr’ for both values of ‘abus’:

abus

Log-odds

Odds

Ratio

Proba

0

-1.616

0.20

1:5

0.166

1

-0.847

0.43

1:2.3

0.300

where \(\text{logit}(p_{\text{abus}=1}) = \beta_0 + \beta_\mathrm{abus} \times 1 = -1.616 + 0.769 = -0.847\).

Contingency table#

In that particular scenario where we analyze 2 binary variables, we can create a contingency table using pd.crosstab. This will help us visualize the distribution of ‘suicide.hr’ across the different levels of ‘abus’.

# Create the contingency table
contingency_table = pd.crosstab(data_abus['abus'], data_abus['suicide.hr'])

# Print the contingency table
print("Contingency table:")
print(contingency_table.to_markdown(numalign="left", stralign="left"))
Contingency table:
| abus   | 0.0   | 1.0   |
|:-------|:------|:------|
| 0      | 453   | 90    |
| 1      | 147   | 63    |

This table will provide a clear overview of the number of individuals in each category of ‘abus’ and ‘suicide.hr’. Next, let’s use the Table2x2 module from statsmodels to calculate and display the odds ratio and confidence interval.

# Calculate the odds ratio and confidence interval
sm.stats.Table2x2(contingency_table).summary()
Estimate SE LCB UCB p-value
Odds ratio 2.157 1.487 3.129 0.000
Log odds ratio 0.769 0.190 0.397 1.141 0.000
Risk ratio 1.192 1.083 1.312 0.000
Log risk ratio 0.175 0.049 0.079 0.272 0.000

The odds ratio obtained with the analysis of the contingency table is similar to the odds ratio computed using the logistic regression model. This is because both methods estimate the odds ratio, but they use slightly different approaches. We should refer to the chapter dealing with the comparison of proportions to understand this.

Logistic regression model with a continuous variable#

Let’s build a simple logistic regression model to predict the probability of high suicide risk (’suicide.hr’) based on the continuous variable ‘age’.

Our model can be expressed as:

\[\text{logit}(p) = \beta_0 + \beta_\mathrm{age} X_\mathrm{age}\]

where:

  • \(p\) represents the probability of having a high suicide risk.

  • \(\text{logit}(p)\) is the log-odds of having a high suicide risk.

  • \(\beta_0\) is the intercept, representing the log-odds of having a high suicide risk when age is 0 (which may not have practical meaning).

  • \(\beta_\mathrm{age}\) is the regression coefficient for ‘age’, representing the change in the log-odds of having a high suicide risk associated with a one-unit increase in age.

# Drop rows with missing values for `suicide.hr` and `abus`
data_age = data[['suicide.hr', 'age']].copy().dropna()

# Define the dependent variable (y) and independent variable (X)
y = data_age['suicide.hr']
X = data_age[['age']]

# Add a constant term to the independent variable matrix
X = sm.add_constant(X)

# Fit the logistic regression model
model_age = sm.Logit(y, X)  # Using Logit for logistic regression
result_age = model_age.fit()     # This uses Maximum Likelihood Estimation (MLE) by default

# Print the model summary
print(result_age.summary())
Optimization terminated successfully.
         Current function value: 0.496980
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:             suicide.hr   No. Observations:                  758
Model:                          Logit   Df Residuals:                      756
Method:                           MLE   Df Model:                            1
Date:                Tue, 18 Feb 2025   Pseudo R-squ.:                 0.01187
Time:                        08:20:27   Log-Likelihood:                -376.71
converged:                       True   LL-Null:                       -381.24
Covariance Type:            nonrobust   LLR p-value:                  0.002621
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.5527      0.288     -1.919      0.055      -1.117       0.012
age           -0.0216      0.007     -2.923      0.003      -0.036      -0.007
==============================================================================
# Calculate the odds ratios for all coefficients
odds_ratios = np.exp(result_age.params)

# Print the odds ratios
print("Odds ratios:")
print(odds_ratios.to_markdown(numalign="left", stralign="left"))
Odds ratios:
|       | 0        |
|:------|:---------|
| const | 0.575416 |
| age   | 0.978666 |

The intercept (-0.5527) represents the log-odds of a prisoner having a high suicide risk when their age is 0. While we can exponentiate this to get an odds value, it doesn’t have a practical interpretation in this context, as age cannot be 0. The intercept mainly serves as a mathematical starting point for the model.

The coefficient for ‘age’ (-0.0216) tells us how the log-odds of high suicide risk change with each one-year increase in age. More importantly, the odds ratio for age (0.979) indicates that for each year older a prisoner is, their odds of being a high suicide risk decrease by about 2% (1 - 0.979 = 0.021). However, the association is relatively weak and the confidence interval is somewhat wide, so we should interpret these findings with caution.

Fitting a more complex logistic regression model#

Let’s fit a more complex logistic regression model using the Patsy formula and the statsmodels library. This approach allows us to include multiple independent variables and their interactions.

\[\text{logit}(p) = \beta_0 + \beta_\mathrm{duree} X_\mathrm{duree} + \beta_\mathrm{discip} X_\mathrm{discip} + \beta_\mathrm{abus} X_\mathrm{abus} + \beta_\mathrm{discip:duree} X_\mathrm{discip}X_\mathrm{duree}\]

where:

  • \(p\) represents the probability of the outcome.

  • \(\text{logit}(p)\) is the log-odds of the outcome.

  • \(\beta_0\) is the intercept.

  • \(\beta_\mathrm{duree}\), \(\beta_\mathrm{discip}\), and \(\beta_\mathrm{abus}\) are the coefficients for the individual variables (‘duree’, ‘discip’, and ‘abus’).

  • \(\beta_\mathrm{discip:duree}\) is the coefficient for the interaction term between ‘discip’ and ‘duree’.

The interaction term allows us to capture the combined effect of ‘discip’ and ‘duree’ on the outcome.

import statsmodels.formula.api as smf

# Drop rows with missing values for relevant variables
data_interaction = data[['suicide.hr', 'abus', 'discip', 'duree']].copy().dropna()

# Create the Patsy formula using the `C()` function to specify categorical variables
formula_interaction = "Q('suicide.hr') ~ duree + discip*duree"

# Fit the logistic regression model using the Patsy formula
model_interaction = smf.logit(formula_interaction, data=data_interaction)
result_interaction = model_interaction.fit()

# Print the model summary
print(result_interaction.summary2())
Optimization terminated successfully.
         Current function value: 0.491424
         Iterations 6
                         Results: Logit
================================================================
Model:              Logit            Method:           MLE      
Dependent Variable: Q('suicide.hr')  Pseudo R-squared: 0.028    
Date:               2025-02-18 08:20 AIC:              548.5668 
No. Observations:   550              BIC:              565.8065 
Df Model:           3                Log-Likelihood:   -270.28  
Df Residuals:       546              LL-Null:          -277.97  
Converged:          1.0000           LLR p-value:      0.0015221
No. Iterations:     6.0000           Scale:            1.0000   
----------------------------------------------------------------
                  Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
----------------------------------------------------------------
Intercept         0.0737   0.5469  0.1347 0.8928 -0.9983  1.1457
duree            -0.3784   0.1301 -2.9084 0.0036 -0.6334 -0.1234
discip            0.6338   1.2578  0.5039 0.6143 -1.8314  3.0990
discip:duree     -0.0101   0.2895 -0.0350 0.9721 -0.5775  0.5572
================================================================

In the previous formula, the dependent variable (’suicide.hr’) is enclosed in Q() to quote the variable name, as it contains the special character ‘.’.

Moreover, the ‘discip*duree’ term specifies the interaction between ‘discip’ and ‘duree’. The * operator is used to indicate an interaction between two or more variables. In the previous chapter, we used the : operator to specify interaction terms in a similar way. Remind that the colon only includes the interaction term itself, not the main effects of the variables involved.

By combining all the approaches we’ve seen in multiple regression - using categories, removing the intercept, and including quadratic polynomial terms - we can create more flexible and complex logistic regression models that capture non-linear relationships and interactions between variables.

Conclusion#

In this chapter, we delved into logistic regression, a powerful technique for modeling the relationship between a binary outcome and one or more predictor variables. We built upon our knowledge of simple and multiple linear regression, demonstrating how the logit transformation allows us to adapt linear modeling principles to analyze binary outcomes.

We explored the interpretation of logistic regression coefficients, emphasizing the importance of odds ratios in understanding the effects of predictors on the outcome. We also saw how to use the statsmodels library in Python to fit and interpret logistic regression models, leveraging the flexibility of Patsy formulas to incorporate interactions and categorical variables.

Importantly, we highlighted the close connection between logistic regression and the analysis of contingency tables, showing how both approaches can be used, in some scenarios, to estimate odds ratios and assess associations between variables.

Beyond its applications in traditional statistical analysis, logistic regression plays a crucial role in machine learning, particularly in classification tasks. The predicted probabilities from a logistic regression model can be used to classify observations into different categories based on a chosen threshold. This makes logistic regression a valuable tool for a wide range of applications, from medical diagnosis to image recognition.

With this chapter, we conclude our exploration of fitting models to data. In the next section, we’ll shift our focus to another fundamental statistical technique: Analysis of Variance (ANOVA). ANOVA allows us to compare means across multiple groups and understand the sources of variation in our data.

Cheat sheet#

Building the logistic regression model#

import statsmodels.formula.api as smf

# Create a new DataFrame with only the relevant variables
analysis_data = data[[y, X1, X2]].copy()

# Drop rows with missing values in any of the selected columns
analysis_data = analysis_data.dropna()

# Define the model formula using patsy syntax with the interaction term
formula = "y ~ X1 + X2"

# Quote variable names that contains special characters (e.g., periods `.` or hyphens `-`)
# formula = "Q('dur.interv') ~ age + Q('dep.cons') + Q('subst.cons')"

# Fit the MLE  model using the formula and data
model_interaction = smf.logit(formula, data=analysis_data)

# Fit the model
results = model.fit()

# Print the regression results summary
print(results.summary())
# print(results.summary2())

Session information#

The output below details all packages and version necessary to reproduce the results in this report.

!python --version
print("-------------")

from importlib.metadata import version

# List of packages we want to check the version
packages = ['numpy', 'pandas', 'statsmodels']

# Initialize an empty list to store the versions
versions = []

# Loop over the packages
for package in packages:
    try:
        # Get the version of the package
        package_version = version(package)
        # Append the version to the list
        versions.append(package_version)
    except Exception:  # Use a more general exception for broader compatibility
        versions.append('Not installed')

# Print the versions
for package, version in zip(packages, versions):
    print(f'{package}: {version}')
Python 3.12.9
-------------
numpy: 1.26.4
pandas: 2.2.2
statsmodels: 0.14.2