Logistic regression#
Introduction#
We’ve spent a good amount of time exploring the intricacies of correlation, simple linear regression, and multiple using statsmodels
. Now, let’s shift our focus to a powerful technique called logistic regression.
Logistic regression comes into play when our outcome variable (the dependent variable) has two possible categories - think “yes” or “no,” “success” or “failure,” or “present” or “absent”. Why can’t we just use linear regression in these cases? Well, in simple linear regression, we assume the error term follows a normal distribution, which can take on any value from negative to positive infinity. This doesn’t align with a binary outcome.
While we often use logistic regression with multiple independent variables (making it technically “multiple logistic regression”), the core concepts remain the same. Here’s a quick comparison of regression types:
Type of Regression |
Dependent Variable (Y) |
Examples |
---|---|---|
Linear |
Continuous (interval, ratio) |
Enzyme activity, renal function, weight |
Logistic |
Binary (dichotomous) |
Death during surgery, graduation, recurrence of cancer |
Proportional Hazards |
Elapsed time to event |
Months until death, time to graduation |
Definitions#
The maths#
At the heart of logistic regression lies the concept of odds and odds ratios. As we have seen in the chapter on the comparison of proportions, the odds of an event occurring are calculated as the probability of that event happening divided by the probability of it not happening:
The logistic regression model calculates the odds of our outcome based on two key components:
Baseline odds: these represent the odds of the outcome when all of our independent variables are set to zero. To make this more meaningful, we can center our variables (e.g., by using “Age - 20” instead of just “Age”). This way, our baseline odds would reflect the odds for a 20-year-old individual.
Odds ratios (OR): each independent variable in our model has an associated odds ratio, which tells us how the odds of the outcome change with a one-unit increase in that variable:
For continuous variables, the odds ratio tells us how much the odds change for a one-unit increase in that variable.
For binary variables, it tells us how much the odds change when the binary variable switches from 0 to 1.
We can interpret the OR as follows:
An OR of 1 means no relationship between the variable and the outcome.
An OR greater than 1 indicates that the odds increase as the variable increases (or when it switches from 0 to 1 for binary variables).
An OR less than 1 indicates that the odds decrease as the variable increases (or when it switches from 0 to 1 for binary variables).
Let’s imagine we’re predicting the likelihood of a certain species of bird nesting in a particular habitat. Our model includes an odds ratio (OR) of 1.05 associated with tree density. This means that for every additional tree per hectare, the odds of a bird nesting in that habitat increase by 5%.
Now, let’s consider another example where we’re predicting the risk of a plant disease. Our model shows an OR of 0.90 for soil pH. This indicates that for every one-unit increase in soil pH, the odds of the plant disease decrease by 10%.
To illustrate how odds ratios work over a range of values, let’s say we have an OR of 0.98 for temperature in predicting insect abundance. We want to compare the odds of insect abundance at 30°C versus 20°C. We can use the following formula:
This means the odds of insect abundance at 30°C are about 82% of the odds at 20°C. In other words, the odds are about 18% lower at the higher temperature.
Logistic regression#
As we’ve discussed, linear regression assumes the dependent variable (\(Y_i\)) can take any value. But what if \(Y_i\) is binary—for example, representing the presence (\(1\)) or absence (\(0\)) of a disease? This is where logistic regression comes in.
In logistic regression, \(Y_i\) is binary (e.g., 0 or 1). This creates two main issues for linear regression:
Non-linearity: the relationship between our predictors and a binary outcome is usually S-shaped (sigmoidal), not linear.
Unbounded predictions: linear regression can predict values outside the 0 to 1 range, which doesn’t make sense for probabilities.
To address these issues, we introduce the logit transformation. This transformation takes a probability (\(p\)) and maps it from \([0,1)\) to a value that can range from negative to positive infinity \((-\infty, +\infty)\). The logit function is:
Where:
\(p_i\) is the probability of the event (e.g., the probability of a patient having a certain disease) being 1 for the \(i\)-th participant.
\(\ln\) is the natural logarithm, not the common logarithm (\(\log_{10}\))
Now, instead of predicting \(Y_i\) directly, we predict the logit of \(p_i\). This allows us to use a linear equation, similar to multiple regression:
Where:
\(Y_i\) is the dependent variable (what we’re trying to predict).
\(X_{i,1}, X_{i,2}, \dots, X_{i,n}\) are the independent variables (our predictors).
\(\beta_0\) is the intercept (the value of \(Y\) when all \(X\) are 0).
\(\beta_1, \beta_2, \dots, \beta_n\) are the coefficients (how much \(Y\) changes for a one-unit change in each \(X\)).
The \(Y_i\) value is the natural log of odds, which can be transformed to a probability. Since it implicitly embodies uncertainty, there is no need to explicitly add a random term to the model.
Maximum likelihood estimation#
Maximum likelihood estimation (MLE) is used to estimate the \(\beta\) coefficients in logistric regression. As discussed at the end of the linear regression chapter as an advanced technique, MLE is a different approach than the OLS method we used in linear regression:
OLS finds the coefficients that minimize the sum of squared errors between the observed and predicted values. It’s a good method when the dependent variable is continuous and the errors are normally distributed.
MLE finds the coefficients that maximize the likelihood of observing the data given the model. It’s a more general method that can be used for various types of data and models, including logistic regression where the dependent variable is binary.
In essence, MLE is a more suitable method for estimating the coefficients in logistic regression due to the nature of the outcome variable and the model’s assumptions.
We then apply back-transformation to retrieve the odds and probability:
Odds: \(\frac{p_i}{1 - p_i} = e^{Y_i} = e^{\beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + ... + \beta_n X_{i,n}}\)
Probability: \(p_i = \frac{1}{1 + e^{-Y_i}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + ... + \beta_n X_{i,n})}}\)
Example#
Let’s say we want to predict the likelihood of a student passing an exam (\(Y = 1\) for pass, \(0\) for fail) based on whether they attended a study session (\(X_1 = 1\) for attended, \(0\) for did not attend) and whether they completed the practice problems (\(X_2 = 1\) for completed, \(0\) for did not complete).
Our logistic regression model might look like this:
Here’s how we can interpret the coefficients:
Intercept (\(\beta_0 = -1.5\)): this represents the log-odds of passing the exam for a student who did not attend the study session and did not complete the practice problems.
Study Session coefficient (\(\beta_1 = 0.8\)): attending the study session increases the log-odds of passing by 0.8.
Practice Problems coefficient (\(\beta_2 = 1.2\)): completing the practice problems increases the log-odds of passing by 1.2.
To get the actual probabilities, we would need to apply the transformations discussed earlier (exponentiate to get odds, then use the logistic function to get probability):
Assumptions#
Just as with multiple linear regression, logistic regression relies on several assumptions. Violating these can lead to unreliable estimates, invalid P values, and incorrect conclusions.
Linearity#
In multiple regression, we assumed linearity between each independent variable and the dependent variable (after considering other predictors).
In logistic regression, we assume linearity between each independent variable and the log-odds of the outcome.
To assess this assumption, we examine residual plots (plotted against predicted log-odds) and consider component-component plus residual (CCPR) plots. If we find evidence of non-linearity, we can try transformations of the independent variables, including polynomial terms. In some cases, we might need to consider alternative models if the relationship is fundamentally non-linear.
Independence of errors#
In both multiple and logistic regressions, errors (residuals) should be independent of each other. This is often violated in data with time series, repeated measures, or clustered structures.
To assess this, we consider the study design and examine residual plots for patterns. If we suspect non-independence, we need to use appropriate models that account for the data structure, such as time series models, mixed-effects models, or generalized estimating equations (GEEs).
Homoscedasticity#
In multiple regression, we assumed constant variance of errors across all levels of the independent variables.
In logistic regression, this assumption is not as critical. Due to the binary nature of the outcome, some degree of heteroscedasticity is expected.
We assess this primarily through visual inspection of residual plots. If we detect severe heteroscedasticity, we can consider transformations or robust standard errors.
Normality of errors#
In multiple regression, we assumed normally distributed errors, which is especially important for small sample sizes.
In logistic regression, this assumption is less critical due to the Central Limit Theorem and the asymptotic properties of maximum likelihood estimation.
We can assess normality using histograms and Q-Q plots of the residuals. If substantial deviations from normality are observed, we might consider transformations or bootstrapping.
No perfect multicollinearity#
No independent variable should be a perfect linear combination of the others. This assumption applies to both multiple and logistic regression.
We can assess multicollinearity by examining correlation matrices and variance inflation factors (VIFs). To address it, we might remove redundant variables, combine variables, or use regularization techniques.
Real-world example#
Getting the data#
In the previous chapter on multiple regression, we analyzed a study examining the prevalence of mental disorders among male prisoners in France.
The dataset, coming from the study “Prevalence of mental disorders in French prisons for men” by Falissard et al. (2006), is available through a MOOC (Massive Open Online Course) on biostatistics using R offered by Bruno Falissard (one of the study’s authors).
import pandas as pd
# Load the dataset from the URL of the MOOC
data = pd.read_csv(
"https://lms.fun-mooc.fr/c4x/Paris11/15001/asset/smp2.csv",
delimiter=';', # Specify the delimiter as ';'
)
# Display the first 5 rows of the DataFrame
data.head()
age | prof | duree | discip | n.enfant | n.fratrie | ecole | separation | juge.enfant | place | ... | subst.cons | scz.cons | char | rs | ed | dr | suicide.s | suicide.hr | suicide.past | dur.interv | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 31.0 | autre | 4.0 | 0.0 | 2.0 | 4 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 1.0 | 2.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | NaN |
1 | 49.0 | NaN | NaN | 0.0 | 7.0 | 3 | 2.0 | 1.0 | 0.0 | 0.0 | ... | 0 | 0 | 1.0 | 2.0 | 2.0 | 1.0 | 0.0 | 0.0 | 0.0 | 70.0 |
2 | 50.0 | prof.interm?diaire | 5.0 | 0.0 | 2.0 | 2 | 2.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 1.0 | 2.0 | 3.0 | 2.0 | 0.0 | 0.0 | 0.0 | NaN |
3 | 47.0 | ouvrier | NaN | 0.0 | 0.0 | 6 | 1.0 | 1.0 | 0.0 | 1.0 | ... | 0 | 0 | 1.0 | 2.0 | 2.0 | 2.0 | 1.0 | 0.0 | 0.0 | 105.0 |
4 | 23.0 | sans emploi | 4.0 | 1.0 | 1.0 | 6 | 1.0 | 1.0 | NaN | 1.0 | ... | 0 | 0 | 1.0 | 2.0 | 2.0 | 2.0 | 0.0 | 0.0 | 1.0 | NaN |
5 rows × 26 columns
As seen in the previous chapter on multiple regression, the variable names in the dataset are in French, though a comprehensive PDF document explains each variable in detail, as summarized below.
Variable |
Meaning |
Units |
---|---|---|
age |
Age |
Years |
prof |
Profession (agriculteur, artisan, cadre, profession intermédiaire, employé, ouvrier, autre, sans emploi) |
Categorical |
dep.cons |
Presence of depressive disorder (diagnosed by consensus of two clinicians) |
Binary (0: No, 1: Yes) |
scz.cons |
Presence of schizophrenia (diagnosed by consensus of two clinicians) |
Binary (0: No, 1: Yes) |
grav.cons |
Severity of the inmate’s psychopathology (1: normal, 2: borderline, 3: mild, 4: moderate, 5: manifest, 6: severe, 7: among the most ill) |
Ordinal (1-7) |
n.enfant |
Number of children |
Count |
rs |
Novelty seeking (1: low, 2: moderate, 3: high) |
Ordinal (1-3) |
ed |
Harm avoidance (1: low, 2: moderate, 3: high) |
Ordinal (1-3) |
dr |
Reward dependence (1: low, 2: moderate, 3: high) |
Ordinal (1-3) |
duree |
Duration of incarceration (1: Less than 1 month, 2: 1 to 6 months, 3: 6 months to 1 year, 4: 1 to 5 years, 5: 5 years or more) |
Ordinal (1-5) |
discip |
Disciplinary action since incarceration |
Binary (0: No, 1: Yes) |
n.fratrie |
Number of siblings |
Count |
ecole |
Education level (1: no diploma, 2: middle school, 3: CAP, BEP, 4: high school, 5: university) |
Ordinal (1-5) |
separation |
Separation from parents for at least 6 months during childhood |
Binary (0: No, 1: Yes) |
juge.enfant |
Followed by a children’s judge before age 18 |
Binary (0: No, 1: Yes) |
place |
Placement in a home or foster care before age 18 |
Binary (0: No, 1: Yes) |
abus |
History of childhood abuse (physical, psychological, or sexual) |
Binary (0: No, 1: Yes) |
ago.cons |
Presence of agoraphobia |
Binary (0: No, 1: Yes) |
ptsd.cons |
Presence of post-traumatic stress disorder |
Binary (0: No, 1: Yes) |
alc.cons |
Presence of alcohol abuse |
Binary (0: No, 1: Yes) |
subst.cons |
Presence of substance abuse |
Binary (0: No, 1: Yes) |
char |
Intensity of personality disorder (1: absent, 2: mild, 3: moderate, 4: severe) |
Ordinal (1-4) |
suicide.s |
Suicide risk score |
Score (1-6) |
High suicide risk |
Binary (0: No, 1: Yes) |
|
suicide.past |
History of suicide attempt |
Binary (0: No, 1: Yes) |
dur.interv |
Duration of the interview |
Minutes |
Building the logistic regression model#
We’ll construct a simple logistic regression model to predict the probability of high suicide risk (’suicide.hr’) based on the presence of childhood abuse (‘abus’). Our model can be expressed as:
where:
\(p\) represents the probability of having a high suicide risk.
\(\text{logit}(p)\) is the log-odds of having a high suicide risk, which is transformed using the logit function to ensure the probability ranges between 0 and 1.
\(\beta_0\) is the intercept, representing the log-odds of having a high suicide risk when there is no history of childhood abuse.
\(\beta_\mathrm{abus}\) is the regression coefficient for ‘abus’, representing the change in the log-odds of having a high suicide risk associated with having a history of childhood abuse.
import statsmodels.api as sm
# Drop rows with missing values for `suicide.hr` and `abus`
data_abus = data[['suicide.hr', 'abus']].copy().dropna()
# Define the dependent variable (y) and independent variable (X)
y = data_abus['suicide.hr']
X = data_abus[['abus']]
# Add a constant term to the independent variable matrix
X = sm.add_constant(X)
# Fit the logistic regression model
model_abus = sm.Logit(y, X) # Using Logit for logistic regression
result_abus = model_abus.fit() # This uses Maximum Likelihood Estimation (MLE) by default
# Print the model summary
print(result_abus.summary())
Optimization terminated successfully.
Current function value: 0.494196
Iterations 5
Logit Regression Results
==============================================================================
Dep. Variable: suicide.hr No. Observations: 753
Model: Logit Df Residuals: 751
Method: MLE Df Model: 1
Date: Tue, 18 Feb 2025 Pseudo R-squ.: 0.02098
Time: 08:20:27 Log-Likelihood: -372.13
converged: True LL-Null: -380.11
Covariance Type: nonrobust LLR p-value: 6.494e-05
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -1.6161 0.115 -14.003 0.000 -1.842 -1.390
abus 0.7688 0.190 4.052 0.000 0.397 1.141
==============================================================================
To make these coefficients more interpretable, we need to convert them to odds ratios.
import numpy as np
# Calculate the odds and probability for the intercept
intercept_odds = np.exp(result_abus.params.loc['const'])
intercept_prob = intercept_odds / (1 + intercept_odds)
# Calculate the odds ratio and confidence interval for 'abus'
abus_coeff_or = np.exp(result_abus.params.loc['abus'])
abus_coeff_ci = np.round(np.exp(result_abus.conf_int().loc['abus']), 2)
# Print the odds and probability for the intercept
print(f"Odds for intercept (no abuse): {intercept_odds:.3f}")
print(f"Probability for intercept (no abuse): {intercept_prob:.3f}")
# Print the odds ratio, confidence interval, and p-value for 'abus'
print(f"\nOdds ratio for 'abus': {abus_coeff_or:3.3f}")
print(f"CI of the OR for 'abus': [{abus_coeff_ci[0]}, {abus_coeff_ci[1]}]")
print(f"P value for 'abus': {result_abus.pvalues.loc['abus']:3.4f}")
Odds for intercept (no abuse): 0.199
Probability for intercept (no abuse): 0.166
Odds ratio for 'abus': 2.157
CI of the OR for 'abus': [1.49, 3.13]
P value for 'abus': 0.0001
Let’s interpret the coefficients and odds ratios:
Intercept: the intercept, or constant ‘const’ (-1.616) represents the log-odds of ‘suicide.hr = 1’ (high suicide risk) when there is no history of childhood abuse (‘abus = 0’). Exponentiating this value gives us the odds, which is approximately 0.20:1. This means that when there’s no history of abuse, the odds of having a high suicide risk are about 1 in 5, or a probability of \(0.20 / (1 + 0.20) = 16.6\%\).
‘abus’ coefficient: the coefficient for ‘abus’ (0.7688) represents the change in the log-odds of ‘suicide.hr’ associated with having a history of childhood abuse. Exponentiating this value gives us the odds ratio, which is approximately 2.16. This means that prisoners with a history of childhood abuse have about 2.16 times the odds of having a high suicide risk compared to those without a history of abuse.
Confidence intervals: the 95% confidence interval for the ‘abus’ coefficient (0.397 to 1.141) does not include 1, confirming that the association between ‘abus’ and ‘suicide.hr’ is statistically significant at a significance level of 0.05.
P value: the P value for ‘abus’ is very small (less than 0.0001), which also provides strong evidence to reject the null hypothesis that there is no association between childhood abuse and suicide risk.
Our logistic regression analysis indicates a statistically significant positive association between a history of childhood abuse and high suicide risk among male prisoners. Prisoners with a history of abuse have more than twice the odds of being classified as high suicide risk compared to those without such a history. The table below shows the probability of ‘suicide.hr’ for both values of ‘abus’:
abus |
Log-odds |
Odds |
Ratio |
Proba |
---|---|---|---|---|
0 |
-1.616 |
0.20 |
1:5 |
0.166 |
1 |
-0.847 |
0.43 |
1:2.3 |
0.300 |
where \(\text{logit}(p_{\text{abus}=1}) = \beta_0 + \beta_\mathrm{abus} \times 1 = -1.616 + 0.769 = -0.847\).
Contingency table#
In that particular scenario where we analyze 2 binary variables, we can create a contingency table using pd.crosstab
. This will help us visualize the distribution of ‘suicide.hr’ across the different levels of ‘abus’.
# Create the contingency table
contingency_table = pd.crosstab(data_abus['abus'], data_abus['suicide.hr'])
# Print the contingency table
print("Contingency table:")
print(contingency_table.to_markdown(numalign="left", stralign="left"))
Contingency table:
| abus | 0.0 | 1.0 |
|:-------|:------|:------|
| 0 | 453 | 90 |
| 1 | 147 | 63 |
This table will provide a clear overview of the number of individuals in each category of ‘abus’ and ‘suicide.hr’. Next, let’s use the Table2x2
module from statsmodels
to calculate and display the odds ratio and confidence interval.
# Calculate the odds ratio and confidence interval
sm.stats.Table2x2(contingency_table).summary()
Estimate | SE | LCB | UCB | p-value | |
---|---|---|---|---|---|
Odds ratio | 2.157 | 1.487 | 3.129 | 0.000 | |
Log odds ratio | 0.769 | 0.190 | 0.397 | 1.141 | 0.000 |
Risk ratio | 1.192 | 1.083 | 1.312 | 0.000 | |
Log risk ratio | 0.175 | 0.049 | 0.079 | 0.272 | 0.000 |
The odds ratio obtained with the analysis of the contingency table is similar to the odds ratio computed using the logistic regression model. This is because both methods estimate the odds ratio, but they use slightly different approaches. We should refer to the chapter dealing with the comparison of proportions to understand this.
Logistic regression model with a continuous variable#
Let’s build a simple logistic regression model to predict the probability of high suicide risk (’suicide.hr’) based on the continuous variable ‘age’.
Our model can be expressed as:
where:
\(p\) represents the probability of having a high suicide risk.
\(\text{logit}(p)\) is the log-odds of having a high suicide risk.
\(\beta_0\) is the intercept, representing the log-odds of having a high suicide risk when age is 0 (which may not have practical meaning).
\(\beta_\mathrm{age}\) is the regression coefficient for ‘age’, representing the change in the log-odds of having a high suicide risk associated with a one-unit increase in age.
# Drop rows with missing values for `suicide.hr` and `abus`
data_age = data[['suicide.hr', 'age']].copy().dropna()
# Define the dependent variable (y) and independent variable (X)
y = data_age['suicide.hr']
X = data_age[['age']]
# Add a constant term to the independent variable matrix
X = sm.add_constant(X)
# Fit the logistic regression model
model_age = sm.Logit(y, X) # Using Logit for logistic regression
result_age = model_age.fit() # This uses Maximum Likelihood Estimation (MLE) by default
# Print the model summary
print(result_age.summary())
Optimization terminated successfully.
Current function value: 0.496980
Iterations 6
Logit Regression Results
==============================================================================
Dep. Variable: suicide.hr No. Observations: 758
Model: Logit Df Residuals: 756
Method: MLE Df Model: 1
Date: Tue, 18 Feb 2025 Pseudo R-squ.: 0.01187
Time: 08:20:27 Log-Likelihood: -376.71
converged: True LL-Null: -381.24
Covariance Type: nonrobust LLR p-value: 0.002621
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.5527 0.288 -1.919 0.055 -1.117 0.012
age -0.0216 0.007 -2.923 0.003 -0.036 -0.007
==============================================================================
# Calculate the odds ratios for all coefficients
odds_ratios = np.exp(result_age.params)
# Print the odds ratios
print("Odds ratios:")
print(odds_ratios.to_markdown(numalign="left", stralign="left"))
Odds ratios:
| | 0 |
|:------|:---------|
| const | 0.575416 |
| age | 0.978666 |
The intercept (-0.5527) represents the log-odds of a prisoner having a high suicide risk when their age is 0. While we can exponentiate this to get an odds value, it doesn’t have a practical interpretation in this context, as age cannot be 0. The intercept mainly serves as a mathematical starting point for the model.
The coefficient for ‘age’ (-0.0216) tells us how the log-odds of high suicide risk change with each one-year increase in age. More importantly, the odds ratio for age (0.979) indicates that for each year older a prisoner is, their odds of being a high suicide risk decrease by about 2% (1 - 0.979 = 0.021). However, the association is relatively weak and the confidence interval is somewhat wide, so we should interpret these findings with caution.
Fitting a more complex logistic regression model#
Let’s fit a more complex logistic regression model using the Patsy formula and the statsmodels
library. This approach allows us to include multiple independent variables and their interactions.
where:
\(p\) represents the probability of the outcome.
\(\text{logit}(p)\) is the log-odds of the outcome.
\(\beta_0\) is the intercept.
\(\beta_\mathrm{duree}\), \(\beta_\mathrm{discip}\), and \(\beta_\mathrm{abus}\) are the coefficients for the individual variables (‘duree’, ‘discip’, and ‘abus’).
\(\beta_\mathrm{discip:duree}\) is the coefficient for the interaction term between ‘discip’ and ‘duree’.
The interaction term allows us to capture the combined effect of ‘discip’ and ‘duree’ on the outcome.
import statsmodels.formula.api as smf
# Drop rows with missing values for relevant variables
data_interaction = data[['suicide.hr', 'abus', 'discip', 'duree']].copy().dropna()
# Create the Patsy formula using the `C()` function to specify categorical variables
formula_interaction = "Q('suicide.hr') ~ duree + discip*duree"
# Fit the logistic regression model using the Patsy formula
model_interaction = smf.logit(formula_interaction, data=data_interaction)
result_interaction = model_interaction.fit()
# Print the model summary
print(result_interaction.summary2())
Optimization terminated successfully.
Current function value: 0.491424
Iterations 6
Results: Logit
================================================================
Model: Logit Method: MLE
Dependent Variable: Q('suicide.hr') Pseudo R-squared: 0.028
Date: 2025-02-18 08:20 AIC: 548.5668
No. Observations: 550 BIC: 565.8065
Df Model: 3 Log-Likelihood: -270.28
Df Residuals: 546 LL-Null: -277.97
Converged: 1.0000 LLR p-value: 0.0015221
No. Iterations: 6.0000 Scale: 1.0000
----------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
----------------------------------------------------------------
Intercept 0.0737 0.5469 0.1347 0.8928 -0.9983 1.1457
duree -0.3784 0.1301 -2.9084 0.0036 -0.6334 -0.1234
discip 0.6338 1.2578 0.5039 0.6143 -1.8314 3.0990
discip:duree -0.0101 0.2895 -0.0350 0.9721 -0.5775 0.5572
================================================================
In the previous formula, the dependent variable (’suicide.hr’) is enclosed in Q()
to quote the variable name, as it contains the special character ‘.’.
Moreover, the ‘discip*duree’ term specifies the interaction between ‘discip’ and ‘duree’. The *
operator is used to indicate an interaction between two or more variables. In the previous chapter, we used the :
operator to specify interaction terms in a similar way. Remind that the colon only includes the interaction term itself, not the main effects of the variables involved.
By combining all the approaches we’ve seen in multiple regression - using categories, removing the intercept, and including quadratic polynomial terms - we can create more flexible and complex logistic regression models that capture non-linear relationships and interactions between variables.
Conclusion#
In this chapter, we delved into logistic regression, a powerful technique for modeling the relationship between a binary outcome and one or more predictor variables. We built upon our knowledge of simple and multiple linear regression, demonstrating how the logit transformation allows us to adapt linear modeling principles to analyze binary outcomes.
We explored the interpretation of logistic regression coefficients, emphasizing the importance of odds ratios in understanding the effects of predictors on the outcome. We also saw how to use the statsmodels
library in Python to fit and interpret logistic regression models, leveraging the flexibility of Patsy formulas to incorporate interactions and categorical variables.
Importantly, we highlighted the close connection between logistic regression and the analysis of contingency tables, showing how both approaches can be used, in some scenarios, to estimate odds ratios and assess associations between variables.
Beyond its applications in traditional statistical analysis, logistic regression plays a crucial role in machine learning, particularly in classification tasks. The predicted probabilities from a logistic regression model can be used to classify observations into different categories based on a chosen threshold. This makes logistic regression a valuable tool for a wide range of applications, from medical diagnosis to image recognition.
With this chapter, we conclude our exploration of fitting models to data. In the next section, we’ll shift our focus to another fundamental statistical technique: Analysis of Variance (ANOVA). ANOVA allows us to compare means across multiple groups and understand the sources of variation in our data.
Cheat sheet#
Building the logistic regression model#
import statsmodels.formula.api as smf
# Create a new DataFrame with only the relevant variables
analysis_data = data[[y, X1, X2]].copy()
# Drop rows with missing values in any of the selected columns
analysis_data = analysis_data.dropna()
# Define the model formula using patsy syntax with the interaction term
formula = "y ~ X1 + X2"
# Quote variable names that contains special characters (e.g., periods `.` or hyphens `-`)
# formula = "Q('dur.interv') ~ age + Q('dep.cons') + Q('subst.cons')"
# Fit the MLE model using the formula and data
model_interaction = smf.logit(formula, data=analysis_data)
# Fit the model
results = model.fit()
# Print the regression results summary
print(results.summary())
# print(results.summary2())
Session information#
The output below details all packages and version necessary to reproduce the results in this report.
!python --version
print("-------------")
from importlib.metadata import version
# List of packages we want to check the version
packages = ['numpy', 'pandas', 'statsmodels']
# Initialize an empty list to store the versions
versions = []
# Loop over the packages
for package in packages:
try:
# Get the version of the package
package_version = version(package)
# Append the version to the list
versions.append(package_version)
except Exception: # Use a more general exception for broader compatibility
versions.append('Not installed')
# Print the versions
for package, version in zip(packages, versions):
print(f'{package}: {version}')
Python 3.12.9
-------------
numpy: 1.26.4
pandas: 2.2.2
statsmodels: 0.14.2