Why conditional logistic regression




















The regression is fitted by maximisation of the natural logarithm of the conditional likelihood function using Newton-Raphson iteration as described by Krailo et al. These are artificially matched data from a study of the risk factors associated with low birth weight in Massachusetts in To analyse these data using StatsDirect you must first open the test workbook using the file open function of the file menu.

Then select Conditional Logistic from the Regression and Correlation section of the analysis menu. Then select "LBWT" when asked for the case-control indicator.

You may infer from the results above that hypertension, smoking status and previous pre-term delivery are convincing predictors of low birth weight in the population studied.

Note that the selection of predictors for regression models such as this can be complex and is best done with the help of a Statistician. The optimal selection of predictors depends not only upon their numerical performance in the model, with or without appropriate transformations or study of interactions, but also upon their biophysical importance in the study.

Reid S, Tibshirani R: Regularization paths for conditional logistic regression: The clogitl1 package. Journal of Statistical Software. Article Google Scholar. Molecular Systems Biology. Greenland S: Small-sample bias and corrections for conditional maximum-likelihood odds-ratio estimators. Agresti A, Min Y: Effects and non-effects of paired identical observations in comparing proportions with binary matched-pairs data.

Bartolucci F: On the conditional logistic estimator in two-arm experimental studies with non-compliance and before-after binary outcomes. Heinze G, Puhr R: Bias-reduced and separation-proof conditional logistic regression with small or sparse data sets.

Journal of Machine Learning Research. Lokhorst J: The lasso and generalised linear models. Honors project. Ann Statist. J Fourier Anal Appl. Comm Pure Appl Math. Rosset S, Zhu J: Piecewise linear regularized solution paths.

Park MY, Hastie T: l 1 -regularization path algorithm for generalized linear models. Avalos M, Grandvalet Y, Pouyes H, Orriols L, Lagarde E: High-dimensional sparse matched case-control and case-crossover data: A review of recent works, description of an R tool and an illustration of the use in epidemiological studies.

Computational Intelligence Methods for Bioinformatics and Biostatistics. Edited by: Formenti, E. Gui J, Li H: Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Engler D, Li Y: Survival analysis with high-dimensional covariates: An application in microarray studies. Statistical Applications in Genetics and Molecular Biology. Goeman J: l 1 penalized estimation in the Cox proportional hazards model.

Biometrical Journal. PubMed Google Scholar. Engeland A, Skurtveit S, Morland J: Risk of road traffic accidents associated with the prescription of drugs: A registry-based cohort study. Annals of Epidemiology. Karjalainen K, Blencowe T, Lillsunde P: Substance use and social, health and safety-related factors among fatally injured drivers. Accid Anal Prev. Electron J Statist. Download references. You can also search for this author in PubMed Google Scholar.

Correspondence to Marta Avalos. MA developed the algorithms and revised the R code, performed the analysis on the datasets and wrote the Manuscript. HP developed the R code. LO helped collect the data, performed the analysis on the datasets, interpreted the results of the analysis and conducted the epidemiological literature review.

EL designed and supervised the epidemiological research, collected the data, and interpreted the results of the analysis. All authors read and approved the final manuscript.

This article is published under license to BioMed Central Ltd. Reprints and Permissions. Avalos, M. Sparse conditional logistic regression for analyzing large-scale matched data from epidemiological studies: a simple algorithm. BMC Bioinformatics 16, S1 Download citation. Published : 17 April Anyone you share the following link with will be able to read this content:. Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative.

Skip to main content. Search all BMC articles Search. Download PDF. Volume 16 Supplement 6. Background Epidemiological case-control studies are used to identify factors that may contribute to a health event by comparing a group of cases, that is, people with the health event under investigation, with a group of controls who do not have the health event but who are believed to be similar in other respects.

The same algorithm was independently proposed in two other high-dimensional matched case-control studies to identify association between Crohn's disease and genetic markers in family-based designs such as case-sibling and case-parent [ 18 ]; specific brain regions of acute infarction and hospital acquired pneumonia in stroke patients [ 19 ].

This line of approach was used in a matched case-control study to identify association between DNA methylation levels and hepatocellular carcinoma in tumor-adjacent non-tumor tissues [ 21 ]; the case-crossover study of prescription drugs and driving for the whole population [ 22 ].

Publicly available implementation Several algorithms have been proposed for solving the Lasso for the Cox model [ 45 , 42 , 46 — 50 ]. Table 1 Main publicly available R packages that solves the Lasso and other sparse penalties for the Cox, logistic or conditional logistic models surveyed October 1st, Full size table. Results Medicinal drugs have a potential effect on the skills needed for driving, a task that involves a wide range of cognitive, perceptual and psychomotor activities.

Data sources and designs Information on drug prescriptions and road traffic accidents was obtained from the following anonymized population-based registries: the national health care insurance database which covers the whole French population and includes data on reimbursed prescription drugs , police reports, and the national police database of injurious road traffic crashes. Individually matched case-control study The epidemiological question is: "What is different about at-fault drivers, if they are highly comparable to not at-fault drivers on external factors that may influence a road crash such as weather or road conditions?

Figure 1. Flowchart of the inclusion procedure. Full size image. Figure 2. Table 2 Odds ratio OR by study design. Conclusion We have developed a simple algorithm for the adaptation of the Lasso and related methods to the conditional logistic regression model. References 1. Article PubMed Google Scholar 4. Article PubMed Google Scholar 6. Article PubMed Google Scholar 7. Suppl 1 9. Google Scholar Article PubMed Google Scholar Article Google Scholar PubMed Google Scholar Author information Affiliations Univ.

View author publications. Additional information Competing interests The authors declare that they have no competing interests.

Authors' contributions MA developed the algorithms and revised the R code, performed the analysis on the datasets and wrote the Manuscript. About this article. Cite this article Avalos, M. When the odds ratio associated with a year increase in age is 3, the power is decreasing with a wider matching range of age.

This is not observed until the confounding effect is large. We let the odds ratio associated with the exposure be 1 under the null hypothesis and 1. When the null hypothesis is true, i. The estimation results assuming the null hypothesis is true are presented in Tables 3 and 4. In Table 3 , the bias is consistently around 0 regardless of confounding effect and age matching range.

It remains similar between unconditional and conditional models until the mean age difference reaches 20 when the unconditional model has a shorter interval than the conditional model. The reduction in the SE leads to the difference of 0.

Table 3. Biases of unconditional and conditional logistic regression models under the null hypothesis. Table 4. The findings are consistent when the mean age difference is Table 5.

Table 6. In conclusion, unconditional and conditional logistic regression models perform similarly in testing and estimation except when the age distributions of exposed and unexposed subjects are 20 years apart.

When the two age distributions are 20 years apart, the unconditional model consistently gives a type I error below the acceptable range and is slightly less powerful than the conditional model under the alternative hypothesis. When the alternative hypothesis is true, the unconditional model significantly underestimates the effect of exposure while the conditional model consistently produces an unbiased estimate.

When the mean age of exposed subjects is 20 years older than that of unexposed subjects, cases are more likely to be matched to controls with the same exposure status and the association is diminished accordingly. The unconditional method ignores matching but adjusts for confounding in the framework of regression.

In general, the Mantel—Haenszel estimator and the logit-based estimator are similar when the data within strata, here age groups, are not too sparse Without losing generalizability, assume that age is grouped into a few age groups.

Denoted by a, b, c , and d , the four cell counts representing the numbers of exposed cases, exposed controls, unexposed cases, and unexposed controls, respectively. The Mantel—Haenszel odds ratio is given by. The top and bottom age groups particularly have the ratio of number of cases to number of controls given the exposure status close to the case—control matching ratio. The addition from a particular age group to the numerator and the denominator tend to be similar, which drives the association toward the null value.

Through simulations, we assumed well-powered studies, and every case can be matched to a control, which is reasonable because the question that we attempt to address is whether a matched case—control data need to be analyzed by conditional logistic regression model. For a sufficiently large sample size regardless of disease prevalence and exposure frequency, our conclusions are generalizable for other disease prevalence and exposure frequency.

Again, the objective of this article is to compare the two methods given a matched case—control data instead of unmatched and matched data from different study designs where matched data tend to have a smaller sample size due to unmatched cases. Our findings suggest that when cases and controls are matched on age only, the data are essentially loose-matching data, and unconditional logistic regression is a proper method when the age distributions of exposed and unexposed subjects are not significantly apart.

Previous literature has provided in-depth discussion about the advantages of unconditional regression model compared to its conditional alternative, such as convenience, easy to access, straightforward interpretation, and the potential to preserve unmatched controls We argue that matched case—control studies have been underappreciated by the misconception that matched case—control data can be analyzed only by matched methods. A paper reviewed statistical methods of 37 matched case—control studies published in Among these studies, a majority of them performed matching on demographic variables namely age and sex only.

The conclusion was made as the authors claimed following the book of Breslow et al. Based on our findings, matched methods are not necessary for loose-matching data, e. While we believe that it is realistically rare to observe two age distributions that are 20 years apart for exposed and unexposed subjects, it gives us an example how the matching distortion matched cases and controls tend to share the same exposure status fails the unconditional logistic regression model.

In contrast, the matching distortion was corrected by including the matching variables in the conditional logistic regression model 12 , Although we only considered a single matching variable, i. With an increasing number of matching variables, loose matching is less likely to hold in the data, e.

However, the strength of loose matching is not always reflected from the number of matching variables. Matching on neighborhood or matching based on relationships implicitly matches numerous unmeasured variables including unmeasurable variables. Such studies apparently generate genuinely matched data that need to be analyzed by matched methods. It should be cautioned that our findings are for matched case—control data and cannot be generalized for propensity score PS matched data.

PS method was developed to facilitate causal inference in the spirit of clinical trials Matching in PS method is performed on the probability of a treatment assignment, which is determined by a selection of variables including confounders. After controlling for these variables, it is assumed that the outcome is independent of treatment status. The study is typically a cohort study, and the purpose of PS matching is to ensure that the treatment groups are balanced with respect to the variables conditional independence.

In contrast, case—control studies are retrospective studies, and the exposure status is observed. While there is a debate about whether treated and untreated samples should be regarded as independent, which will inform the choice of statistical methods 17 , it is different from the question that we have tried to address in terms of study design and matching scheme.

The scope of this study is limited to case—control studies that perform matching on a few demographic variables and consider methods of unconditional and conditional logistic regression models. In addition, the simulation settings assume absolute matching success, no model misspecification, and no interaction between exposure and matching variables.

However, these assumptions can be relaxed and will require further investigation. The results by a linear regression model unmatched method and a linear mixed effects model assuming random effects for matching sets matched method were quite similar in terms of regression coefficient and P value associated with the case—control status, which supports our finding that case—control data matched on a few demographic variables can be properly analyzed by unmatched methods.

To conclude, it has been known that matched methods, e. Matched methods additionally are robust to the matching distortion. Unmatched methods, e. When the study design involves other complex features such as censoring and repeated measures, matching on a few demographic variables can be ignored if the confounding effect is not very large. Standard methods such as Cox regression and generalized estimating equation then can be readily applied. Unmatched methods also are appealing for saving computational time when the same analysis needs to be repeated extensively, e.

In addition to matching, other factors also need to be considered, such as study design and practical feasibility when choosing a statistical method. All of the authors contributed significantly to study design, result interpretation, and manuscript preparation. The data simulations were conducted by C-LK. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Estimation of multiple relative risk functions in matched case-control studies.



0コメント

  • 1000 / 1000