Comparison of dimension reduction-based logistic regression models forcase-control genome-wide association study: principal
components analysis vs. partial least squares
Department of Epidemiology and Biostatistics, School of Public Health
2.
Department of Public Service Management, Schoolof KangDa
3.
Section of Clinical Epidemiology, Jiangsu Key Laboratory of Cancer Biomarkers, Prevention and Treatment,
Cancer Center
4.
Institute of Occupational Medicine and Ministry of Education, Key Laboratory for Environment and Health, School of PublicHealth, Tongji Medical College, Huazhong University of Science andTechnology, Wuhan 430030, China
5.
State Key
Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, Jiangsu 211166, China
6.
State Key Laboratory of Molecular Oncology and Department of Etiology and Carcinogenesis, Cancer Institute and Hospital,
Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100021, China
Funds:
the National Natural Science
Foundation of China (81202283, 81473070, 81373102 and
81202267), Key Grant of Natural Science Foundation of the
Jiangsu Higher Education Institutions of China (10KJA330034 and
11KJA330001)
With recent advances in biotechnology, genome-wide association study (GWAS) has been widely used to identify
genetic variants that underlie human complex diseases and traits. In case-control GWAS, typical statistical strategy is
traditional logistical regression (LR) based on single-locus analysis. However, such a single-locus analysis leads to the
well-known multiplicity problem, with a risk of inflating type I error and reducing power. Dimension reduction-based
techniques, such as principal component-based logisticregression (PC-LR), partial least squares-based logistic regression
(PLS-LR), have recently gained much attention in the analysis of high dimensional genomic data. However, the perfor-
mance of these methods is still not clear, especially in GWAS. We conducted simulations and real data application to
compare the type I error and power of PC-LR, PLS-LR and LR applicable to GWAS within a defined single nucleotide
polymorphism(SNP)setregion.WefoundthatPC-LRandPLScanreasonablycontroltypeIerrorundernullhypothesis.
Oncontrast,LR,whichiscorrectedbyBonferronimethod,wasmoreconservedinallsimulationsettings.Inparticular,we
found that PC-LR and PLS-LR had comparable power and they both outperformed LR, especially when the causal SNP
was in high linkage disequilibrium with genotyped ones and with a small effective size in simulation. Based on SNP set
analysis, we applied all three methods to analyze non-small cell lung cancer GWAS data.
Langyan S, Bhardwaj R, Kumari J, et al. Nutritional Diversity in Native Germplasm of Maize Collected From Three Different Fragile Ecosystems of India. Front Nutr, 2022, 9: 812599.
DOI:10.3389/fnut.2022.812599
2.
Juvinao-Quintero DL, Cardenas A, Perron P, et al. Associations between an integrated component of maternal glycemic regulation in pregnancy and cord blood DNA methylation. Epigenomics, 2021, 13(18): 1459-1472.
DOI:10.2217/epi-2021-0220
3.
Zhang J, Wu X. Predict Health Care Accessibility for Texas Medicaid Gap. Healthcare (Basel), 2021, 9(9): 1214.
DOI:10.3390/healthcare9091214
4.
Ayati M, Koyutürk M. PoCos: Population Covering Locus Sets for Risk Assessment in Complex Diseases. PLoS Comput Biol, 2016, 12(11): e1005195.
DOI:10.1371/journal.pcbi.1005195
5.
Zhang Q, Zhao Y, Zhang R, et al. A Comparative Study of Five Association Tests Based on CpG Set for Epigenome-Wide Association Studies. PLoS One, 2016, 11(6): e0156895.
DOI:10.1371/journal.pone.0156895