Download Monte Carlo Simulation Based Statistical Modeling PDF...
ICSA Book Series in Statistics Series Editors: Jiahua Chen · Ding-Geng (Din) Chen
Ding-Geng (Din) Chen John Dean Chen Editors
Monte-Carlo SimulationBased Statistical Modeling
ICSA Book Series in Statistics
Series editors
Jiahua Chen, Department Department of Stat Statistic istics, s, Unive University rsity of British British Columbia, Columbia, Vancouver, Vancouver, Canada Ding-Geng (Din) Chen, University of North Carolina, Chapel Hill, NC, USA
More information about this series at http://www.springer.com/series/13402
Ding-Geng (Din) Chen John Dean Chen
•
Editors
Monte-Carlo Simulation-Based Statistical Modeling
1 3
Editors
Ding-Geng (Din) Chen University of North Carolina Chapel Hill, NC USA
John Dean Chen Risk Management Credit Suisse New York, NY USA
and University of Pretoria Pretoria South Africa
ISSN 2199-0980 ICSA Book Series in Statistics ISB ISBN 978978-98 9811-10 10-3 -330 3066-3 3 DOI 10.1007/978-981-10-3307-0
ISSN 2199-0999
(electronic)
ISBN SBN 978978-98 9811-10 10--3307 3307-0 -0
(eB (eBook) ook)
Library of Congress Control Number: 2016960187 © Springer
Nature Singapore Pte Ltd. 2017
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the mat materi erial al is con concer cerned ned,, spe speci cifical cally ly the rig rights hts of tra transl nslati ation, on, rep reprint rinting ing,, reu reuse se of illu illustr strati ations ons,, recitation, recita tion, broad broadcastin casting, g, reprod reproduction uction on microfilms or in any other phy physic sical al way way,, and trans transmis missio sion n or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The us usee of gene genera rall desc descri ript ptiv ivee na name mes, s, reg regis iste tere red d name names, s, tr trad adem emar arks ks,, se serv rvic icee ma mark rks, s, et etc. c. in th this is publication public ation does not imply, even in the absen absence ce of a specific state statement, ment, that such names are exempt exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book boo k are belie believed ved to be tru truee and accur accurate ate at the date of pub public licati ation. on. Neith Neither er the pub publis lishe herr nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer Nature Singapore Pte Ltd. Theregisteredcompanyaddressis:152BeachRoad,#22-06/08GatewayEast,Singapore189721,Singapore
Preface
Over the last two decades, advancements in computer technology have enabled accelerated research and development of Monte-Carlo computational methods. This book is a compilation of invited papers from some of the most forward-thinking statistica stati sticall resea researchers rchers.. These authors prese present nt new development developmentss in Monte-Carlo Monte-Carlo simula sim ulation tion-ba -based sed sta statis tistic tical al modeli modeling, ng, thereb thereby y creati creating ng an opportu opportunit nity y for the exchange ideas among researchers and users of statistical computing. Our aim in creating this book is to provide a venue for timely dissemination of the research in Monte-Carlo simulation-based statistical modeling to promote further research and collaborative work in this area. In the era of big data science, this collection of innovative research not only has remarkable potential to have a substantial impact on the development of advanced Monte-Carlo methods across the spectrum of statistical data analyses but also has great promise for fostering new research and collaborations addressing the ever-changing challenges and opportunities of statistics and data science. The authors have made their data and computer programs publicly available, making it possible for readers to replicate the model development and data analysis presented in each chapter and readily apply these new methods in their own research. The 18 chapters are organized into three parts. Part I includes six chapters that present and discuss general Monte-Carlo techniques. Part II comprises six chapters with wit h a common common focus focus on Mon Montete-Car Carlo lo met method hodss used used in missin missing g data data analys analyses, es, which is an area of growing importance in public health and social sciences. Part III is composed of six chapters that address Monte-Carlo statistical modeling and their applications.
v
vi
Preface
Part I: Monte-Carlo Techniques (Chapters Joint Generation of Binary, Ordinal, Count, and Normal Data with Specified Marginal and Association Structures in Monte-Carlo Simulations – Quantifying the Uncertainty in Optimal Experiment Schemes via Monte-Carlo Simulations ) “
”
“
”
Chapter Jo Join intt Ge Gene nera rati tion on of Bi Bina nary ry,, Or Ordi dina nal, l, Co Count unt,, an and d No Norm rmal al Da Data ta wi with th Specified Marginal and Association Structures in Monte-Carlo Simulations presents a unified framework for concurrently generating data that include the four major types of distributions (i.e., binary, ordinal, count, and normal) with speci fied marginal and association structures. In this discussion of an important supplement to existing methods, Hakan Demirtas unifies the Monte-Carlo Monte-Carlo methods for specified types typ es of data data and pre presen sents ts his sys system temati aticc and compreh comprehens ensive ive invest investiga igatio tion n for mixed data generation. The proposed framework can then be readily used to simulate multivariate data of mixed types for the development of more sophisticated simulation, computation, and data analysis techniques. ovin ing g th thee Ef fici cien ency cy of th thee Mo Mont nte-C e-Car arlo lo Me Meth thods ods Us Usin ing g In Chapt Chapter er Impr Improv Ranked Simulated Approach , Hani Samawi provides an overview of his development of ranked simulated sampling; a key approach for improving the ef ficiency of general Monte-Carlo methods. Samawi then demonstrates the capacity of this approach to provide unbiased estimation. In Chapter Normal and Non-normal Data Simulations for the Evaluation of TwoTw o-Sa Samp mple le Lo Loca cati tion on Te Test stss , Jessica ica Hoag and Chia-Lin Ling Kuo discuss MonteMon te-Car Carlo lo sim simula ulatio tion n of nor normal mal and nonnon-norm normal al data data to evaluat evaluatee two-sa two-sampl mplee location tests (i.e., statistical tests that compare means or medians of two independent populations). Chapter Anatomy of Correlational Magnitude Transformations in Latency and Discretization Contexts in Monte-Carlo Studies proposes a general assessment of corr co rrel elati ationa onall magni magnitu tude de ch chan ange gess in th thee la laten tency cy and and di disc scre reti tiza zati tion on cont context extss of MonteMon te-Car Carlo lo stu studie dies. s. Fur Furthe ther, r, author authorss Hakan Hakan Demirt Demirtas as and Ceren Ceren Vardar Vardar-Ac -Acar ar “
”
“
”
“
”
“
”
provide a conceptual framework and computational algorithms for modeling the correlation transitions under specified distributional assumptions within the realm of discretization in the context of latency and the threshold concept. The authors illustrate the proposed algorithms with several examples and include a simulation study that demonstrates the feasibility and performance of the methods. Monte-Carlo Carlo Simulation Simulation of Corre Correlated lated Binary Respon Responses ses discusses Chapter Montethe Monte-Carlo simulation of correlated binary responses. Simulation studies are a well-known, highly valuable tool that allows researchers to obtain powerful concl clus usio ions ns fo forr corr correl elat ated ed or lo long ngit itudi udina nall resp respons onsee data data.. In ca case sess wher wheree lo logis gisti ticc modeli mod eling ng it used, used, the res resear earcher cher mus mustt have have app appropr ropriat iatee method methodss for simula simulatin ting g correlated binary data along with associated predictors. In this chapter, author Trent Lalonde presents an overview of existing methods for simulating correlated binary “
”
response data and compares those methods with methods using R software.
Preface
vii
Chapter Qua Quanti ntifyi fying ng the Unc Uncert ertain ainty ty in Opt Optima imall Exp Experi erimen mentt Sch Scheme emess via Monte-Carlo Simulations provides a general framework for quantifying the sensitivity sitiv ity and uncertainty uncertainty that result result from the misspeci misspecification of model parameters in optimal experimental schemes. In designing life-testing experiments, it is widely ac acce cepte pted d th that at th thee optim optimal al ex expe peri rime menta ntall sche scheme me depe depend ndss on unkno unknown wn mode modell parameters, and that misspecified parameters can lead to substantial loss of ef fi “
”
ciency in the statistical analysis. To quantify this effect, Tony Ng, Tzong-Ru Tsai, Y.L. Li, and Nan Jiang use Monte-Carlo simulations to Yu-Jau evaluateLin, the robustness of optimal experimental schemes.
Part II: Monte-Carlo Methods for Missing Data (Chapters Markov Chain Monte-Carlo Methods for Missing Data Under Ignorability Assumptions – Application of Markov Chain Monte-Carlo Multiple Imputation Method to Deal with Missing Data from the Mechanism of MNAR in Sensitivity Analysis for a Longitudinal Clinical Trial ) “
”
“
”
Chapter Markov Chain Monte-Carlo Methods for Missing Data Under Ignorability Assumptions prese present ntss a fu full lly y Baye Bayesi sian an me meth thod od fo forr usin using g th thee Mark Markov ov chai chain n Monte-Carlo technique for missing data to sample the full conditional distribution of the missing data given observed data and the other parameters. In this chapter, Hare Ha resh sh Rocha Rochani ni an and d Da Danie niell Li Linde nderr show show ho how w to appl apply y th thes esee meth methods ods to re real al datasets with missing responses as well as missing covariates. Additionally, the authors provide simulation settings to illustrate this method. In Chapte Chapter r A Multiple Imputation Framework for Massive Multivariate Data of Different Variable Types: A Monte-Carlo Technique , Hakan Demirtas discusses multiple imputation for massive multivariate data of variable types from planned miss mi ssing ingne ness ss desi design gnss with with th thee pur purpo pose se to buil build d th theo eoret retic ical al,, al algor gorit ithm hmic ic,, and and “
”
“
”
implementation-based components of a unified, general-purpose multiple imputation framework. The planned missingness designs are highly useful and will likely increase in popularity in the future. For this reason, the proposed multiple imputation framework represents an important refinement of existing methods. Chapter Hy Hybr brid id Mo Mont ntee-Ca Carl rlo o in Mu Mult ltip iple le Mi Miss ssin ing g Da Data ta Im Impu puta tati tions ons wi with th Application to a Bone Fracture Data introduces the Hybrid Monte-Carlo method as an ef ficient approach for sampling complex posterior distributions of several correlated parameters from a semi-parametric missing data model. In this chapter, Hui Xiee desc Xi describ ribes es a mo mode deli ling ng appr approa oach ch fo forr mi miss ssin ing g valu values es th that at does does not not re requ quire ire assuming assumi ng specific distri distributi butiona onall forms. forms. To demons demonstra trate te the method, method, the author author provides an R program for analyzing missing data from a bone fracture study. Chapter Stati Statistica sticall Method Methodologie ologiess for Deali Dealing ng with Incompl Incomplete ete Longit Longitudinal udinal Outcomes Due to Dropout Missing at Random considers key methods for handling longitudinal data that are incomplete due to missing at random dropout. In this
“
”
“
”
viii
Preface
chapter, Ali Satty, Henry Mwambil, and Geert Muhlenbergs provide readers with an overview of the issues and the different methodologies for handling missing data in lo long ngit itud udin inal al data datase sets ts th that at resu result lt fr from om dr drop opout out (e (e.g .g., ., st stud udy y at attr trit itio ion, n, lo loss ss of follow fol low-up -up). ). The authors authors exa examin minee the pot potent ential ial streng strengths ths and weaknes weaknesses ses of the various methods through two examples of applying these methods. In Chapter Applications of Simulation for Missing Data Issues in Longitudinal “ ”
Clinical Trials , Frank Liu and James Kost present simulation-based for addressing missing data issues in longitudinal clinical trials, such asapproaches control-based imputa imp utation tion,, tip tippin ping-po g-point int analys analysis, is, and a Baye Bayesia sian n Markov Markov chain chain MonteMonte-Car Carlo lo method. Computation programs for these methods are implemented and available in SAS. In Chap Chapter ter Applic Applicati ation on of Mar Markov kov Chai Chain n Mon Montete-Car Carlo lo Mul Multip tiple le Imp Imputat utation ion Method to Deal with Missing Data from the Mechanism of MNAR in Sensitivity Analysis for a Longitudinal Clinical Trial , Wei Sun discusses the application of Mark Ma rkov ov ch chai ain n Mont Montee-Car Carlo lo mu mult ltip iple le impu imputa tatio tion n fo forr data data th that at is miss missin ing g not at random ran dom in lon longitu gitudin dinal al datase datasets ts fro from m cli clinic nical al tri trials als.. This This chapte chapterr compar compares es the patterns of missing data between study subjects who received treatment and study subjects who received a placebo. “
”
Part III: Monte-Carlo in Statistical Modellings and Applications (Chapters Monte-Carlo Simulation in Modeling for Hierarchical Generalized Linear Mixed Models – Bootstrap-Based LASSOtype Selection to Build Generalized Additive Partially Linear Models for High-Dimensional Data ) “
”
“
”
Chapter Monte-Carlo Simulation in Modeling for Hierarchical Generalized Linear Mixed Models adds adds a discu discussion ssion of Monte-Carlo Monte-Carlo simulation-based simulation-based hierarchical hierarchical models, taking into account the variability at each level of the hierarchy. In this “
”
chapter, chapte r, Kyl Kylee Iri Irimat mataa and Jef Jeffre frey y Wil Wilson son dis discus cusss MonteMonte-Car Carlo lo simula simulation tionss for hierarchical linear mixed-effects models to fit the hierarchical logistic regression models mod els with with random random interc intercept eptss (both (both ran random dom interc intercept eptss and random random slopes slopes)) to multilevel data. Monte-Carlo -Carlo Methods in Finan Financial cial Modeli Modeling ng demonstrates the use of Chapter Monte Monte-Carlo MonteCarlo methods methods in finan nancia ciall mod modeli eling. ng. In this this chapte chapter, r, Chuansh Chuanshu u Ji, Tao Wang, and Leicheng Yin discuss two areas of market microstructure modeling and option pricing using Monte-Carlo dimension reduction techniques. This approach uses Bayesian Markov chain Monte-Carlo inference based on the trade and quote database from Wharton Research Data Services. Chapter Si Simul mulat atio ion n St Stud udie iess on th thee Ef Effe fect ctss of th thee Ce Cens nsor orin ing g Di Dist stri ribu buti tion on Assump Ass umptio tion n in the Ana Analys lysis is of Int Interv ervalal-Cen Censor sored ed Fai Failur luree Tim Timee Dat Dataa discusses using Monte-Carlo simulations to evaluate the effect of the censoring distribution assumpt ass umption ion for interv intervalal-cen censore sored d sur surviv vival al data. data. In this this chapte chapter, r, Tyler Tyler Cook Cook and “
”
“
”
Preface
ix
Jianguo Sun inv Jianguo invest estiga igate te the effec effectiv tivenes enesss and flex exib ibil ilit ity y of two two meth method odss fo for r regression analysis of informative case I and case II interval-censored data. The authors present extensive Monte-Carlo Monte-Carlo simula simulation tion studies studies that provide readers readers with guidelines regarding dependence of the censoring distribution. Chapter Robust Bayesian Hierarchical Model Using Monte-Carlo Simulation usess Monteuse Monte-Car Carlo lo sim simula ulatio tion n to dem demons onstra trate te a robust robust Bayesi Bayesian an multile multilevel vel item item ”
“
resp respon onse se mo mode del. l. In this this chap chapte ter, r, Geng Geng Ch Chen en uses uses data data fr from om pati patien ents ts with with Parkinson s disease, a chronic progressive disease with multidimensional impairments. men ts. Using Using the these se dat data, a, Chen ill illustr ustrate atess app applyi lying ng the multil multileve evell item item res respons ponsee model to not only deal with the multidimensional nature of the disease but also simultaneou simul taneously sly estimate estimate measur measurement ement-spec -speciific parame parameter ters, s, covari covariate ate ef effec fects, ts, and patient-specific characteristics of disease progression. In Chapter A Comparison of Bootstrap Confidence Intervals for Multi-level Longitudinal Longitu dinal Data Using MonteMonte-Carlo Carlo Simul Simulation ation , Mark Reiser, Lanlan Yao, and Xiao Wang present a comparison of bootstrap confidence intervals for multilevel longitudinal data using Monte-Carlo simulations. Their results indicate that if the sample size at the lower level is small, then the parametric bootstrap and cluster boots boo tstr trap ap pe perf rfor orm m be bett tter er at th thee hi highe gherr le leve vell th than an th thee twotwo-st stag agee boot bootst stra rap. p. The The auth au thors ors th then en appl apply y th thee boot bootst stra rap p me metho thods ds to a lo longi ngitu tudin dinal al st stud udy y of pr pres esch chool ool children nested within classrooms. Chapter Bo Boots otstr trap ap-B -Base ased d LA LASS SSOO-Ty Type pe Se Sele lect ctio ion n to Bu Buil ild d Ge Gener neral aliz ized ed Addi Ad diti tive ve Pa Part rtia iall lly y Li Line near ar Mo Mode dels ls fo forr Hi High gh-D -Dim imen ensi siona onall Da Data ta pres present entss an approa app roach ch to using using a boot bootstr strapap-bas based ed LASSOLASSO-typ typee select selection ion to build build general generalize ized d additive partially linear models for high-dimensional data. In this chapter, Xiang Li Liu, u, Ti Tian an Chen Chen,, Yu Yuan anzh zhan ang g Li Li,, and and Hu Huaa Li Lian ang g first propos proposee a bootstr bootstrapap-bas based ed procedure to select variables with penalized regression and then apply their procedure to analyze data from a breast cancer study and an HIV study. The two examples demonstrate the procedure s flexibility and utility in practice. In addition, the the auth authors ors pr pres esen entt a simul simulat atio ion n study study th that at sh show ows, s, when when comp compar ared ed with with th thee penalized regression approach, their variable selection procedure performs better. As a genera generall note, note, the ref refere erence ncess for each each chapte chapterr are includ included ed immedia immediatel tely y ’
“
”
“
”
’
following the chapter text. We have organized the chapters as self-contained units so readers can more easily and readily refer to the cited sources for each chapter. To facili facilitat tatee reader readerss under underst stan andi ding ng of th thee meth method odss pr pres esent ented ed in th this is book, book, corresponding data and computing program can be requested from the fi rst editor by email at DrDG.Chen@g DrDG.Chen@gmail. mail.com. com. The editors are deeply grateful to many who have supported the creation of this book. We thank the authors of each chapter for their contributions and their generous sharing of their knowledge, time, and expertise to this book. Second, our sincere gratitude goes to Ms. Diane C. Wyant from the School of Social Work, University of North Carolina at Chapel Hill for her expert editing and comments of this this book book whic which h subs substa tant ntia iall lly y up upli lift ft the the qual qualit ity y of th this is book book.. We gr grat atef eful ully ly acknowledge the professional support of Hannah Qiu (Springer/ICSA Book Series ’
coordinato coordin ator) r) and Wei Zha Zhao o (assoc (associat iatee edi editor tor)) from from Spr Springe ingerr Beijin Beijing g that that made made publishing this book with Springer a reality.
x
Preface
We welcome readers comments, including notes on typos or other errors, and look forward to receiving suggestions for improvements to future editions of this book. Please send comments and suggestions to any of the editors listed below. ’
October 2016
Ding-Geng (Din) Chen University of North Carolina, Chapel Hill, USA University of Pretoria, South Africa John Dean Chen Credit Suisse, New York, NY, USA
About the Book
This book This book br briings ngs tog ogeeth ther er ex expe pert rt re rese sear arch cher erss enga engage ged d in Mont Montee-Ca Carl rlo o simulation-based statistical modeling, offering them a forum to present and discuss recent issues in methodological development as well as public health applica cati tions ons.. It is di divi vide ded d in into to th thre reee pa part rts, s, with with th thee firs rstt pr prov ovidi iding ng an overv overvie iew w of Monte-Carlo the second focusing data Monte-Carlo ods,, an ods and d th thee techniques, th thir ird d ad addr dres essi sing ng Baye Bayesi sian an an and d on gene gemissing nera rall st stat atis isti tica cal l mode modeli ling ng methus usin ing g Monte-Carlo simulations. The data and computer programs used here will also be made publicly available, allowing readers to replicate the model development and data analysis presented in each chapter, and to readily apply them in their own researc res earch. h. Featur Featuring ing hig highly hly top topica icall conten content, t, the book book has the potent potential ial to impact impact model development development and data analyses analyses across a wide spectrum spectrum of fields, and to spark further research in this direction.
xi
Contents
Partt I Par
Mon Montete-Carl Carlo o Tec Techniq hniques ues
Joint Generation of Binary, Ordinal, Count, and Normal Data with Specified Marginal and Association Structures in Monte-Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Hakan Demirtas, Rawan Allozi, Yiran Hu, Gul Inan and Levent Ozbek Improving the Ef ficiency of the Monte-Carlo Methods Using Ranked Simulated Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Hani Michel Samawi Normal and Non-normal Data Simulations for the Evaluation of Two-Sample Location Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Jessica R. Hoag and Chia-Ling Kuo Anatomy of Correlational Magnitude Transformations in Latency and Discretization Contexts in Monte-Carlo Studies . . . . . . . . . . . . . . . . 59 Hakan Demirtas and Ceren Vardar-Acar Monte-Carlo Simulation of Correlated Binary Responses . . . . . . . . . . . . 85 Trent L. Lalonde Quantifying the Uncertainty in Optimal Experiment Schemes via Monte-Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 H.K.T. Ng, Y.-J. Lin, T.-R. Tsai, Y.L. Lio and N. Jiang Partt II Par
Mont Monte-C e-Carlo arlo Met Methods hods in Mi Missin ssing g Data Data
Markov Chain Monte-Carlo Methods for Missing Data Under Ignorability Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Haresh Rochani and Daniel F. Linder A Multiple Imputation Framework for Massive Multivariate Data of Different Variable Types: A Monte-Carlo Technique . . . . . . . . . 143 Hakan Demirtas
xiii
xiv
Contents
Hybrid Monte-Carlo in Multiple Missing Data Imputations with Application to a Bone Fracture Data . . . . . . . . . . . . . . . . . . . . . . . . 163 Hui Xie Statistical Methodologies for Dealing with Incomplete Longitudinal Outcomes Due to Dropout Missing at Random . . . . . . . . . 179
A. Satty, H. Mwambi and G. Molenberghs Applications of Simulation for Missing Data Issues in Longitudinal Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 G. Frank Liu and James Kost Application of Markov Chain Monte-Carlo Multiple Imputation Method to Deal with Missing Data from the Mechanism of MNAR in Sensitivity Analysis for a Longitudinal Clinical Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Wei Sun Partt III Par
Mon Montete-Car Carlo lo in Stati Statisti stical cal Modell Modellings ings and Applic Applicati ations ons
Monte-Carlo Simulation in Modeling for Hierarchical Generalized Linear Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Kyle M. Irimata and Jeffrey R. Wilson Monte-Carlo Methods in Financial Modeling . . . . . . . . . . . . . . . . . . . . . . 285 Chuanshu Ji, Tao Wang and Leicheng Yin Simulation Studies on the Effects of the Censoring Distribution Assumption in the Analysis of Interval-Censored Failure Time Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 Tyler Cook, Zhigang Zhang and Jianguo Sun Robust Bayesian Hierarchical Model Using Monte-Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
Geng Chen and Sheng Luo A Comparison of Bootstrap Confidence Intervals for Multi-level Longitudinal Data Using Monte-Carlo Simulation . . . . . . . . . . . . . . . . . 367 Mark Reiser, Lanlan Yao, Xiao Wang, Jeanne Wilcox and Shelley Gray Bootstrap-Based LASSO-Type Selection to Build Generalized Additive Partially Linear Models for High-Dimensional Data . . . . . . . . . 405 Xiang Liu, Tian Chen, Yuanzhang Li and Hua Liang Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
Editors and Contributors
About the Editors Prof. Ding-Geng Chen is a fellow of the American St Stat atis isti tica call Asso Associ ciati ation on and and curr current ently ly th thee Wall Wallac acee Kura Ku ralt lt di dist stin ingu guis ishe hed d pr prof ofes esso sorr at th thee Univ Univer ersi sity ty of North Carolina at Chapel Hill. He was a professor at th thee Un Univ iver ersi sity ty of Roch Roches este terr and and th thee Karl Karl E. Peac Peacee endo en dowe wed d em emin inen entt sc scho hola larr chai chairr in bi bios osta tati tist stic icss at Geor Ge orgi giaa Sout Southe hern rn Univ Univer ersi sity ty.. He is al also so a se seni nior or consul con sultan tantt for biophar biopharmac maceut eutica icals ls and governm government ent agencies with extensive expertise in clinical trial biostatis sta tistic ticss and pub public lic health health statis statistic tics. s. Profes Professor sor Chen Chen has wri writte tten n mor moree than than 150 referre referred d publica publicatio tions ns and co-a co -aut utho hore red/ d/co co-e -edi dite ted d te ten n book bookss on cl clin inic ical al tr tria iall methodology, meta-analysis, causal-inference and public health statistics.
Mr. John Dean Chen is fispecialized in Monte-Carlo simulations in modelling nancial market risk. He is currently current ly a Vice Preside President nt at Credit Suisse Suisse specializi specializing ng in regulatory stress testing with Monte-Carlo simulations. He began his career on Wall Street working in commo co mmodi diti ties es tr trad ading ing marke markett ri risk sk befor beforee movi moving ng to tradin tra ding g str struct ucture ured d notes notes within within the Exotic Exoticss Intere Interest st Rate Derivatives desk at Barclays Capital. He transitioned back to risk at Mitsubishi UFJ working in its model risk group. During his career in the financial industry, he witnessed in person the unfolding of the financial crisis, and the immediate aftermath consuming much of the fi nancial industry. He graduated from the University of Washington with a dual Bachelors of Science in Applied Mathematics and Economics.
xv
xvi
Editors and Contributors
Contributors Rawan Allozi Division of Epidemiology and Bios Biostatistics tatistics (MC923), University of Illinois at Chicago, Chicago, IL, USA Geng Chen Clinical Statistics, GlaxoSmithKline, Collegeville, PA, USA Tian Chen Tian Chen Depart Departmen mentt of Mathem Mathemati atics cs and Statis Statistic tics, s, Univers University ity of Toledo Toledo,, Toledo, OH, USA Tyler Cook University of Central Oklahoma, Edmond, OK, USA Hakan Haka n De Demi mirt rtas as Di Divi visi sion on of Epid Epidem emio iolo logy gy and and University of Illinois at Chicago, Chicago, IL, USA
Bios Biosta tati tist stic icss
(MC9 (MC923 23), ),
Shelley Gray Speech Speech and Hearing Science, Science, Arizona State Univer University, sity, Tempe, Tempe, AZ, USA Jessic Jess ica a R. Ho Hoag ag De Depa part rtme ment nt of Comm Commun unit ity y Medi Medici cine ne and and Heal Health th Care Care,, Conne Con nect ctic icut ut In Inst stit itute ute fo forr Cl Clin inic ical al an and d Tr Tran ansl slat atio iona nall Scie Scienc nce, e, Unive Univers rsit ity y of Connecticut Health Center, Farmington, USA Yiran Yira n Hu Div Divisi ision on of Epidemi Epidemiolo ology gy and Bio Biosta statis tistic ticss (MC923) (MC923),, Univer Universit sity y of Illinois at Chicago, Chicago, IL, USA Gul Ina Inan n Dep Depart artment ment of Sta Statis tistic tics, s, Middle Middle Eas Eastt Techni Technical cal Univer Universit sity, y, Ankara Ankara,, Turkey Kyle M. Irimata School of Mathematical and Statistical Sciences, Arizona State University, Tempe, AZ, USA Chuanshu Ji Depart Chuanshu Departmen mentt of Statis Statistic ticss and Ope Operat ration ionss Resear Research, ch, Univer Universit sity y of North Carolina, Chapel Hill, NC, USA N. Jiang Jiang Departm Department ent of Mathem Mathemati atical cal Sci Scienc ences, es, Univers University ity of South South Dakota Dakota,, Vermillion, SD, USA James Kost Merck & Co. Inc., North Wales, PA, USA Chia-L Chia -Lin ing g Ku Kuo o Depar epartm tmeent of Co Comm mmun unit ity y Medic edicin inee and and Heal Healtth Care Care,, Conne Con nect ctic icut ut In Inst stit itute ute fo forr Cl Clin inic ical al an and d Tr Tran ansl slat atio iona nall Scie Scienc nce, e, Unive Univers rsit ity y of Connecticut Health Center, Farmington, USA Trent L. La Trent Lalon londe de Depart Departme ment nt of Appl Applie ied d St Stat atis isti tics cs and and Rese Resear arch ch Metho Methods ds,, University of Northern Colorado, Greeley, CO, USA Yuanzhang Li Division of Preventive Medicine, Walter Reed Army Institute of Research, Silver Spring, MD, USA Hua Liang Department of Statistics, Ge George orge Washington University, Washington, DC, USA
Editors and Contributors
xvii
Y.-J. Lin Department of Applied Mathematics, Chung Yuan Christian University, Chung-Li District, Taoyuan city, Taiwan Daniel F. Linder Department of Biostatistics and Epidemiology, Me Medical dical College of Georgia, Augusta University, Augusta, GA, Georgia Y.L. Li Y.L. Lio o Departmen Departmentt of Mathema Mathematic tical al Scienc Sciences, es, Univers University ity of South South Dakota Dakota,, Vermillion, SD, USA G. Frank Liu Merck & Co. Inc., North Wales, PA, USA Xiang Liu Health Informatics Institute, University of South Florida, Tampa, FL, USA Sheng Luo Department of Biostatistics, The University of Texas Health Science Center at Houston, Houston, TX, USA G. Molenberghs I-BioStat, I-BioStat, Universiteit Universiteit Hasselt Hasselt & KU Leuven, Hasselt, Belgium H. Mwa wamb mbii Fa Facu cult lty y of Ma Math them emat atic ical al Scie Scienc nces es and and St Stat atis isti tics cs,, Alne Alneel elai ain n University, Khartoum, Sudan H.K.T. H.K. T. Ng Departmen Departmentt of Statis Statistica ticall Scienc Science, e, Southe Southern rn Methodi Methodist st Univer Universit sity, y, Dallas, TX, USA Levent Ozbek Depa Department rtment of Statistic Statistics, s, Ankara University University,, Ankara, Ankara, Turkey Mark Rei Mark Reiser ser School School of Mathe Mathema mati tica call and and St Stat atis isti tica call Scie Scienc nce, e, Ariz Arizona ona St Stat atee University, Tempe, AZ, USA Haresh Rochani Department of Biostatistics, Jiann-Ping Hsu College of Public Health, Georgia Southern University, Statesboro, GA, Georgia Hani Michel Samawi Department of Biostatistics, Jia Jiann-Ping nn-Ping Hsu College Public Health, Georgia Southern University, Statesboro, Georgia A. Satty School of Mathematics, Statistics and Computer Science, University of
KwaZulu-Natal, Pietermaritzburg, South Africa Jianguo Sun University of Missouri, Columbia, MO, USA Wei Sun Manager Biostatistician at Otsuka America, New York, USA T.-R. Tsai Department of Statistics, Tamkang University, Tamsui District, New Taipei City, Taiwan Ceren Vardar-Acar Department of Statistics, Middle East Technical University, Ankara, Turkey Tao Wang Bank of America Merrill Lynch, New York, NY, USA Xiao Wang Statistics and Data Corporation, Tempe, AZ, USA Jeanne Wilcox Division of Educational Leadership and Innovation, Arizona State State University, Tempe, AZ, USA
xviii
Editors and Contributors
Jeffrey Jeffre y R. Wil Wilson son W.P. Carey Carey Sch School ool of Busine Business, ss, Arizona Arizona State State Univer Universit sity, y, Tempe, AZ, USA Hui Xie Simon Frase Fraserr Unive University, rsity, Burnaby, Burnaby, Canada; The Unive University rsity of Illinois Illinois at Chicago, Chicago, USA Lanlan Lanla n Ya Yao o Schoo Schooll of Ma Mathe thema mati tica call and and St Stat atis isti tica call Sc Scie ienc nce, e, Ariz Arizona ona St Stat atee University, Tempe, AZ, USA Leicheng Yin Exelon Exelon Business Business Services Company, Enterprise Enterprise Risk Management, Chicago, IL, USA Zhigang Zhang Memor Memorial ial Sloan Kettering Kettering Cancer Center, New York, NY, USA
Part I
Monte-Carlo Techniques
Joint Generation of Binary, Ordinal, Count, and Normal Data with Specified Marginal and Association Structures in Monte-Carlo Simulations Hakan Demirtas, Rawan Allozi, Yiran Hu, Gul Inan and Levent Ozbek
Abstract This chapter is concerned with building a unified framework for concurrently generating data sets that include all four major kinds of variables (i.e., binary, ordinal, count, and normal) when the marginal distributions and a feasible association structure are specified for simulation purposes. The simulation paradigm has been commonly employed in a wide spectrum of research fields including the physical, medical, social, and managerial sciences. A central aspect of every simulation study is the quantification of the model components and parameters that jointly define a scientific process. When this quantification cannot be performed via deterministic tools, researchers resort to random number generation (RNG) in finding simulation-based answers to address the stochastic nature of the problem. Although many RNG algorithms have appeared in the literature, a major limitation is that they were not desi designed gned to to concu concurrent rrently ly accommo accommodate date all va variabl riablee types types ment mentioned ioned abov above. e. Thus, these algorithms provide only an incomplete solution, as real data sets include variables of different kinds. This work represents an important augmentation of the existing methods as it is a systematic attempt and comprehensive investigation for mixed data generation. We provide an algorithm that is designed for generating data of mixed marginals, illustrate its logistical, operational, and computational details; and present ideas on how it can be extended to span more complicated distributional settings in terms of a broader range of marginals and associational quantities.
H. Demirtas (B) R. Allozi Y. Hu Division of Epidemiology and Biostatistics (MC923), University of Illinois at Chicago, 1603 West Taylor Street, Chicago, IL 60612, USA e-mail:
[email protected]
·
·
G. Inan Department of Statistics, Middle East Technical University, Ankara, Turkey
L. Ozbek Department of Statistics, Ankara University, Ankara, Turkey © Springer Nature Singapore Pte Ltd. 2017 D.-G. Chen and J.D. Chen (eds.), Monte-Carlo Simulation-Based Statistical Modeling, ICSA Book Series in Statistics, DOI 10.1007/978-981-10-3307-0_1
3
4
1
H. Demirtas et al.
Intr In trod oduc ucti tion on
Stochastic simulation is an indispensable part and major focus of scientific inquiry. Model building, estimation, and testing typically require verification via simulation to ass assess ess the va valid lidit ity y, rel reliab iabili ility ty,, and pla plausi usibil bility ity of inf infere erenti ntial al te techn chniqu iques, es, to ev evalu aluate ate how well the implemented models capture the specified true population values, and how reasonably these models respond to departures from underlying assumptions, among other things. Describing a real notion by creating mirror images and imperfect proxies of the perceived underlying truth; iteratively refining and occasionally redefining the empirical truth to decipher the mechanism by which the process under consideration is assumed to operate in a repeated manner allows researchers to study the performance of their methods through simulated data replicates that mimic the real data charact characterist eristics ics of interest interest in any given given setting. setting. Accur Accuracy acy and precisi precision on measures regarding the parameters under consideration signal if the procedure works properly; and may suggest remedial action to minimize the discrepancies between expectation and reality reality.. Simu Si mula lati tion on st stud udie iess ha have ve be been en co comm mmon only ly em empl ploy oyed ed in a br broa oad d ra rang ngee of di disc scip ipli line ness in order to better comprehend and solve today’s increasingly sophisticated issues. A core component of every simulation study is the quantification of the model components nen ts and par parame ameter terss tha thatt joi jointl ntly y defi define ne a sci scient entific ific phe phenom nomeno enon. n. Det Determ ermini inisti sticc too tools ls are typically inadequate to quantify complex situations, leading researchers to utilize RNG techniques in finding simulation-based solutions to address the stochastic behavior of the problems that generally involve variables of many different types on a structural level; i.e., causal and correlational interdependencies are a function of a mixtur mix turee of bin binary ary,, ord ordina inal, l, cou count, nt, and con contin tinuou uouss va varia riable bles, s, whi which ch act sim simult ultane aneous ously ly to characterize the mechanisms that collectively delineate a paradigm. In modern times, we are unequivocally moving from mechanistical to empirical thinking, from small data to big data, from mathematical perfection to reasonable approximation to reality, and from exact solutions to simulation-driven solutions. The ideas presented herein are important in the sense that the basic mixed-data generation setup can be augmented to handle a large spectrum of situations that can be encountered in many areas. This work is concerned with building the basics of a unified framework for concurrently generating data sets that include all four major kinds of variables (i.e., binary, ordinal, count, and normal) when the marginal distributions and a feasible association structure in the form of Pearson correlations are specified for simulation purposes. Although many RNG algorithms have appeared in the literature, a fundamental restriction is that they were not designed for a mix of all prominent types of data. The current chapter is a systematic attempt and compendious investigation for mixed data generation; it represents a substantial augmentation of the existing methods,, an ods and d it ha hass po pote tent ntia iall to ad adv van ance ce sc scie ient ntifi ificc re rese sear arch ch an and d kn kno owl wled edge ge in a me mean anin ingf gful ul way. The broader impact of this framework is that it can assist data analysts, practi-
tioners, tioner s, the theore oretic tician ians, s, and met method hodolo ologis gists ts ac acros rosss man many y dis discip ciplin lines es to sim simula ulate te mix mixed ed data with relative ease. The proposed algorithm constitutes a comprehensive set of
Joint Generation of Binary, Ordinal, Count, and Normal Data …
5
computati comput ationa onall too tools ls tha thatt of offer ferss pro promis mising ing pot potent ential ial for bu build ilding ing enh enhanc anced ed com comput puting ing infrastructure for research and education. We propose an RNG algorithm that encompasses all four major variable types, buil bu ildi ding ng upo upon n our pr preevi vious ous wo work rk in ge gene nera rati tion on of mu mult ltiivar aria iate te ord ordin inal al da data ta (D (Dem emir irta tass 2006), 2006 ), joint generation of binary and normal data (Demirtas and Doganay 2012 2012), ), ordinal and normal data (Demirtas and Yavuz 2015), 2015), and count and normal data (Amatya and Demirtas 2015 Demirtas 2015)) with the specification of marginal and associational parameters along with other related work (Emrich and Piedmonte 1991 1991;; Demirtas and Hedeker 2011 Hedeker 2011,, 2016; 2016; Demirtas et al. al. 2016a 2016a;; Ferrari and Barbiero 2012 Barbiero 2012;; Yahav and Shmueli 2012 Shmueli 2012). ). Equally importantly, we discuss the extensions on nonnormal contin con tinuou uouss dat dataa via po power wer pol polyno ynomia mials ls tha thatt wou would ld han handle dle the ov overw erwhel helmin ming g maj majori ority ty of continuous shapes (Fleishman 1978; 1978; Vale and Maurelli 1983; 1983; Headrick 2010; 2010; Demirtas et al. al. 2012 2012;; Demirtas 2017a Demirtas 2017a), ), count data that are prone to over- and underdispersion via generalized Poisson distribution (Demirtas 2017b (Demirtas 2017b), ), broader measures of associations such as Spearman’s rank correlations and L-correlations (Serfling and Xiao 2007 Xiao 2007), ), and the specification of higher order product moments. Conceptual, algorithmic, operational, and procedural details will be communicated throughout the chapter. Thee or Th orga gani niza zati tion on of th thee ch chap apte terr is as fo foll llow ows: s: In Se Sect ct.. 2, th thee al algo gori rith thm m fo forr si simu mult ltaaneous generation of binary, ordinal, count, and normal data is given. The essence of the algorithm is finding the correlation structure of underlying multivariate normal (MVN) data that form a basis for the subsequent discretization in the binary and ordinal cases, and correlation mapping using inverse cumulative distribution functions (cdfs) in the count data case, where modeling the correlation transitions for different distributional pairs is discussed in detail. Section 3 presents some logistical details and an illustrative example through an R package that implements the algorithm, demonstrating how well the proposed technique works. Section 4 includes discussion on limitations, future directions, extensions, and concluding remarks.
2
Alg Al gor orit ith hm
The algorithm is designed for concurrently generating binary, ordinal, count, and continuous data. The count and continuous parts are assumed to follow Poisson and normal distributions, respectively. While binary is a special case of ordinal, for the purpose of exposition, the steps are presented separately. Skipped patterns are allowed for ordinal variables. The marginal characteristics (the proportions for the binary and ordinal part, the rate parameters for the count part, and the means and variances for the normal part) and a feasible Pearson correlation matrix need to be specified by the users. The algorithmic skeleton establishes the basic foundation, extensions to more general and complicated situations will be discussed in Sect. 4.
The operational engine of the algorithm hinges upon computing the correlation matrix of underlying MVN data that serve as an intermediate tool in the sense that binary and ordinal variables are obtained via dichotomization and ordinalization,
6
H. Demirtas et al.
respectively, through the threshold concept, and count variables are retrieved by correlation mapping using inverse cdf matching. The procedure entails modeling the correlation transformations that result from discretization and mapping. In what follows, let B , O , C , and N denote binary, ordinal, count, and normal variable var iables, s, respe respecti ctivel vely y. Let Σ be th thee sp spec ecifi ified ed Pe Pear arso son n co corr rrel elat atio ion n ma matr trix ix wh whic ich h co commprises of ten submatrices that correspond to all possible variable-type combinations. Required parameter values are p’s for binary and ordinal variables, λ’s for count variables, (µ,σ 2 ) pa pair irss fo forr no norm rmal al var aria iabl bles es,, an and d th thee en entr trie iess of th thee co corre rrela lati tion on ma matr trix ix Σ . These quantities are either specified or estimated from a real data set that is to be mimicked. 1. Check Check if Σ Σ is positive definite. 2. Fi Find nd the the up uppe perr an and d lo lowe werr co corr rrel elat atio ion n bo boun unds ds fo forr al alll pa pair irss by th thee so sort rtin ing g me meth thod od of Demirt Dem irtas as and Hed Hedek eker er (2011 2011). ). It is we well ll-k -kno nown wn th that at co corr rrel elat atio ions ns ar aree no nott bo bound unded ed between 1 and 1 in most bivariate settings as different upper and/or lower bounds may be imposed by the marginal distributions (Hoeffding 1940 (Hoeffding 1940;; Fréchet 1951). 1951 ). These restrictions apply to discrete variables as well as continuous ones. Let Π ( F , G ) be the set of cdf’s H on R 2 having marginal cdf’s F and G . Hoeffding (1940 (1940)) and Fréchet Fréchet (1951 1951)) proved that in Π ( F , G ), there exist cdf’s H L and H U U , called the lower and upper bounds, having minimum and maximum correlation. For all ( x , y ) R 2 , H L ( x , y ) m a x F ( x ) G ( y ) 1, 0 m i n F ( x ), and H U ), G ( y ) . For any H Π ( F , G ) and all ( x , y ) R 2 , U ( x , y ) H L ( x , y ) H ( x , y ) H U denote ote the Pea Pearso rson n cor correl relati ation on U ( x , y ). If δ L , δU , and δ den coefficients for H L , H U δ δU . One can infer U , and H , respectively, then δ L that th at if V is is un unif ifor orm m in 0, 1 , th then en F −1 (V ) and G −1 (V ) are maxim maximally ally corre correlate lated; d; and F −1 (V ) and G −1 (1 V ) are maximally anticorrelated. In practical terms, generating X and Y independently with a large number of data points before sorting them in the same and opposite direction give the approximate upper and lower correlation bounds, respectively. Make sure all elements of Σ Σ are within the plausible range.
−
≤
+
=
∈ ]
[ ≤ [ ] −
∈ ∈
=
[
+
− ] ∈
≤ ≤
3. Perform Perform logical logical checks such as binary binary proportions proportions are between between 0 and 1, probabilities add up to 1 for ordinal variables, the Poisson rates are positive for count varia va riable bles, s, va varia riance ncess for nor normal mal va varia riable bless are pos positi itive ve,, the mea mean, n, va varia riance nce,, pro propor por-tion ti on an and d ra rate te ve vect ctor orss ar aree co cons nsis iste tent nt wi with th th thee nu numb mber er of var aria iabl bles es,, Σ is is sym symme metri tricc and its diagonal entries are 1, to prevent obvious misspecification errors. 4. For B-B combi combinatio nations, ns, find the tetr tetrachor achoric ic (pre-d (pre-dichot ichotomiz omizatio ation) n) corre correlati lation on given giv en the speci specified fied phi coef coefficie ficient nt (post (post-dich -dichotomi otomizati zation on corre correlati lation). on). Let X 1 , X 2 p j and C or ( X 1 , X 2 ) δ12 , wh repres rep resent ent bin binary ary va varia riable bless suc such h tha thatt E X j j wher eree p j ( j 1, 2) and δ12 (ph (phii coe coeffi fficie cient) nt) are gi give ven. n. Let Φ t 1 , t 2 , ρ12 be the cdf for a standa sta ndard rd bi biva varia riate te norm normal al ran random dom va varia riable ble wit with h cor correl relati ation on coe coeffi ffici cient ent ρ12 (tetrat 1 t 2 choric correlation). Naturally, Naturally, Φ t 1 , t 2 , ρ12 −∞ −∞ f ( z1 , z 2 , ρ12 )d z1 d z 2 ,
[ ]=
=
[
where f ( z 1 , z 2 , ρ12 )
2π (1
2 1/2 ρ12 )
[
]= − 1
ex p
]
( z 12
=
2ρ12 z 1 z 2
z 22 )/
(2(1
−
2 ρ12 ))
=[
−
] ×
−
−
+
. The connection between δ 12 and ρ 12 is reflected in the equation
Joint Generation of Binary, Ordinal, Count, and Normal Data …
Φ z ( p1 ), z ( p2 ), ρ12
[
7
12 ( p1 q1 p2 q2 )
]=δ
1/2
+ p p 1
2
Solve for ρ 12 where z ( p j ) denotes the p jt h quant quantile ile of the standard standard normal distribution, and q j 1 p j . Repeat this process for all B-B pairs pairs.. 5. For B-O and O-O com combin binati ations ons,, imp implem lement ent an ite iterat rativ ivee pro proced cedure ure tha thatt find findss
= −
the polychoric (pre-discretization) correlation given the ordinal phi coefficient (post-discretization correlation). Suppose Z ( Z 1 , Z 2 ) N (0, ∆ Z 1 Z 2 ), where Z denotes the bivariate standard normal distribution with correlation matrix ∆ Z 1 Z 2 who whose se of off-d f-diag iagona onall ent entry ry is δ Z 1 Z 2 . Let X ( X 1 , X 2 ) be the bi biva variriate ordi ordinal nal dat dataa whe where re und underl erlyin ying g Z is dis discre cretiz tized ed bas based ed on cor corres respon pondin ding g normal norm al qua quanti ntiles les gi give ven n the mar margin ginal al pro propor portio tions, ns, wit with h a cor correl relati ation on mat matrix rix ∆ Z 1 Z 2 . If we need to sample from a random vector ( X 1 , X 2 ) whose marginal cdfs are F 1 , F 2 tied together via a Gaussian copula, we generate a sample ( z 1 , z 2 ) from Z N (0, ∆ Z 1 Z 2 ), then set x ( x 1 , x 2 ) ( F 1− 1 (u 1 ), F 2− 1 (u 2 )) when u (u 1 , u 2 ) (Φ wher eree Φ is th thee cd cdff of th thee st stan anda dard rd no norm rmal al (Φ(( z 1 ),Φ( z 2 )), wh distri dis tribu butio tion. n. The cor correl relati ation on mat matrix rix of X, de deno note ted d by ∆ X 1 X 2 (wi (with th an of off-d f-diag iagona onall entry δ X 1 X 2 ) obv obviou iously sly dif differ ferss fro from m ∆ Z 1 Z 2 d due ue to discr discretiz etization ation.. More speci specifical fically ly,,
=
∼
=
∼
=
=
=
=
|beδ established | < |δ via| inthelarge δ 2012): δ samples. The relationship between and ): following algorithm (Ferrari and Barbiero Barbiero 2012 X 1 X 2
Z 1 Z 2
X 1 X 2
Z 1 Z 2 can
0 a. Generate Generate standard bivaria bivariate te norma normall data with the correlation correlation δ Z where 1 Z 2 0 0 δ Z δ X 1 X 2 (Here, δ Z is the initial polychoric correlation). 1 Z 2 1 Z 2 b. Discr Discretiz etizee Z 1 and Z 2 , based on the cumulative probabilities of the marginal distribution F 1 and F 2 , to obtain X 1 and X 2 , respectively. 1 1 c. Com Comput putee δ X through X 1 and X 2 (Here, δ X is th thee ord ordin inal al ph phii co coef effic ficie ient nt 1 X 2 1 X 2 after the first iteration). v d. Ex Exec ecut utee th thee fo foll llo owi wing ng lo loop op as lo long ng as δ X δ X 1 X 2 > ε and 1 v vma x 1 X 2 (vma x and ε are the maximum number of iterations and the maximum tolerated absolute error, respectively, both quantities are set by the users): v−1 v v v (a) Update δ Z by g (v), where g (v) δ X 1 X 2 /δ X . Here, δ δ Z Z Z Z 1 2 1 2 1 Z 2 1 X 2
=
|
=
−
|
≤ ≤
=
g (v) serves as a correction coefficient, which ultimately converges to 1. v+1 v (b) Generate bivariate normal data with δ Z and compute δ after dis X 1 X 2 1 Z 2 cretization.
Again, one should repeat this process for each B-O (and O-O) pair. 6. For C-C combinations, compute the corresponding corresponding normal-normal correlations correlations (pre-mapping) given the specified count-count correlations (post-mapping) via the inverse cdf method in Yahav and Shmueli (2012 ( 2012)) that was proposed in the context of correlated count data generation. Their method utilizes a slightly modified version of the NORTA NORTA (Normal to Anything) approach (Nelsen (Nelsen 2006 2006), ), which involves involves generation of MVN variate variatess with give given n univariate marginals and the correlation structure ( R N ), and then transforming it into any desired distribution using the inverse cdf. In the Poisson case, NORTA can be implemented
by the following steps:
8
H. Demirtas et al.
a. Genera Generate te a k -dimensional normal vector Z N from M V N distribution with mean vector 0 and a correlation matrix R N . b. Tr Transfor ansform m Z N to a Poisson vector X C as follows: i. For each each eleme element nt z i of Z N , calculate the Normal cdf, Φ ( z i ). ii.. Fo ii Forr ea each ch val alue ue of Φ ( z i ), ca calc lcul ulat atee th thee Po Pois isso son n in inve vers rsee cd cdff wi with th a (Φ(( z i )); where Ψ λi ( x ) desired−corresponding marginal rate λi , Ψ λ−i 1 (Φ λ i x e λ i =0 i ! .
− = Ψ − (Φ z z (Φ(( ) ) , . . . , Ψ (Φ (Φ(( ))
=
T
1 1 c. X C is a draw from the desired multii k λi λk variate count data with correlation matrix R P O I S .
An ex exact act the theore oretic tical al con connec necti tion on bet betwee ween n R N and R P O I S has not bee been n est establ ablish ished ed to date. However, it has been shown that a feasible range of correlation between a pair of Poisson variables after the inverse cdf transformation is within ρ
[ =
C or (Ψ C or (Ψ (Ψ λ−i 1 (U ), Ψ λ− j 1 (1 U )),ρ (Ψ λ−i 1 (U ), Ψ λ− j 1 (U )) , where λi and λ j orm (0, 1). Yahav and Shmueli are the marginal rates, and U Uni f orm Shmueli (2012 2012)) propose prop osed d a con conce ceptu ptuall ally y sim simple ple met method hod to app approxi roximat matee the rel relati ations onship hip bet betwee ween n the two correlations. They have demonstrated that R P O I S can be approximated
−
= ∼ ∼
]
as an exponential function of R N where the coefficients are the functions of ρ and ρ . 7. For B-N/O-N combinations, combinations, find the biserial/polyserial biserial/polyserial correlation correlation (before discretization of one of the variables) given the point-biserial/point-polyserial correlation (after discretization) by the linearity and constancy arguments proposed by Demirtas and Hedeker (2016 (2016). ). Suppose that X and Y follow a bivariate normal distribution with a correlation of δ X Y . Without loss of generality, we may assume that both X and Y are standardized to have a mean of 0 and a variance of 1. Let X D D be the binary variable resulting from a split on X , X D I ( X k ). Thus, E X D p and V X D pq where q 1 p. D D D The correlation between X D can be obtained in a simple way, D and X , δ X D D X ] √ C ov[ X D , X D E X D E X X k / pq . We can namely, δ X D D X D X / pq V [ X ] V [ X ]
=
≥≥ =
[ ]= [ ]= = − = [ ] √ = [ | ≥ ≥ ] √
D D
also ex also expre press ss the rel relati ations onship hip bet betwee ween n X and and Y via via the fol follo lowin wing g lin linear ear re regre gressi ssion on model: Y δ X Y X ε (1)
= =
+ +
where ε is independent of X and Y , and follows N (0, 1 δ X2 Y ). When we generaliz gener alizee this to nonnorm nonnormal al X and/or Y (both centered and scaled), the same relationship can be assumed to hold with the exception that the distribution of ε ε follows a nonnormal distribution. As long as Eq. 1 is valid,
∼ ∼
C ov X D D , Y
[
] = C ov[ X , δ = C ov[ X , δ = δ C ov[ X X Y
−
D D
X Y X
D D
X Y
D D
D D
D D
++ ε] X ] + C ov[ X , ε ] , X ] + C ov[ X , ε ] .
(2)
Since ε is independent of X , it will also be independent of any deterministic E Y function of X such such as X D D , and thus C ov X D D , ε will be 0. As E X
[
]
[ ]= [ ]=
Joint Generation of Binary, Ordinal, Count, and Normal Data …
9
] = δ √ pq and C ov[ X , Y ] = δ , Eq Eq.. 2 = δ δ . δ (3) √ In the bivariate normal case, δ h / pq where h is the ordinate of the nor = mal curve at the point of dichotomization. Equation 3 indicates that the linear 0, V X V Y reduces to
[ ] = [ ] = 1, C ov[ X
D , Y D
X D D Y
X D D Y
X Y
X Y X D D X
X D D X
association between X D D and Y is assumed to be fully explained by their mutual association with X (Demirtas (Demirtas and Hedeker 2016 Hedeker 2016). ). The ratio, δ X D D Y /δ X Y is equal E X X k / pq . It is a constant given p and to δ X D D X E X D D X / pq the distribution of ( ( X , Y ). These correlations are invariant to location shifts and scaling, X and Y do not have to be centered and scaled, their means and variance an cess ca can n ta take ke an any y fin finit itee val alue ues. s. On Once ce th thee ra rati tio o (δ X D D X ) is fo foun und, d, on onee ca can n co comp mput utee the biserial correlation when the point-biserial correlation is specified. When X is ordinalized to obtain X O , the fundamental ideas remain unchanged. If the assumptions of Eqs. 1 and and 3 are met, the method is equally applicable to the ordinal case in the context of the relationship between the polyserial (before ordinalization) and point-polyserial (after ordinalization) correlations. The eas-
= [
8.
9.
] √ = [ | ≥ ≥ ] √
iest way of computing δ X O X is to generate X with with a large number of data points, then th en or ordi dina nali lize ze it to obt obtai ain n X O , an and d th then en co comp mput utee th thee sa samp mple le co corr rrel elat atio ion n be betw twee een n X O and X . X cou could ld fol follo low w an any y con contin tinuous uous uni univa varia riate te dis distri tribu butio tion. n. Ho Howe weve verr, her heree X is is assumed to be a part of MVN data before discretization. For C-N com combin binati ations ons,, use use the cou count nt ve versi rsion on of Eq. Eq.3 3, which is δ X C C Y δ X Y δ X C C X is valid. The only difference is that we use the inverse cdf method rather than discretization via thresholds as in the binary and ordinal cases. For B-C and O-C combinations combinations,, suppose that there are two identical identical standard standard normal variables, one underlies the binary/ordinal variable before discretization, the other underlies the count variable before inverse cdf matching. One can find C or ( O , N ) by the method of Demirtas and Hedeker (2016 2016). ). Then, assume C or (C , O ) C or (C , N ) C or ( O , N ). C or (C , O ) is specified and C or ( O , N ) is calculated. Solve for C or (C , N ). Then, find the underlying N-N correlation by Step 8 above (Amatya and Demirtas 2015 Demirtas 2015;; Demirtas and Hedeker 2016). 2016 ). Construct an overall, overall, intermediate intermediate correlation correlation matrix, Σ ∗ using the results from Step St epss 4 th thro roug ugh h 9, in co conj njun unct ctio ion n wi with th th thee NN-N N pa part rt th that at re rema main inss un unto touc uche hed d wh when en we compute Σ ∗ from Σ . Check Che ck if Σ ∗ is positive definite. If it is not, find the nearest positive definite correlation matrix by the method of Higham ( Higham (2002 2002). ). Genera Gen erate te mul multi tiva varia riate te nor normal mal dat dataa wit with h a mea mean n ve vecto ctorr of (0, .. ..., ., 0) and corre correlati lation on ∗ matrix of Σ , which can easily be done by using the Cholesky decomposition of and d a ve vect ctor or of uni univa vari riat atee no norm rmal al dr draw aws. s. Th Thee Ch Chol oles esky ky de deco comp mpos osit itio ion n of Σ ∗ Σ ∗ an produces a lower-triangular matrix A for which A A T Σ ∗ . If z ( z 1 , .. ..., ., z d )
=
=
10.
11. 12.
∗
=
=
are d independent standard normal random variables, then Z Az is a random draw from this distribution. 13. Dichotomize binary, binary, ordinalize ordinal by respective respective quantiles, go from normal to count by inverse cdf matching.
=
10
3
H. Demirtas et al.
Some Oper Operation ational al Details Details and an Illustr Illustrativ ativee Example Example
The software implementation of the algorithm has been done in PoisBinOrdNor package (Demirtas et al. al. 2016b 2016b)) within R environment (R Development Core Team 2016). 2016 ). The package has functions for each variable-type pair that are collectively capable of modeling the correlation transitions. More specifically, corr.nn4bb functio func tion n find findss the tet tetrac rachori horicc cor correl relati ation on in Ste Step p 4, corr.nn4bn and corr.nn4on functio func tions ns com comput putee the bis biseri erial al and pol polyse yseria riall cor correl relati ations ons for bin binary ary and ord ordina inall va variriables, respectively, in Step 7, corr.nn4pbo function is used to handle B-C and O-C pairs in Step 9, corr.nn4pn function is designed for C-N combinations in Step 8, and corr.nn4pp function calculates the pre-mapping correlations in Step 6. In addition, polychoric correlations in Step 5 are computed by ordcont function in GenOrd pac packag kagee (Ba (Barbi rbiero ero and Fer Ferrar rarii 2015 2015). ). co corr rrel elat atio ion n bou bound nd ch chec eck k (S (Ste tep p 2) as well as the validation of the specified quantities (Step 3), assembling all the intermediate correlation entries into Σ ∗ (Step 10), and generating mixed data (Steps 12 and 13) are performed by validation.specs, intermat, and genPBONdata functions, respectively, in PoisBinOrdNor package. Positive definiteness checks in Steps 1 and 11 are done by is.positive.definite function in corpcor package (Schaefer et al. al. 2015 2015), ), finding the nearest Σ ∗ is implemented by nearPD functio func tion n in Matrix pac packag kagee (Ba (Bates tes and Mae Maechl chler er 2016 2016), ), an and d MV MVN N da data ta ar aree ge gene nera rate ted d by rmvnorm function function in mvtnorm package (Genz et al. al. 2016 2016). ). For illustration, suppose we have two variables in each type. Operationally, PoisBinOrdNor package assumes that the variables are specified in a certain order. Let Y 1 Poisson (3), Y 2 Poisson (5), Y 3 Bernoulli (0.4), Y 4 Bernoulli (0.6), Y 5 and Y 6 are ordinal P (Y j j i ) pi , where pi (0.3, 0.3, 0.4) and (0.5, 0.1, 0.4) for i 0, 1, 2 for j 5 and 6, respectively, Y 7 N (2, 1), and Y 8 N (5, 9). The correlation matrix Σ is is specified as follows under the assumption that columns (and rows) represent the order above:
∼ =
∼ = = =
∼
∼
= ∼
∼
1 0.70 0.66 0.25 0.41 0.63 0.22 0.51 0.70 1 0.59 0.22 0.37 0.57 0.20 0.46 0.66 0.59 1 0.21 0.34 0.53 0.19 0.43 0.22 0.21 1 0.13 0.20 0.07 0.16 = 00..25 Σ = 0.37 0.34 0.13 1 0.33 0.12 0.27 0.41 0.57 0.53 0.20 0.33 1 0.18 0.42 0.63 22 0.20 0.19 0.07 0.12 0.18 1 0.15 0.51 0.46 0.43 0.16 0.27 0.42 0.15
1
The inte intermedi rmediate ate corre correlati lation on matri matrix x Σ ∗ –af –after ter va valid lidati ating ng the fea feasib sibili ility ty of mar margin ginal al and correlational specifications and applying all the relevant correlation transition
steps turns out to be (rounded to three digits digits after the decim decimal) al)
Joint Generation of Binary, Ordinal, Count, and Normal Data …
11
1 0.720 0.857 0.325 0.477 0.776 0.226 0.523 0.720 1 0.757 0.282 0.424 0.693 0.202 0.466 0.857 0.757 1 0.336 0.470 0.741 0.241 0.545 0.325 0.282 0.336 1 0.186 0.299 0.089 0.203 ∗ Σ = 0.477 0.424 0.470 0.186 1 0.438 0.135 0.305 0.776 0.693 0.741 0.299 0.438 1 0.216 0.504 0.226 0.202 0.241 0.089 0.135 0.216 1 0.150 0.523 0.466 0.545 0.203 0.305 0.504 0.150
1
Generating N 10, 000 rows of data based on this eight-variable system yields the following empirical correlation matrix (rounded to five digits after the decimal):
= =
1 0.69823 0.67277 0.24561 0.40985 0.63891 0.22537 0.50361 0.69823 1 0.59816 0.21041 0.36802 0.57839 0.21367 0.45772 0.67277 0.59816 1 0.20570 0.32448 0.55564 0.20343 0.42192 0.24561 0.21041 0.20570 1 0.12467 0.20304 0.06836 0.17047 0.40985 0.36802 0.32448 0.12467 1 0.32007 0.12397 0.26377 0.63891 0.57839 0.55564 0.20304 0.32007 1 0.17733 0.41562 0.22537 0.21367 0.20343 0.06836 0.12397 0.17733 1 0.15319 0.50361 0.45772 0.42192 0.17047 0.26377 0.41562 0.15319
1
The discrepancies between the specified and empirically computed correlations are indiscernibly small and the deviations are within an acceptable range that can be expected in any stochastic process. If we had repeated the experiment many times in a full-blown simulation study, the average differences would be even more negligible. We have observed the similar trends in the behavior of the marginal parameters (not reported for brevity), which lend further support to the presented methodology. The assessment of the algorithm performance in terms of commonly accepted accuracy and precision measures in RNG and imputation settings as well as in other simulated environments can be carried out through the evaluation metric developed in Demirtas (2004a (2004a,, b, 2005 2005,, 2007a 2007a,, b, 2008 2008,, 2009 2009,, 2010 2010), ), Demirtas and Hedeker (2007 Hedeker 2007,, 2008a, 2008a, b, c), Demirtas and Schafer Schafer (2003 2003), ), Demirtas et al. (2007 2007,, 2008), 2008 ), and Yucel and Demirtas (2010 (2010). ).
4
Futu Fu turre Di Dirrec ecti tion onss
The significance of the current study stems from three major reasons: First, data analysts, practitioners, theoreticians, and methodologists across many different disciplines in medical, managerial, social, biobehavioral, and physical sciences will
be able to simulate multivariate data of mixed types with relative ease. Second, the proposed work can serv proposed servee as a mile milestone stone for the developmen developmentt of more sophisticate sophisticated d simulation, computation, and data analysis techniques in the digital information, massive data era. Capability of generating many variables of different distributional
12
H. Demirtas et al.
types, nat types, nature ure,, and dep depend endenc encee str struct ucture uress may be a con contri tribu butin ting g fac factor tor for bet better ter gra graspsping the ope operat ration ional al cha charac racter terist istics ics of tod today’ ay’ss int intens ensiv ivee dat dataa tre trends nds (e. (e.g., g., sat satell ellite ite dat data, a, internet traffic data, genetics data, ecological momentary assessment data). Third, these ideas can help to promote higher education and accordingly be instrumental in training graduate students. Overall, it will provide a comprehensive and useful set of computational tools whose generality and flexibility offer promising potential for building enhanced statistical computing infrastructure for research and education. While this this work represent representss a decent step forward forward in mixe mixed d data generation, generation, it may not be sufficiently complex for real-life applications in the sense that real count and cont co ntin inuo uous us da data ta ar aree ty typi pica call lly y mo more re co comp mpli lica cate ted d th than an wh what at Po Pois isso son n an and d nor norma mall di dist stri ri-buti bu tions ons ac acco comm mmoda odate te,, an and d it is li like kely ly th that at sp spec ecifi ifica cati tion on of pa para rame mete ters rs th that at co cont ntro roll th thee first two moments and the second order product moment is inadequate. To address these concerns, we plan on building a more inclusive structural umbrella, whose ingredients are as follows: First, the continuous part will be extended to encompass nonnormal continuous variables by the operational utility of the third order power polynomials. This approach is a moment-matching procedure where any given continu ti nuous ous va vari riab able le in th thee sy syst stem em is expr xpres esse sed d by th thee su sum m of li line near ar co comb mbin inat atio ions ns of po powwers of a standard normal variate (Fleishman 1978 (Fleishman 1978;; Vale and Maurelli Maurelli 1983 1983;; Demirtas et al. al. 2012 2012), ), which requires the specification of the first four moments. A more elaborate version in the form of the fifth order system will be implemented (Headrick 2010)) in an attempt to control for higher order moments to cover a larger area in the 2010 skewness-elongation plane and to provide a better approximation to the probability density functions of the continuous variables; and the count data part will be augmented through the generalized Poisson distribution (Demirtas 2017b (Demirtas 2017b)) that allows under-- and ove under over-di r-dispers spersion, ion, which is usual usually ly encou encountere ntered d in most applications, applications, via an additional dispersion parameter. Second, although the Pearson correlation may not be the best association quantity in every situation, all correlations mentioned in this chapter are special cases of the Pearson correlation; it is the most widespread meas me asur uree of as asso soci ciat atio ion; n; an and d ge gene nera rali lity ty of th thee me meth thod odss pr prop opos osed ed he here rein in wi with th di difffe fere rent nt kinds of variables requires the broadest possible framework. For further broadening the scale, scope, and applicability of the ideas presented in this chapter, the proposed RNG technique will be extended to allow the specification of the Spearman’s rho, which is more popular for discrete and heavily skewed continuous distributions, will be in inco corp rpor orat ated ed in into to th thee al algo gori rith thm m fo forr co conc ncur urre rent ntly ly ge gene nera rati ting ng al alll fo four ur ma majo jorr ty type pess of variables. For the continuous-continuous pairs, the connection between the Pearson and Spearman correlations is given in Headrick (2010 2010)) through the power coefficients, and these two correlations are known to be equal for the binary-binary pairs. The relationship will be derived for all other variable type combinations. Inclusion of Spearman’s rho as an option will allow us to specify nonlinear associations whose mono mo noto toni nicc co comp mpone onent ntss ar aree re refle flect cted ed in th thee ra rank nk co corr rrel elat atio ion. n. Th Thir ird, d, th thee expa xpand nded ed fif fifth th
order po order poly lyno nomi mial al sy syst stem em wi will ll be fur furth ther er au augm gmen ente ted d to ac acco comm mmod odat atee LL-mo mome ment ntss an and d L-correlations (Hosking 1990 (Hosking 1990;; Serfling and Xiao 2007 Xiao 2007)) that are based on expectations of certain linear combinations of order statistics. The marginal and product L-moments are known to be more robust to outliers than their conventional counterparts in the sense that they suffer less from the effects of sampling variability, and
Joint Generation of Binary, Ordinal, Count, and Normal Data …
13
they enable more secure inferences to be made from small samples about an underlying probability distribution. On a related note, further expansions can be designed to handle more complex associations that involve higher order product moments. The salient advantages of the proposed algorithm and its augmented versions are as follows: (1) Individual components are well-established. (2) Given their computational simplicity, generality, and flexibility, these methods are likely to be widely used by researchers, methodologists, and practitioners in a wide spectrum of scientific disciplines, especially in the big data era. (3) They could be very useful in graduate-level teaching of statistics courses that involve computation and simulation, and in training graduate students. (4) A specific set of moments for each variable is fairly rare in practice, but a specific distribution that would lead to these mome mo ment ntss is ve very ry co comm mmon on;; so ha havi ving ng ac acce cess ss to th thes esee me meth thod odss is ne need eded ed by po pote tent ntia iall lly y a large group of people. (5) Simulated variables can be treated as outcomes or predictors in subsequent statistical analyses as the variables are being generated jointly. (6) Required quantities can either be specified or estimated from a real data set. (7) The final product after all these extensions will allow the specification of two prominent types of correlations (Pearson and Spearman correlations) and one emerging type (L-correlations) provided that they are within the limits imposed by marginal distributions. This makes it feasible to generate linear and a broad range of nonlinear associations. (8) The continuous part can include virtually any shape (skewness, low or high peakedness, mode at the boundary, multimodality, etc.) that is spanned by power polynomials; the count data part can be under- or over-dispersed. (9) Ability to jointly generate different types of data may facilitate comparisons among existing data analysis and computation methods in assessing the extent of conditions under which available methods work properly, and foster the development of new tools, especially in contexts where correlations play a significant role (e.g., longitudinal, clustered, and other multilevel settings). (10) The approaches presented here can be regarded as a variant of multivariate Gaussian copula-based methods as (a) the binary and ordinal variables are assumed to have a latent normal distribution before discreti discre tizat zation ion;; (b) the cou count nt va varia riable bless go thr throug ough h a cor correl relat ation ion map mappin ping g proc procedu edure re via the norma normal-tol-to-anyt anything hing approa approach; ch; and (c) the conti continuous nuous variables variables consi consist st of polynomial terms involving normals. To the best of our knowledge, existing multivariate copulas are not designed to have the generality of encompassing all these variable types simultaneously. (11) As the mixed data generation routine is involved with latent variables that are subsequently discretized, it should be possible to see how the correlation structure changes when some variables in a multivariate continuous setting are dichotomized/ordinalized (Demirtas (Demirtas 2016 2016;; Demirtas and Hedeker Hedeker 2016 2016;; Demirtas et al. al. 2016a). 2016a). An important by-product of this research will be a better understanding of the nature of discretization, which may have significant implications in interpreting interpreting the coefficie coefficients nts in regression-type models when some predictors predictors
are discretized. discretized. On a rela related ted note, this could be usefu usefull in meta-analysis meta-analysis when some studies discretize variables and some do not. (12) Availability of a general mixed data generation algorithm can markedly facilitate simulated power-sample size calculations for a broad range of statistical models.
14
H. Demirtas et al.
References Amatya, A., & Demirtas, H. (2015). Simultaneous generation of multivariate mixed data with Poisson and normal marginals. Journal of Statistical Computation and Simulation, 85, 3129– 3139. Barbiero, A., & Ferrari, P. A. (2015). Simulation of ordinal and discrete variables with given correl cor relati ation on mat matrix rix and mar margin ginal al dis distri tribu butio tions. ns. R pac packag kagee Gen GenOrd Ord.. https://cran.r-project.org/web/ packages/GenOrd Bates D., & Maechler M. (2016). Sparse and dense matrix classes and methods. R package Matrix. http://www.cran.r-project.org/web/packages/Matrix Demirtas, H. (2004a). Simulation-driven inferences for multiply imputed longitudinal datasets. Statistica Neerlandica, 58, 466–482. Demirtas, H. (2004b). Assessment of relative improvement due to weights within generalized estimating equations framework for incomplete clinical trials data. Journal of Biopharmaceutical Statistics, 14, 1085–1098. Demirtas, H. (2005). Multiple imputation under Bayesianly smoothed pattern-mixture models for non-ignorable drop-out. Statistics in Medicine, 24, 2345–2363. Demirta Dem irtas, s, H. (20 (2006) 06).. A met method hod for mul multi tiva varia riate te ord ordina inall dat dataa gen genera eratio tion n gi give ven n mar margin ginal al dis distrib tributi utions ons and correlations. Journal of Statistical Computation and Simulation, 76 , 1017–1025. Demirtas, H. (2007a). Practical advice on how to impute continuous data when the ultimate interest cent centers ers on dich dichotomi otomized zed outco outcomes mes throu through gh pre-s pre-specifi pecified ed thres threshold holds. s. Communications Communications in Statistics-Simulation and Computation, 36 , 871–889. Demirtas, H. (2007b). The design of simulation studies in medical statistics. Statistics in Medicine, 26 , 3818–3821. Demirta Dem irtas, s, H. (20 (2008) 08).. On imp imputi uting ng con contin tinuou uouss dat dataa whe when n the ev event entual ual int intere erest st per pertai tains ns to ord ordina inaliz lized ed outcomes via threshold concept. Computational Statistics and Data Analysis, 52 , 2261–2271. Demirtas, H. (2009). Rounding strategies for multiply imputed binary data. Biometrical Journal, 51, 677–688. Demirtas, H. (2010). A distance-based rounding strategy for post-imputation ordinal data. Journal of Applied Statistics, 37 , 489–500. Demirtas, H. (2016). A note on the relationship between the phi coefficient and the tetrachoric correlation under nonnormal underlying distributions. American Statistician, 70, 143–148. Demirtas, H. (2017a). Concurrent generation of binary and nonnormal continuous data through fifth order power polynomials. Communications in Statistics–Simulation and Computation, 46, 489–357. Demirtas, H. (2017b). On accurate and precise generation of generalized Poisson variates. Communications in Statistics–Simulation and Computation, 46, 489–499. Demirta Dem irtas, s, H., Ahm Ahmadi adian, an, R., Ati Atis, s, S., Can Can,, F. E., & Erc Ercan, an, I. (20 (2016a 16a). ). A non nonnor normal mal loo look k at pol polych ychori oricc correlations: Modeling the change in correlations before and after discretization. Computational Statistics, 31, 1385–1401. Demirtas, H., Arguelles, L. M., Chung, H., & Hedeker, D. (2007). On the performance of biasreduction reduc tion tech technique niquess for var variance iance estim estimation ation in appr approxima oximate te Baye Bayesian sian boots bootstrap trap imput imputation ation.. Computational Statistics and Data Analysis, 51 , 4064–4068. Demirtas, H., & Doganay, B. (2012). Simultaneous generation of binary and normal data with specified marginal and association structures. Journal of Biopharmaceutical Statistics, 22, 223– 236.
Demirtas, Demirta s, H., Fre Freels els,, S. A., & Yuce ucel, l, R. M. (20 (2008) 08).. Pla Plausi usibil bility ity of mul multi tiva varia riate te nor normal mality ity ass assump umptio tion n when multiply imputing non-Gaussian continuous outcomes: A simulation assessment. Journal of Statistical Computation and Simulation , 78 , 69–84. Demirtas, H., & Hedeker Hedeker,, D. (2007). Gaussianization-based quasi-imputation and expansion strategies for incomplete correlated binary responses. Statistics in Medicine, 26 , 782–799. Demirtas, H., & Hedeker, D. (2008a). Multiple imputation under power polynomials. Communications in Statistics- Simulation and Computation, 37 , 1682–1695.
Joint Generation of Binary, Ordinal, Count, and Normal Data …
15
Demirtas, H., & Hedeker, D. (2008b). Imputing continuous data under some non-Gaussian distributions. Statistica Neerlandica, 62, 193–205. Demirtas, H., & Hedeker, D. (2008c). An imputation strategy for incomplete longitudinal ordinal data. Statistics in Medicine, 27 , 4086–4093. Demirtas, H., & Hedeker, D. (2011). A practical way for computing approximate lower and upper correlation bounds. The American Statistician, 65, 104–109. Demirtas, Demirta s, H., & Hed Hedek eker er,, D. (20 (2016 16). ). Com Comput puting ing the poi pointnt-bis biseri erial al cor correl relati ation on und under er an any y und underl erlyin ying g continuous distribution. Communications in Statistics- Simulation and Computation, 45 , 2744– 2751. Demirtas, H., Hedeker, D., & Mermelstein, J. M. (2012). Simulation of massive public health data by power polynomials. Statistics in Medicine, 31 , 3337–3346. Demi De mirt rtas as H. H.,, Hu Y., & Al Allo lozi zi R. (2 (201 016b 6b). ). Data Data gene generati ration on with Poisso oisson, n, binar binaryy, or ordina dinall and normal components, R package PoisBinOrdNor. https://cran.r-project.org/web/packages/ PoisBinOrdNor.. PoisBinOrdNor Demirtas, H., & Schafer, J. L. (2003). On the performance of random-coefficient pattern-mixture models for non-ignorable drop-out. Statistics in Medicine , 22 , 2553–2575. Demirtas, H., & Yavuz, Y. (2015). Concurrent generation of ordinal and normal data. Journal of Biopharmaceutical Statistics, 25, 635–650. Emrich, J. L., & Piedmonte, M. R. (1991). A method for generating high-dimensional multivariate binary variates. The American Statistician, 45 , 302–304. Ferrari, P. A., & Barbiero, A. (2012). Simulating ordinal data. Multivariate Behavioral Research, 47 , 566–589. Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43, 521–532. Fréchet, M. (1951). Sur les tableaux de corrélation dont les marges sont données. Annales de l’Université de Lyon Section A, 14 , 53–77. Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., et al. (2016). Multivariate normal and t distributions. R package mvtnorm. https://cran.r-project.org/web/packages/mvtnorm mvtnorm. https://cran.r-project.org/web/packages/mvtnorm.. Headri Hea drick, ck, T. C. (20 (2010) 10).. Statis Statistical tical simul simulation ation:: powe powerr metho method d poly polynomia nomials ls and other tran transform sformation ationss boca raton. FL: Chapman and Hall/CRC. Higham, N. J. (2002). Computing the nearest correlation matrix—a problem from finance. IMA Journal of Numerical Analysis, 22 , 329–343. Hoeffding, W. (1994). Scale-invariant correlation theory. In: N.I. Fisher & P.K. Sen (Eds.), T he collected works of Wassily Hoeffding (the original publication year is 1940) (pp. 57–107). New York: Springer. Hosking, J. R. M. (1990). L-moments: Analysis and estimation of distributions using linear combinations of order statistics. Journal of the Royal Statistical Society, Series B, 52, 105–124. Nelsen, R. B. (2006). An introduction to copulas. Berlin, Germany: Springer. R Development Core Team (2016) R: A Language and Environment for Statistical Computing. http://www.cran.r-project.org.. http://www.cran.r-project.org Schaef Sch aefer er,, J., Opg Opgenen-Rhe Rhein, in, R., Zub Zuber er,, V., Ahd Ahdesm esmaki aki,, M., Silv Silva, a, A. D., Stri Strimme mmerr, K. (20 (2015) 15).. Efficient Estimation of Covariance and (P (Partial) artial) Correlation. R package corpcor. corpcor. https://cran.r-project. org/web/packages/BinNonNor.. org/web/packages/BinNonNor Serfling, R., & Xiao, P. (2007). A contribution to multivariate L-moments: L-comoment matrices. Journal of Multivariate Analysis, 98 , 1765–1781. Vale, C. D., & Maurelli, V. V. A. (1983). Simulating multivariate nonnormal distributions. Psychometrika, 48 , 465–471.
Yahav, I., & Shmueli, G. (2012). On generating multivariate Poisson data in management science applications. Applied Stochastic Models in Business and Industry , 28 , 91–102. Yuc ucel el,, R. M. M.,, & De Demi mirt rtas as,, H. (2 (201 010) 0).. Im Impa pact ct of no nonn-no norm rmal al ra rand ndom om ef effe fect ctss on in infe fere renc ncee by mu mult ltip iple le imputation: A simulation assessment. Computational Statistics and Data Analysis, 54, 790–801.
Improving the Efficiency of the Monte-Carlo Methods Using Ranked Simulated Approach Hani Michel Samawi
Abstract This chapter explores the concept of using ranked simulated sampling approach (RSIS) to improve the well-known Monte-Carlo methods, introduced by Samawii (1999), Samaw 1999), and extended to steady-state ranked simulated sampling (SRSIS) by Al-Saleh and Samawi Samawi (2000). 2000). Both simulation sampling approaches are then extended to multivariate rank nkeed simulated sampl pliing (MVRSIS) and multi mul tiva varia riate te ste steady ady-st -state ate ran ranked ked simula simulate ted d sam sampli pling ng approa approach ch (MVSRS (MVSRSIS) IS) by Samawi and Al-Saleh (2007 (2007)) and Samawi and Vogel Vogel ( (2013 2013). ). These approaches have been bee n dem demons onstra trate ted d as pro provid viding ing unbi unbiase ased d est estim imato ators rs and improv improving ing the perform performanc ancee of some of the Monte-Carlo methods of single and multiple integrals approximation. Additionally, the MVSRSIS approach has been shown to improve the performance and efficiency of Gibbs sampling (Samawi et al. al . 2012). 2012). Samawi and colleagues showed that their approach resulted in a large savings in cost and time needed to attain a specified level of accuracy.
1
Intr Introd oduc ucti tion on
The te The term rm Mo Mont ntee-Ca Carl rlo o re refe fers rs to tech techni niqu ques es that that us usee ra rand ndom om pr proc oces esse sess to ap appr prox oxim imat atee a non-stochastic k-dimensional integral of the form
= =
θ
g (u )d u ,
(1.1)
R k
(Hammersley and Handscomb 1964 Handscomb 1964). ). The literature presents many approximation techniques, including Monte-Carlo methods. However, as the dimension of the integrals rises, the difficulty of the inte-
gration problem increases even for relatively low dimensions (see Evans and Swartz 1995 1995). ). Give Given n such such compli complica catio tions, ns, man many y res resear earche chers rs are confus confused ed about about which which method method H.M. Samawi (B) Department of Biostatistics, Jiann-Ping Hsu College Public Health, Georgia Southern University, 30460 Statesboro, Georgia e-mail:
[email protected] [email protected] © Springer Nature Singapore Pte Ltd. 2017 D.-G. Chen and J.D. Chen (eds.), Monte-Carlo Simulation-Based Statistical Modeling, ICSA Book Series in Statistics, DOI 10.1007/978-981-10-3307-0_2
17
18
H.M. Samawi
to use; however, the advantages and disadvantages of each method are not the primary concern of this chapter. The focus of this chapter is the use of Monte-Carlo methods in multiple integration approximation. The motivation for this research is based on the concepts of ranked set sampling ). The motivation is based on the fact that the (RSS), introduced by McIntyre (1952 (1952). i th quan iss th thee dens densit ity y wheree f (i ) i quanti tifie fied d un unit it of RS RSS S is si simp mply ly an obse observ rvat atio ion n fr from om f (i ) , wher function of the i th order statistic of a random sample of size n . When the underlying density is the uniform distribution on (0, 1), f (i ) follows a beta distribution with parameters (i , n i 1). Samawii (1999 Samaw ( 1999)) was the first to explore the idea of RSS (Beta sampler) for integral approximation. He demonstrated that the procedure can improve the simulation efficiency based on the ratio of the variances. Samawi’s ranked simulated sampling whic ich h procedu proc edure re RSI RSIS S genera generates tes an indepe independe ndent nt ran random dom sample sample U (1) , U (2) , . . . , U (n) , wh β(i , n i 1), i is de deno note ted d by RSIS RSIS,, wher wheree U (i ) 1, 2, ..., ..., n and β(., (., .) denotes the beta distribution. The RSIS procedure constitutes an RSS based on random samples from the uniform distribution U (0, 1). The idea is to use this RSIS to compute (1.1) 1.1) with k 1, instead of using an SRS of size n from U (0, 1), when the range 1.1)) is (0, 1). In case of arbitrary range ( a , b) of the integral of the integral in (1.1 in (1.1 1.1), ), Samawi (1999 (1999)) used the sample: X (1) , X (2) , ..., ..., X (n) and the importance − sampling technique to evaluate (1.1 (1.1)), where X (i ) F X 1 (U (i ) ) and F X (.) is the distribution function of a continuous random variable. He showed theoretically and through simulation studies that using the RSIS sampler for evaluating (1.1 ( 1.1)) substantially improved the efficiency when compared with the traditional uniform sampler (USS). Al-Saleh and Zheng (2002 ( 2002)) introduced the idea of bivariate ranked set sampling (BVRSS) and showed through theory and simulation that BVRSS outperforms the bivariate simple random sample for estimating the population means. The BVRSS is as follows: Suppose ( X , Y ) is a bivariate random vector with the joint probability density func-
− +
∼
− + { =
}
=
=
X ,Y ( x , y ). Then, tion f X
1. A ra rand ndom om samp sample le of si size ze n 4 is identified from the population and randomly allocated into n 2 pools each of size n 2 so that each pool is a square matrix with n rows and n columns. 2. In the first pool, identify the minimum v value alue by judgment with respect tto o the first characteristic X , for each of the n rows. 3. Fo Forr the the n minima obtained in Step 2, the actual quantification is done on the pair that corresponds to the minimum value of the second characteristic, Y , identified by judgment. This pair, given the label (1, 1), is the first element of the BVRSS
sample. 4. Repea Repeatt Steps 2 and 3 for the second pool pool,, but in Step 3, the pair correspondin corresponding g to the second minimum value with respect to the second characteristic, Y , is chosen for actual quantification. This pair is given the label (1, 2). (las ast) t) po pool ol.. 5. Th Thee proc proces esss co cont ntin inue uess un unti till the the labe labell (n (n,, n) is asce ascert rtai aine ned d fr from om th thee n 2 th (l
Improving the Efficiency of the Monte-Carlo Methods …
19
The pro proced cedure ure descri described bed above above pro produc duces es a BVR BVRSS SS of size size n 2 .Let ( X [i ]( j ) , Y (i )[ j ] ), i 1, 2, . . . , n and j 1, 2, . . . , n denote the BVRSS sample from f X X ,Y ( x , y ) ( X [i ]( j ) Y (i )[ j ] ). where f X X [i ]( j ) , Y (i )[ j ] ( x , y ) is the joint probability density function of ( From Al-Saleh and Zheng ( Zheng (2002 2002), ),
=
=
[
]
= f
Y Y( i )[ j ] ( y )
f X X [i ]( j ) , Y (i )[ j ] ( x , y )
|
x ) f X Y | X ( y x X ( j ) ( x ) f Y f YY [ j ] ( y )
,
(1.2)
where f X X ( j ) is the density of the j th order statistic for an SRS sample of size n from value the marginal density of f X Y [ j ] ( y ) be the density of the corresponding Y X and f Y given by f Y Y[ j ] ( y )
∞ = −∞
−
|
x )d x , while f YY ( i )[ j ] ( y ) is the density of the i th f X Y | X ( y x X ( j ) ( x ) f Y
order statistic of an iid sample from f Y Y [ j ] ( y ), i.e. f Y Y (i )[ j ] ( y )
= c.(F [ ] ( y )) − (1 − F [ ] ( y )) − f [ ] ( y ) i 1
Y j
n i
Y j
Y Y j
y
where F Y Y[ j ] ( y )
∞ = (
−∞ −∞
|
x )d x )d w. f X Y | X (w x X ( j ) ( x ) f Y
Combining these results, Eq. (1.2) 1.2) can be written as
= c1 ( F YY [ ] ( y))i −1 (1 − F YY [ ] ( y))n−i ( F X ( x )))) j −1 (1 − F X X ( x ))))n− j f ( x , y )
f X X [i ]( j j)) , Y (i )[ j ] ( x , y )
j
j
(1.3)
where c1
=
(i
n
!
n
!
− 1)!(n − i )! )( ( j − 1)!(n − j )!
.
Furthermore, Al-Saleh and Zheng (2002 (2002)) showed that, n
1 n2
n
j
i
f X [i ]( j ) ,Y (i )[ j ] ( x , y )
= f ( x , y ).
(1.4)
For a varie For ariety ty of ch choi oice cess of f (u , v), one can have (U , V ) bivariate bivariate uniform with a probab probabilit ility y densi density ty funct function ion f(u, v); 0 < u, v < 1, such that U U (0, 1) and V U (0, 1) (See Johnson 1987 1, 2, . . . , n and ). In that case, (U [i ]( j ) , V (i )[ j ] ), i Johnson 1987). j 1, 2, . . . , n should have a bivariate probability density function given by
∼∼ =
f ( j ),(i ) (u , v)
]
=
=
[
n
!
n
!
∼ ∼
i −1 n −i [ [ 1 − F Y [ F Y u ] j −1 Y [ ] (v)] Y [ ] (v)] ( j − 1)!(n − j )! (i − 1)!(n − i )! j
j
1)!(n i )! [1 − u]n− j f (u, v).
(i
!
1) (n
( j
j)
! (1.5)
2007)) extended the work of Samawi Samaw i ( (1999 1999)) and Al-Saleh Samawi and Al-Saleh ( Al-Saleh (2007 (1.1)) and Zheng (2002 (2002)) for the Monte-Carlo multiple integration approximation of (1.1 when k = 2.
20
H.M. Samawi
Moreover, to further improve some of the Monte-Carlo methods of integration, used d ste steady ady-st -state ate ran ranke ked d set simula simulated ted sampli sampling ng Al-Sal AlSaleh eh and Sam Samaw awii (2000) 2000) use (SRSIS) as introduced by Al-Saleh and Al-Omari ( Al-Omari (1999 1999). ). SRSIS has been shown to be simpler and more efficient than Samawi’s (1999 ( 1999)) method. 2013)) work, the SRSIS algorithm introduced by Al-Saleh In Samawi and Vogel Vogel ( (2013 and Samawi Samawi (2000 2000)) was extended to multivariate case for the approximation of multiple integrals using Monte-Carlo methods. However However,, to simplify the algorithms, we introduce only the bivariate integration problem; with this foundation, multiple integral problems are a simple extension.
2
Steady-S Steady-State tate Ranked Ranked Simulated Simulated Sampling Sampling (SRSIS) (SRSIS)
Al-Saleh Al-Sal eh and Al-Oma Al-Omarri (1999) 1999) intr introdu oduce ced d the the id idea ea of mu mult ltis ista tage ge ra rank nked ed set set samp sampli ling ng (MRSS). To promote the use of MRSS in simulation and Monte-Carlo methods, let (s )
(s )
X i i 1, 2, . . . , n , be an (MRSS of size n at stage s . Assume that X (is ) has s) probability density function f i and a cumulati cumulative ve dist distribu ribution tion function function F i . AlSaleh and Al-Omeri demonstrated the following properties of MRSS:
{ ; =
}
1. f ( x )
1
=n
n
( s )
f i ( x )),,
(2.1)
=
i 1
2.
If ss
→ ∞, then
(s ) F i ( x )
→
∞ ) ( x )
( Fi
=
0
if x < Q (i −1)/ n n F ( x ) (i 1) i f Q (i −1)/ n x < Q (i )/ n , 1 if x Q (i )/ n
− −
≤ ≥≥
(2.2)
for i = 1, 2, . . . , n , where Q is the 100 α th percentile of F ( x ). α
∼ U (0, 1), then for i = 1, 2, ..., ∼ ..., n , we have
3. If X
∞)
(
F i ( x )
and (
= ∞)
0 (i 1
− − −
n x
− − ≤ ≥ ≥
if x < (i 1)/ n 1) i f (i 1)/ n x < i / n , if x i / n
n i f (i
− 1)/ n ≤ x 2 as follows:
=
=
..., n generate independently ..., k ), ), i = 1, 2, ..., = 1,2, ..., i − 1 i , l , s = 1, 2, . . . , k and i , i = 1, 2, . . . , n . , from U n n
1. For ea each ch (il , l U il (is )
=
≤
l
s
s
l
s
22
H.M. Samawi
−1 (U i (i ) )l , s 1, 2, . . . , k and i l , is 1, 2, . . . , n , from 2. Gen Genera erate te X il (is ) F X l s il F X il ( x ), ), l 1, 2, . . . , k , respectively. 3. Then, genera generate te the mult multiv ivaria ariate te version of the steady-state steady-state simulated simulated sample by using any technique for conditional random number generation.
=
3
=
=
=
Monte-Car Monte-Carlo lo Method Methodss for for Multip Multiple le Integr Integration ation Problem Problemss
Very good descriptions of the basics of the various Monte-Carlo methods have ), Liu (2001 (2001), ), Morgan (1984 (1984), ), been provided by Hammersley and Handscomb (1964 ( 1964), Rober Rob ertt and and Case Casell llaa (2004), 2004), and and Shre Shreid ider er (1966). 1966). The MonteMonte-Car Carlo lo methods methods descri described bed include inclu de crude crude,, anti antithet thetic, ic, importanc importance, e, control control variate variate,, and stratified stratified sampl sampling ing approaches. approaches. However, when variables are related, Monte-Carlo methods cannot be used directly (i (i.e .e., ., si simi mila larr to the the ma mann nner er that that thes thesee meth method odss are are used used in uni univaria ariate te in inte tegr grat atio ion n pr proboblems) because using the bivariate uniform probability density function f (u , v) as a sampler sampl er to evalua evaluate te Eq. (1.1) 1.1) with k 2, f (u , v) is not consistent. However, in this context it is reasonable to use the importance sampling method, and therefore, it follows that other Monte-Carlo techniques can be used in conjunction with importance sampling. Thus, our primary concern is importance sampling.
= =
3.1
Importance Sampling Method
In general, suppose that f is a density function on R k such that the closure of the set of points where g (.) is non-zero and the closure set of points where f (.)is non-zero. Let U i i 1, 2, ..., ..., n ] be a sample from f (.). Then, because
[ =
= =
θ
g (u ) f (u )
f (u )d u ,
1.1)) can be estimated by Equati Equ ation on (1.1
∧
1
= n θ =
n
=
i 1
g (u i ) f (u i )
.
(3.1)
Equati Equ ation on (3.1) 3.1) is an unbiased estimator for (1.1 ( 1.1), ), with variance given by 1
V ar (θ )
ˆ =n
g (u )2
(
f (u )
R k
θ 2 ).
du
−
Improving the Efficiency of the Monte-Carlo Methods …
23
In addition, from the point of view of the strong law of large numbers, it is clear that θ θ θ almost almost surely as n . A limite limited d number number of distri distribu butio tional nal famili families es exist exist in a mul multi tidim dimens ension ional al con conte text xt and are commonly used as importance samplers. For example, the multivariate multivariate Student’s Student’s
ˆ → →
→ ∞
family is used extensively in the literature as an importance sampler. Evans and Swartz (1995 (1995)) indicated a need for developing families of multivariate distribution that exhibit a wide variety of shapes. In addition, statisticians want distributional families to have efficient algorithms for random variable generation and the capacity to be easily fitted to a specific integrand. This paper provides a new way of generating a bivariate sample based on the bivariate biva riate steady-state sampling (BVSRSIS) that has the potential to extend the existing sampling methods. We also provide a means for introducing new samplers and to substa substanti ntial ally ly imp improv rovee sub substa stanti ntiall ally y the effici efficienc ency y of the integ integrat ration ion approx approxima imatio tion n based on those samplers.
3.2
Using Bivariate Steady-State Sampling (BVSRSIS)
Let
= =
g ( x , y )d x d y .
θ
(3.2)
To estimate θ , generate a bivariate sample of size n 2 from f(x, y), which mimics g ( x , y ) aan 1, 2, ..., ..., n and nd has the same range, such as ( X i j , Y i j ), i j 1, 2, ..., n . Then n n g ( x i j , yi j ) 1 . (3.3) θθ f ( x i j , yi j ) n2
=
=
[
]
ˆ = =
= =
i 1 j 1
Equati Equ ation on (3.3) 3.3) is an unbiased estimate for (3.2 ( 3.2)) with variance 1
ˆ = n
V ar (θ )
2
g 2 ( x , y )
(
f ( x , y )
dx dy
2
− θ ).
(3.4)
desc scri ribe bed d using BVSRSI BVSRSIS, S, genera generate te a biva bivaria riate te sam sample ple of size size n 2 , as de To estimate estimate (3.2) 3.2) using 1, 2, ..., n and j 1, 2, ..., n . Then in above, say ( X [i ]( j ) , Y (i )[ j ] ), i
[
=
=
]
ˆ
1
= n
θθ BV B V S R S I S
2
n
n
= =
i 1 j 1
g ( x [i ]( j ) , y(i )[ j ] )
f ( x [i ]( j ) , y(i )[ j ] )
.
(3.5)
( 3.2)) using ( using (2.5 2.5). ). Also, by using (2.5 (2.5)) Equation Equati on (3.5) 3.5) is also an unbiased estimate for (3.2 the variance of (3.5 (3.5)) can be expressed as
24
H.M. Samawi
ˆ
V ar (θθ BV B V S R S I S )
(i, j ) g where, θ / f
1
= V ar (θˆ) − n
4
n
n
(i , j )
(θ g/ f
= =
i 1 j 1
2 g/ f ) ,
− θ
(3.6)
E [g ( X [ ] , Y [ ] )/ f ( X [ ] , Y [ ] )], θ = E [ g ( X , Y )/ f ( X , Y )] = θ . = The variance of the estimator in (3.6 (3.6)) is less than the variance of the estimator in i ( j )
(i ) j
i ( j )
(i ) j
g / f
(3.4). 3.4).
3.3
Simulation Study
This section presents the results of a simulation study that compares the performance man ce of the import importanc ancee sam sampli pling ng metho method d des descri cribed bed above above usi using ng BVSRSI BVSRSIS S scheme schemess with the performance of the bivariate simple random sample (BVUSS) and BVRSS 2007)) as intr introd oduc uced ed by Sama Samawi wi and and Voge ogel (2013 2013). ). scheme sch emess by Samaw Samawii and Al-Sal Al-Saleh eh (2007
3.3.1 3.3. 1
Illus Illustrat tration ion for for Im Import portance ance Sampl Sampling ing Metho Method d When Integral’ Integral’ss Limits Are (0, 1)x(0, 1)
As in Sama Samawi wi an and d Al Al-S -Sal aleh eh (2007), 2007), illu illust stra rati tion on of th thee impa impact ct of BVSR BVSRSI SIS S on impo imporrtance sampling is provided by evaluating the following integral 1
= =
θ
1
0
(1
+ v). exp(u (1 + v)) du dv = 3.671.
(3.7)
0
This example uses four bivariate sample sizes: n = 20, 30, 40 and 50. To estimate the variances using the simulation method, we use 2,000 simulated samples from BVUSS and BVSRSIS. Many choices of bivariate and multivariate distributions with uniform marginal on [0, 1] are available (Johnson 1987). 1987). However, for this simulation, we chose Plackett’s uniform distribution (Plackett (Plackett 1965), 1965), which is given by f (u , v)
=
{ − 1)(u + v − 2 u v) + 1}
ψ (ψ
, 0 < u , v < 1 , ψ > 0.
2
{[1 + (u + v)(ψ − 1)] − 4ψ (ψ − 1)u v}
3/2
(3.8) The param parameter eter ψ go gove verns rns the depend dependenc encee bet betwee ween n the compon component entss (U , V ) distributed ψ (Johnson according to f . . Three cases explicitly indicate the role of ψ (Johnson 1987 1987): ): ψ ψ ψ
U = V , →10 U and = 1 V − are = independent, → ∞ U = = V ,
Improving the Efficiency of the Monte-Carlo Methods …
25
3.7)) using BVSRSIS relative to BVUSS and BVRSS Table 1 Efficiency of estimating ( estimating (3.7
\
n ψ
1
2
20
289.92 (8.28 8.28))
273.68 (9.71 9.71))
30
649.00 (12.94 12.94))
631.31 (13.06 13.06))
40
1165.31 (16.91 16.91))
1086.46 (18.60)
50
1725.25 (21.67 21.67))
1687.72 (23.03 23.03))
Note Values shown in bold were were extracted from Samawi and Al-Saleh ( Al-Saleh (2007 2007))
Table 1 presents the relative efficiencies of our estimators using BVRSIS in comparison with using BVUSS and BVSRSIS relative to BVUSS for estimating (3.7 ( 3.7)). As illustrated in Table 1, BVSRSIS is clearly more efficient than either BVUSS or BVRSIS when used for estimation.
3.3.2 3.3. 2
Illus Illustrat tration ion When When tthe he Int Integra egral’ l’ss Limits Limits A Are re Arbitr Arbitrary ary 2 Subset of R
Recent work by Samawi and Al-Saleh (2007 (2007)) and Samawi and Vogel Vogel ( (2013 2013)) used an identical example in which the range of the integral was not (0, 1), and the authors evaluated the bivariate normal distribution (e.g., g ( x , y ) is the N 2 (0, 0, 1, 1, ρ ) density.) For integrations with high dimensions and a requirement of low relative error, the evaluation of the multivariate normal distribution function remains one of ). To demonstrate the unsolved problems in simulation (e.g., Evans and Swartz 1995 Swartz 1995). how BVSRSIS increases the precision of evaluating the multivariate normal distribution, we illustrate the method by evaluating the bivariate normal distribution as follows: z 1
z 2
== g( x , y ) dx dy ,
θ
(3.9)
−∞ −∞
where g ( x , y ) is the N 2 (0, 0, 1, 1, ρ ) density. Given the similar shapes of the marginal of the normal and the marginal of the logistic probability density functions, it is natural to attempt to approximate the biva bivaria riate te normal normal cum cumula ulati tive ve dis distri tribu butio tion n fun functi ction on by the bi biva varia riate te log logist istic ic cumula cumulati tive ve
distribution function. For the multivariate logistic distribution and its properties, see Johnson and Kotz (1972 (1972). ). The density of the bivariate logistic (Johnson and Kotz 1972) 1972) is chosen to be
√
√
2 π 2 e−π( x + y )/ 3 (1
+√ e−π z1/ 3√ + e−π z2 / π x / 3 π y / 3 3 − − ( ) 3 1+e +e
! f ( x , y ) =
√ 3
)
− ∞ < x < z 1; − ∞ < y < z 2 .
,
(3.10)
It can be shown that the marginal of X is given by
26
H.M. Samawi
√ √ √ − − − π x / 3 π z / 3 π z / 3 πe f ( x ) = √ +(1e+−πe x /√ 3 + e−+π ze /√ 3 2 ) , − ∞ < x < z1. 3 ( 1 ) 1
2
(3.11)
2
√ 3 Now let W
= Y + = +
π
√
√
π z2 / 3
π X / 3
ln(1
| = √ 3
+ e−
+ e−
). Then it can be shown that
√ − π w/ 3 2 π e
f (w x x )
√
1 e√ −π x / 3 √ x / 3 e −π z 2 / 3 1 e−π x /
+
+
+
√
3
− ∞ n 1 . Given the k -th -th permuted data set, xk and yk , where xk = {{ x j j , j : z k j = 1 , j = 1 , 2, .. ..., ., n 1 }, { y j , j : z k (n1 + j ) = 1 , j = 1 , 2, .. ..., ., n 2 }} and yk = {{ x j j , j : z k j = 2 , j = 1 , 2, .. ..., ., n 1 }, { y j , j : z k (n1 + j ) = 2 , j = 1 , 2, .. -test and Wilcoxon rank-sum test are ..., ., n 2 }}, Welch’s t -test applied and the test statistics and p -values are calculated. Denoted by s k t and s k w the test statistics and qk t and qk w the p-values at the k -th -th permutation associated with Welch’s t -test and Wil Wilcoxon coxon rank-sum test, respectively respectively. Similarly, Similarly, denote by s ot and sow the observed test statistics statistics and pot and pow the observed p-values, calculated using
w t Define m k ( pk , pk ) as
w t of pk and pk .
the observed pdata, x and y. the rank-sum minimumtest are given byThe t -test and Wilcoxon permutation -values of Welch’s
pp.welch =
2 B
B
min
2 B
I (sk t ≤ s ot ),
B
min
k =1
B
k =1
and pp.wilcox =
I (sk t > s ot ) ,
k =1
B
I (sk w ≤ s ow ),
I (sk w > sow ) ,
k =1
(6)
(7)
where I (·) = 1 if the condition in the parentheses is true; otherwise, I (·) = 0. Similarly, the p-value associated with the minimum p-value is given by
pminp =
1 B
B
I m k ( pk t , p k w ) ≤ m o ( pot , p ow ) . ) ≤
k =1
(8)
46
J.R. Hoag and C.-L. Kuo
Overall, 10 two-sample location tests are compared including t -tests, -tests, Wilcoxon rank-sum test, two-stage test, and permutation tests. For convenience, when presenting results, each test is referenced by an abbreviated notation, shown below in parentheses. • t -tests: -tests: pooled t -test (pooled), Welch’s t -test (welch), permutation Welch’s t -test (p.welch), robust t -test (robust.t , F -test -test for the equality of variances followed by pooled t -test or Welch’s t -test) • Wilc ilcoxon oxonran rank-s k-sum um tes tests: ts: Wilc ilcoxo oxon n ran rank-s k-sum um tes testt (wilcox), permu permutati tation on Wilcox ilcoxon on rank-sum test (p.wilcox) • two-stage tests: two-stage test with the first-stage α level at 0.5%, 2.5%, and 5% (2stage0.5, 2stage2.5, 2stage5) • minimum p-va -value: lue: minim minimum um p-va -value lue of permu permutati tation on Welch elch’’s t -test -test and Wilc ilcoxon oxon rank-sum test (minp)
3
Simu Si mula lati tion onss
Simula Simu late ted d da data ta we were re us used ed to te test st th thee nu null ll hy hypo poth thes esis is H 0 : µ1 = µ 2 versus versus the alternative alternati ve hypothesis H 1 : µ1 = µ 2 . Assume x = { x 1 , .. ..., ., x n1 } was simulated from N (µ1 , σ 12 ) and y = { y1 , .. ..., ., yn2 } wa wass sim simula ulated ted fro from m N (µ2 , σ 22 ). Wit itho hout ut lo losi sing ng ge gennerality erali ty,, let σ 1 and σ 2 be se sett equ qual al to 1. Let µ1 be ze zero ro,, µ2 be 0 und under er th thee nu null ll hy hypo poth thes esis is and 0.5 under the alternative hypothesis, the conventional value suggested by Cohen (2013 2013)) for a medium effect size. n 1 is set equal to or double the size of n 2 . n 1 and n 2 were chosen to detect the difference between µ 1 and µ2 for 20, 40, 60, or 80% power at 5% significance level. When n 1 = n 2 , the sample size required per group was 11, 25, 41, and 64, respectively. When n 1 = 2 n 2 , n 1 =18, 38, 62, 96 and n 2 was
half of n 1 in ea half each ch se sett ttin ing. g. Th Thee sa same me sa samp mple le si size zess we were re us used ed for nu null ll si simu mula lati tion ons, s, no non n normal, and heteroscedastic settings. Power analysis was conducted using G*Power software (Faul et al. 2007 al. 2007). ). Fleishman’s power method to simulate normal and non-normal data is based on a polynomial function given by x = f (ω) = (ω) = a + b ω + c ω2 + d ω3 ,
(9)
where ω is a random value from the standard normal with mean 0 and standard deviation 1. The coefficients a , b , c , and d are determined by the first four moments of X with the first two moments set to 0 and 1. For distributions with mean and standard deviation different from 0 the andskewness 1, the data be shifted rescaled after being simulated. Let γ 3 denote andcan γ 4 denote the and/or kurtosis. γ 3 and γ 4 are both set to 0 if X is normally distributed. The distribution is left-skewed if γ 3 < 0 an and d ri righ ghtt-sk skeewe wed d if γ 3 > 0. γ 4 is sm smal alle lerr th than an 0 fo forr a pl plat atok okurt urtot otic ic di dist stri ribu buti tion on and an d gre great ater er tha than n 0 fo forr a le lept ptok okur urto toti ticc dist distri ribu buti tion. on. By th thee 12 mo mome ment ntss of the the st stan anda dard rd
Normal and Non-normal Data Simulations for the Evaluation …
47
normal distribution, we can derive the equations below and solve a , b, c, and d via the Newton-Raphson method or any other non-linear root-finding method,
a = −c b2 + 6bd + 2 c2 + 15d 2 − 1 = 0
2
2
2c b + 24bd + + 105d + 2 − − γ γ 3 = 0
24 bd + c
2
2
1 + b + 28bd + d 2 12 + 48bd + 141c 2 + 225d 2
− γ − γ 4 = 0
(10) (11)
(12)
(13)
One limitation of Fleishman’s power method is that it does not cover the entire domain of skewness and kurtosis. Given γ 3 , the relationship between γ 3 and γ 4 is described by the inequation, γ 4 ≥ γ 32 − 2 (Devroye (Devroye 1986 1986). ). Precisely, the empirical γ 4 given γ 3method, = 0 is -1.151320 lower of γ (Headrick Sawilowsky Sawilowsky 2000 ). By bound Fleishman’s power three conditions wereand investigated: (1) 2000). heterogeneous variance, (2) skewness, and (3) kurtosis. Each was investigated using equal and unequal sample sizes to achieve a broad range of power. µ1 = µ 2 = 0 under the null hypothesis. µ1 = 0 and µ2 = 0 .5 when the alternative hypothesis is true. Other parameters were manipulated differently. For (1), normal data was simulated with equal and unequal variances by letting σ 1 = 1 and σ 2 = 0.5, 1, 1.5. For (2), skewed data was simulated assuming equal variance at 1, equal kurtosis at 0, and equa eq uall sk skeewn wnes esss at 0, 0. 0.4, 4, 0. 0.8. 8. Si Simi mila larl rly y, for (3 (3), ), ku kurt rtot otic ic da data ta wa wass si simu mula late ted d as assu sumi ming ng equal variance at 1, equal skewness at 0, and equal kurtosis at 0, 5, 10. To visualize the distributions from which the data were simulated, 10 6 data points were simulated from the null distributions and used to create density plots (see Fig. 1). To best visualize the distributions, the data range was truncated at 5 and − 5. The left panel
Distrib tributi utions ons to sim simula ulate te het hetero erosce scedas dastic tic,, sk skew ewed, ed, and kur kurtot totic ic dat dataa whe when n the nu null ll hy hypot pothes hesis is Fig. 1 Dis Fig. of eq equa uall po popu pula lati tion on me mean anss is tr true ue (left : no norm rmal al di dist stri ribu buti tion onss wi with th me mean anss at 0 an and d st stan anda dard rd de devi viat atio ions ns at 0.5, 1, and 1.5; middle: distributions with means at 0, standard deviations at 1, skewness at 0, 0.4, and 0.8, and kurtosis at 0; right : distributions with means at 0, standard deviations at 1, skewness at 0, and kurtosis at 0, 5, and 10)
48
J.R. Hoag and C.-L. Kuo
demonstrates the distributions for the investigation of heterogeneous variance. One samp sa mple le is si simu mula late ted d fr from om th thee st stan anda dard rd no norm rmal al (g (gre reen en). ). Th Thee ot othe herr sa samp mple le is si simu mula late ted d from the normal with mean 0 and standard deviation 0.5 (red) or 1.5 (blue). For skewness and kurtosis, the two distributions were assumed to be exactly the same under the null hypothesis. Although these distributions are for null simulations, the distribut distr ibutions ions for pow power er simu simulati lations ons are the same for one sampl samplee and shifted from 0 to 0.5 for the second sample. The number of simulation replicates was 10,000 and 1,000 for null simulations and power simulations, respectively. The number of permutation replicates for the permutation tests at each simulation replicate was 2,000, i.e. B = 2000. The significance level for two-location comparison was set to 0.05. All the simulations were carried out in R 3.1.2 (Team 2014 (Team 2014). ). The Fleishman coefficients (a , b , c , and d ) given thee fir th first st fo four ur mo mome ment ntss (t (the he fir first st tw two o mo mome ment ntss 0 an and d 1) we were re de deri rive ved d vi viaa th thee R fu func ncti tion on “Fleishman.coef.NN” (Demirtas et al. al. 2012 2012). ).
4
Results
The simulation results are presented in figures. In each figure, the type I error or power is presented on the y -axis and the effective sample size per group to achieve 20%, 20 %, 40 40%, %, 60 60%, %, an and d 80 80% % po powe werr is pr pres esen ente ted d on th thee x -axis -axis ass assumi uming ng the poo pooled led t -test -test is ap appr propr opria iate te.. Th Thee ef effe fect ctiive sa samp mple le si size ze pe perr gr grou oup p fo forr tw twoo-sa samp mple le me mean an co comp mpar aris ison on is defined as 2n 1 n 2 ne = . (14) n 1 + n 2
Each test is represented by a colored symbol. Given an effective sample size, the results of 1. t -tests -tests (pooled t -test, theoretical and permutation Welch’s t -tests, robust t -test) 2. Wilc Wilcoxon oxon rank-sum tests (theoretical and and permutation Wilcoxon Wilcoxon rank-sum tests) and the minimum p-value (minimum p-value of permutation Welch’s t -test and Wilcoxon rank-sum test) 3. tw twoo-st stag agee te test stss (no (norm rmal alit ity y te test st wi with th th thee α le leve vell at 0. 0.5, 5, 2. 2.5, 5, an and d 5% fo forr bo both th sa samp mple less followed by Welch’s t -test or Wilcoxon rank-sum test) are aligned in three columns from left to right. The results for n 1 = n 2 are in the left panels and those for n 1 = 2 n 2 are in the right panels. In the figures that present type I error results, two horizontal lines y = 0 .0457 and y = 0 .0543 are added to judge whether the type I error is correct. The type I error is considered correct if it falls within the 95% confidence interval of the 5% significance level (0.0457, 0.0543). We use valid to describe tests that maintain a correct type I error and liberal and conservative for tests that result in a type I error above and under the nominal level, respectively.
Normal and Non-normal Data Simulations for the Evaluation …
49
simulation results for normal data with equal or uneq unequal ual variances. variances. y = 0 .0457 and Fig. 2 Null simulation y = 0 .0543 are added to judge whether the type I error is correct. The type I error is considered correct if it falls within the 95% confidence interval of the 5% significance level (0.0457, 0.0543). The tests are grouped into three groups from left to to right for for result presentation, (1) t -tests: pooled t -test (pooled (pooled), ), theoretical and permutation Welch’s t -tests (welch (welch and p.welch), p.welch), robust t -test (robust.t ), (2) Wilcoxon rank-sum test (wilcox (wilcox), ), permutation Wilcoxon rank-sum test (p.wilcox ( p.wilcox)) robust.t), and the minimum p -value (minp (minp), ), (3) two-stage tests: two-stage tests with the first stage α level at 0.5, 2.5, 5% (2stage0.5 (2stage0.5,, 2stage2.5 2stage2.5,, 2stage5 2stage5))
4.1
Heterogeneous Variance
The null simu simulati lation on resul results ts are prese presented nted in Fig. 2. All tests maintain a correct type I error when the sample sizes are equal. Wilcoxon rank-sum tests and the minimum p -value have a slightly inflated type I error when the variances are unequal and relatively small (σ 1 = 1 and σ 2 = 0 .5). When the sample sizes are unequal, all tests aree va ar vali lid d as lo long ng as th thee var aria ianc nces es ar aree eq equa ual. l. Wh When en th thee va vari rian ance cess ar aree un uneq equa ual, l, ho howe weve verr, neither the pooled t -test -test nor the Wilcoxon rank-sum test maintain a correct type I
50
J.R. Hoag and C.-L. Kuo
Powerr simul simulation ation results results for normal data with equa equall or unequal variances variances.. The tests are Fig. 3 Powe groupe gro uped d int into o thr three ee gro group upss fro from m left to right for resul resultt prese presentati ntation, on, (1) t -tests -tests:: poole pooled d t -test (pooled ), theoretical t heoretical and permutation Welch’s Welch’s t -tests (welch (welch and and p.welch ), robust t -test (robust.t (robust.t), ), pooled), p.welch), (2) Wilcoxon rank-sum test (wilcox ( wilcox), ), permutation Wilcoxon rank-sum test (p.wilcox (p.wilcox)) and the minimum p -value (minp (minp), ), (3) two-stage tests: two-stage tests with the first stage α level at 0.5, 2.5, 5% (2stage0.5 2stage0.5,, 2stage2.5 2stage2.5,, 2stage5 2stage5))
error. Both exhibit a conservative type I error when σ 1 = 1 and σ 2 = 0 .5 and a liberal type I error when σ 1 = 1 and σ 2 = 1 .5. The pooled t -test -test in particular is more strongly influenced than the Wilcoxon rank-sum test. The minimum p -value involves Wilcoxon rank-sum test and is thus similarly affected but not as severely as the Wilcoxon rank-sum test alone. The robust t -test and permutation Welch’s t -test behave similarly when the sample size is small. Welch’s t -test -test and two-stage tests are the only tests with type I error protected for heterogeneous variance regardless of sample sizes. The po power wer sim simula ulatio tion n res result ultss are pre presen sente ted d in Fig Fig.. 3. Unsurp Unsurprisi risingly ngly,, the sett settings ings of σ 1 = 1 , σ 2 = 0 .5 result in higher power than σ 1 = 1 , σ 2 = 1, and than σ 1 = 1 , σ 2 =
Normal and Non-normal Data Simulations for the Evaluation …
51
Fig. 4 Null simulation simulation results for skewed data. y = 0 .0457 and y = 0 .0543 are added to judge whether the type I error is correct. The type I error is considered correct if it falls within the 95% confidence interval of the 5% significance level (0.0457, 0.0543). The tests are grouped into three groups from left to t o right for for result presentation, (1) t -tests: pooled t -test (pooled (pooled), ), theoretical and permutation Welch’s t -tests (welch (welch and and p.welch p.welch), ), robust t -test (robust.t (robust.t), ), (2) Wilcoxon rank-sum test (wilcox (wilcox), ), permutation Wilcoxon rank-sum test (p.wilcox (p.wilcox)) and the minimum p -value (minp (minp), ), (3) two-stage tests: two-stage tests with the first stage α level at 0.5, 2.5, 5% (2stage0.5 ( 2stage0.5,, 2stage2.5 2stage2.5,, 2stage5)) 2stage5
1.5. All the valid tests have similar power. Tests with an inflated type I error tend to be slightly more powerful. In contrast, tests that have a conservative type I error tend to be slightly less powerful.
52
J.R. Hoag and C.-L. Kuo
Fig. 5 Pow Power er simulation simulation results for skewed data. The tests are grou grouped ped into three groups from left to t o right for for result presentation, (1) t -tests: pooled t -test (pooled (pooled), ), theoretical and permutation Welch’s t -tests (welch (welch and and p.welch ), robust t -test (robust.t (robust.t), ), (2) Wilcoxon rank-sum test (wilcox (wilcox), ), p.welch), permutation Wilcoxon rank-sum test (p.wilcox (p.wilcox)) and the minimum p -value (minp (minp), ), (3) two-stage tests: two-stage tests with the first stage α level at 0.5, 2.5, 5% (2stage0.5 (2stage0.5,, 2stage2.5 2stage2.5,, 2stage5 2stage5))
4.2
Skewness
The tw The two o sa samp mple le si size ze ra rati tios os re resu sult lt in si simi mila larr pa patt tter erns ns of ty type pe I er erro rorr an and d po powe werr. Al Alll te test stss main ma inta tain in a co corr rrec ectt ty type pe I er erro rorr ac acro ross ss th thee sp spec ectr trum um of sa samp mple le si size ze an and d sk skeewn wnes esss exc xcep eptt 2stage0.5, which has an inflated type I error when the sample size is insufficiently large and the skewness is 0.8 (Fig. 4). A similar pattern for power is observed within group gro upss (t (thr hree ee gr grou oups ps in th thre reee co colu lumn mns) s).. Wit ith h an in incr crea ease se in sk skeewn wnes ess, s, th thee po powe werr of t tests (pooled t -test, theoretical and permutation Welch’ Welch’ss t -te -test, st, rob robust ust t -test) remains at th thee le leve vell as pl plan anne ned d an and d is sl slig ight htly ly lo lowe werr th than an ot othe herr te test stss wh when en th thee sk skeewn wnes esss re reac ache hess 0.8 (Fig. 5).
Normal and Non-normal Data Simulations for the Evaluation …
53
simulation results results for kurto kurtotic tic data. y = 0 .0457 and y = 0 .0543 are added to judge Fig. 6 Null simulation whether the type I error is correct. The type I error is considered correct if it falls within the 95% confidence interval of the 5% significance level (0.0457, 0.0543). The tests are grouped into three groups from left to t o right for for result presentation, (1) t -tests: pooled t -test (pooled (pooled), ), theoretical and permutation Welch’s t -tests (welch (welch and and p.welch ), robust t -test (robust.t (robust.t), ), (2) Wilcoxon rank-sum p.welch), test (wilcox (wilcox), ), permutation Wilcoxon rank-sum test (p.wilcox (p.wilcox)) and the minimum p -value (minp (minp), ), (3) two-stage tests: two-stage tests with the first stage α level at 0.5, 2.5, 5% (2stage0.5 ( 2stage0.5,, 2stage2.5 2stage2.5,, 2stage5)) 2stage5
4.3
Kurtosis
Based Base d on Fi Fig. g. 6, al alll te test stss ma main inta tain in a co corr rrec ectt ty type pe I er erro rorr wh when en th thee ef effe fect ctiive sa samp mple le si size ze is eq equa uall or gr grea eate terr th than an 25 25.. Wh When en th thee ef effe fect ctiive sa samp mple le si size ze is sm smal all, l, e. e.g. g. n e = 11 11,, an and d the kurtosis is 10, t -tests (with the exception of permutation Welch’ Welch’ss t -test) exhibit a conservative type I error—the type I error of permutation Welch’s t -test is protected with consistently similar power to other t -tests. Figure 7 displays tests within groups
54
J.R. Hoag and C.-L. Kuo
Fig. 7 Pow Power er simulation simulation results for kurto kurtotic tic data. The tests are grouped into three groups groups from left to t o right for for result presentation, (1) t -tests: pooled t -test (pooled (pooled), ), theoretical and permutation Welch’s t -tests (welch (welch and and p.welch ), robust t -test (robust.t (robust.t), ), (2) Wilcoxon rank-sum test (wilcox (wilcox), ), p.welch), permutation Wilcoxon rank-sum test (p.wilcox (p.wilcox)) and the minimum p -value (minp (minp), ), (3) two-stage tests: two-stage tests with the first stage α level at 0.5, 2.5, 5% (2stage0.5 (2stage0.5,, 2stage2.5 2stage2.5,, 2stage5 2stage5))
(three groups in three columns) with similar power. While not apparent when the sample size is small, t -tests are not as compelling as other tests when the skewness is 5 or greater. Wilcoxon rank-sum tests are the best performing tests in this setting, but the power gain is not significant when compared to the minimum p-value and two-stage tests.
Normal and Non-normal Data Simulations for the Evaluation …
5
55
Disc Di scus ussi sion on
The goal of this chapter was to conduct an evaluation of the variety of tests for twosample sam ple loc locati ation on com compar pariso ison n usi using ng Mon Monte te Car Carlo lo sim simula ulati tion on te techn chniqu iques. es. In con conclu clusio sion, n, hete he tero roge gene neou ouss var aria ianc ncee is no nott a pr prob oble lem m fo forr an any y of th thee te test stss wh when en th thee sa samp mple le si size zess ar aree equal. For unequal sample sizes, Welch’ Welch’ss t -test maintains robustness against heterogeneous variance. The interaction between heterogeneous variance and sample size is consi consistent stent with the stat statement ement of (Ko (Kohr hr and Games 1974 Games 1974)) for general two-sample location comparison that a test is conservative when the larger sample has the larger population variance and is liberal when the smaller sample has the larger population variance. When the normality assumption is violated by moderately skewed or kurtotic data, Welch’s t -tes -testt and other t -tests maintain a correct type I error so long as the sample size is sufficiently large. The t -tests, however, are not as powerful as the Wilcoxon rank-sum test and others. As lo long ng as th thee ef effe fect ctiive sa samp mple le si size ze is 11 or gr grea eate terr, di dist stri ribu buti tion onss of Wel elch ch’’s t -test -test and the Wilcoxon rank sum test statistics are well approximated by their theoretical distributions. There is no need to apply a permutation test for a small sample size of that level if a theoretical test is appropriately applied. A permutation test is not a solution to heterogeneous variance but may protect type I error when the normality is violated and the sample size is small. An extra test to test for equality of variance or for normality may fail to protect type I error, e.g. robust.t for heterogeneous variance and 2stage0.5 for skewed data when the sample size is insufficiently large. While t -tests -tests are not sensitive to nonnormal data for the protection of type I error, a conservative normality test with α
as low as 0.5% can lead to a biased type I error. In simulations, the two-stage tests perform well against heterogeneous variance, but this is only the case because the simulatio simul ation n sett settings ings assume both distribution distributionss are norma normall and the two-stage two-stage test testss in fact are Welch’s t -test -test most of the time. The two-stage tests are meant to optimize the test based on whether the normality assumption is met. These tests do not handle heterogeneous variance. Under circumstances of non-normal distributions and heterogeneous variance, it has been shown by Rasch et al. that two-stage tests may lead to an incorrect type I error and lose power (Rasch et al. al . 2011 2011). ). The literature and these simulation results reveal that the optimal test procedure is a sensible switch between Welch’s t -test and the Wilcoxon rank sum test. Multistage tests such as robust t -test -test and two-stage tests are criticized for their biased type I error. The minimum p-value simply takes the best result of Welch’s t and Wilcoxon Wilc oxon rank-sum tests and is not designed specifically for heterogeneous variance or no nonn-nor norma mall da data ta di dist stri ribu buti tions ons.. Wh Whil ilee it sh shar ares es th thee fla flaw w of mu mult ltii-st stag agee te test stss wi with th no protection of type I error and secondary power, the minimum p-value is potentially robust to heterogeneous variance and non-normal distributions. A simple way to protect type I error and power against heterogeneous variance is to plan a study with two samples of equal size. In this setting, Wilcoxon ranksum test maintains robustness to non-normal data and competitive with Welch’s t -test -test when the normality assumption is met. If the sample sizes are unequal, the
56
J.R. Hoag and C.-L. Kuo
minimum p-v -valu aluee is a rob robust ust opt option ion bu butt com comput putati ationa onall lly y mor moree ex expen pensi sive ve.. Oth Otherw erwise ise,, heterogeneous variance and deviation from a normal distribution must be weighed in selecting between Welch’s t -test -test or the Wilcoxon rank-sum test. Alternatively, Zimmerman and Zumbo recommended that Welchs t -test -test performs better than the Wilcoxon rank sum test under conditions of both equal and unequal variance on data that has been pre-ranked (Zimmerman and Zumbo 1993 Zumbo 1993). ). Although moderately non-normal data has been the focus of this chapter, the conclusions are still useful as a general guideline. When data is extremely non-normal, none of the tests will be appropriate. Data transformation may be applied to meet the normality assumption Osborne (2005 (2005,, 2010 2010). ). Not every distribution, however, can be transformed to normality (e.g. L-shaped distributions). In this scenario, data dichotomization may be applied to simplify statistical analysis if information loss is not a concern (Altman and Royston 2006 Royston 2006). ). Additional concerns include multimodal distributions which may be a result of subgroups or data contamination (Marrero 1985)—a 1985 )—a few outliers can distort the distribution completely. All of these considerations serve as a reminder that prior to selecting a two-sample location test, the first key to success is full practical and/or clinical understanding of the data being assessed.
References Altman,, D. G., & Roy Altman Roysto ston, n, P. (20 (2006) 06).. The cos costt of dic dichot hotomi omisin sing g con contin tinuo uous us va varia riable bles. s. Bmj, 332(7549), 1080.
Beasley, T. M., Erickson, S., & Allison, D. B. (2009). Rank-based inverse normal transformations are increasingly used, but are they merited? Behavior Genetics, 39 (5), 580–595. Boik, R. J. (1987). The fisher-pitman permutation test: A non-robust alternative to the normal theory f test when variances are heterogeneous. British Journal of Mathematical and Statistical Psychology , 40 (1), 26–42. Cohen, J. (2013). Statistical power analysis for the behavioral sciences . Academic press. de Winter, J. C. (2013). Using the students t-test with extremely small sample sizes. Practical Assessment, Research & Evaluation, 18 (10), 1–12. Demirtas, H., Hedeker, D., & Mermelstein, R. J. (2012). Simulation of massive public health data by power polynomials. Statistics in Medicine, 31 (27), 3337–3346. Devroye, L. (1986). Sample-based non-uniform random variate generation. In Proceedings of the 18th conference on Winter simulation, pp. 260–265. ACM. Ernst, M. D., et al. (2004). Permutation methods: a basis for exact inference. Statistical Science, 19(4), 676–685. Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G* power 3: A flexible statistical power ana lysis program for the social, behavioral, and biomedical sciences. Behavior Researc Research h Methods, 39 (2), 175–191. Headrick, T. C., & Sawi Headrick, Sawilow lowsky sky,, S. S. (200 (2000). 0). Weigh eighted ted simpl simplex ex proce procedures duresfor for deter determinin mining g boun boundary dary points and constants for the univariate and multivariate power methods. Journal of Educational and Behavioral Statistics, 25 (4), 417–436. Kohr, R. L., & Games, P. A. (1974). Robustness of the analysis of variance, the welch procedure and a box procedure to heterogeneous variances. The Journal of Experimental Education , 43(1), 61–69.
Normal and Non-normal Data Simulations for the Evaluation …
57
Marrero, O. (1985). Robustness of statistical tests in the two-sample location problem. Biometrical Journal, 27 (3), (3), 299–316. Osborne, J. (2005). Notes on the use of data transformations. Practical Assessment, Research and Evaluation, 9 (1), 42–50. Osborne, J. W. (2010). Improving your data transformations: Applying Applying the box-cox transformation. transformation. Practical Assessment, Research & Evaluation, 15 (12), 1–9. Rasch, D., Kubinger, K. D., & Moder, K. (2011). The two-sample t-test: pre-testing its assumptions does not pay off. Statistical Papers, 52(1), 219–231. Rochon, J., Gondan, M., & Kieser, M. (2012). To test or not to test: Preliminary assessment of normality when comparing two independent samples. BMC Medical Research Methodology , 12(1), 81. Roys Ro ysto ton, n, J. (1 (198 982) 2).. An ex exte tens nsio ion n of sh shap apir iro o an and d wi wilk lk’’s w te test st fo forr no norm rmal alit ity y to la larg rgee sa samp mple les. s. Applied Statistics, 115–124. Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to student’s t-test and the mann-whitney u test. Behavioral Ecology, 17 (4), 688–690. Sawilowsky, S. S. (2005). Misconceptions leading to choosing the t-test over the wilcoxon mann-
whitney test for shift in location parameter. Schucany, W. R., & Tony Ng H. (2006). Preliminary goodness-of-fit tests for normality do not validate the one-sample student t. Communications in Statistics Theory and Methods , 35(12), 2275–2286. Team, R. C. (2014). R: A language and environment for statistical computing. R foundation for statistical computing, vienna, austria, 2012. Welch, B. L. (1938). The significance of the difference between two means when the population variances are unequal. Biometrika, 29 (3–4), 350–362. Zimmerman, D. W. (1996). A note on homogeneity of variance of scores and ranks. The Journal of Experimental Education, 64 (4), 351–362. Zimmerman, D. W. (1998). Invalidation of parametric and nonparametric statistical tests by concurrent violation of two assumptions. The Journal of Experimental Education, 67 (1), (1), 55–68. Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. British Journal of
Mathematical and Statistical Psychology, 57 (1), (1), 173–181. Zimmerman, D. W., & Zumbo, B. D. (1993). Rank transformations and the power of the student t-test and welch t’test for non-normal populations with unequal variances. Canadian Journal of (3), 523. Experimental Psychology/Revue canadienne de psycholog psychologie ie expérimenta expérimentale le, 47 (3),
Anatomy of Correlational Magnitude Transformations in Latency and Discretization Contexts in Monte-Carlo Studies Hakan Demirtas and Ceren Vardar-Acar
Abstract This chapter is concerned with the aassessment ssessment of correlational m magnitude agnitude changes when a subset of the continuous variables that may marginally or jointly
fol follo low w nea nearly rlygenerally an any y distri distribu butio tion n in a mul multi tiva varia riate teasset settin g is dic dichot hotomi omized zed or ordina ord lized. ed. Statisticians regard discretization ating bad idea on the grounds ofinaliz power, inform inf ormati ation, on, and effec effectt siz sizee los loss. s. Despit Despitee thi thiss undeni undeniabl ablee disadv disadvant antage age and legit legitima imate te criticism, its widespread use in social, behavioral, and medical sciences stems from the fact that discretization could yield simpler, more interpretable, and understandable conclusions, especially when large audiences are targeted for the dissemination of the research outcomes. We do not intend to attach any negative or positive connotations to discretization, nor do we take a position of advocacy for or against it. The purpose of the current chapter is providing a conceptual framework and computational algorithms for modeling the correlation transitions under specified distribu-
ti tiona onall as assu sump mpti tion onss with within in the the re real alm m of disc discre reti tiza zati tion on in th thee cont contex extt of th thee late latenc ncy y and and threshold concepts. Both directions (identification of the pre-discretization correlation value in order to attain a specified post-discretization magnitude, and the other way around) are discussed. The ideas are developed for bivariate settings; a natural extension to the multivariate case is straightforward by assembling the individual correlation entries. The paradigm under consideration has important implications and broad applicability in the stochastic simulation and random number generation worlds. The proposed algorithms are illustrated by several examples; feasibility and performance of the methods are demonstrated by a simulation study.
H. Demirtas (B) Division of Epidemiology and Biostatistics (MC923), University of Illinois at Chicago, 1603 West Taylor Street, Chicago, IL 60612, USA e-mail:
[email protected] C. Vardar-Acar Department of Statistics, Middle East Technical University, Ankara, Turkey e-mail:
[email protected] [email protected] © Springer Nature Singapore Pte Ltd. 2017 D.-G. Chen and J.D. Chen (eds.), Monte-Carlo Simulation-Based Statistical Modeling, ICSA Book Series in Statistics, DOI 10.1007/978-981-10-3307-0_4
59
60
1
H. Demirtas and C. Vardar-Acar
Intr Introd oduc ucti tion on
Unlike Unli ke natu natura rall (tru (true) e) dich dichot otom omie iess su such ch as ma male le ve vers rsus us fe fema male le,, co cond nduc ucto torr vers versus us in insu su-lator, vertebrate versus invertebrate, and in-patient versus out-patient, some binary variables are derived through dichotomization of underlying continuous measurements. Such artificial dichotomies often arise across many scientific disciplines. Exam Ex ampl ples es incl includ udee obesi obesity ty st stat atus us (o (obe bese se ve vers rsus us nonnon-obe obese se)) ba base sed d on body body mass mass in inde dex, x, preter pre term m versu versuss term term babies babies give given n the ges gestat tation ion per period iod,, high high versu versuss lo low w nee need d of socia sociall interaction, small versus large tumor size, early versus late response time in surveys, young versus old age, among many others. In the ordinal case, discretization is equally commonly encountered in practice. Derived polytomous variables such as youngyoung-middl middle-old e-old age, low-med low-medium-hi ium-high gh income, income, cold-coolcold-cool-av average erage-hot -hot temperature, no-mild-moderate-severe depression are obtained based on nominal age, income, temperature, and depression score, respectively. While binary is a special case of ordinal, for the purpose of illustration, integrity, and clarity, separate arguments are presented throughout the chapter. On a terminological note, we use the words binary/dichotomous and ordinal/polytomous interchangeably to simultaneously reflect the preferences of statisticians/psychometricians. Obviously, polytomous variables can normally be ordered (ordinal) or unordered (nominal). For the remainder of the chapter, the term “polytomous” is assumed to correspond ordered variables. Discretization is typically shunned by statisticians for valid reasons, the most prominent of which is the power and information loss. In most cases, it leads to a
dimi dimini nish shed ed effe effect ct si size ze as we well ll as re redu duce ced d re reli liab abil ilit ity y and and stre strengt ngth h of asso associ ciat atio ion. n. HowHowever, simplicity, better interpretability and comprehension of the effects of interest, and superi superiori ority ty of som somee categ categori orica call data data measur measures es suc such h as odds odds ratio ratio have have been been argue argued d by proponents of discretization. Those who are against it assert that the regression paradigm is general enough to account for interacti interactive ve effects, outliers, ske skewed wed distributions, and nonlinear relationships. In practice, especially substantive researchers and practitioners employ discretization in their works. For conflicting views on relative perils and merits of discretization, see MacCallum et al. al . ( (2002 2002)) and Farrington and Loeber Loeber (2000 2000), ), respectively. We take a neutral position; although it is not a recommended approach from the statistical theory standpoint, it frequently occurs in practice, mostly driven by improved understandability-based arguments. Instead of engaging in fruitless philosophical discussions, we feel that a more productive effort can be directed towards finding answers when the discretization is performed, which motivates the formation of this chapter’s major goals: (1) The determination of correlational magnitude changes when some of the continuous variables that may marginally or jointly follow almost any distribution in a multivariate setting are dichotomized or ordinalized. (2) The presentation of a conceptual and computational framework for modeling the correlational transformations before and after discretization.
Anatomy of Correlational Magnitude Transformations in Latency …
61
A correlation between two continuous variables is usually computed as the common Pearson correlation. If one or both variables is/are dichotomized/ordinalized by a threshold concept of underlying continuous variables, different naming conventions are assigned to the correlations. A correlation between a continuous and a dichotomized/ordinalized variable is a biserial/polyserial and point-biserial/pointpolyserial correlation before and after discretization, respectively. When both variables are dichotomi dichotomized/ zed/ordina ordinalize lized, d, the correlati correlation on between between the two latent latent conti continuous nuous variables is known as the tetrachoric/polychoric correlation. The phi coefficient is the correlation between two discretized variables; in fact, the term phi coefficient is reserved for dichotomized variables, but for lack of a better term we call it the “ordinal phi coefficient” for ordinalized variables. All of these correlations are special cases of the Pearson correlation. Correlations are naturally altered in magnitude by discretization. In the binary case, there is a closed form, double numerical integration formula that connects the correlations before and after dichotomization under the normality assumption ). Demirtas and when both variables are dichotomized (Emrich and Piedmonte 1991 Piedmonte 1991). Doganay (2012 (2012)) enhanced this Gaussian copula approach, along with the algebraic relationship between the biserial and point-biserial correlations, in the context of 2016)) fur joint generation of binary and normal variables. variables. Demirtas and Hedeker (2016 ther extended this correlational connection to nonnormal variables via linearity and constancy arguments when only one variable in a bivariate setting is dichotomized. Going back to the scenario where the dichotomization is performed on both variables, Demirtas Demirtas (2016 2016)) proposed algorithms that find the phi coefficient when the
tetrachoric correlation is specified (and the other way around), under any distributional assumption for continuous variables through added operational utility of 1978;; can Valeonly andbe Maurelli Maurelli 1983 1983). ). In the ordinal the power polynomials (Fleishman (Fleishman 1978 case, modeling the correlation transition performed iteratively under the normality assumption (Ferrari and Barbiero 2012 Barbiero 2012). ). Demirtas et al. al. ( (2016a 2016a)) augmented the idea of computing the correlation before or after discretization when the other one is specified and vice versa, to an ordinal setting. The primary purpose of this chapter is providing several algorithms that are designed to connect pre- and post-discretization correlations under specified distributional assumptions in simulated environments. More specifically, the following relationships are established: (a) tetrachor tetrachoric ic corre correlati lation/phi on/phi coef coefficie ficient, nt, (b) biserial/ biserial/point point-bise -biserial rial correlati correlations, ons, (c) polychoric correlation/ordinal phi coefficient, and (d) polyserial/point-polyserial corr co rrel elat atio ions ns,, wher wheree (a (a)– )–(b (b)) and and (c (c)–( )–(d) d) are are re rele lev vant ant to bi bina nary ry and and or ordi dina nall data data,, re resp spec ec-tively; (b)–(d) and (a)–(c) pertain to situations where only one or both variables is/are discretized, respectively. In all of these cases, the marginal distributions that are needed for finding skewness (symmetry) and elongation (peakedness) values for the underlying continuous variables, proportions for binary and ordinal variables, and associational quantities in the form of the Pearson correlation are assumed to be specified. This work is important and of interest for the following reasons: (1) The link between these types of correlations has been studied only under the normality assumption; however, the presented level of generality that encompasses a com-
62
H. Demirtas and C. Vardar-Acar
prehensive range of distributional setups is a necessary progress in computational statistics. (2) As simulation studies are typically based on replication of real data characteristics and/or specified hypothetical data trends, having access to the latent data as well as the eventual binary/ordinal data may be consequential for exploring a richer spectrum of feasible models that are applicable for a given data-analytic problem involving correlated binary/ordinal data. (3) The set of techniques sheds light on how corre correlati lations ons are related related before and afte afterr discretiz discretizatio ation; n; and has potential potential to broaden our horizon on its relative advantages and drawbacks. (4) The algorithms work wo rk for for a very very bro broad ad clas classs of unde underl rlyi ying ng biv bivaria ariate te late latent nt de dens nsit itie iess an and d bi bina nary/ ry/or ordi dina nall dataa distri dat distribu butio tions, ns, allo allowin wing g ski skip p patter patterns ns for the latte latterr, wit withou houtt requir requiring ing the identi identical cal distribution assumption on either type of variables. (5) The required software tools for the implementation are minimal, one only needs a numerical double integration solver for the binary-binary case, a computational platform with univariate random number generation (RNG) capabilities for the binary/ordinal-continuous case, an iterative scheme that connects the polychoric correlations and the ordinal phi coefficients under the normality assumption for the ordinal-ordinal case, a polynomial rootroo t-fin finde derr an and d a nonl nonlin inea earr eq equa uati tions ons se sett solv solver er to ha hand ndle le no nonno nnorm rmal al cont contin inuo uous us variariables. (6) The algorithmic steps are formulated for the bivariate case by the nature of the problem, but handling the multivariate case is straightforward by assembling the correlation matrix entries. (7) The collection of the algorithms can be regarded as an operational machinery for developing viable RNG mechanisms to generate multivariate latent variables as well as subsequent binary/ordinal variables given
their marginal shape characteristics and associational structure in simulated settings, potentially expediting the evolution of the mixed data generation routines. (8) The meth me thod odss coul could d be us usef eful ul in ad adv vanci ancing ng re rese sear arch ch in me meta ta-a -ana naly lysi siss doma domain inss wh wher eree vari vari-ables are discretized in studiesisand remainedIncontinuous in some others. background The organization of some the chapter as follows. Sect. 2, necessary information is given for the development of the proposed algorithms. In particular, how ho w the correl correlati ation on transf transform ormat ation ion wor works ks for discre discreti tized zed data data throug through h numeri numerica call inteintegration and an iterative scheme for the binary and ordinal cases, respectively, under the normality assumption, assumption, is outli outlined; ned; a general general nonnormal nonnormal continuous continuous data genergeneration technique that forms a basis for the proposed approaches is described (when both variables are discretized), and an identity that connects correlations before and after discretization via their mutual associations is elaborated upon (when only one variable is discretized). In Sect. 3, several algorithms for finding one quantity given the other are provided under various combinations of cases (binary versus ordinal, directionality in terms of specified versus computed correlation with respect to preversus post-discretization, and whether discretization is applied on one versus both varia va riable bles) s) and some some illust illustrat rativ ivee ex examp amples les,, rep repres resent enting ing a broad broad range range of distri distribu butio tional nal shapes that can be encountered in real applications, are presented for the purpose of exposition. In Sect. 4, a simulation study for evaluating the method’s performance in a multivariate multivariate setup by commonly accepted accepted accuracy (unbiasedne (unbiasedness) ss) measures in both directions is discussed. Section 5 includes concluding remarks, future research directions, limitations, and extensions.
Anatomy of Correlational Magnitude Transformations in Latency …
2
63
Build Buildin ing g Bloc Blocks ks
This section gives necessary background information for the development of the propose prop osed d algori algorithm thmss in modeli modeling ng correl correlati ation on transi transitio tions. ns. In what what fol follo lows, ws, cor correl relati ation on type and rela related ted notation depend on the three factors: (a) before before or after after discretiz discretizaation, (b) only one or both variables is/are discretized, discretized, and (c) discretized vari variable able is dichotomous or polytomous. To establish the notational convention, for the remainder of the chapter, let Y 1 and Y 2 be the continuous variables where either Y 1 only or both are discretized to yield X 1 and X 2 depending on the correlation type under consideration (When Y’s are normal, they are denoted as Z , which is relevant for the normal-based results and for the nonnormal extension via power polynomials). To distinguish between binary and ordinal variables, the symbols B and O appear in the subscr sub script ipts. s. Fur Furthe thermo rmore, re, for avoid avoiding ing an any y confus confusion ion,, the symbol symbolss B S , T E T , P S , and P O L Y are are made a part of δ δ Y Y 1 Y 2 and δ Z Z 1 Z 2 to differentiate among the biserial, tetrachoric, polyserial, and polychoric correlations, respectively. For easier readability, thatt sho shows ws the specifi specificc cor correl relati ation on types types and associ associat ated ed notati notationa onall we includ includee Table able 1 tha symbols based on the three above-mentioned factors.
2.1
Dichotomous Case: Normality
Borrowing Borrow ing ide ideas as fro from m the RNG litera literatur ture, e, if the unde underly rlying ing distri distribut bution ion before before dichotomization is bivariate normal, the relationship between the phi coefficient Demir-and the tet tetrac rachori horicc correl correlati ation on is kno known wn (Em (Emric rich h and Piedmo Piedmonte nte 1991; 1991; Demir 2012;; Demirtas 2016). 2016). Let X 1 B and X 2 B represent binary varitas and Doganay 2012 p j able ab less such such that that E X j j B 1 q j for j 1, 2, and and C or X 1 B , X 2 B δ X X 1 B X 2 B , Y 1 and Z 2 Y 2 be th where p1 , p2 , and δ X given. n. Let Z 1 thee corr correeX 1 B X 2 B are give sponding standard normal variables, and let Φ be the cumulative distribution func-
[
] = = −
=
[
=
]=
=
convention for different correlation types depend depending ing on the Table able 1 Terminological and notational convention three self-explanatory factors When
Discrete data type
Befo Be fore re
Dic Dichoto hotom mous ous
Discretized
Name
Y 1 only
Biserial correlation
After
Symbol
δ Y 1 Y B S 2
Point-biserial correlation
Before After Poly Polyto tom mous ous
Y 1 only
δ Y 1 Y 2T E T δ X X 1 B X 2 B
Both Y 1 and Y 2 Tetrachoric correlation Phi coefficient
Befo Be fore re
δ X X 1 B Y 2
Polyserial correlation
δ Y 1 Y P S 2
Point-polyserial correlation δ X X 1 O Y 2
After Before
Both Y 1 and Y 2 Polychoric correlation
δ Y 1 Y P O L Y
Ordinal phi coefficient
δ X X 1 O X 2 O
After
2
64
H. Demirtas and C. Vardar-Acar
tion for a standard bivariate normal random variable with correlation coefficient z z 1 zz 2 δ Z 1 Z 2T E T . Obviously, Φ z 1 , z 2 , δ Z 1 Z 2T E T −∞ −∞ f (( z1 , z 2 , δ Z 1 Z 2T E T )d z1 d z 2 , where
[
(2(1
2 )) Z 1 Z 2T E T
− δ
2 1/2 Z 1 Z 2T E T
= [ 2π (1 − δ
f (( z 1 , z 2 , δ Z 1 Z 2T E T )
] = ) ]− × ex p − ( z − 2δ 1
2 1
Z 1 Z 2T E T z 1 z 2
2 2
+ z )/
tetrachori horicc correl correlati ation on . The The phi phi co coef effic ficie ient nt (δ X X 1 B X 2 B ) and the tetrac
(δ Z 1 Z 2T E T ) are linked via the equation
[
z ( p1 ), z ( p2 ), δ Z 1 Z 2T E T Φ z
1/2 X X 1 B X 2 B ( p1 q1 p2 q2 )
] = δ
+ p p 1
2
(1)
quanti ntile le of the standa standard rd normal normal dis distri tribut bution ion for j 1, 2. where z ( p j ) den denote otess the p jt h qua As long as δ X X 1 B X 2 B is within the feasible correlation range (Hoeffding 1940, Fréchet 1951; 2011), ), the solution is unique. Once Z 1 and Z 2 are 1951; Demirtas and Hedeker 2011
=
(1 work, p j ) X j B interest 1 if Z j in z generated, the binary are derived setting and 0 otherwise for j variables 1, 2. While RNG isby not our ultimate this we can use Eq. 1 for bridging the phi coefficient and the tetrachoric correlation. When only one of the normal variables ( Z 1 ) is dichotomized, i.e., X 1 B I ( Z 1 S h / p1 q1 z (1 p1 ), it is relatively easy to show that δ X δ X X 1 B Z 1 X 1 B Z 2 /δ Z 1 Z B 2 where h is is the the ordi ordina nate te of the the nor norma mall cu curv rvee at the the po poin intt of di dich chot otom omiz izat atio ion n (Dem (Demir irta tass and Hedeker 2016 Hedeker 2016). ). Real data often do not conform to the assumption assumption of normality; normality; hence most simulation studies should take nonnormality into consideration. The next section inves-
=
−
=
=
≥
−
= √ ≥ =
tigates the situation where one or both continuous variables is/are nonnormal.
2.2
Dichotomous Case: Beyond Normality
Extending the limited and restrictive normality-based results to a broad range of distributional setups requires the employment of the two frameworks (an RNG routine for multivariate continuous data and a derived linear relationship in the presence of discretization), which we outline below. We first tackle the relationship between the tetrachoric correlation δ Y 1 Y 2T E T and the phi coefficient δ X X 1 B X 2 B under nonnormality via the use of the power polynomials which h is a mo mome ment nt-m -mat atch chin ing g pr proc oced edur uree th that at simu simula late tess no nonn nnor orma mall (Fleishman 1978), 1978), whic distribut distr ibutions ions often used in Monte-Carl Monte-Carlo o studi studies, es, based on the premise that realreal-life life distributions of variables are typically characterized by their first four moments. It hinges upon the polynomial transformation, Y a b Z c Z 2 d Z 3 , where Z follows a standard normal distribution, and Y is standardized (zero mean and unit variance)..1 The distribution of Y depends on the constants a , b, c, and d , that can variance) be computed for specified or estimated values of skewness ( ν 1 E Y 3 ) and excess kurtosis (ν 2 E Y 4 3). The procedure of expressing any given variable by the sum of linear combinations of powers of a standard normal variate is capable of
= + +
= [ ] −
1
+ = [ ]
We drop the subscript in Y as we start with the univariate case.
Anatomy of Correlational Magnitude Transformations in Latency …
65
covering a wide area in the skewness-elongation plane whose bounds are given by the general expression ν 2 ν 12 2.2 Assuming Assum ing that E Y 0,and E Y 2 1, by ut util iliz izin ing g th thee mo mome ment ntss of th thee stan standa dard rd normal distribution, the following set of equations can be derived:
≥ − [ ]= [ ]=
= −c b + 6bd + 2c + 15d − 1 = 0 2c(b + 24bd + 105d + 2) − ν = 0 24[bd + c (1 + b + 28bd ) + d (12 + 48bd + 141c + 225d )] − ν = 0 a
2
2
2
2
2
2
2
2
1
2
2
2
(2) (3) (4) (5)
These equations can be solved by the Newton-Raphson method, or any other plausi pla usible ble root-fi root-findi nding ng or nonl nonline inear ar opt optim imiza izati tion on routin routine. e. Mor Moree detail detailss for the Newto Newtonn2012). ). The Raphson algorithm for this particular setting is given by Demirtas et al. al . ( (2012 polynomial coefficients are estimated for centered and scaled variables; the resulting data set should be back-transformed to the original scale by multiplying every data point by the standard deviation and adding the mean. Centering-scaling and the reverse of this operation are linear transformations, so it does not change the values of skewness, kurtosis, and correlations. Of note, we use the words symmetry and
skewness skewne ss interc interchan hangea geably bly.. Sim Simila ilarly rly,, kurt kurtosi osis, s, elonga elongatio tion, n, and peake peakedne dness ss are meant meant to convey the same meaning. The multi multivar variate iate extensio extension n of Fleishman Fleishman’’s powe powerr method method (Vale (Vale and Maurell Maurellii 1983 1983)) plays a central role for the remainder of this chapter. The procedure for generating multivariate continuous data begins with computation of the constants given in Eqs. 2–5, independently for each variable. The bivariate case can be formulated in matrix notation as shown below. First, let Z 1 and Z 2 be variables drawn from standard normal populations; let z be the vector of normal powers 0 through 3, z j 1, Z j j , Z j2 , Z j3 ; and let w be the weight vector that contains the power func-
]
=[
]
=[
=
1, 2. The nonnorm nonnormal al variable variable a j , b j , c j , d j j for j tion weig weights hts a , b, c,and d , w j Y j j is then defined as the product of these two vectors, Y jj w j z j . Let δ YY 1 Y 2 be the correlation between two nonnormal variables Y 1 and Y 2 that correspond to the normal variables Z 1 and Z 2 , respectively. respectively.3 As the variables are standardized, meaning w w z z E (Y 1 ) E (Y 2 ) 0, δ Y E Y Y E w ) ( ( ) 1 2 Y 1 Y 2 1 R w2 , where R is the 1 1 2 2 expected matrix product of z z 1 and z 2 :
=
=
=
= E (z z )
R
1 2
Z Z 1 Z 2
0 3δ Z Z 1 Z 2
2
=
=
01 δ 0 = 1 0
=
1 0
2 2δ Z 1 Z 2 0
+
0 3δ Z Z 1 Z 2 1 0 3 6δ Z 9δ Z Z 1 Z 2 1 Z 2
+
,
In fact, equality is not possible for continuous distributions.
3
is the the sa same me as δ Y 1 Y T E T or δ Y 1 Y P O L Y dep depend ending ing on if discre discretiz tized ed va varia riable bless are binary binary or ordina ordinal, l, δ Y Y 1 Y 2 is 2 2 respec res pectiv tively ely.. For the genera generall presen presentat tation ion of the po power wer polyno polynomia mials, ls, we do not make make that that distin distincti ction. on.
66
H. Demirtas and C. Vardar-Acar
is the correlation between Z and Z . After algebraic operations, the where δ 1 2 Z 1 Z 2 following relationship between δ Y Z 1 Z 2 in terms of polynomial coefficients Y 1 Y 2 and δ Z ensues:
δ Y Y1 Y 2
= δ
Z Z 1 Z 2 (b1 b2
2 Z 1 Z 2 (2c1 c2 )
+ 3b d + 3d b + 9d d ) + δ 1 2
1 2
1 2
3 Z 1 Z 2 (6d 1 d 2 )
+ δ
(6)
Solving this cubic equation for δ Z Z 1 Z 2 gives the intermediate correlation between the two standard normal variables that is required for the desired post-transformation Clearly, correlations for each pair of variables should be assembled correlation δ Y Y1 Y 2 . Clearly, into int o a mat matrix rix of interc intercorr orrela elatio tions ns in the multi multiva varia riate te ca case. se. Fo Forr a compre comprehen hensi sive ve source source ). and detailed account on the power polynomials, see Headrick (2010 ( 2010). In the dichot dichotomi omizat zation ion con conte text, xt, the connec connectio tion n betwee between n the underly underlying ing non nonnor normal mal (δ Y Z 1 Z 2 ) in Eq. 6, along with the relationship between Y1 Y 2 ) and normal correlations (δ Z the tetrachoric correlation ( δ Z X 1 B X 2 B ) conveyed in Z 1 Z 2T E T ) and the phi coefficient (δ X Eq. 1, is instrumental in Algorithms-1a and -1b in Sect. 3. To address the situation where only one variable ( Y 1 ) is dichotomized, we now move to the relationship of biserial ( δ Y 1 Y 2 B S ) and point-biserial (δ X X 1 B Y 2 ) correlations in the absence of the normality assumption, which merely functions as a starting point below. Suppose that Y 1 and Y 2 jointly follow a bivariate normal distribution with a correlation of δ Y 1 Y 2 B S . Without loss of generality, we may assume that both Y 1 and Y 2 are standardized to have a mean of 0 and a variance of 1. Let
= ≥ [ ]=
X 1 B be the binary variable resulting from a split on Y 1 , X 1 B I (Y 1 k ), where p1 q1 where p1 and V X 1 B k is the point of dichotomization. Thus, E X 1 B q1 1 p1 . The cor correl relati ation on betwee between n X 1 B and Y 1 , δ X can an be obta obtain ined ed in a simp simple le X 1 B Y 1 c √ C ov[ X 1 B ,Y 1 ] k / p1 q1 . E X 1 B Y 1 / p1 q1 E Y 1 Y 1 way,, namely, way namely, δ X X 1 B Y 1 V X V Y
[ ]= = − √ √ [ ] [ ] =between We can also express = the relationship ≥ ] linear [ Y ] and Y = via [the|following regression model: Y + ε Y = δ (7) 1 B
1
1
2
2
Y 1 Y 2 B S 1
2 ). Y 1 Y 2 B S
∼ (0, 1 − δ ∼
where ε is independent of Y 1 and Y 2 , and follows N
When we
generalize this to nonnormal Y 1 and/or Y 2 (both centered and scaled), the same relationship can be assumed to hold with the exception that the distribution of ε follows a nonnormal distribution. As long as Eq. 7 is valid,
[
C ov X 1 B , Y 2
] = C ov[ X , δ = C ov[ X , δ = δ C ov[ X
+ ε] Y ] + C ov[ X , Y ] + C ov[ X
1 B
Y 1 Y 2 B S Y 1
1 B
Y 1 Y 2 B S 1
Y 1 Y 2 B S
1 B
1
] , ε] .
1 B , ε 1 B
(8)
Since ε is independent of Y 1 , it will also be independent of any deterministic E Y 2 Y 1 such as X 1 B , and thus C ov X 1 B , ε will be 0. As E Y 1 0, function of Y V Y 1 V Y 2 δ Y 1 Y 2 B S , Eq. 8 1, C ov X 1 B , Y 1 δ X X 1 B Y p1 q1 and C ov Y 1 , Y 2 reduces to (9) δ Y 1 Y 2 B S δ X δ X X 1 B Y 1 . X 1 B Y 2
[ ]= [ ]=
[
[ √ ] ]= =
[ ]= [ ]= [ ]=
Anatomy of Correlational Magnitude Transformations in Latency …
67
h / p p1 q1 where h is th In the the at biva bivari riat ate e norm noof rmal al case case,, δ X 1 B Y 1 the e or ordi dina nate te ofassociation th thee norm normal al 9 indicates that the linear curve the point dichotomization. Equation between X 1 B and Y 2 is assumed to be fully explained by their mutual association with Y 1 (Demirtas and Hedeker 2016 Hedeker 2016). ). The ratio, δ X X 1 B Y 2 /δ Y 1 Y 2 B S is equal to δ X X 1 B Y 1 E X 1 B Y 1 / p1 q1 E Y 1 Y 1 k / p1 q1 , which is a constant given p1 and the distribut distr ibution ion of (Y 1 , Y 2 ). These These cor correl relati ations ons are in inva varia riant nt to locat location ion shifts shifts and scalin scaling, g, Y 1 and Y 2 do not not have have to be cent center ered ed an and d sc scal aled ed,, th thei eirr mean meanss an and d varia arianc nces es can can take take any any finite values. Once the ratio (δ X X 1 B Y 1 ) is found (it could simply be done by generating Y 1 and dichotomizing it to yield X 1 B ), one can compute the point-biserial ( δ X X 1 B Y 2 ) or biserial δ Y 1 Y 2 B S correlation when the other one is specified. This linearity-constancy Algorithm-2 and 9, will be the crux of Algorithm-2 argument that jointly emanates from Eqs. 7 and given in Sect. 3.
= √
[
] √ = [ | ≥ ] √
2.3
Polytomous Case: Normality
=
In the ordinal case, although the relationship between the polychoric correlation (δ Y 1 Y 2 P O L Y ) and the ordinal phi correlation ( δ X X 1 O X 2 O ) can be written in closed form, as explained below, the solution needs to be obtained iteratively even under the normality assumption since no nice recipe such as Eq. 1 is available. In the context of
correlated ordinal data generation, Ferrari and Barbiero (2012 (2012)) proposed an iterative procedure based on a Gaussian copula, in which point-scale ordinal data are generated when the marginal proportions and correlations are specified. For the purposes of this chapter, one can utilize their method to find the corresponding polychoric correlation or the ordinal phi coefficient when one of them is given under normality. The algorithm in Ferrari and Barbiero (2012 ( 2012)) serves as an intermediate step in formulating the connection between the two correlations under any distributional assumption on the underlying continuous variables. wheree Concent Conc entrat rating ing on the biva bivaria riate te ca case, se, sup suppos posee Z ( Z 1 , Z 2 ) N (0, ∆ Z 1 Z 2 ), wher Z denotes the bivariate standard normal distribution with correlation matrix ∆ Z 1 Z 2 whose off-diagonal entry is δ Z 1 Z 2P O L Y . Let X ( X 1 O , X 2 O ) be the bivariate ordinal data where underlying Z is discretized based on corresponding normal quantiles given the marginal proportions, with a correlation matrix ∆ Z 1 Z 2 . If we need to sample from a random vector ( X 1 O , X 2 O ) whose marginal cumulative distribution functions (cdfs) are F 1 , F 2 tied together via a Gaussian copula, we generate a sam( x 1 O , x 2 O ) ( F 1− 1 (u 1 ), F 2− 1 (u 2 )) ple ( z 1 , z 2 ) from Z N (0, ∆ Z 1 Z 2 ), then set x
∼
=
=
= = ∼ (Φ(( z ),Φ( z )), where Φ is the cdf of the standard normal when u = (u , u ) = (Φ X , denoted by ∆ distribution. The correlation matrix of X (with an off-diagonal 1
2
1
2
X 1 O X 2 O
entry δ X X 1 O X 2 O ) obviously differs from ∆ Z 1 Z 2 due to discretization. More specifically, large sam sample ples. s. The rel relati ations onship hip betwee between n ∆ X 1 O X 2 O and ∆ Z 1 Z 2 δ X X 1 O X 2 O < δ Z 1 Z 2P O L Y in large is established resorting to the following formula (Cario and Nelson 1997 1997): ):
|
|
| |
∞ ∞ ∞ ∞ −1 − 1 − 1 E [ X 1 O X 2 O ] = E [ F 1 (Φ( Z 1 )) F 2 (Φ( Z 1 ))] = F 1 (Φ( Z 1 )) F 2− 1 (Φ( Z 1 )) f (( z 1 , z2 )d z 1 d z2 (10) −∞ −∞
68
H. Demirtas and C. Vardar-Acar
( z 1 , z 2 ) is the biva where f ( bivaria riate te sta standa ndard rd normal normal prob probabi abili lity ty functio function n (pdf) (pdf) with with correcorre ∆ Z 1 Z 2 . If X 1 and X 2 lation matrix ∆ Z 1 Z 2 , which implies that ∆ X 1O X 2 O is a function of ∆ are ordinal random variables with cardinality k 1 and k 2 , respectively, Eq. 10 reduces 10 reduces to a sum of k 1 k 2 integrals of f ( ( z 1 , z 2 ) over a rectangle, i.e., k 1 k 2 differences of the bivariate cdf computed at two distinct points in 2 , as articulated by Ferrari ). and Barbiero (2012 (2012). The relevant part of the algorithm is as follows:
×
×
0 1. Gen Genera erate te sta standa ndard rd biva bivaria riate te nor normal mal dat dataa wit with h the cor correl relati ation on δ Z where Z P O L Y 1
0 δ Z POLY 1 Z 2
=
0 δ X X 1 O X 2 O (Here, δ Z Z P O L Y 1 2
2
is the initial polychoric correlation).
2. Discr Discretiz etizee Z 1 and Z 2 , based on the cumulative probabilities of the marginal distribution F 1 and F 2 , to obtain X 1 O and X 2 O , respectively. 1 1 is the ordinal ordinal phi coefficoeffithrough X 1 O and X 2 O (Here, δ X 3. Com Comput putee δ X X X 1O 2O 1O 2O cient after the first iteration). v v vma x δ X 4. Exec Execute ute the followi following ng loop as long as δ X X 1 O X 2 O > ε and 1 1 O X 2 O (vma x and ε are the maximum number of iterations and the maximum tolerated absolute error, respectively, both quantities are set by users): v −1 v v v (a) Update δ Z g (v), where g (v) δ X δ P O L Y by δ POLY X 1 O X 2 O /δ X 1 O X 2 O . Z Z Z Z Z P O L Y
−
|
1
2
1
2
=
1
|
≤ ≤
=
2
Here, g (v) serves as a correction coefficient, which ultimately converges to 1. v +1 v after dis(b) Generate bivariate normal data with δ Z and compute δ X 1 O X 2 O Z P O L Y 1
2
cretization. Again, our focus in the current work is not RNG per se, but the core idea in Ferrari and Barbiero (2012 (2012)) is a helpful tool that links δ Y 1 Y 2 P O L Y and δ X X 1 O X 2 O for ordinalized data through the intermediary role of normal data between ordinal and nonnormal 2.4). ). continuous data (Sect. 2.4 When only one of the normal variables ( Z 1 ) is ordinalized, no nice formulas such 2.1 giv given in the the bina binary ry da data ta cont conteext ar aree availa ailabl ble. e. Th Thee good good news news is as the one one in Sec ect. t.2.1 that a much more general procedure that accommodates any distributional assumption on underlying continuous variables is available. The process that relates the polyserial (δ Z 1 Z 2P S ) and point-polyserial (δ X X 1 O Z 2 ) correlations is available by extending the arguments substantiated in Sect. 2.2 to 2.2 to the ordinal data case.
2.4
Polytomous Case: Beyond Normality
When variables are ordinalized, the connection between the polychoric tion (δ both establi shed bycorrelaa twoX 1 O X 2 O ) can be established Y 1 Y 2 P O L Y ) and the ordinal phi coefficient (δ X stage scheme, in which we compute the normal, intermediate correlation ( δ Z 1 Z 2P O L Y ) 2012)) (prefrom the ordinal phi coefficient by the method in Ferrari and Barbiero ( Barbiero (2012 sented in Sect. 2.3) 2.3) before we find the nonnormal polychoric correlation via the power polynomials (Eq. 6). The other direction (computing δ X X 1 O X 2 O from δ Y 1 Y 2 P O L Y ) can be implemented by executing the same steps in the reverse order. The associated computational routines are presented in Sect. 3 ( Algorithms-3a and -3b ).
Anatomy of Correlational Magnitude Transformations in Latency …
69
The correlational identity given in Eq. 9 holds for ordinalized data as well when only one variable variable is ordinalized ordinalized (Demi (Demirtas rtas and Hedeke Hedekerr 2016); 2016); the ordinal version of the equation can be written as δ X δ Y 1 Y 2 P S δ X X 1 O Y 2 X 1 O Y 1 . The same linearity and constancy of ratio arguments equally apply in terms of the connection between the polyserial (δ Y 1 Y 2 P S ) and point-polyserial (δ X X 1 O Y 1 ) correlations; the fundamental utility and operational characteristics are parallel to the binary case. Once the ratio ( δ X X 1 O Y 1 ) is found by generati generating ng Y 1 and discretizing it to obtain X 1 O , one can easily compute eith ei ther er of thes thesee qu quan anti titi ties es give given n the the othe otherr. Th This is will will be pert pertin inen entt in Algorithm-4 below. The next section section puts all these these concept conceptss togethe togetherr from an algorithm algorithmic ic point of view with numerical illustrations.
=
3
Algorithms Algorithms and Illustrati Illustrative ve Examples Examples
We wo work rk with with ei eigh ghtt dist distri rib butio utions ns to re refle flect ct so some me comm common on sh shap apes es th that at ca can n be enco encoun un-tered ter ed in real-l real-life ife app applic licati ations ons.. The illust illustrat rativ ivee ex examp amples les com comee from from bi biva varia riate te dat dataa with with Weibull and Normal mixtur mixturee mar marginal ginals. s. In what follows, follows, W and and N M stand stand for Weibull and Normal mixtur mixturee, respectively. The W density density is f (( y γ , δ ) γ δ δ y δ −1 ex p( ( yγ )δ ) for y > 0, an and d γ > 0 and δ > 0 are are the the scal scalee and and shap shapee para parame mete ters rs,, re resp spec ec--
|
=
−
|
tively tiv ely.. The NM densit density y is f ( ( y π , µ1 , σ1 , µ2 , σ2 )
−
−
1 y µ2 2 ( σ ) 2 2
, where 0
=
π ex p σ1 2π
√
−
−
1 y µ1 2 ( σ ) 2 1
+
− √
(1 π ) σ2 2π
ex p < π < 1 is the the mi mixi xing ng para parame mete terr. Sinc Sincee it is a mi mixt xtur ure, e, it can be unimodal or bimodal. Depending on the choice of parameters, both distributi tions ons ca can n ta take ke a varie ariety ty of shap shapes es.. We use use fo four ur se sets ts of pa para rame mete terr spec specifi ifica cati tion onss fo forr each each of the these se distri distribu butio tions: ns: For For W distribution, distribution, (γ , δ ) pa pair irss are are ch chos osen en to be (1, 1), (1, 1.2), (1, 3.6), and (1, 25), corresponding to mode at the boundary, positively skewed, nearly symmetric, and negatively skewed shapes, respectively. For NM distribution, the parameter set (π , µ1 , σ1 , µ2 , σ2 ) is set to (0.5, 0, 1, 3, 1), (0.6, 0, 1, 3, 1), (0.3, 0, 1, 2, 1),and (0.5, 0, 1, 2, 1), whose whose shapes shapes are bimoda bimodal-s l-symm ymmetr etric, ic, bimoda bimodallasymmetric, unimodal-negatively skewed, and unimodal-symmetric, respectively. These four variations of the W and N M densities are plotted in Fig. 1 (W/NM: the first/second columns) in the above order of parameter values, moving from top to bottom. Finally, as before, p1 and p2 represent the binary/ordinal proportions. In the binary case, they are single numbers. In the ordinal case, the marginal 1, 2 and j 1, 2,...k i , and proportions are denoted as P ( X i j ) pi j for i pi which h skip skip pa patt tter erns ns are are allo allowe wed. d. Fu Furt rthe herm rmor ore, e, if th thee user userss ..., pi k i ), in whic ( pi 1 , p i 2 , ..., wish wi sh to st star artt the the ordi ordina nall cate catego gori ries es fr from om 0 or any any in inte tege gerr ot othe herr th than an 1, th thee asso associ ciat atio iona nall implications remain unchanged as correlations are inv invariant ariant to the location shifts. Of note, the number of significant digits reported throughout the chapter varies by the computational sensitivity of the quantities. Algorithm-1a: Computing the tetrachoric correlation correlation ( δ Y 1 Y 2T E T ) from the phi coef ficient ( δ X X 1 B X 2 B ): The algorithm for computing δ Y 1 Y 2T E T when δ X X 1 B X 2 B , p1 , p2 , and the Y 1 and Y 2 ( ν 1 and ν 2 ) are specified, is as follows: key distributional characteristics of Y
=
= =
=
=
70
H. Demirtas and C. Vardar-Acar
Fig. 1 Density functions functions of of Weibull Weibull ( first column ) and Normal Mixture ( second column) distributions for chosen parameter values that appear in the text
1. Sol Solve ve Eq. 1 for δ Z 1 Z 2T E T . 2. Comput Computee the powe powerr coeffi coefficient cientss (a , b, c, d ) for Y 1 and Y 2 by Eqs. 2–5. 3. Plug all quan quantiti tities es obtai obtained ned in Steps 1–2 into Eq. 6, and solve for δ Y 1 Y 2T E T .
∼
∼
=
Suppose Y 1 W (1, 1), Y 2 N M (0.5, 0, 1, 3, 1), ( p1 , p 2 ) (0.85, 0.15), and δ X 0.1. Solving for δ Z 1 Z 2T E T in Eq. 1 (Step 1) yields 0.277. The power coeffiX 1 B X 2 B cients ( a , b, c, d ) in Eqs. 2–5 (Step 2) turn out to be ( 0.31375, 0.82632, 0.31375, 0.02271) and (0.00004, 1.20301, 0.00004, 0.07305) for Y 1 and Y 2 , respec respec-tively. Substituting these into Eq. 6 (Step 3) gives δ Y 1 Y 2T E T 0.243. Similarly, for 0.5, δ Z 1 Z 2T E T 0.919 and δ Y 1 Y 2T E T 0.801. ( p1 , p2 ) (0.10, 0.30) and δ X X 1 B X 2 B The upper half of Table 2 includes a few more combinations. Algorithm-1b: Computing the phi coefficient ( δ X from the tetrachoric tetrachoric corr correeX 1 B X 2 B ) from
=
− =
=
−
−
=
=
=
la lati tion on (δ Y Thee qu quan anti titi ties es that that ne need ed to be sp spec ecifi ified ed are are th thee same same as in AlgorithmY1 Y 2T E T ): Th 1a , and the steps are as follows:
1. Comput Computee the powe powerr coeffi coefficient cientss (a , b, c, d ) for Y 1 and Y 2 by Eqs. 2–5. 2. Sol Solve ve Eq. 6 for δ Z 1 Z 2T E T . 3. Plug Plug δ Z 1 Z 2T E T into Eq. 1, and solve for δ X X 1 B X 2 B .
∼
∼
With the same pair of distributions, where Y 1 W (1, 1) and Y 2 N M (0.5, 0, 1, 3, 1), suppose ( p1 , p2 ) (0.85, 0.15) and δ Y 1 Y 2T E T 0.4. After solving for the
=
=−
Anatomy of Correlational Magnitude Transformations in Latency …
71
Y1 Y 2T E T ) or the phi coefficient ( δ X 1 B X 2 B ) values of the tetrachoric correlation correlation (δ Y Table 2 Computed values when wh en on onee of them them is spec specifie ified, d, wi with th two two sets sets of pr prop opor ortio tions ns fo forr Y 1 W (1, 1) and Y 2 N M (0.5, 0, 1, 3, 1)
∼
∼
p1
0.85
p2
0.15
0.30
0.85
0.10
p2
0.30
2
−0.849 −0.533
−0.755 −0.472
−0.2
−0.616
−0.540
0.3
δ Y 1 Y T E T 2
0.15
δ Y 1 Y T E T
2
0.5 p1
δ Z 1 Z T E T
−0.6 −0.3
0.1 0.10
δ X X 1 B X 2 B
0.277
0.572
0.502
0.919
0.801
δ Z 1 Z T E T
δ X X 1 B X 2 B
0.456
0.246
2
0 .4
0.243
0.2
−
−0.227
−0.085
−0.5
−0.570
−0.192
0.6 0.1
0.7
0.685 0.114
0.801
0.173 0.052
0.441
Fig. 2 δ X X 1 B X 2 B versus δ Y 1 Y T E T for Y 1 W (1, 1) 2
0 . 1
∼
∼
and Y 2 N M (0.5, 0, 1, 3, 1), where solid, dashed, and dotted curves represent ( p1 , p 2 ) (0.85, 0.15), (0.10, 0.30), and (0.50, 0.50), respectively; respectively; the range differences are due to the Fréchet-Hoeffding bounds
5 . 0
=
T E T 2
Y 1
Y δ
0 . 0
5 . 0 −
0 . 1 −
−1.0
−0.5
0.0
0.5
1.0
δX 1B X 2B
=−
=
power coefficients (Step 1), Steps 2 and 3 yield δ Z 1 Z 2T E T 0.456 and δ X X 1 B X 2 B 0.246, respectively. Similarly, when ( p1 , p2 ) (0.10, 0.30) and δ Y 1 Y 2T E T 0.7, 0.801 and δ X δ Z 1 Z 2T E T 0.441. The lower half of Table 2 includes a few X 1 B X 2 B more combinations. More comprehensively, Fig. 2 shows the comparative behavior of δ δ X X 1 B X 2 B and δ Y 1 Y 2T E T when the proportion pairs take three different values, with the addition of ( p1 , p 2 ) (0.50, 0.50) to the two pairs above, for this particular distributional setup.
−
=
=
=
=
=
72
H. Demirtas and C. Vardar-Acar
X 1 B Y 1 and c2 X 2 B Y 2 that connect values of c1 connect the biserial biserial (δ Y Table 3 Computed values δ X δ X Y 1 Y 2 B S ) and wheree Y 1 W (1, 1.2) and Y 2 N M (0.6, 0, 1, 3, 1) point-bise point -biserial rial correlatio correlations ns (δ X X 2 B Y 1 ), wher X 1 B Y 2 or δ X
p1 or p2
0.05
ˆ ˆ =
ˆ =
ˆ
c1
0.646
∼
ˆ
c2
∼
0.447
0.15
0.785
0.665
0.25
0.809
0.782
0.35
0.795
0.848
0.45
0.760
0.858
0.55
0.710
0.829
0.65
0.640
0.772
0.75
0.554
0.692
0.85
0.436
0.585
0.95
0.261
0.392
Algorithm-2: Computing the biserial corre correlation lation (δ Y 1 Y B S ) from the point-biserial 2 correlation (δ X ) and the other way around : One only needs to specify the distriX 1 B Y 2
Y 1 (the variables that is to be dichotomized) and the proportion ( p1 ) butional form of Y 2.2). ). The steps are as follows: for this algorithm (See Sect. 2.2
= =
1. Gen Genera erate te Y 1 with a large number of data points (e.g., N 100, 000). 2. Dicho Dichotomiz tomizee Y 1 to obtain X 1 B through the specified value of p1 , and compute the c1 . sample correlation, δ X X 1 B Y 1
= ˆ
B S
B S
c1 by Eq. 9. 3. Find Find δ X 1 B Y 2 or δ Y 1 Y 2 by δ X 1 B Y 2 / δ δY 1 Y 2 In this illustration, we assume that Y 1 W (1, 1.2), Y 2 N M (0.6, 0, 1, 3, 1), and δ Y 1 Y 2 B S 0.60. Y 1 is dichotomized to obtain X 1 B where E ( X 1 B ) p1 0.55. After c1 δ Y 1 Y B S follo fol lowi wing ng Step Stepss 1 an and d 2, c1 tur turnsout nsout to be0 .710, and accordi accordingl ngly y δ X X 1 B Y 2 2 0.426. Similarly, if the specified value of δ δ X δ X X 1 B ,Y 2 is 0.25, then δ Y 1 Y 2 B S X 1 B Y 2 /c1 0.352. The fundamental ideas remain the same if Y 2 is dichotomized (with a proportion p2 ) rather than Y 1 . In that case, with a slight notational difference, the new c2 and δ X equations would be δ X δY 1 Y 2 B S c2 . Table 3 shows c1 and c2 valX 2 B Y 2 X 2 B Y 1 / δ ues whe when n p 1 or p 2 ran ranges ges bet betwee ween n 0.05and0.95withanincrementof0.10. We furthe furtherr generated bivariate continuous data with the above marginals and the biserial correlations between 0.85 and 0.90 with an increment of 0 .05. We then dichotomized Y 1 where p1 is 0.15 and 0.95, and computed the empirical point-biserial correlation.
∼
=
=ˆ
∼
= = = ˆ =
ˆ
= ˆ
=ˆ
ˆ
= ˆ =
ˆ
−
The lower- and upper-right graphs in Fig. 3 the plot of the algorithmic value of c1 in Step 2 and δ Y 1 Y 2 B S versus δ X X 1 B Y 2 , respectively, where the former is a theoretical and δ X X 1 B Y 2 in the latter is an empirical quantity. As expected, the two c1 values are the same as the slopes of the linear lines of δ δ Y 1 Y 2 B S versus δ X X 1 B Y 2 , lending support for how plausibly Algorithm-2 is working. The procedure is repeated under the assumption that Y 2 is dichotomized rather than Y 1 (lower graphs in Fig. 3). Algorithm-3a: Computing the polychoric correlation (δ Y 1 Y 2 P O L Y ) from the ordinal phi coefficient (δ X algorithm hm for computin computing g δ Y 1 Y 2 P O L Y when δ X X 1 O X 2 O ): The algorit X 1 O X 2 O , p1 ,
ˆˆ
ˆ
Anatomy of Correlational Magnitude Transformations in Latency …
73
6 . 0
8 . 0 6 . 0
2
Y
2 . 0
B 1
c1
X δ 2 .
4 . 0
0 −
2 . 0
6 . 0 −
−0.5
0 .0
0 .5
−0.5
δ Y1 Y2BS
0.0
0 .5
δ Y1 Y2BS 6 . 0
8 . 0 6 . 0
1 Y B 2
c2
2 . 0
X
4 . 0
δ
2 . 0
2 . 0 −
6 . 0 −
−0.5
0 .0
δ
0 .5 BS
−0.5
0.0
δ
0 .5 BS
Y1 Y2
ˆˆ
Y1 Y2
ˆ
), and Fig. 3 Plots of c1 (upper-left ), ), c2 (lower-left ), ), empirical δ X X 1 B Y 2 versus δ Y 1 Y B S (upper-right ), 2
∼ W (1, 1.2)
(shown wn by o) or 0.95 (sho (shown wn by *) *),, wher wheree Y 1 δ X X 2 B Y 1 versus δ Y 1 Y B S for p1 or p 2 is 0.15 (sho 2 and Y 2 N M (0.6, 0, 1, 3, 1)
∼
p2 and the key distributional characteristics of Y Y 1 and Y 2 ( ν 1 and ν 2 ) are specified, is as follows:
), outlined in Sect. 2.3 for 2.3 for finding 1. Use the meth method od in Ferrari and and Barbiero (2012 (2012), δ Z 1 Z 2P O L Y . 2. Comput Computee the powe powerr coeffi coefficient cientss (a , b, c, d ) for Y 1 and Y 2 by Eqs. 2–5. 3. Plug all quan quantiti tities es obtai obtained ned in Steps 1–2 into Eq. 6, and solve for δ Y 1 Y 2 P O L Y .
∼
=
∼ = −
Suppose Y 1 W (1, 3.6), Y 2 N M (0.3, 0, 1, 2, 1), ( p1 , p2 ) ((0.4, 0.3, 0.2, 0.1), 0.7. Solving for δ Z 1 Z 2P O L Y in Step 1 yields 0.816. (0.2, 0.2, 0.6)), and δ X X 1 O X 2 O The power coefficients (a , b, c, d ) in Eqs. 2–5 (Step 2) turn out to be ( 0.00010, 1.03934, 0.00010, 0.01268) for Y 1 and (0.05069, 1.04806, 0.05069, 0.02626) 0.813. Similarly, for Y 2 . Substituting these into Eq. 6 (Step 3) gives δ Y 1 Y 2 P O L Y
− −
−
− =− 0.2, δ for ( p , p ) ((0.1, 0.1, 0.1, 0.7), (0.8, 0.1, 0.1)) and δ = = 0.439. The upper half of Table 4 includes = 0.441 and δ = a few more combi−
1
2
Z 1 Z 2P O L Y
X 1 O X 2 O
Y 1 Y 2 P O L Y
nations. from the pol polyc ychor horic ic Algorithm-3b: Computing the ordinal phi coefficient (δ X X 1 O X 2 O ) from correlation (δ Y 1 Y 2 P O L Y ): The required quantities that need specification are the same as in Algorithm-3a, and the steps are as follows:
74
H. Demirtas and C. Vardar-Acar
thee or ordi dina nall ph phii co coef ef-Computed ed va value luess of the polych polychori oricc correl correlati ation on (δ Y 1 Y P O L Y ) or th Table 4 Comput 2 W (1, 3.6) and Y 2 ficient (δ X X 1 O X 2 O ) given the other, with two sets of proportions for Y 1 N M (0.3, 0, 1, 2, 1)
∼
p1
(0.4, 0.3, 0.2, 0.1)
p2
(0.2, 0.2, 0.6)
δ X X 1 O X 2 O
− 0 .7 − 0 .3 0.4
0.6 (0.1, 0.1, 0.1, 0.7)
(0.8, 0.1, 0.1)
(0.4, 0.3, 0.2, 0.1)
p2
(0.2, 0.2, 0.6)
− 0 .6 − 0 .4 − 0 .1
δ Y 1 Y P O L Y
− 0 .8 − 0 .2 0.7
2
2
−0.816 −0.380
0.546
−0.770 −0.568 −0.167
0.441
−0.802 −0.201
0.502
0.702
0.544
−0.767 −0.566 −0.166
0.439
δ X X 1 O X 2 O
2
−0.813 −0.378
0.826
δ Z 1 Z P O L Y
2
0.5
δ Y 1 Y P O L Y
δ Z 1 Z P O L Y
0.828
0.2
p1
∼
−0.686 −0.155
0.368
0.511
(0.1, 0.1, 0.1, 0.7)
(0.8, 0.1, 0.1)
0 .8
− 0 .2
0.5
0.802
−0.201
0.502
0.7
0.702
0.638
−0.122
0.219
0.263
1. Comput Computee the powe powerr coeffi coefficien cients ts ( a , b, c, d ) for Y 1 and Y 2 by Eqs. 2–5. 2. Sol Solve ve Eq. 6 for δ Z 1 Z 2T E T . ). 3. Sol Solve ve for δ X (2012). X 1 O X 2 O given δ Z 1 Z 2T E T by the method in Ferrari and Barbiero (2012
∼
∼
With ith the same same set of speci specifica ficatio tions, ns, namel namely y, Y 1 W (1, 3.6), Y 2 N M (0.3, 0, 1, 2, 1), and ( p1 , p2 ) ((0.4, 0.3, 0.2, 0.1), (0.2, 0.2, 0.6)), suppose δ Y 1 Y 2 P O L Y 0.5. Afte Af terr solv solvin ing g for for the the po powe werr coef coeffic ficie ient ntss (Ste (Step p 1) 1),, Step Stepss 2 an and d 3 yi yiel eld d δ Z 1 Z 2P O L Y 0.502 0.368 368.. Simil Similarl arly y, whe when n ( p1 , p2 ) ((0.1, 0.1, 0.1, 0.7), (0.8, 0.1, and δ X X 1 O X 2 O 0.263. The lower half 0.1)) and δ Y 1 Y 2 P O L Y 0.7, δ Z 1 Z 2P O L Y 0.702 and δ X X 1 O X 2 O of Table 4 includes a few more combinations. A more inclusive set of results is given in Fig. 4, which shows the relative trajectories of δ X X 1 O X 2 O and δ Y 1 Y 2 P O L Y when the prop proport ortion ion set setss take take thr three ee differ different ent va value lues, s, wit with h the add additi ition on of ( p1 , p2 ) ((0.25, 0.25, 0.25, 0.25), (0.05, 0.05, 0.9)) to the two sets above. from the the po poin intt Algorithm-4: Computing the polyserial correl correlation ation (δ Y 1 Y 2 P S ) from and d th thee ot othe herr wa wayy arou around nd : Th polyserial corre correlation lation (δ X Thee fo foll llow owin ing g step stepss en enab able le X 1 O Y 2 ) an Y 1 (the varius to calculate either one of these correlations when the distribution of Y able that is subsequently ordinalized) and the ordinal proportions ( p1 ) are specified (See Sect. 2.4 2.4)):
=
=
=
=
=
= =
=
=
Anatomy of Correlational Magnitude Transformations in Latency … Fig. 4 δ X X 1 O X 2 O versus δ Y 1 Y P O L Y for Y 1 W (1, 3.6) 2
∼
0 . 1
∼
and Y 2 N M (0.3, 0, 1, 2, 1), where solid , dashed , and dotted curves represent ( p1 , p 2 ) ((0.4, 0.3, 0.2, 0.1), (0.2, 0.2, 0.6)), ((0.1, 0.1, 0.1, 0.7), (0.8, 0.8, 0.1)), and (( 0.25, 0.25, 0.25, 0.25), (0.05, 0.05, 0.9)), respectively; the range differences are due to the Fréchet-Hoeffding Fréchet-Hoeff ding bounds
75
5 . 0
=
Y L O P 2
Y 1 Y δ
0 . 0
5 . 0 −
0 . 1 −
−1.0
−0.5
0.0
δX1OX2O
0.5
1.0
= =
1. Gen Genera erate te Y 1 with a large number of data points (e.g., N 100, 000). through h the specified specified value value of p1 , and compute the 2. Ordina Ordinalize lize Y 1 to obtain X 1 O throug c1 . sample correlation, δ X X 1 O Y 1 δY 1 Y 2 P S c1 by Eq. 9. 3. Find Find δ X X 1 O Y 2 / δ X 1 O Y 2 or δ Y 1 Y 2 P S by δ X
= ˆ
=ˆ
∼
∼ =
For illustrative purposes, we assume that Y 1 W (1, 25), Y 2 N M (0.5, 0, 1, 2, 1), and δ P S 0.6. Y is ordinalized to obtain X , where p (0.4, 0.3, 0.2, 1 1O 1 Y 1 ,Y 2 0.1). After following Steps 1 and 2, c1 turns out to be 0.837, and accordingly c1 δ Y 1 Y 2 P S δ X 0.502. Similarly, if the specified value of δ X X 1 O Y 2 X 1 O Y 2 is 0.3, then 0.358 358.. The ccore ore ide ideas as remain remain uncha unchange nged d if Y Y 2 is dichotomiz dichotomized ed δ Y 1 Y 2 P S δ X X 1 O Y 2 /c1 c2 Y 1 , in which the new equations become δ X (with a proportion p2 ) instead of Y X 2 O Y 2 δY 1 Y 2 P S c2 . Seve Several ral c1 and c2 va value luess are tabula tabulate ted d for corres correspond ponding ing p1 or and δ X X 2 O Y 1 / δ p2 specifications in Table Table5 5. Figure 5 provides a comparison between the theoretical (suggested by Algorithm-4) and empirical point-polyserial correlations (δ X X 1 O Y 2 ) for values in the range range of 0.95 and 0.95 (upper specified polyserial correlation (δ Y 1 Y 2 P S ) values graph gra ph)) an and d the the sc scat atte terr plot plot of the the diff differ eren ence cess be betw twee een n th thee tw two o quan quanti titi ties es (l (low ower er gr grap aph) h) are given. We first generated bivariate continuous data using the above distributional assumptions, then ordinalized Y 1 , computed the empirical post-discretization correlations, and made a comparison that is shown in the two graphs in Fig. 5, which collectively suggest that the procedure is working properly. Somee operat Som operationa ionall remark remarkss: Al Alll co comp mput utin ing g wor ork k has has be been en done done in R sof softw tware are (R Development Devel opment Core Team, 2016). In the algorithms that involve the power polynomials,, a standals stand-alo alone ne comput computer er code code in Dem Demirt irtas as and Hed Hedek eker er (2008a 2008a)) was was us used ed to solv solvee the system of equations. More sophisticated programming implementations such as fleishman.coef function in BinNonNor package (Inan and Demirtas 2016 Demirtas 2016), ), Param.fleishman function in PoisNonNor package (Demirtas et al. al. 2016b), 2016b), and Fleishman.coef.NN function in BinOrdNonNor package (Demirtas et al.
= = ˆ = =ˆ
= ˆ =
ˆ
ˆ
= ˆ
ˆ
−
76
H. Demirtas and C. Vardar-Acar
ˆ =
ˆ =
that at co conn nnec ectt th thee po poly lyse seri rial al Computed ted va value luess of c1 δ X Table able 5 Compu δ X X 2 O Y 2 th X 1 O Y 1 and c2 W (1, 25) and Y 2 (δ Y 1 Y B S ) and point-polyserial correlations ( δ X X 2 O Y 1 ), where Y 1 X 1 O Y 2 or δ X
∼
2
N M (0.5, 0, 1, 2, 1) p1
p2
ˆ
c1
0.837
ˆ
c2
(0.4, 0.3, 0.2, 0.1)
–
–
(0.1, 0.2, 0.3, 0.4)
–
0.927
–
(0.1, 0.4, 0.4, 0.1)
–
0.907
–
(0.4, 0.1, 0.1, 0.4)
–
0.828
–
(0.7, 0.1, 0.1, 0.1)
–
0.678
–
–
(0.3, 0.4, 0.3)
–
0.914
–
(0.6, 0.2, 0.2)
–
0.847
–
(0.1, 0.8, 0.1)
–
0.759
–
(0.1, 0.1, 0.8)
–
0.700
–
(0.4, 0.4, 0.2)
–
0.906
∼
5 . 0 0 . 0 5 . 0 −
−0.5
0.0
0.5
3 0 0 . 0 0 0 0 . 0 3 0 0 . 0 −
theoreticall (x axis) versus empirical empirical (y axis) point-pol point-polyseri yserial al correlations correlations Fig. 5 The plot of the theoretica (δ X X 1 O Y 2 ) given the specified polyserial correlation (δ Y 1 Y P S ) values (upper graph) and the scatter 2
plot of the differences between the two quantities ( lower graph), where Y 1 N M (0.5, 0, 1, 2, 1)
∼ W (1, 25) and Y 2 ∼
2016c)) can also be employed. The root of the third order polynomials in Eq. 6 was 2016c found by polyroot function in the base package. The tetrachoric correlation and the phi coefficient in Eq. 1 was computed by phi2tetra function in psych packal. 2016 2016), ), age (Revelle 2016 (Revelle 2016)) and pmvnorm function in mvtnorm package (Genz et al. respectively. Finding the polychoric correlation given the ordinal phi coefficient and the opposite direction were performed by ordcont and contord functions in GenOrd package (Barbiero and Ferrari 2015 Ferrari 2015), ), respectively.
Anatomy of Correlational Magnitude Transformations in Latency …
4
77
Simulatio Simulations ns in a Multiva Multivariate riate Setting Setting
By the the probl problem em defin definit itio ion n an and d desi design gn,, all all de deve velo lopm pmen entt has has be been en pr pres esen ente ted d in bi biva vari riat atee sett settin ings gs.. For For as asse sess ssin ing g ho how w the the algo algori rith thms ms work work in a broad broader er mu mult ltiivaria ariate te co cont nteext and and for highlighting the generality of our approach, we present two simulation studies that involve the specification of either pre- or post-discretization correlations. Simu Simula lati tion on work work is devi devise sed d aroun around d fiv fivee co cont ntin inuo uous us vari variab able les, s, an and d four four of th thes esee are are subseq sub sequen uently tly dic dichot hotomi omized zed or ord ordina inaliz lized. ed. Referr Referring ing to the Weibull and Normal mixture den densit sities ies in the illust illustrat rativ ivee ex examp amples les,, the distri distribu butio tional nal forms forms are as fol follo lows: ws: Y 1 W (1, 3.6), Y 2 W (1, 1.2), Y 3 N M (0.3, 0, 1, 2, 1), Y 4 N M (0.5, 0, 1, 2, 1), and Y 5 N M (0.5, 0, 1, 3, 1). Y 1 , ..., discre cretiz tized ed with with proport proportion ionss ..., Y 4 are to be dis p 0.6, p 0.3, p (0.4, 0.2, 0.2, 0.2), and p (0.1, 0.6, 0.3), respectively respectively.. 1 2 3 4 Two dichotomized (Y 1 and Y 2 ), two ordinalized (Y 3 and Y 4 ), and one continuous (Y 5 ) variables form a sufficient environment, environment, in which all types of correlations mentioned in this work are covered. For simplicity, we only indicate if the correlations are preor post-discretization quantities without distinguishing between different types in terms of naming and notation in this section. We investigate both directions: (1) The
∼
=
∼
=
∼
=
∼
=
∼
pre-discretization correlation matrix is specified; the theoretical (algorithmic) postdiscretization quantities were computed; data were generated, discretized with the prescription guided by the proportions, and empirical correlations were found via n 1000 10 00 si simu mula lati tion on re repl plic icat ates es to se seee how how clos closel ely y th thee algo algori rith thmi micc an and d empi empiri rica call va vallues are aligne aligned d on avera average. ge. (2) The pos post-d t-disc iscret retiza izatio tion n matrix matrix is spe specifi cified; ed; correl correlati ation on among latent variables were computed via the algorithms; data were generated with this correlation matrix; then the data were dichotomized or ordinalized to gauge if we obtain the specified post-discretization correlations on average. In Simulation 1, the pre-di pre-discr screti etizat zation ion cor correl relati ation on matrix matrix (Σ pr represent enting ing the correl correlati ation on struct structure ure pr e ) repres among continuous variables, is defined as
=
1 00 0 14 −0 32 0 56 0 54 0 14 1 00 −0 10 0 17 0 17 = −0 32 −0 10 1 00 −0 40 −0 38 0 56 0 17 −0 40 1 00 0 67 . . . . 0.54
Σ pr pr e
. . . . 0.17
. . . . 0.38
. . . . 0.67
−
. . . . 1.00
,
[ ]
where the variables follow the order of ( Y 1 , ..., ..., Y 5 ). Let Σ i , j denote the correlation relat ion betwe between en variable variabless i and j , wher wheree i , j 1, ..., ..., 5. The theore theoreti tical cal postpost-
=
dis discre tizat zation ion the va value lues s (un (under der the ass assump umptio tion n that tha t thespecifically, algori algorithm thmss Σ functio function n propproperlycreti and yield true values) were computed. More pos t 1, 2 was post found by Algorithm-1b, Σ pos pos t 1, 5 and Σ post pos t 2, 5 by Algorithm-2, Σ post pos t 1, 3 , Σ pos pos t 1, 4 , Σ pos pos t 2, 3 , Σ post pos t 2, 4 , and Σ pos pos t 3, 4 by Algorithm-3b, Σ pos pos t 3, 5 and Σ post pos t 4, 5 by Algorithm-4. These values collectively form a post-discretization correlation matrix (Σ pos pos t ), which serves as the True Value (TV). The empirical postdiscretiz discr etizatio ation n corre correlati lation on estimate estimatess were calculate calculated d after generatin generating g N 1, 000 ro rows ws
[ ] [ ] [ ]
[ ] [ ]
[ ] [ ] [ ]
[ ] [ ]
= =
78
H. Demirtas and C. Vardar-Acar
defined d in the text) Table 6 Results of Simulation 1 (the reported quantities are define Parameter Σ pre pr e
[ ] Σ [1, 3] Σ [1, 4] Σ [1, 5] Σ [2, 3] Σ [2, 4] Σ [2, 5] Σ [3, 4] Σ [3, 5] Σ [4, 5] Σ 1, 2
T V (Σ post post )
RB
PB
SB
0.08934
0.00008
0.09
0.03
−0.32 −0.22936 −0.23385
0.00449
1.96
1.40
0.14 0.
0.08942
AE
0.56 0.
0.38946
0.39128
0.00182
0.47
0.72
0.54 0.
0.43215
0.43425
0.00210
0.49
0.84
−0.10 −0.07420 −0.07319
0.00101
1.36
0.32
0.17 0.
0.12175
0.12073
0.00102
0.84
0.33
0.17 0.
0.13681
0.13883
0.00202
1.48
0.65
−0.40 −0.31908 −0.31922 −0.38 −0.34082 −0.34671
0.00014
0.04
0.05
0.00589
1.73
2.15
0.00245
0.42
1.28
0.67 0.
0.58528
0.58773
..., Y 5 ) by the specified Σ pr of multivariate latent, continuous data ( Y 1 , ..., pr e , followed by
=
discretization of (Y 1 , ..., ..., Y 4 ). The whole process was repeated for n 1, 000 times. We evaluated the quality of estimates by three commonly accepted accuracy measures: measu res: (a) Raw Bias Bias (RB) (RB), (b) Per Percent centage age Bias (PB), and (c) Standardized Bias (SB) (Demirtas 2004a, 2004a, 2007a, 2007a, b, 2008). 2008). They all are functions of the average estimate ( AE ), ), and their definitions are well-established: When the parame E δ ter of interest is δ , R B δ δ (absolute (absolute average deviation), P B 100 E δ δ δ /δ (absolute average deviation as a percentage of the true value), and
= | [ ˆ − ]|]|
=
∗
| [ˆ − ] | S B = 100 ∗ | E [δ (absol solute ute avera average ge devia deviatio tion n with with respec respectt to the over over-δˆ − δ ]| ]|/ V [δ δˆ ] (ab all uncertainty in the system). A procedure is typically regarded as working prop1/2
erly if R B < 5 and S B < 50 (Demirtas et al. al . 2007). 2007). In Table 6, we tabulate Σ pr pr e , T V (Σ post pos t ), A E , R B , P B , and S B . All the three accuracy quantities demonstrate negligibly small deviations from the true values; they are within acceptable limits, suggesting that the set of algorithms provides unbiased estimates. In Simulation 2, we take the reverse route by specifying the post-discretization correlation matrix (Σ pos pos t ), which serves as the True Value (TV), in the following way:
1 00 Σ post pos t
=
. 0.24 0.18 0.10 0.38
0.24 1.00 0.20 0.11 0.42
−
0.18 0.20 1.00 0.07 0.29
−
0.10 0.11 0.07 1.00 0.16
− − −
0.38 0.42 0.29 0.16 1.00
−
The corres correspon pondin ding g pre pre-di -discr screti etizat zation ion mat matrix rix was was foun found d via the alg algori orithm thms. s. The theoretical Σ pr pr e 1, 2 was computed by Algorithm-1a, Σ pr pr e 1, 5 and Σ pr pr e 2, 5 by Algorithm-2, Σ pr pr e 3, 4 by pr e 2, 4 , and Σ pr pr e 2, 3 , Σ pr pr e 1, 4 , Σ pr pr e 1, 3 , Σ pr Algorithm-3a, Σ pr pr e 3, 5 and Σ pr pr e 4, 5 by Algorithm-4. These values jointly form a pre-discret pre-discretizat ization ion corre correlati lation on matr matrix ix (Σ pr empiricall post-discr post-discretiz etizatio ation n pr e ). The empirica
[ ] [ ] [ ]
[ ] [ ]
[ ] [ ]
[ ]
[ ] [ ]
Anatomy of Correlational Magnitude Transformations in Latency …
79
defined d in the text) Table 7 Results of Simulation 2 (the reported quantities are define Pa Param ramete eterr TV (Σ post post )
[ ] Σ [1, 3] Σ [1, 4] Σ [1, 5] Σ [2, 3] Σ [2, 4] Σ [2, 5] Σ [3, 4] Σ [3, 5] Σ [4, 5] Σ 1, 2
Σ pre pr e
AE
RB
PB
SB
0.24 0.
0.40551
0.24000
0.00000
0.00
0.00
0.18 0.
0.24916
0.17727
0.00273
1.52
0.90
0.10 0.
0.14551
0.10041
0.00041
0.41
0.13
0.38 0.
0.47484
0.38022
0.00022
0.06
0.08
0.20 0.
0.29496
0.20729
0.00729
3.64
2.44
−0.11
−0.15954 −0.10601
0.00399
3.63
1.27
0.43347
0.01347
3.21
5.50
−0.07
−0.08884 −0.07001
0.00001
0.01
0.01
0.29443
0.00443
1.53
1.58
−0.16
−0.18316 −0.16004
0.00004
0.02
0.01
0.42 0.
0.29 0.
0.52188
0.32334
= 1, 000 rows of multivar = multivarii-
correlation estimates estimates were calcula calculated ted after generati generating ng N
ate latent, continuous data (Y 1 , .. ..., ., Y 5 ) by the computed Σ pr pr e before discretization of (Y 1 , .. ..., ., Y 4 ). As before, this process is repeated for n 1, 000 times. In Table 7, we tabulate T V (Σ post pos t ), Σ pr pr e , A E , R B , P B , and S B . Again, the discrepancies between the expected and empirical quantities are minimal by the three accuracy criteria, providing substantial support for the proposed method. These results indicate compelling and promising evidence in favor of the algorithms herein. Our evaluation is based on accuracy (unbiasedness) measures. Pre-
=
cision is another important criterion in terms the quality and performance of the estimates (Demirtas et al. al. 2008; 2008 ; Demirtas andof Hedeker Hedeker 2008b 2008b; ; Yucel and Demirtas 2010 2010). ). We address the precision issues by plotting the correlation estimates across estimates closely match the all simulation replicates in both scenarios (Fig. 6). The estimates and 7 7,, with a healthy amount of variation that is within true values shown in Tables 6 and the limits of Monte-Carlo simulation error. On a cautious note, however, however, there seems to be slightly more variation in Simulation 2, which is natural since there are two layers of randomness (additional source of variability).
5
Disc Discus ussi sion on
If the discretization thresholds and underlying continuous measurements are available, one can easily compute all types of correlations that appear herein. The scale of the work in this chapter is far broader, as it is motivated by and oriented towards comput com puting ing the differ different ent correl correlat ation ional al mag magnit nitude udess bef before ore and after after discre discreti tizat zation ion when when one of these quantities is specified in the context of simulation and RNG, in both directions. The set of proposed techniques is driven by the idea of augmenting the normalbased results concerning different types of correlations to any bivariate continuous
80
H. Demirtas and C. Vardar-Acar 6 . 0
4 . 0
5 . 0
4 . 0
3 . 0 2 . 0
0 . 0
2 . 0
1 . 0
0 . 0 2 . 0
−
1 . 0 −
4 . 0 −
2 . 0 −
Fig. 6 The trace plot of correlation correlation estimates for Simulation 1 (left graph) and Simulation 2 (right and 7 7 graph) across n 1, 000 replicates; they closely match the true values shown in Tables 6 and
=
setting. Nonnormality is handled by the power polynomials that map the normal and nonnormal correlations. The approach works as long as the marginal characteristics (skewness and elongation parameters for continuous data and proportion values for binary/ordinal data) and the degree of linear association between the two variables are legitimately defined, regardless of the shape of the underlying bivariate continuous density. When the above-mentioned quantities are specified, one can connect correlations before and after discretization in a relatively simple manner. Onee pote On potent ntia iall limi limita tati tion on is that that po powe werr po poly lynom nomia ials ls co cove verr mo most st of th thee fe feas asib ible le symsym2), bu butt no nott en enti tire rely ly.. In an atte attemp mptt to sp span an a lar large gerr metry-peak metry -peakednes ednesss plane (ν 2 ν 12 2), space, one can utilize the fifth order polynomial systems (Demirtas 2017 (Demirtas 2017;; Headrick 2002), ), although it may not constitute an ultimate solution. In addition, a minor con2002 cern could be that unlike binary data, the marginal proportions and the second order product moment (correlation) do not fully define the joint distribution for ordinal data. In other words, odds ratios and correlations do not uniquely determine each other. However, in overwhelming majority of applications, the specification of the first and second order moments suffices for practical purposes; and given the scope of this this work work,, whic which h is mo mode deli ling ng the the tr tran ansi siti tion on betw betwee een n di difffe fere rent nt pa pair irss of corr correl elat atio ions ns,, this complication is largely irrelevant. Finally, the reasons we base our algorithms on the Pearson correlation (rather than the Spearman correlation) are that it is much more common in RNG context and in practice; and the differences between the two
≥ −
Anatomy of Correlational Magnitude Transformations in Latency …
81
are negligibly small in most cases. Extending this method for encompassing the Spearman correlation will be taken up in future work resorting to a variation of the sorting idea that appeared in Demirtas and Hedeker (2011 ( 2011), ), allowing us to capture any monotonic relationship in addition to the linear relationships. On a related note, furth fur ther er expa expans nsio ions ns can can be imag imagin ined ed to acco accomm mmoda odate te mo more re comp comple lex x asso associ ciat atio ions ns th that at involve higher order moments. The positive characteristics and salient advantages of these algorithms are as follows: They work for an extensive class of underlying bivariate latent distributions whose components are allowed to be non-identically distributed. Nearly all continuous shapes and skip patterns for ordinal variables are permissible. The req requir uired ed softwa software re too tools ls for the implem implement entati ation on are rather rather bas basic ic,, users users mer merel ely y need a computational platform with numerical double integration solver for the binary-binary case, univariate RNG capabilities for the binary/ordinal-continuous case, an iterative scheme that connects the polychoric correlations and the ordi-
•
•
nal phi coefficients under the normality assumption for the ordinal ordinal case, a polynomial root-finder and a nonlinear equations set solver to handle nonnormal continuous variables. The des descri cripti ption on of the connec connectio tion n bet betwee ween n the two two correl correlati ations ons is nat natura urally lly gi give ven n for the biva bivaria riate te case. case. The mul multi tiva varia riate te ex exten tensio sion n is easily easily man manage ageabl ablee by assemb assemblin ling g the individual correlation entries. The way the techniques work is independent of the number of variables; the curse of dimensionality is not an issue. The algorithms could be conveniently used in meta-analysis domains where some studies discretize variables and some others do not. Assessing the magnitude of change in correlations before and after ordinalization is likely to be contributory in simulation studies where we replicate the specified trends especially when simultaneous access to the latent data and the eventual binary/ordinal data is desirable. One can more rigorously fathom the nature of discretization in the sense of knowing how the correlation structure is transformed after dichotomization or ordinalization. The proposed procedures can be regarded as a part of sensible RNG mechanisms to generate multivariate latent variables as well as subsequent binary/ordinal variables given their marginal shape characteristics and associational structure in simulated environments, potentially expediting the development of novel mixed data generation routines, especially when an RNG routine is structurally involved
•
•
•
•
•
with generating multivariate continuous data as an intermediate step. In conjunction with the published works on joint binary/normal (Demirtas and Doganay 2012 Doganay 2012), ), binary/nonnormal continuous (Demirtas et al. al. 2012), 2012), ordinal/normal (Demirtas and Yavuz 2015 Yavuz 2015), ), count/normal (Amatya and Demirtas 2015 Demirtas 2015), ), and multivariate ordinal ), the ideas presented in this chapter might serve as data generation (Demirtas 2006 (Demirtas 2006), a milestone for concurrent mixed data generation schemes that span binary, ordinal, count, and nonnormal continuous data. These The se algori algorithm thmss ma may y be instru instrumen mental tal in de deve velop loping ing multip multiple le imp imputa utati tion on strat strateegies for mixed longitudinal or clustered data as a generalization of the incomplete
•
82
H. Demirtas and C. Vardar-Acar
data methods published in Demirtas and Schafer Schafer (2003), 2003), Demirtas Demirtas (2004b 2004b,, 2005), 2005), and Demirtas and Hedeker Hedeker (2007, 2007, 2008c 2008c). ). Concomitantly, they can be helpful in 2010). improving rounding techniques in multiple imputation (Demirtas 2009 (Demirtas 2009,, 2010). Wrapping it up, this work is inspired by the development of algorithms that are designed to model the magnitude of change in correlations when discretization is employed. In this regard, the proposed algorithms could be of working functionality in identifying the relationships between different types of correlations before and after discretization, and have noteworthy advantages for simulation purposes. As a final note, a software implementation of the algorithms can be accessed through the ). recent R package CorrToolBox (Allozi and Demirtas 2016 Demirtas 2016).
References
Allozi, R., & Demirt Allozi, Demirtas, as, H. (2016) (2016).. Modeli Modeling ng Correl Correlati ationa onall Magnit Magnitude ude Tr Trans ansfor format mation ionss in Discretiza Discr etization tion Contexts, Contexts, R package package CorrToolBox. CorrToolBox. https://cran.r-project.org/web/packages/ CorrToolBox.. CorrToolBox Amatya Ama tya,, A., & Demirt Demirtas, as, H. (2015) (2015).. Simulta Simultaneo neous us genera generatio tion n of multi multiva varia riate te mixed mixed data data with Poisson and normal marginals. Journal of Statistical Computation and Simulation , 85, 3129–3139. Barbiero, A., & Ferrari, P.A. (2015). Simulation of Ordinal and Discrete Variables with Given Correlation Matrix and Marginal Distributions, R package GenOrd package GenOrd.. https://cran.r-project.org/ web/packages/GenOrd.. web/packages/GenOrd Cario, M. C., & Nelson, B. R. (1997). Modeling and generating random vectors with arbitrary margina mar ginall distribut distributions ions and correlatio correlation n matrix (T (Techn echnical ical Repor Report) t). Depart Departmen mentt of Indust Industria riall EngiEngineering and Management Services: Northwestern University, Evanston, IL, USA. Demirtas, H. (2004a). Simulation-driven inferences for multiply imputed longitudinal datasets. Statistica Neerlandica, 58, 466–482. Demirtas, H. (2004b). Assessment of relative improvement due to weights within generalized estimating equations framework for incomplete clinical trials data. Journal of Biopharmaceutical Statistics, 14, 1085–1098. Demirtas, H. (2005). Multiple imputation under Bayesianly smoothed pattern-mixture models for non-ignorable drop-out. Statistics in Medicine, 24, 2345–2363. Demirta Dem irtas, s, H. (2006) (2006).. A method method for multi multiva varia riate te ordina ordinall data data genera generatio tion n gi give ven n margin marginal al distrib distributi utions ons and correlations. Journal of Statistical Computation and Simulation, 76 , 1017–1025. Demirtas, H. (2007a). Practical advice on how to impute continuous data when the ultimate interest centers centers on dichotomi dichotomized zed outcomes outcomes through through pre-specifi pre-specified ed threshold thresholds. s. Communications Communications in Statistics-Simulation and Computation, 36 , 871–889. Demirtas, H. (2007b). The design of simulation studies in medical statistics. Statistics in Medicine, 26 ,irtas, 3818–3821. Dem Demirta s, H. (2008) (2008).. On imputi imputing ng contin continuou uouss data data when when the ev event entual ual intere interest st pertai pertains ns to ordina ordinaliz lized ed outcomes via threshold concept. Computational Statistics and Data Analysis, 52 , 2261–2271. Demirtas, H. (2009). Rounding strategies for multiply imputed binary data. Biometrical Journal, 51, 677–688. Demirtas, H. (2010). A distance-based rounding strategy for post-imputation ordinal data. Journal of Applied Statistics, 37 , 489–500. Demirtas, H. (2016). A note on the relationship between the phi coefficient and the tetrachoric correlation under nonnormal underlying distributions. American Statistician, 70, 143–148.
Anatomy of Correlational Magnitude Transformations in Latency …
83
Demirtas, H. (2017). Concurrent generation of binary and nonnormal continuous data through fifth order power polynomials, Communications in Statistics- Simulation and Computation. 46 , 344–357. Demirta Dem irtas, s, H., Ahmadi Ahmadian, an, R., Atis, Atis, S., Can, Can, F. E., & Ercan, Ercan, I. (2016a (2016a). ). A nonnor nonnormal mal look look at polych polychori oricc correlations: Modeling the change in correlations before and after discretization. Computational Statistics, 31, 1385–1401. Demirtas, H., Arguelles, L. M., Chung, H., & Hedeker, D. (2007). On the performance of biasreduction techniques for variance estimation in approximate Bayesian bootstrap imputation. Computational Statistics and Data Analysis, 51 , 4064–4068. Demirtas, H., & Doganay, B. (2012). Simultaneous generation of binary and normal data with specified marginal and association structures. Journal of Biopharmaceutical Statistics, 22, 223– 236. Demirta Dem irtas, s, H., Freels Freels,, S. A., & Yucel, ucel, R. M. (2008) (2008).. Plausi Plausibil bility ity of multi multiva varia riate te no norma rmalit lity y assump assumptio tion n when multiply imputing non-Gaussian continuous outcomes: A simulation assessment. Journal of Statistical Computation and Simulation , 78 , 69–84. Demirtas, H., & Hedeker, Hedeker, D. (2007). Gaussianization-based quasi-imputation and expansion strategies for incomplete correlated binary responses. Statistics in Medicine, 26 , 782–799. Demirtas, H., & Hedeker, D. (2008a). Multiple imputation under power polynomials. Communica-
Communica tions in Statistics- Simulation and Computation, 37 , 1682–1695. Demirtas, H., & Hedeker, D. (2008b). Imputing continuous data under some non-Gaussian distributions. Statistica Neerlandica, 62, 193–205. Demirtas, H., & Hedeker, D. (2008c). An imputation strategy for incomplete longitudinal ordinal data. Statistics in Medicine, 27 , 4086–4093. Demirtas, H., & Hedeker, D. (2011). A practical way for computing approximate lower and upper correlation bounds. The American Statistician, 65, 104–109. Demirta Dem irtas, s, H., & Hedek Hedeker er,, D. (2016 (2016). ). Comput Computing ing the pointpoint-bis biseri erial al correl correlati ation on under under an any y underl underlyin ying g continuous distribution. Communications in Statistics- Simulation and Computation, 45 , 2744– 2751.
Demirtas, H., Hedeker, D., & Mermelstein, J. M. (2012). Simulation of massive public health data by power polynomials. Statistics in Medicine , 31 , 3337–3346. Demirtas, H., & Schafer, J. L. (2003). On the performance of random-coefficient pattern-mixture models for non-ignorable drop-out. Statistics in Medicine , 22 , 2553–2575. Demirtas, H., Shi, Y., & Allozi, R. (2016b). Simultaneous generation of count and continuous data, R package PoisNonNor. https://cran.r-project.org/web/packages/PoisNonNor PoisNonNor. https://cran.r-project.org/web/packages/PoisNonNor.. Demirtas, H., Wang, Y., & Allozi, R. (2016c) Concurrent generation of binary, ordinal and continuous data, R package BinOrdNonNor. https://cran.r-project.org/web/packages/BinOrdNonNor BinOrdNonNor. https://cran.r-project.org/web/packages/BinOrdNonNor.. Demirtas, H., & Yavuz, Y. (2015). Concurrent generation of ordinal and normal data. Journal of Biopharmaceutical Statistics, 25, 635–650. Emrich, J. L., & Piedmonte, M. R. (1991). A method for generating high-dimensional multivariate binary variates. The American Statistician, 45 , 302–304. Farrington, D. P., & Loeber, R. (2000). Some benefits of dichotomization in psychiatric and criminological research. Criminal Behaviour and Mental Health , 10 , 100–122. Ferrari, P. A., & Barbiero, A. (2012). Simulating ordinal data. Multivariate Behavioral Research, 47 , 566–589. Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43, 521–532. Fréchet, M. (1951). Sur les tableaux de corrélation dont les marges sont données. Annales de l’Université de Lyon Section A, 14 , 53–77. Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Bornkamp, B., Maechler, M., & Hothorn, T. (2016). Multivariate normal and t distributions, R package mvtnorm. https://cran.rmvtnorm. https://cran.rproject.org/web/packages/mvtnorm.. project.org/web/packages/mvtnorm Headrick, T. C. (2002). Fast fifth-order polynomial transforms for generating univariate and multivariate nonnormal distributions. Computational Statistics and Data Analysis, 40 , 685–711.
84
H. Demirtas and C. Vardar-Acar
Headrick, T. C. (2010). Statistical Simulation: Power Method Polynomials and Other Transformations Boca Raton. FL: Chapman and Hall/CRC. Hoeffding, W. (1994). Scale-invariant correlation theory. In N. I. Fisher & P. K. Sen (Eds.), The T he Collected Works of Wassily Hoeffding (the original publication year is 1940) (pp. 57–107). New York: Springer. Inan, G., & Demirtas, H. (2016). Data generation with binary and continuous non-normal components, R package BinNonNor. https://cran.r-project.org/web/packages/BinNonNor BinNonNor. https://cran.r-project.org/web/packages/BinNonNor.. MacCal Mac Callum lum,, R. C., Zhang, Zhang, S., Preacher Preacher,, K. J., & Rucke Rucker, r, D. D. (2002) (2002).. On the practice practice of dichotomization of quantitative variables. Psychological Methods, 7 , 19–40. R Development Core Team. (2016). R: A Language and Environment for Statistical Computing. http://www.cran.r-project.org. http://www.cran.r-project.org. Revelle, Rev elle, W. (2016). (2016). Procedure Proceduress for psycholo psychological gical,, psychomet psychometric, ric, and personality personality researchm researchmulti ultivar variiate normal and t distributions, R package psych. https://cran.r-project.org/web/packages/psych. https://cran.r-project.org/web/packages/psych. Vale, C. D., & Maurelli, V. V. A. (1983). Simulating multivariate nonnormal distributions. Psychometrika, 48 , 465–471. Yuc ucel el,, R. M., M., & Demi Demirt rtas as,, H. (201 (2010) 0).. Impa Impact ct of no nonn-no norm rmal al rand random om effe effect ctss on in infe fere renc ncee by mult multip iple le imputation: A simulation assessment. Computational Statistics and Data Analysis, 54, 790–801.
Monte-Carlo Simulation of Correlated Binary Responses Trent L. Lalonde
Abstract Simulation studies can provide powerful conclusions for correlated or longitudi longit udinal nal respon response se data, data, partic particula ularly rly for relat relativ ively ely small small sam sample pless for which which asympasymptotic theory does not apply. For the case of logistic modeling, it is necessary to have
appropriate methods for simulating correlated binary data along with associated predictors. This chapter presents a discussion of existing methods for simulating correlated binary response data, including comparisons of various methods for different data types, such as longitudinal versus clustered binary data generation. The purposes and issues associated with generating binary responses are discussed. Simulation methods are divided into four main approaches: using a marginally specified joint probability distribution, using mixture distributions, dichotomizing non-binary random ran dom va varia riable bles, s, and usi using ng a condit condition ionall ally y spe speci cified fied dis distri tribu butio tion. n. Approa Approache chess usi using ng a completely specified joint probability distribution tend to be more computationally intens int ensiv ivee and requir requiree det determ ermina inatio tion n of distri distribu butio tional nal proper propertie ties. s. Mix Mixtur turee method methodss can involve mixtures of discrete variables only, mixtures of continuous variables only, and mixtures involving both continuous and discrete variables. Methods that involve discretizing non-binary variables most commonly use normal or uniform variables, but some use count variables such as Poisson random variables. Approaches using a conditional specification of the response distribution are the most general, and allow for the greatest range of autocorrelation to be simulated. The chapter concludes with a discussion of implementations available using R software.
1
Intr Introd oduc ucti tion on
Correlated binary data occur frequently in practice, across disciplines such as health policy analysis, clinical biostatistics, econometric analyses, and education research. For example, health policy researchers may record whether or not members of a househo hous ehold ld ha have ve hea health lth ins insura urance nce;; econom econometr etric ician ianss ma may y be intere intereste sted d in whethe whetherr sma small ll T.L. Lalonde (B) Department of Applied Statistics and Research Methods, University of Northern Colorado, Greeley, CO, USA e-mail:
[email protected] [email protected] © Springer Nature Singapore Pte Ltd. 2017 D.-G. Chen and J.D. Chen (eds.), Monte-Carlo Simulation-Based Statistical Modeling, ICSA Book Series in Statistics, DOI 10.1007/978-981-10-3307-0_5
85
86
T.L. Lalonde
businesse busine ssess within within va vario rious us urb urban an distri districts cts ha have ve app applie lied d for financi financial al ass assist istanc ance; e; higher higher educat edu cation ion res resear earche chers rs might might study study the probab probabili ilitie tiess of col colle lege ge attend attendanc ancee for studen students ts from a number of high schools. In all of these cases the response of interest can be represented as a binary outcome, with a reasonable expectation of autocorrelation among those responses. Correspondingly, analysis of correlated binary outcomes has received considerable and long-lasting attention in the literature (Stiratelli et al. 1984;; Zeger and Liang 1986 Liang 1986;; Prentice 1988 Prentice 1988;; Lee and Nelder 1996 Nelder 1996;; Molenberghs and 1984 Verbeke 2006 Verbeke 2006). ). The most common models fall under the class of correlated binary logistic regression modeling. While approp appropriat riately ely develope developed d logis logistic tic regressio regression n models models include include asymptoti asymptoticc estiestimator properties, Monte-Carlo simulation can be used to augment the theoretical resultss of suc result such h largelarge-sam sample ple distri distribu buti tiona onall pro proper pertie ties. s. Mon Montete-Car Carlo lo simula simulatio tion n can be used to confirm such properties, and perhaps more importantly, simulation methods can be used to complement large-sample distributi distributional onal properties with small-sample
results. Therefore it is important to be able to simulate binary responses with specified fie d au auto toco corr rrel elat atio ion n so that that corr correl elat ated ed bina binary ry da data ta mode models ls can can bene benefit fit fr from om simu simula lati tion on studies. Throughout the chapter, the interest will be in simulating correlated binary outcomes, Y i j , where i indicates a cluster of correlated responses and j enumerates the responses. It will be assumed that the simulated outcomes have specified marginal probabilities, πi j , and pairwi pairwise se aut autoco ocorre rrelat lation ion,, ρi j ,i k . The The term term “c “clu lust ster er”” will will be used used to refer to a homogenous group of responses known to have autocorrelation, such as an individual in a longitudinal study or a group in a correlated study. For many of the algorithms presented thissubscript discussion, a single cluster will be and considered for simplicity, in which case theinfirst of the Y i j will be omitted Y i , πi , and ρi j will be used instead. Some methods will additionally require the specification of joint probabilities, higher-order correlations, or predictors.
1. 1.1 1 Bi Bina nary ry Da Data ta Is Issu sues es There are a number of important factors to consider when developing or evaluating a correlated data simulation technique. Common issues include computational feasibility bil ity or simpli simplicit city y, the incorp incorpora oratio tion n of pre predic dictor torss or cova covaria riates tes,, va varia riatio tion n of parame parame-ters between and within clusters, and the ability to effectively control expectations, variation, and autocorrelation. For binary data in particular, there are a number of additional concerns that are not relevant to the more traditional normal, continuous data generation. Simulation of binary responses is typically based on a probability of interest, or the expectation of the Bernoulli distribution. Many authors have noted that the pairwise joint probabilities for binary data, πi , j P (Y i 1, Y j j 1), are restricted by the marginal probabilities, according to,
=
m a x (0, πi
=
=
+ π − 1) ≤ π ≤ mi n(π , π ), j
i , j
i
j
Monte-Carlo Simulation of Correlated Binary Responses
87
which imposes restrictions on joint distributions according to the desired marginal probabilities. In addition, it is necessary to properly account for the inherent meanvariance var iance relations relationship hip associate associated d with Bernou Bernoulli lli data, Var(Y i ) πi (1 πi ), imposi imposing ng further restrictions on higher-order moments based on the desired marginal probabilities. Many Man y method methodss of sim simula ulatin ting g cor correl relate ated d binary binary data data suffer suffer fro from m restri restricte cted d ranges ranges of produce prod uced d autoco autocorre rrelat lation ion,, whi which ch sho should uld typ typic icall ally y spa span n ( 1, 1). For For corr correl elat ated ed bi bina nary ry outcomes Y 1 , . . . , Y N marginal ginal expe expectat ctations ions π1 , . . . , π N , Pren Prenti tice ce (1988) 1988) argue argued d N with mar that the pairwise correlation between any two responses Y i and Y j j must lie within the range (l , u ), where
=
−
−
l
= m a x
−
(πi π j )/(1
− π )(1 − π ), i
j
− − (1
πi )(1
− π )/(π π ) j
i
j
,
u
= m i n
πi (1
− π )/π (1 − π ), j
j
i
π j (1
− π )/π (1 − π ) i
i
j
,
(1)
to satisfy the requirements for the joint distribution. This implies that, depending on the desired marginal probabilities, using a fully specified joint probability distribution can lead to simulated values with restricted ranges of pairwise correlation. The restrictions of Eq. 1 can be counterintuitive to researchers who are used to the unconstrained correlation values of normal variables. Somee method Som methodss of sim simula ulatin ting g cor correl relat ated ed binary binary outcom outcomes es str strugg uggle le to contro controll changes in probabilities across clusters of correlated data. Due to the typically nonlinear lin ear nature nature of the relati relations onship hipss betwee between n respon response se pro probab babili ilitie tiess and predic predictor tors, s, man many y method met hodss also fail fail to incorp incorpora orate te pred predict ictors ors into into dat dataa simulati simulation. on. These These issues issues,, among among others, must be considered when developing and selecting a method for simulating correlated binary data. This chapter presents a thorough discussion of existing methods for simulating correlated binary data. The methods are broadly categorized into four groups: correlated binary outcomes produced directly from a fully specified joint probability distribution, from mixtures of discrete or continuous variables, from dichotomized continuous or count variables, and from conditional probability distributions. The oldest old est litera literatur turee is avail availabl ablee for fully fully specifi specified ed joi joint nt probab probabili ility ty dis distri tribu butio tions, ns, bu butt often often these the se method methodss are comput computat ation ionall ally y int intens ensiive and requir requiree spe specifi cificat cation ion of higher higher-or -order der pro proba bili liti ties es orofco corr rrel elat atio ions ns at the the ou outs et.. Mi Mixt xtur ureedesired appr approa oach ches es ha have ve been been used us ed to other comcombinebabi products binary variables to tset induce the autocorrelation, while mixtur mix tures es invo involvi lving ng contin continuou uouss varia variable bless requir requiree dic dichot hotomi omizat zation ion of the result resulting ing valvalues. Dichotomizing normal or uniform variables is a widely implemented approach to producing binary data, although less well-known approaches have also been pursued, sue d, such such as dichot dichotomi omizin zing g cou counts nts.. Condit Condition ional al spe speci cifica ficatio tion n of a binary binary dis distri tribu butio tion n typically makes use of “prior” binary outcomes or predictors of interest, and tends to lead to the greatest range of simulated autocorrelation. Each of these methods for generating correlated binary data will be discussed through a chronological perspec-
88
T.L. Lalonde
tive, with some detail provided and some detail left to the original publications. The chapter concludes with general recommendations for the most effective binary data simulation methods.
2
Fully Specified Specified Joint Joint Probabi Probability lity Distribu Distributions tions
The met method hod for simula simulati ting ng correl correlate ated d binary binary out outcom comes es with with the longes longest-t t-tenu enured red pre pressence in the literature, is full specification of a joint probability distribution for correlated binary variates. The joint pdf can either be written explicitly in closed form, or derived by exhaustive listing of possible outcomes and associated probabilities. ). In all cases the generation of data relies on the method of Devroye (1986 (1986).
2.1 Simulating Binary Data with a Joint PDF Given a joint probabi Given probability lity density function function for any type of binary binary data, data, correlate correlated d or independent indepe ndent,, Devroye Devroye (1986) 1986) desc descri ribe bed d an effe effect ctiive meth method od fo forr gene genera rati ting ng appr approp opri ri-at atee se sequ quen ence cess of bina binary ry ou outc tcom omes es.. As Assu sume me a join jointt pdf pdf has has be been en fu full lly y sp spec ecifi ified ed fo forr th thee T binary random vector Y (Y 1 , . . . , Y N N ), such that marginal probabilities can be ca calc lcul ulat ated ed for for any any comb combin inat atio ion n of bina binary ry ou outc tcom omes es.. Fo Forr finit finitee N , pr prob obab abil ilit itie iess can can be
=
N
calculated for all 2 possible outcome vectors, denoted p0 , . . . , p 2 N −1 . Realizations for the random vector Y can be generated according to the following algorithm.
Generating Binary Values with a Joint PDF
1. Order the probabi probabiliti lities es from smalles smallestt to largest largest,, p(0) , . . . , p (2 N −1) . 2. Define cum cumulativ ulativee v values alues z j according to:
= 0, z = z − + p − = 1. z 0 j
z 2 N
j 1
( j ) ,
1
3. Gener Generate ate a standard standard uniform varia variate te U on ( 0, 1). 4. Sel Selec ectt j such that z j U < z j +1 . 5. The random sequ sequence ence is the binary repre representa sentation tion of the integer integer j .
≤
The method is relatively simple, relying on calculating probabilities associated with all possible N -dimensional binary sequences, and generating a single random uniform variable to produce an entire vector of binary outcomes. Devroye (1986 1986)) has argued that this method produces appropriate binary sequences according to the pdf
Monte-Carlo Simulation of Correlated Binary Responses
89
provide provi ded, d, an and d the the me meth thod od is re reli lied ed on exten xtensi sive vely ly in situ situat atio ions ns in wh whic ich h a jo join intt pdf pdf can can be constr construct ucted. ed. In fac fact, t, rel relyin ying g on this this algori algorithm thm,, many many author authorss ha have ve pursue pursued d method methodss of constructing a joint pdf as a means to generate vectors of binary responses.
2.2 Explicit Specification of the Jo Joint int PDF Simu Simula lati tion on of co corr rrel elat ated ed bina binary ry ou outc tcom omes es is gene genera rall lly y th thoug ought ht to begi begin n with with Baha Bahadu durr (1961). 1961). Bahadur (1961 (1961)) presented a joint pdf for N correlated binary random variables as follows. Let yi indicate realizations of binary variables, π P (Y i 1)
=
=
represent the constant expectation of the binary variables, all equivalent, and ρi j
be the autocorrelation between two variables Y i and Y j j . Then the joint pdf can be written,
f ( ( y1 , . . . , y N )
=1+
i
= j
ρi j ( 1)( yi + y j ) π (2− yi − y j ) (1
−
π (1
− π)
− π)
+
( yi y j )
.
1961)) expressed the joint pdf in terms of lagged correlations for While Bahadur ( Bahadur (1961 an autoregressive time series, the idea expands generally to clustered binary data. In practice it is necessary to estimate range restrictions for the autocorrelation to ensure f (( y1 , . . . , y N ) (0, 1). These restrictions depend on values of π π , and are typically determine deter mined d empirica empirically lly (Fa (Farrell rrell and Sutradha Sutradharr 2006 2006). ). Usin Using g th thee algo algori rith thm m of Devr Devroy oyee Bahadur (1961) 1961) can be used to simulate values for a single (1986), 1986), the model of Bahadur cluster or longitudinal subject, then repeated for additional clusters. This allows the probability π to va vary ry acr across oss clu cluste sters, rs, howe howeve verr, π is ass assume umed d consta constant nt within within cluste clusters, rs, reducing the ability to incorporate effects of covariates covariates.. Most crucially, crucially, the pdf given by Bahadur Bahadur (1961 1961)) will become computationally burdensome for high-dimensional data simulations (Lunn and Davies 1998 Davies 1998;; Farrell and Sutradhar 2006 Sutradhar 2006). ).
∈
2.3 Derivation of the Joint PDF 1961), ), authors have pursued the Instead of relying on the joint pdf of Bahadur ((1961 construction of joint probability distributions not according to a single pdf formula, but instead by using desired properties of the simulated data to directly calculate all possible possib le probab probabilit ilities ies assoc associate iated d with N -di -dimen mensio sional nal sequen sequences ces of binary binary out outcom comes. es. Often these methods involve iterative processes, solutions to linear or nonlinear systems of equations, and matrix and function inversions. They are computationally complex but provide complete information about the distribution of probabilities without using a closed-form pdf. Lee (1993 (1993)) introduced a method that relies on a specific copula distribution, with binary variables defined according to a relationship between the copula parameters
90
T.L. Lalonde
and the desired probabilities and correlation for the simulated binary responses. A copula (Genest and MacKay 1986a MacKay 1986a,, b) is a multivariate distribution for random variables ( X 1 , . . . , X N N ) on the N -dimensional unit space, such that each marginal distribution for X i is uniform on the the domain (0, 1). The variables X 1 , . . . , X N N can be used to generate binary variables Y 1 , . . . , Y N N by dichotomizing according to Y i I ( X i > πi ), where πi is the desired expectation of the binary variable Y i and 1993)) I () takes the value of 1 for a true argument and 0 otherwise. However, Lee ( (1993 did not use a simple dichotomization of continuous variables. In order to extend this idea to allow for autocorrelation, Lee (1993 1993)) proposed using the copula with exchangeable correlation, the Archimidian copula proposed by Gene Genest st and and MacK MacKay ay (1986a). 1986a). Use Use of su such ch a co copul pulaa will will in indu duce ce Pe Pear arso son n co corre rrela lati tion on between any two random binary variables Y i and Y j j given by,
=
ρi j
=
−π π , π π (1 − π )(1 − π ) π 0 ,0
i
i
j
j
i
(2)
j
where π0,0 is the joint probability that both variables Y i and Y j j take the value 0. mi n (πi , π j ), it follows that π0,0 > πi π j and Because of the restriction that π0,0 therefore the induced Pearson correlation will always be positive when this method 1993)) requires the correlation of Eq. 2 to be constant is applied. The method of Lee ( Lee (1993 within clusters.
≤
Constructing the Joint Distribution by Archimidian Copula 1. For a single ccluster luster,, determine mar marginal ginal probabilities π 1 , . . . , π N for correlated binary variables Y 1 , . . . , Y N N with desired constant autocorrelation ρ. 2. A value for the Archimidian Archimidian copula distri distributi bution on parameter parameter can be calculated based on the values of πi and ρ . The parameter takes values on (0, 1 , where a value of 1 indicates independence, while as 0 the Kendall correlation converges to 1. 3. Usi Using ng the Archim Archimidi idian an copula copula distri distribu butio tion, n, solve solve linea linearly rly for the joint joint probabilities of Y 1 , . . . , Y N N with respect to the binary representation of each integer j , where 0 j 2 N 1,
]
→
≤ ≤
P j
1
− 1
N
N ),
= P (Y = j , . . . , Y = j
(3)
where j1 . . . j N is the binary representation of j . Calculation of this probability using the Archimidian copula has closed form but requires computation of the inverse of a CDF-like function. 4. Giv Given en the probabil probabilitie itiess P0 , P2 N −1 , simulate a binary sequence according to the algorithm of Devroye (1986 (1986). ). 5. Repea Repeatt for additional additional clusters. clusters.
Monte-Carlo Simulation of Correlated Binary Responses
91
The method of Lee (1993 ( 1993)) allows inclusion of predictors when determining the marginal πi , which which cle clearl arly y allo allows ws the these se pro probab babil iliti ities es to va vary ry within within cluste clusters. rs. Howe Howeve verr, the method is restricted to positive, exchangeable autocorrelation that cannot vary within groups, and requires the solution to numerous systems of equations and the inverse of a CDF-like function. Kang and Jung (2001 (2001)) discussed an alternative to this method such that the probabilities P j can be obtained obtained by solvi solving ng a nonlin nonlinear ear system of equations equations relating relating P j to the first two desired moments, πi and the correlation Corr(Y i , Y j j ) ρi j . However, the necessity of solving a system of nonlinear equations does not decrease the computational complexity of the algorithm as compared to the copula approach. The method of Gange (1995 (1995)) simplifies the method of Lee (1993 (1993)) but can add computational complexity. Suppose that, in addition to specifying the desired mar-
=
ginal probabilities πi associated with the correlated binary variables, Y i , pairwise and higher-order joint probabilities are also specified. That is, joint probabilities P (Y i 1 πi , j P (Y i yi , Y j j y j ) and higher-order joint probabilities πi 1 ,..., ,...,ii k yi 1 , . . . , Y i k yik ) can be specified, up to order k . Given such probabilities, a full joint probability density function is constructed through the Iterative Iterative Proportional Fitting algorithm. The main idea of the Iterative Proportional Fitting algorithm is to equate derivation of the joint pdf to fitting a log-linear model to a contingency table of order 2 N , including interactions up to order k . Each pairwise or higher-order joint probability is treated as a constraint on the implicit contingency table, and has an associated nonlinear equation. Solving this system of nonlinear equations corresponding to the
=
= =
=
=
=
higher-order joint probabilities is equated to the standard likelihood-based solution to log-linear model fitting. Predicted values from the log-linear model give the probabilities associated with all possible outcome combinations.
Constructing the Joint Distribution using the Iterative Proportional Fitting Algorithm
1. For a single cl cluster, uster, specify the desired m marginal arginal probabilit probabilities ies πi , pairwise probabilities π i , j , and up to k -order -order joint probabilities πi 1 ,..., ,...,ii k . 2. Const Constru ruct ct a loglog-li line near ar mo mode dell with with inte intera ract ctio ions ns up to order order k corresponding corresponding to the constraints specified by the probabilities up to order k . 3. Fit the log-line log-linear ar model to obta obtain in estimated probabilities corresponding corresponding to the joint marginal pdf. ). 4. Simul Simulate ate binary val values ues accordi according ng to the algorithm algorithm of Devroye Devroye (1986 (1986). 5. Repea Repeatt for additional additional clusters. clusters.
The method of Gange ( Gange (1995 1995)) allows for covariates to be included in determining ini1995)) tial probabilities and pairwise or higher-order joint probabilities, and Gange ( Gange (1995 describes how the pairwise probabilities can be connected to the working correlation structure of the Generalized Estimating Equations, allowing for specific correlation structures to be derived. However, specific correlation structures are not explicitly
92
T.L. Lalonde
1995). ). Further, this method requires an iterative procedure to exemplified by Gange exemplified Gange ( (1995 solve a system of nonlinear equations, which can become computationally burdensome with increased dimension of each cluster.
3
Specifica Specification tion by Mixture Mixture Distribu Distributions tions
Many approaches to simulating correlated binary variables rely on the properties of random variables defined as mixtures of other random variables with various distributions. Authors have pursued mixtures of discrete distributions, continuous
distributions, and combinations of discrete and continuous distributions. Methods relying solely on mixtures of continuous distributions tend to struggle to represent correlated binary responses.
3.1 Mixtures Involving Discrete Distributions Many attempts to circumvent the computational burden of the methods of Bahadur 1961), ), Lee (1993 (1993), ), and Gange Gange (1995 1995)) have turned to generating binary responses (1961 through mixtures of discrete random variables. Kanter (1975 1975)) presented a method that directly uses autoregression to simulate properly correlated binary outcomes. Suppose it is of interest to simulate a series of binary values Y i with constant expec1975)) proposed to use the model, tation π . Kanter Kanter ( (1975 Y i
= U
i
⊕
− ⊕ + Y i
W i
1
(1
− U )W , i
i
where ind indica icate tess additi addition on modulo modulo 2, U i can can be take taken n to be Berno Bernoul ulli li with with pr prob obab abil ilit ity y (π((1 πU ))/(1 (π πU and W i can be taken to be Bernoulli with probability πW 2π πU ). Assuming Y i −1 , U i , and W i to be independent, it can be shown that all Y i have expectati expectation on π and that the autocorrelation between any two outcomes is
=
Corr(Y i , Y jj )
=
πU (1
1
− 2π )
− 2π π
U
| − |
−
−
i j
.
The method of Kanter (1975 (1975)) requires πU (0, m i n ((1 π)/π, 1)), and includes restrictions on the autocorrelation in the simulated data based on the probabilities used. In particular, Farrell and Sutradhar (2006 2006)) showed that no negative autocorrelation can be generated with any probability π chosen to be less than or equal to 0.50. In addi additi tion on,, this this me meth thod od does does not allo allow w fo forr easy easy varia ariati tion on of proba probabi bili liti ties es wi with thin in clusters or series. Use of a mixture of binary random variables was updated by Lunn and Davies (1998), 1998), who proposed a simple method for generating binary data for multiple clusters simultaneously, simultaneously, and for various types of autocorrelation structure within groups.
∈
−
Monte-Carlo Simulation of Correlated Binary Responses
93
Suppose the intention is to simulate random binary variables Y i j such that the expectation of each is cluster-dependent, πi , and the autocorrelation can be specified. First assume a positive, constant correlation ρ i is desired within clusters. Then simulate binary values according to the equation, Y i j
= (1 − U )W + U Z , ij
ij
ij
i
= √
(4)
ρi , and both W i j where U i j can be taken to be Bernoulli with probability πU and Z i can be Bernoulli with success probability πi , all independent. Then it can be show shown n that that E Y i j πi and Corr Corr(Y i j , Y i k ) ρi . Lu Lunn nn and and Davi Davies es (1998) 1998) expl explai ain n how how to adjust their method to allow for additional correlation structures within clusters,
[ ]=
=
by changing the particular mixture from Eq. 4,
= (1 − U )W + U Z Y = (1 − U )W + U Y − Y = (1 − U )W + U W −
Y i j
ij
ij
ij
i
(Exchangeable),
ij
ij
ij
ij
i , j 1
(Autoregressive),
ij
ij
ij
ij
i , j 1
(M-Dependent).
This algorithm requires constant outcome probabilities within clusters. In order to accommodate varying probabilities within clusters, Lunn and Davies Davie s (1998) 1998) proposed a simple transformation of generated binary responses,
˜ = A
Y i j
i j Y i j ,
where Ai j is taken to be Bernoulli with success probability αi j , independent of all other simulated variables. Then the Y i j satisfy the previous distributional requirements with E Y i j αi j ma x (π (πi ). Lunn and Davies Davies (1998 1998)) acknowledge that this multiplicative transformation imposes an additional multiplicative correction to the correlation between generated binary values. While this transformation allows the probabilities to vary within clusters, it does not easily incorporate predictors into this variation, instead accommodating known success probabilities that may differ across responses within clusters.
˜
[˜ ]=
Generating Binary Values using a Binary Distribution Mixture 1. For a single single clu cluster ster i , determine the desired probability π i and autocorrelation structure, along with cluster-dependent correlation ρ i . 2. Depending on tthe he autocorrelation autocorrelation structure, generate U i j , W i j , and possipossibly Z i . 3. Calc Calculate ulate Y i j using the appropriate mixture. 4. If necessa necessary ry,, transform Y i j to Y i j to accommodate varying probabilities within-cluster. 5. Repea Repeatt for other other clust clusters. ers.
˜
94
T.L. Lalonde
The issue issue of allow allowing ing within within-cl -clust uster er va varia riati tion on of succe success ss pro probab babili ilitie tiess wa wass addressed by Oman and Zucker Zucker (2001 2001)) in a method that is truly a combination of mixture variable simulation and dichotomization of continuous values. Oman and Zuckerr (2001 Zucke (2001)) argued that the cause of further restricted ranges of autocorrelation in simulated binary data, beyond the limits from Eq. 1 given by Prentice (1988 (1988), ), is the combination of varying probabilities within clusters and the inherent mean-variance relationship of binary data. Assume the interest is in simulating binary variables Y i j with probabilities π i j , varying both between and within clusters. To define the joint probability distribution of any two binary outcomes Y i j and Y i k , define P (Y
Y
1)
(1
ν )π π
ν mi n (π , π ),
(5)
ij
i k
× = = −
j k
ij
i k
+
jk j k
ij
i k
where ν j k is chosen to reflect the desired correlation structure within groups, as folallows ws th thee corr correl elat atio ion n betw betwee een n an any y two two lows lows.. The The join jointt dist distri ribu buti tion on sp spec ecifi ified ed by Eq. Eq. 5 allo responses within a cluster, denoted ρ j k , to be written ρ j k ν j k m a x ((ρ ρst ), where the maximum is taken over all values of correlation within cluster i . The process of generating binary values is constructed to accommodate this joint distribution as follows. Define responses according to
= ×
Y i j
= I ( Z ≤ θ ), ij
ij
forr any any cont contin inuo uous us CDF CDF F ,and Z i j is de defin fined ed as an appr appropr opria iate te where θ i j F −1 (πi j ) fo mixture. Similarly to the method of Lunn and Davies (1998 ( 1998), ), Oman and Zucker (2001 2001)) provide mixtures for common correlation structures,
=
= U X + (1 − U ) X Z = U X − + (1 − U ) X Z = U Z − + (1 − U ) X
(Exchangeable),
Z i j
ij
i0
ij
ij
i , j 1
ij
ij
(Moving Average) ,
ij
ij
i , j 1
ij
ij
(Autoregressive),
ij
ij
= √
ν j k , and all X i j are where U i j can be taken to be Bernoulli with probability πU independently distributed according to the continuous distribution F . The different correlation structures are induced by both adjusting the mixture as given above and also by defining ν j k accordingly accordingly.. For example, constant ν across all i , j with the first mixture will produce exchangeable correlation among the binary outcomes, while γ | ji − j2 | wit with h the third third mixtur mixturee wil willl prod produce uce autore autoregre gressi ssive ve correl correlati ation. on. choosing νi j
= Generating Generat ing Bin Binary ary Valu alues es usi using ng Bin Binary ary and Con Continu tinuous ous Dis Distrib tributi ution on Mixture 1. For a ssingle ingle cluster cluster i , determine the marginal probabilities π i j . 2. Decide the target correlation structure. 3. Based on the ta target rget correla correlation tion structure, select ν j k and the mixture function accordingly. accordingly.
Monte-Carlo Simulation of Correlated Binary Responses
95
4. Gen Genera erate te U i j as Bernoulli, X i j according to continuous F , and calculate Z i j and θ i j . 5. Define each outcom outcomee as Y i j I ( Z i j θ i j ). 6. Repea Repeatt for additional additional clus clusters ters..
=
≤
Oman and Zucker Zucker (2001) 2001) noted that covariates can be incorporated by defining T xi j β as the systematic component of a generalized linear model and taking θ i j F −1 to be the associated link function. It is an interesting idea to use the inverse link funct fun ctio ion n from from a ge gene nera rali lize zed d line linear ar mo mode dell to he help lp co conn nnec ectt pr pred edic icto tors rs to th thee dete determ rmin inaation of whether the binary realization will be 0 or 1. However, the method described
=
continues to suffer from restricted ranges of autocorrelation, most notably that the correlations between binary responses must all be positive.
3.2 Mixtures Involving Continuous Distributions Instead of using proper binary distributions, non-normal values can be simulated using usi ng linea linearr combin combinat ation ionss of sta standa ndard rd norm normal al varia variable bless to repres represent ent the kno known wn moment-based properties of the desired data distribution. Many such methods are based on the work of Fleishman (1978 (1978), ), and later extended (Headrick 2002a, 2002a, 2010, 2010, 2011). In gene 2011). genera ral, l, the the idea idea is to si simu mula late te any non non-no -norma rmall distri distribu butio tion n as a polynom polynomial ial mixture of normal variables,
m
= =
Y
ci Z (i −1) ,
=
i 1
where Y indicates the desired random variable, Z is a standard normal random varia va riable ble,, and the ci rep repres resent ent coeffi coefficie cients nts chosen chosen accord according ing to the desire desired d distri distribu butio tion n of Y Y . Fleishman (1978 (1978)) derived a system of nonlinear equations that, given the target distribution mean, variance, skewness, and kurtosis, could be solved for coefficients c1 , c2 , c3 , and c4 to produce the third-order polynomial approximation to the desired distribution. The intention is to use an expected probability π to estimate those first four moments for the Bernoulli distribution, and approximate accordingly using standard normals. This process has been extended numerous times, including to multivariate data (Vale and Maurelli Maurelli 1983 1983), ), to higher-order polynomials (Headrick 2002a 2002a), ), to using percentiles in place of standard moments to reflect instead the median, the inter-decile range, the left-right tail-weight ratio, and the tail-weight factor (Karian and Dudewicz 1999 Dudewicz 1999), ), and to control autocorrelation in multivariate 2015). ). data (Koran et al. al. 2015 The Fleishman process and similar methods derived from it suffer from some consistent issues. First, the systems of nonlinear equations have a limited range of solutions for the necessary coefficients, and consequently can only be used to represent limited ranges of values for moments or percentile statistics. Therefore the ranges of probabilities and correlation values in the simulated data will be limited.
96
T.L. Lalonde
Secondly, the resulting mixture random variable is a power function of standard normal variables, which generally will not reflect the true mean-variance relationship nece ne cess ssar ary y for for bina binary ry da data ta.. Whil Whilee the the va valu lues es of me mean an and and vari varian ance ce usin using g th thee Flei Fleish shma man n method can in some cases reflect those of an appropriate sample, the dynamic relationship between changing mean and variation will not be captured in general. Thus as the mean changes, the variance generally will not show a corresponding change. Fina Finall lly y, the the me meth thod od does does not re read adil ily y acco accoun untt fo forr th thee effe effect ctss of pr pred edic icto tors rs in simu simula lati ting ng responses. In short, such methods are poorly equipped to handle independent binary data, let alone correlated binary outcomes.
4
Simulatio Simulation n by Dichotomi Dichotomizing zing Variates ariates
Perhaps the most commonly implemented methods for simulating correlated binary outco out come mess are are thos thosee that that invo involv lvee dich dichot otom omiz izat atio ion n of ot othe herr ty type pess of vari variab able les. s. Th Thee most most frequent choice is to dichotomize normal variables, although defining thresholds for uniform variables is also prevalent.
4.1 Dichotomizing Normal Variables Many methods of dichotomizing normal variables have been proposed. The method ), one of the most popular, controls the probabilities of Emrich and Piedmonte (1991 (1991), and pairwise correlations of resulting binary variates. Assume it is of interest to simulate simul ate binary binary variabl variables es Y i with assoc associate iated d probab probabilit ilities ies πi an and d pairwise pairwise corre correlati lations ons give given n by Corr (Y i , Y jj ) ρi j . Begi Begin n by so solv lvin ing g th thee fo foll llow owin ing g equa equati tion on fo forr th thee norm normal al pairwise correlation, δi j , using the bivariate normal CDF Φ ,
=
Φ ( z (πi ), z (π j ), δi j )
=ρ
ij
πi (1
− π )π (1 − π ) + π π , i
j
j
i
j
where z () indicates the standard normal quantile function. Next generate one N dimensional multivariate multivariate normal variable Z with mean 0 and correlation matrix with components δ i j . Define the correlated binary realizations using Y i
= I ( Z ≤ z(π )). i
i
Emrich and Piedmonte ((1991 Emrich 1991)) showed that the sequence Y 1 , . . . , Y N N has the appropriate desired probabilities πi and correlation values ρ i j .
Monte-Carlo Simulation of Correlated Binary Responses
Generating Binary Values by Dichotomizing Normals
1. For a single cluster cluster,, determine the marginal marginal probabilities π i and autocorrelation values ρ i j . 2. Using the biv bivaria ariate te norma normall CDF CDF,, solve solve for normal pairwise pairwise correlation correlation values δ i j . 3. Gen Genera erate te on onee N -dimensional normal variable Z with correlation components given by δi j . 4. Define tthe he binary binary val values ues as Y i I ( Z i z (πi )), where z () is the standard
=
≤
97
normalt for quantile function. 5. Repea Repeat additional additio nal clusters. clusters.
This method is straightforward and allows probabilities to vary both within and between clusters. A notable disadvantage of this method is the necessity of solving a system of nonlinear equations involving the normal CDF, which increases computational burden with large-dimensional data generation.
4.2 Iterated Dichotomization Headrick (2002b (2002b)) proposed a method of simulating multiple clusters of correlated binary data using JMASM3, an iterative method of dichotomizing two sets of binary variables in two steps. This method is unique in that it allows for autocorrelation both within and between clusters of binary data. Assume it is of interest to simulate N clusters of binary data with correlation within clusters denoted by ρY i j ,Y iik k , and correlation between clusters by ρY i j ,Y klkl . Given probabilities π1 , . . . , π N , correlated binary variables X 1 , . . . , X N N can be defined I (U 1 < π1 ). using random uniform variables U 1 , . . . , U N N as follows. Define X 1 Define successive successive X i as
=
X i
=
X 1 , U i < πi X 1 1, U i > πi and X 1
+ = 0 1 − X , U > π and X = 1. 1
i
i
1
This generates a cluster of binary values, each correlated with X 1 . Next simulate binary values Y i j , where i indicates the cluster and j indicates individual outcomes, as follows, where the U i j are independent uniform variables on ( 0, 1),
Y i j
=
X i , U i j < πi j X i 1, U i j > πi j and X i 1 X i , U i j > πi j and X i
+ −
= 0 = 1.
98
T.L. Lalonde
2002b)) shows that the threshold values πi j can be obtained by solving Headrick (2002b a nonlinear system in terms of the specified correlation values ρY i j ,Y iik k and ρY i j ,Y klkl . The order of the nonlinear system corresponds to the number of correlation values specified at the outset. Headrick (2002b ( 2002b)) also provides expressions for the marginal 1) in terms of the first-stage probabilities πi and the secondprobabilities P (Y i j stage thresholds π i j ; however, it is not clear that these marginal probabilities can be controlled from the start.
=
Generating Binary Values by Iterated Dichotomization
1. Dete Determine rmine the autocorr autocorrelat elation ion desired within and between N clusters of binary responses, and select first-stage probabilities π 1 , . . . , π N . 2. Simul Simulate ate X 1 as I (U 1 < π1 ), where U 1 is a random uniform realization on (0, 1). 3. Gener Generate ate the rema remaining ining first-stage first-stage binary outcomes X i according to X 1 and corresponding random uniform variables U i . 4. Solv Solvee for the second-stag second-stagee threshold thresholds, s, πi j , given the desired within and between correlation values, ρ Y i j ,Y iik k and ρ Y i j ,Y klkl , respectively. 5. Gener Generate ate the second-stage second-stage bina binary ry outcomes Y i j according to X i and corresponding random uniform variables U i j .
The iterated dichotomization algorithm allows for control of autocorrelation both within and between clusters, and does not require complete specification of the joint probability distribution. However However,, it does not clearly accommodate predictors or reduce in complexity for common correlation structures; it does not allow easy specifi spe cificat cation ion of the margin marginal al bin binary ary outcom outcomee probab probabili ilitie ties; s; and it requir requires es the soluti solution on of a potentially high-dimensional system of nonlinear equations.
4.3 Dichotomizing Non-normal V Variables ariables Thee meth Th method od of Park Park et al. al. (1996 1996)) si simu mula late tess co corr rrel elat ated ed bi bina nary ry valu values es usin using g a dichotomization of equations. counts, andAssume in the process avoids the necessity of solving any system of nonlinear an interest in generating N correlated correlated binary variables Y 1 , . . . , Y N with pro probab babili ilitie tiess π1 , . . . , π N and associ associat ated ed pairwi pairwise se cor correl relaaN , with tions ρi j . Begin by generating N counts Z 1 , . . . , Z N M Poisson Poisson N using a collection of M random variables X 1 (λ1 ) , . . . , X M M (λ M ), as linear combinations,
Monte-Carlo Simulation of Correlated Binary Responses
Z 1
= ∈ =
X i (λi ),
i S 1
Z 2
X i (λi ),
∈
i S 2
.. .
=
Z N N
∈
i S N
X i (λi ).
99
Notice that each count Z i is a combination of a specific set of the Poisson random variables, denoted by S i . The number of Poisson variables, M , the associated means, λi , and the sets used in the sums, S i , are all determined algorithmically based on the desired probabilities and correlations. Each binary value is then defined by dichotomizing, Y i I ( Z i 0). Park et al. al. (1996) 1996) describe the determination of M , λi , and S i as follows. The Poisson means λ i can be constructed as linear combinations of parameters α i j , 1 i , j N . The α i j can be calculated based on the desired probabilities and pairwise correlations,
=
=
≤
≤
αi j
= ln
+ − 1
ρi j
(1
πi )(1
− π )/(π π ) j
i
j
.
≥
Given all of the α i j , define λ k to be the smallest positive α i j , i , j k , until the first mean λ L matches the magnitude of the largest positiv positivee αi j . Then Then set set M L , let ea eacch mean λi remain as determined by the αi j , and define each summation set S i as those 1996). ). al. ( (1996 means composed of positive α i j (Park et al.
=
Generating Binary Values by Dichotomizing Linear Poisson Mixtures 1. For a single cl cluster, uster, determine the indi individual vidual probabilities πi and pairwise correlations ρ i j . 2. Using the probab probabilit ilities ies and corre correlati lation on values, values, calculate calculate the paramet parameters ers αi j . 3. Using the p parame arameters ters αi j , determine the number of Poisson variables, M , the Poisson means, λi , and the summation sets, S i , for each count variable Z i . 1 , . . . , Z N , an and dters. defin define the bina binary ry re resp spon onse sess as Y i 4. Calculate ulate 5. Calc Repea Repeat t for Z additional additi onal clus clusters . e the
i
= I ( Z = 0).
The method of Park et al. al. ( (1996 1996)) is computationally efficient and does not require solving systems of nonlinear equations. It allows for varying probabilities and correlation values within and between clusters, and can be adjusted to incorporate covariates into the probabilities. However, there is still a restriction on the range of correlations available through this algorithm.
100
5
T.L. Lalonde
Condition Conditionally ally Specified Specified Distribu Distributions tions
Recent attention has been paid to conditionally specifying the distribution of correlated binary variables variables for the purposes of simulation. While the mixture distribut distributions ions can be viewed as conditional specifications, in such cases discussed Sect. 3 the mixtures were defined so that the marginal distributions of the resulting binary variables were completely specified. In this section the discussion focuses on situations without full specification of the marginal outcome distribution. Instead, the distributions
are defined using predictor values or prior outcome values.
5.1 The Linear Conditional Probability Model ntrodu ducced a me meth thod od of simu simullatin ating g bi bina nary ry var ariiate ates us usin ing g Qaqish Qaqi sh (2003) 2003) intro autoregressive-type relationships to simulate autocorrelation. Each outcome value is conditioned on prior outcomes, a relationship referred to as the conditional linear family. The conditional linear family is defined by parameter values that are so-called reproducible in the following algorithm, or those that result in conditional means within the allowable range ( 0, 1). Suppose the interest is in simulating correlated binary variables Y i with associated probabilities πi , and variance-covariance structure defined for each response by its covariation with all previous responses, si Cov( Y 1 , . . . , Y i −1 T , Y i ). Qaqish (2003 2003)) argued that the expectation of the conditional distribution of any response Y i , given all previous responses, can be expressed in the form,
=
[ |[
E Y i Y 1 , . . . , Y i −1
T
] ]=π
i
=π
i
κ iT
+ [ − +
[
Y 1 , . . . , Y i −1
i 1
κi j (Y j j
j 1
=
]
T
] − [π , . . . , π 1
− π ),
T
−]
i 1
j
(6)
where the components of κ i are selected corresponding to the desired variancecovariance structure according to κi
= [Cov([Y , . . . , Y − ] )]− s . 1
i 1
1
T
i
The correlated binary variables are then generated such that Y 1 is Bernoulli with probability π1 , and all subsequent variables are random Bernoulli with probability given by the conditional mean in Eq. 6. It is straightforward to show that such a sequen seq uence ce will will ha have ve the desire desired d ex expec pectat tation ion π1 , . . . , π N T and autocorrel autocorrelation ationdefine defined d by the variance-covariance s i . Qaqish (2003 (2003)) provides simple expressions for κ i j to produce exchangeable, auto-regressive, and moving average correlation structures, as follows,
[
]
Monte-Carlo Simulation of Correlated Binary Responses
κi j λi κi j
= 1 = π
i
=
1/2
+ − + − − − − −
β j
β −i
V i i
ρ
(i
ρ ( yi
1
(Exchangeable),
V jj j
1)ρ
πi
β
j
V i i
−β
i
V jj j
1)
1/2
101
V i i V i −1,i −1
1/2
(Autoregressive) , (Moving Average),
repres resent entss diagon diagonal al ele elemen ments ts of the respons responsee varia variance nce-co -cova varia riance nce,, where V i i rep λi E Y i Y 1 , . . . , Y i 1 T represents the conditional expectation, and β (1
− the decaying correlation for autoregressive models and the 4ρ 2 )1/2 1 /2ρ with ρ single time-lag correlation for moving average models.
= [−|[ ]
]]
= [ −
Generating Binary Values by the Linear Conditional Probability Model
1. For a single cluster cluster,, determine the individual individual probabilities π i . 2. Sele Select ct the desir desired ed correlat correlation ion structur structure, e, and the corresponding corresponding constants constants κ i and conditional means λi . 3. Gener Generate ate the first first res response, ponse, Y 1 , as a random Bernoulli with success probability π 1 . 4. Gener Generate ate subsequent subsequent responses according according to the appropriate appropriate conditional conditional probability. 5. Repea Repeatt for additional additional clusters. clusters. An interesting property of the method presented by Qaqish ( Qaqish (2003 2003)) is the nature of includ inc luding ing pri prior or bin binary ary out outcom comes. es. The ter terms ms κi j (Y j j π j ) sho show th that at a bi bina nary ry re resp spons onsee varia va riable ble is includ included ed rel relat ativ ivee to its its ex expec pecta tatio tion n and transf transform ormed ed accord according ing to a consta constant nt related to the desired autocorrelation. While this method does not explicitly include predictors predic tors in the simulat simulation ion algorithm, algorithm, predi predictors ctors could be included included as part of each expected expe cted value value πi . The The me meth thod od clea clearl rly y allo allows ws fo forr both both posi positi tive ve and and nega negati tive ve valu values es of autocorrelation, unlike many other proposed methods, but restrictions on the values 2003). ). of the autocorrelation remain as discussed by Qaqish ( Qaqish (2003
−
5.2 Non-linear Dynamic Conditional Probability Model The most general method in this discussion is based on the work of Farrell and Sutradhar (2006 (2006), ), in which a nonlinear version of the linear conditional probability mode mo dell propo propose sed d by Qaqi Qaqish sh (2003) 2003) is cons constr truc ucte ted. d. Th Thee mode modell of Farr Farrel elll an and d Sutr Sutrad adha harr 2006)) is conditioned not only on prior binary outcomes in an autoregressive-type (2006 of sequence, but also on possible predictors to be considered in data generation. This approach allows for the inclusion of covariates in the conditional mean, allows for the probabilities to vary both between and within clusters, and allows for the
102
T.L. Lalonde
greatest range of both positive and negative values of autocorrelation. However, the nonlinear conditional probability approach of Farrell and Sutradhar (2006 ( 2006)) does not explicitly provide methods for controlling the probabilities and correlation structure at the outset of data simulation. Assume an interest in simulating correlated binary variables, Y i , where each outcome co me is to be as asso soci ciat ated ed wi with th a vect vector or of pr pred edic icto tors rs,, xi , th thro roug ugh h a vect vector or of para parame mete ters rs,, β . Farrell and Sutradhar (2006 (2006)) proposed using the non-linear conditional model,
ex p(xiT β
E Y i Y 1 , . . . , Y i −1 , xi
[ |[
] ]=
1
+
−
i 1
+
ex p(xiT β
γ k k Y k k )
k k 1
=− +
.
i 1
(7)
γ k k Y k k )
=
k 1
Instead of beginning with desired probabilities and pairwise correlations or an asso2006)) focuses on ciated ciat ed corre correlati lation on struc structure, ture, the method method of Farrell Farrell and Sutradhar Sutradhar ((2006 the model relating predictors to the conditional response probability. This allows the greatest flexibility in producing values of correlation at the expense of control of the correlation structure. Farrell and Sutradhar Sutradhar (2006 2006)) show that, for a simple auto-regression including only the immediately previous response Y i −1 , the marginal expectation and correlation can be calculated based on the nonlinear dynamic model,
= E[Y i ] = P (Y i = 1|Y 0 = 1) + E[Y i −1 ] P (Y i = 1|Y 1 = 1) − P (Y i = 1|Y 0 = 1) µ i (1 − µi ) P (Y k = 1|Y 1 = 1) − P (Y k = 1|Y 0 = 1) . Corr(Y i , Y j ) = µ j (1 − µ j ) k ∈(i , j ] µi
,
= | =
= | =
1) can 1) and P (Y k k 1 Y 0 Because the conditional probabilities P (Y k k 1 Y 1 vary within ( 0, 1), Farrell Farrell and Sutra Sutradha dharr (2006 2006)) argue that the marginal correlation can vary unrestricted between 1 and 1.
−
Generat Gene ratin ing g Bi Bina nary ry Val alues ues by the the No Nonl nlin inear ear Dy Dyna namic mic Co Cond ndit ition ional al Probability Model
1. For a single cluster r of corre correlate lated d binary γ data, selec selectt predict predictors ors x i , coefficients β , andcluste autoregressive coefficients k k . 2. Simul Simulate ate Y 1 as Bernoulli with probability π1 (ex p(xiT β ))/(1 ex p (xiT β )). 3. Simul Simulate ate subseque subsequent nt Y i according to the conditional probability E Y i Y 1 , . . . , Y i −1 , xi . 4. Repea Repeatt for additional additional clus clusters ters..
=
[
] ]
+
[ |
The intuition behind such an approach is that the predictors xi and also the previous outcome variables Y 1 , . . . , Y i −1 are combined linearly but related to the conditional
Monte-Carlo Simulation of Correlated Binary Responses
103
mean through the inverse logit function, as in Eq. 7. The inverse logit function will map any real values to the range (0, 1), thus avoiding the concern of reproducibility 2003). ). discussed by Qaqish ( Qaqish (2003
6
Softwa Software re Discu Discussi ssion on
Few of the methods discussed are readily available in software. The R packages bindata and BinNor utilize utilize discreti discretizati zations ons of normal normal random variables, variables, but not according to Emrich and Piedmonte (1991 (1991). ). The method of Emrich and Piedmonte generate.binary() .binary() function within the MultiOrd pack pack(1991) 1991) is implemented in the generate age, and also withi within n the functio functions ns of the mvtBinaryEP package. The rbin() fun functi ction on in the SimCorMultRes packa package ge explicit explicitly ly uses threshold threshold values values to transform continuous values into binary values (Touloumis 2016). 2016). Touloumis (2016 2016)) proposed defining the marginal distribution of each binary value according to a CDF applied to the systematic component associated with a model, P (Y i j
T i j β ),
= 1) = F (x
where F is is a cumulative distribution function of a continuous random variable, and T xi j β is a line linear ar co comb mbin inat atio ion n of pr pred edic icto tors rs an and d para parame mete terr valu values es.. Each Each bi bina nary ry outco outcome me is then defined according to Y xT β ), where e and is independent I (e F and ij ij ij ij across clusters. While the method of Touloumis (2016 (2016)) does clearly accommodate predi pre dict ctors ors and and al allo low w proba probabi bili liti ties es to vary ary with within in clus cluste ters rs,, it is no nott clea clearr th thee limi limita tati tions ons on the range of autoc autocorrel orrelatio ation, n, nor can the autocorrel autocorrelation ation be easily easily controlle controlled, d, as with other methods. The package binarySimCLF implements implements the method of Qaqish (2003 (2003). ). Authors 1996), ), Fleishman (1978 (1978), ), have individually implemented the methods of Park et al. al. ( (1996 and publications include code for various software, such as with Lee ( 1993) 1993) and Headrick (2002b (2002b). ). However, there is a clear bias in available software: users prefer dichotomization of normal or uniform random variables to produce binary data. Table 1 provides a brief summary of the simulation methods considered, along with wi th some some comm common on adva advant ntag ages es an and d disa disadv dvan anta tage gess of each each clas classs of algo algori rith thms ms.. Meth Meth-ods requiring the full specification of the joint probability distribution, while allowing complete control of the simulated data properties, tend to be complicated and computationally expensive. Alternatively, the mixture methods tend to have simpler algorithms and computational burden, and generally allow the user to specify correlation structures at the outset. The method of Oman and Zucker (2001 (2001)) would seem to be the ideal such approach. Dichotomizing normal and uniform variables remain the most commonly implemented methods for simulating correlated binary outcomes, but most require computations involving systems of nonlinear equations. The approach of Emrich and Piedmonte (1991 ( 1991)) remains prevalent and, accepting the computational burden, is an understandable method that allows easy control of correlation structures. Methods involving conditionally defined distributions are more
=
≤
∼
104
T.L. Lalonde
Advantages and disadvantages of methods of correlated binary outcome simulation Table 1 Advantages Simulation type
Prominent example
Fully specified joint distribution Lee Lee (1993), 1993), using the Archimidian copula Advantages Control of probabilities, correlation, and higher-order moments Disadvantages Computationally expensive, nonlinear systems, function/matrix
inversions Mixtur ture distribution ions
Oman and Zucker (2001), 2001), using a mixture of binary and continuous variables Advantages Simple algorithms, controlled correlation structures Disadvantages Constant probabilities within clusters, no predictors
Di Dich chot otom omiz izin ing g varia ariabl bles es
1991), ), using dichotomized multivariate Emric Emrich h an and d Pi Pied edmo mont ntee ( (1991 normals Advantages Short algorithms, probabilities vary within clusters, Correlation between clusters Disadvantages Nonlinear systems, computationally expensive
Cond Co ndit itio iona nall dist distri rib but utio ions ns
Qaqi Qaqish sh (2003), 2003), using the linear conditional probability model Advantages Widest range of correlations, controlled correlation structures, predictors Disadvantages Complicated algorithms, requiring conditional means and covariances
re rece cent nt,, and and al allo low w fo forr the the gr grea eate test st ra rang ngee of corr correl elat atio ions ns to be simu simula late ted. d. The The algo algori rith thm m of Qaqish (2003 (2003)) is an ideal example, with few disadvantages other than a slightly limited range of correlation values, but allowing the inclusion of predictors, prior outcomes, and easily specified common correlation structures.
References Bahadur, R. R. (1961). A representation of the joint distribution of responses to n dichotomous items. Stanford Mathematical Studies in the Social Sciences, 6 , 158–168. Devroye, L. (1986). Non-uniform random variate generation (1st ed.). Springer, New York. Emrich, L. J., & Piedmonte, M. R. (1991). A method for generating high-dimensional multivariate binary variates. The American Statistician: Statistical Computing, 45 (4), 302–304.
Monte-Carlo Simulation of Correlated Binary Responses
105
Farrell, P. J., & Sutradhar, B. C. (2006). A non-linear conditional probability model for generating correlated binary data. Statistics & Probability Letters, 76 , 353–361. Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43, 521–532. Gange, S. J. (1995). Generating multivariate categorical variates using the iterative proportional fitting algorithm. The American Statistician, 49 (2), 134–138. Genest, C., & MacKay, R. J. (1986a). Copules archimediennes et familles de lois bidimenionnelles dont les marges sont donnees. Canadian Journal of Statistics , 14, 280–283.
Genest, C., & MacKay, R. J. (1986b). The joy of copulas: Bivariate distributions with uniform marginals. The American Statistician, 40, 549–556. Headrick, T. C. (2002a). Fast fifth-order polynomial transforms for generating univariate and multivariate T. nonC.normal distributions. Statistics & Data Analysis , 40, 685–711. Headrick, (2002b). Jmasm3: Computational A method for simulating systems of correlated binary data. Journal of Modern Applied Statistical Methods, 1, 195–201. Headri Hea drick, ck, T. C. (2010) (2010).. Statis Statistical tical simulation simulation:: Power method method polynomial polynomialss and other transform transformation ationss (1st ed.). Chapman & Hall/CRC, New York. Headrick, T. C. (2011). A characterization of power method transformations through l-moments. Journal of Probability Probability and Statistics, 2011. Kang, Kan g, S. H.,& Jung, Jung, S. H. (2001) (2001).. Genera Generatin ting g correl correlate ated d binary binary va varia riable bless with with comple complete te specifi specificat cation ion of the joint distribution. Biometrical Journal, 43(3), 263–269. Kanter, Kante r, M. (1975). (1975). Autoregre Autoregression ssion for discrete discrete processes processes mod 2. Journal of Applied Probability, 12, 371–375. Karian, Z. A., & Dudewicz, E. J. (1999). Fitting the generalized lambda distribution to data: A method based on percentiles. Communications in Statistics: Simulation and Computation, 28, 793–819. Koran, J., Headrick, T. C., & Kuo, T. C. (2015). Simulating univariate and multivariate no normal distributions through the method of percentiles. Multivariate Behavioral Research Research, 50, 216–232. Lee, A. J. (1993). Generating random binary deviates having fixed marginal distributions and specified degrees of association. The American Statistician: Statistical Computing, 47 (3), 209–215. Lee, Y., & Nelder, J. A. (1996). Hierarchical generalized linear models. Journal of the Royal Statistical Society, Series B (Methodological), 58 (4), 619–678. Lunn, A. D., & Davies, S. J. (1998). A note on generating correlated binary variables. Biometrika, 85(2), 487–490. Molenberghs, G., & Verbeke, G. (2006). Models for discrete longitudinal data (1st ed.). Springer. Oman, S. D., & Zucker, D. M. (2001). Modelling and generating correlated binary variables. Biometrika, 88 (1), 287–290. Park, C. G., Park, T., & Shin, D. W. (1996). A simple method for generating correlated binary variates. The American Statistician, 50 (4), 306–310. Prentice, R. L. (1988). Correlated binary regression with covariates specific to each binary observation. Biometrics, 44, 1033–1048. Qaqish, B. F. (2003). A family of multivariate binary distributions for simulating correlated binary variables with specified marginal means and correlations. Biometrika, 90 (2), 455–463. Stiratelli, R., Laird, N., & Ware, J. H. (1984). Random-effects models for serial observations with binary response. Biometrics, 40, 961–971. Touloum ouloumis, is, A. (2016) (2016).. Simula Simulatin ting g correl correlate ated d binary binary and multin multinomi omial al respon responses ses with with simcor simcormul multre tres. s. The Comprehensive R Archive Network 1–5. 1–5. Vale, C. D., & Maurelli, V. A. (1983). Simulating multivariate no normal distributions. Psychometrika, 48, 465–471. Zeger Zeg er,, S. L., & Liang, Liang, K. Y. (1986 (1986). ). Longit Longitudi udinal nal data data analys analysis is for discre discrete te and contin continuou uouss outcom outcomes. es. Biometrics, 42, 121–130.
Quantifying the Uncertainty in Optimal Experiment Experi ment Schemes Schemes via Monte-C Monte-Carlo arlo Simulations
H.K.T. Ng, Y.-J. Lin, T.-R. Tsai, Y.L. Lio and N. Jiang
Abstract In the proces processs of design designing ing life-t life-test esting ing exper experime iments nts,, exper experime imente nters rs alwa always ys establish the optimal experiment scheme based on a particular parametric lifetime model. In most applications, the true lifetime model is unknown and need to be specified for the determination of optimal experiment schemes. Misspecification of the lifetime model may lead to a substantial loss of efficiency in the statistical analysis. Moreover, the determination of the optimal experiment scheme is always relying on asymptotic statistical theory. Therefore, the optimal experiment scheme may not be optimal for finite sample cases. This chapter aims to provide a general framework to quantify the sensitivity and uncertainty of the optimal experiment scheme due to misspecification of the lifetime model. For the illustration of the methodology developed here, analytical and Monte-Carlo methods are employed to evaluate the robustness of the optimal experiment scheme for progressive Type-II censored experiment under the location-scale family of distributions.
H.K.T. Ng (B) Department of Statistical Science, Southern Methodist University, Dallas, TX 75275, USA e-mail:
[email protected] Y.-J. Lin Department of Applied Mathematics, Chung Yuan Christian University, Chung-Li District, Taoyuan city 32023, Taiwan e-mail:
[email protected] [email protected] T.-R. Tsai Department of Statistics, Tamkang University, Tamsui District, New Taipei City, Taiwan e-mail:
[email protected]
·
Y.L. Lio N. Jiang Department of Mathematical Sciences, University of South Dakota, Vermillion, SD 57069, USA e-mail:
[email protected] N. Jiang e-mail:
[email protected] © Springer Nature Singapore Pte Ltd. 2017
107
D.-G. Chen and J.D. Chen (eds.), Monte-Carlo Simulation-Based Statistical Modeling, ICSA Book Series in Statistics, DOI 10.1007/978-981-10-3307-0_6
108
1
H.K.T. Ng et al.
Intr Introd oduc ucti tion on
In designing life-testing experiments for industrial and medical settings, experimenters always assume a parametric model for the lifetime distribution and then determine the optimal experiment scheme by optimizing a specific objective function based on the assumed model. In most applications, the optimal experimental
design is model dependent because the optimality criterion is usually a function of the information measures. Therefore, prior information about the unknown model derived from physical/chemical theory, engineering pre-test results, or past experience with similar experimental units is needed to determine the optimal experiment scheme in practice. However, this prior information may not be accurate and hence the optimal experiment scheme may not perform as well as one expected. In other words, the objective function may not be optimized when the optimal experiment scheme based on inaccurate prior information is used. In addition, to determine the optima opt imall ex exper perime iment nt sch scheme eme,, we rel rely y on the asympt asymptoti oticc sta statis tistic tical al theory theory in most most cases. cases. For instance, A -optimality is developed to minimize the variances of the estimators of the model parameters while these variances are always replaced by the asymptotic ones during the process of optimization. There is no guarantee that the optimal experiment scheme obtained based on asymptotic theory can be more efficient than a non-optimal experiment scheme in finite sample situations. For these reasons, it is important to have a systematic procedure to quantify the sensitivity of see the optimal due toscheme model misspecification. Itand willuncertainty be useful to whetherexperiment a proposedscheme experiment is robust to model misspecification. If a design is indeed robust, it would then assure the practitioners that misspecification in the model would not result in an unacceptable change in the precision of the estimates of model parameters. In this chapter, we discuss the analytical and Monte-Carlo methods for quantifying the sensitivity and uncertainty of the optimal experiment scheme and evaluate the robustness of the optimal experiment scheme. Let θ be the parameter vector of lifetime distribution of test items. The commonly used procedures for the determination of the optimal experiment scheme are described as follows: A-optimality, that minimizes the trace of the variancecovariance matrix of the maximum likelihood estimators (MLEs) of elements of θ , provides an overall measure of variability from the marginal variabilities. It is particularly useful when the correlation between the MLEs of the parameters is low. It is also pertinent for the construction of marginal confidence intervals for the paramete me ters rs in θ . D-opti -optimali mality ty,, that minim minimizes izes the determina determinant nt of the varianc variance-co e-covar variance iance matrix of the MLEs of components of θ , provides an overall measure of variability by taking into account the correlation between the estimates. It is particularly useful when the esti estimate matess are highly correlat correlated. ed. It is also pertinent pertinent for the construction construction of joint confidence regions for the parameters in θ . V -optimality, that minimizes the variance of the estimator of lifetime distribution percentile. presen sentt the not notati ation on and gen genera erall methods methods for quanti quantifyi fying ng the In Sec Sect. t. 2, we pre uncertainty in the optimal experiment scheme with respect to changes in model.
Quantifying the Uncertainty in Optimal Experiment Schemes …
109
Then, in Sect. 3, we focus on progressive Type-II censoring with location-scale family of distributions. The procedure for determining the optimal scheme with progressive censoring and some commonly used optimal criteria are presented in Sects. 3.1 and 3.2 and 3.2,, respectively. In Sect. 3.3, 3.3, some numerical illustrations via analytical and simulation approaches based on extreme value (Gumbel), logistic, and normal dis-
tributions are discussed. Discussions based on the numerical results are provided in Sect. 3.4 3.4.. A numerical example is presented in Sect. 4. Finally, some concluding remarks are given in Sect. 5.
2
Quantifyi Quantifying ng the the Uncer Uncertain tainty ty in the Optima Optimall Experime Experiment nt Scheme
In a life life-t -tes esti ting ng expe experi rime ment nt,, let let the the life lifeti time mess of test test item itemss fo foll llo ow a fa fami mily ly of stat statis isti tica call model M . We are interested in determining the optimal experiment scheme that optimizes the objective function Q (S , M 0 ), where S denotes denotes an experiment scheme and M 0 denotes the true model. In many situations, the determination of the optimal experiment scheme requires a specification of the unknown statistical model M and hence the optim optimal al expe experime riment nt sche scheme me depen depends ds on the specified model M . For instance, in experimental design of multi-level stress testing, (Ka et al a l, 2011) 2011) and Chan et al. al. (2016 2016)) considered the extreme value regression model and derived the expected Fisher information matrix. Consequently the optimal experiment schemes al. (2016) 2016) are specifically for the extreme obtained in Ka et al a l (2011) 2011) and Chan et al. value regression model, which may not be optimal for other regression models. Here, we denote the optimal experiment scheme based on a specified model M as S ∗ (M ). In the ideal situation, the model specified for the optimal experimental sche scheme me is the the true true mo mode del, l, i. i.e. e.,, Q(S ∗ (M 0 ), M 0 ) inf Q(S (M 0 ), M 0 ) and S ∗ (M 0 ) ar arg g inf inf Q(S (M 0 ), M 0 ).
=
S
=
S
On the other hand, in determining the optimal experiment scheme, experimenter always relies on the asymptotic results which are derived based on the sample size goes to infinity. Nevertheless, in practice, the number of experimental units can be used in an experiment is finite and thus the use of the asymptotic theory may not be appropriate. For instance, for A -optimality, the aim is to minimize the variances of the estimated model parameters. This is always attained through minimizing the trace of the inverse of the Fisher information matrix or equivalently, the trace of the asymptotic variance-covariance variance-covariance matrix of MLEs. However However,, the asymptotic variancecovariance matrix may not correctly reflect the true variations of the estimators when the sample size is finite, and hence the optimal experiment scheme may not be optimal or as efficient as expected in finite sample situations. Therefore, largescale Monte-Carlo simulations can be used to estimate the objective functions and evaluate the performance of the optimal experiment scheme. For quantifying the sensitivity and uncertainty of the optimal experiment scheme S ∗ (M ), we describe two pos possib sible le approa approache chess by compar comparing ing exper experime imenta ntall scheme schemess and object objectiv ivee functio functions ns in the following subsections.
110
2.1
H.K.T. Ng et al.
Comparing Experimental Schemes
Let the specified model for obtaining the optimal experiment scheme be M ∗ , then the optimal experiment scheme is
S ∗ (M ∗ )
= ar arg g inf inf Q(S (M ∗ ), M ∗ ). S
To quantify the sensitivity of the optimal experiment scheme S ∗ (M ∗ ), we consider a different model M and establish the optimal experiment scheme based on M as S ∗ (M )
= ar arg g inf inf Q(S (M ), M ). S
Comparing the experimental scheme S ∗ (M ∗ ) and S ∗ (M ) will provide us some insights on the sensitivity of the optimal experiment scheme S ∗ (M ∗ ). If S S ∗ (M ∗ ) is insensitivity to the change in the model M , then S ∗ (M ∗ ) and S ∗ (M ) will be similar to each other. Depending on the nature of the life-testing experiments, different measures of similarity of two experimental schemes can be considered to quantify the sensit sensitiv ivity ity of the optima optimall ex exper perime iment nt scheme scheme S ∗ (M ∗ ). When When appl apply y th this is appro approac ach, h, evaluation of the optimal experiment scheme for different models is needed.
2.2
Comparing Values of Objective Functions
To qua quanti ntify fy the sen sensit sitiv ivity ity of the opt optima imall exper experime iment nt scheme scheme S ∗ (M ∗ ), anothe anotherr approach is to compare the objective function of the optimal experiment scheme S ∗ (M ∗ ) under the model (M ) which is believed to be the true model. Specifically, we can compute the objective function when the experiment scheme S ∗ (M ∗ ) is adopted but the model is M , i.e., to compute Q(S ∗ (M ∗ ), M ). If the optimal experiment scheme S ∗ (M ∗ ) is insensitivity to the change in the model M , then Q(S ∗ (M ∗ ), M ∗ ) will be similar to Q(S ∗ (M ∗ ), M ) or Q(S ∗ (M ), M ). When apply this approach, evaluation of the objective function Q (S ∗ (M ∗ ), M ) is needed.
3
Progre Progressiv ssivee Censor Censoring ing with LocationLocation-Scal Scalee Family Family of Distributions
In this section, we illustrate the proposed methodology through the optimal progres2000;; si sive ve Type-II ype-II ce censo nsorin ring g scheme schemess (see, (see, for examp example, le, Balakr Balakrish ishnan nan and Aggarw Aggarwal alaa 2000 Balakrishnan 2007 Balakrishnan 2007;; Balakrishnan and Cramer 2014 Cramer 2014). ). We consider that the underline statistical model, M , used for this purpose is a member of the log-location-scale family of distributions. Specifically, the log-lifetimes of the units on test have a
Quantifying the Uncertainty in Optimal Experiment Schemes …
·
111
·
g ( ) and G ( ) Table 1 Examples of functional forms of g(
Extreme value (EV)
[ − exp( z) z)] 1 − exp[− exp( z) z)]
g ( z z)) exp zz G ( z z))
Logistic (LOGIS)
− [ + exp(− z) z)]2 1/[1 + exp(− z) z)]
exp( z)/ z)/ 1
Normal (NORM)
√ 12π exp(− z2 /2) zz 1 2 −∞ √ 2π exp(− x /2)dx
location-scale distribution with probability density function (p.d.f.)
;
f X ( x µ , σ )
=
x µ 1 g σ σ
−
,
(1)
−
,
(2)
and cumulative distribution function (c.d.f.)
;
= G
F X ( x µ , σ )
·
x µ σ
;
·
tandard rd form where g ( ) is the standard standard fform orm of the the p.d.f. p.d.f. f X ( x µ , σ ) and G ( ) is the sstanda of the c.d.f. F X 0 and σ 1. The functional forms g and G are X ( x µ , σ ) when µ completely specified and they are parameter-free, but the location and scale parame f X ( x µ , σ ) and F X ( x µ , σ ), are unknown. Many ters, < µ < and σ > 0 of f
; ∞
−∞
=
= =
;
;
well-kno well-k nown wn proper propertie tiess for locati location-s on-scal calee fa famil mily y of distri distribu butio tions ns had been been establ establish ished ed in the literature (e.g., Johnson et al. 1994 al. 1994). ). This is a rich family of distributions that include the normal, extreme value and logistic models as special cases. The functional forms of g ( ) and G ( ) for extreme value, logistic and normal, distributions are summarized in Table 1. A progressively Type-II censored life-testing experiment is described in order. Let n independent units be placed on a life-test with corresponding lifetimes T 1 , T 2 , . . ., T n that are independent and identically distributed (i.i.d.) with p.d.f. f TT (t θ ) and c.d.f. F T T (t θ ), where θ denotes the vector of unknown parameters. Prior to the experiment, the number of complete observed failures m < n and the censoring m scheme ( R1 , R2 , . . . , Rm ), where R j 0 and j=1 R j m n are pre-fixed. During the experiment, R j functioning items are removed (or censored) randomly from the test when the j-th failure is observed. Note that in the analysis of lifetime data, instead of working with the parametric model for T i , it is often more convenient to work with the equiv equivalent alent model for the log-lifetimes X i log T i , for i 1 , 2, . . . , n. The random variables X i , i 1 , 2, . . . , n, are i.i.d. with p.d.f. f X ( x µ , σ ) and c.d.f. F X ( x µ , σ ).
·
·
;
;
≥
;
+ = =
=
;
=
112
3.1
H.K.T. Ng et al.
Maximum Likelihood Estimation
Let the m completely observed (ordered) log-lifetimes from a progressively Type-II censored experiment be X i:m:n , i 1 , 2 . . . , m and their observed values be x i:m:n ,
=
i
= 1, 2, . . . , m. The likelihood function based on x :
≤ i ≤ m) is
i m n (1
:
m
L (µ,σ) = c
=
i 1
;
Ri
− F ( x : : ; µ , σ )]
f X ( x i:m:n µ , σ ) [1
X
imn
x 1:m:n < x 2:m:n <
· · · < x :
where c is the normalizing constant given by c
, m m n,
:
(3)
= n(n − R − 1) · · · (n − R − R − · · · − R − − m + 1). 1
1
m 1
2
The MLEs of µ µ and σ are the values of µ µ and σ which maximizes ( maximizes (3 3). For locationscal scalee fa fami mily ly of dist distri ribu buti tion onss desc descri ribe bed d in Eq Eqs. s. (1) and (2), the log-li log-likel keliho ihood od functi function on can be expressed as (µ,σ)
= ln L(µ,σ) ln c
=
m
− µ
x i:m:n
g
m ln σ
σ + = x : : − µ + R ln 1 − G . σ −
i 1
m
imn
i
=
i 1
ˆ
ˆ
We denote the MLEs of the parameters µ and σ by µ and σ σ , respectively. Computational algorithms for obtaining the MLEs of the parameters of some commonly used location-scale distributions are available in many statistical software packages such as R (R Core Team 2016 Team 2016), ), SAS and JMP. The expected Fisher information matrix of the MLEs can be obtained as E
= −
∂ 2 (µ,σ) ∂µ∂µ
I(µ,σ)
E
2
∂ (µ,σ) ∂µ∂σ
E
E
∂ 2 (µ,σ) ∂µ∂σ 2
∂ (µ,σ) ∂σ ∂σ
I µµ µµ I µσ µσ
=
I µσ µσ I σ σ
.
(4)
Then, the asymptotic variance-covariance matrix of the MLEs can be obtained by 4) as inverting the expected Fisher information matrix in Eq. ( Eq. (4
= I − (µ,σ) =
V(µ,σ)
1
ˆ ˆ ˆ
ˆ ˆ
V ar (µ) Cov( µ, σ ) Cov( µ, σ ) V ar (σ )
ˆ
2
= σ
V 1111 V 1122 V 1122 V 2222
. (5)
Quantifying the Uncertainty in Optimal Experiment Schemes …
113
The computational formulas of the elements in the Fisher information matrix of al. ( (2003 2003), ), Ng et al. (2004 ( 2004)) and Dahmen et al. (4) can be found in Balakrishnan et al. (2012). 2012). For fixed values of n and m and a specific progressive censoring scheme ( R1 , R2 , . . . , Rm ), we can compute the expected Fisher information matrix and the
asymptotic variance-covariance matrix of the MLEs from Eqs. (4 (4) and (5 (5).
3.2
Optimal Criteria
To determine the optimal scheme under progressive Type-II censoring , we consider the following optimal criteria: [1] D-optimality For D -optimality,we search for the censoring scheme that maximizes the determinant of the Fisher information matrix, det (I(µ,σ)). For a given censoring scheme S ( R1 , R2 , . . . , Rm ) with a specific model M , the objective function is
= =
= I
Q D (S , M )
2 µσ .
− I
µµ µ µ I σ σ
(6)
We den denote ote the optima optimall ex exper perime iment nt sch scheme eme for D-op -optim timali ality ty with with a specifi specificc mod model el ∗ (M ). M as S D [2] A-optimality For A-opt -optima imalit lity y, we aim aimto to min minimi imize ze the varia variance ncess of the estima estimator torss of the model model parameters. This can be achieved by designing an experiment that minimizes the trace of the asymptotic variance-covariance variance-covariance matrix, tr V(µ,σ) . For a given experime expe riment nt schem schemee S ( R1 , R2 , . . . , Rm ) wi with th a spec specifi ificc mo mode dell M , the objec objecti tive ve function is
[
= =
]
= V + V
Q A (S , M )
2 22 2.
1 11 1
(7)
We denote the optimal experiment scheme for A-optimality with a specified model M as S A∗ (M ). [3] V -optimality For V -optimality, we aim to minimize the variance of the estimator of 100 δ -th percentile of the log-lifetime distribution, 0 < δ < 1, i.e.,
ˆ = µˆ + σ σˆ G − (δ). = ( R , R , . . . , R For a given censoring scheme S = 1
qδ
1
m ) with
2
a specific model M ,
the objective function is
= V ar (qˆ ) = V ar (µˆ + σσ ˆ G − (δ)) = V + [G − (δ)] V + 2G − (δ) V
QV δ (S , M )
1
δ
11 11
1
2
2 22 2
1
1 12 2,
(8)
114
H.K.T. Ng et al.
where G −1 ( ) is the inverse c.d.f. of the standard location-scale distribution. We denote den ote the optima optimall ex exper perime imenta ntall sch scheme eme for V -op -optim timali ality ty with with a speci specified fied mod model el ∗ (M ). M as S V δ
·
n and m are chosen in advance, that depend on the availability of When the values of n units, experimental facilities and cost considerations, we can determine the optimal censoring scheme ( R1 , R2 , . . . , Rm ). In the finite sample situation, we can list all possible censoring schemes and compute the corresponding objective functions, and then determine the optimal censoring schemes, respectively, through an extensive
search.
3.3
Numerical Illustrations
For illu For illust stra rati tive ve purp purpos ose, e, we co cons nsid ider er the the tr true ue unde underl rlin inee life lifeti time me mode modell of th thee test test unit unitss to be Weibull eibull (i. (i.e., e., the log-li log-lifet fetime imess follo follow w an ex extre treme me value value distri distribu butio tion, n, M 0 E V ) and we are interested in investigating the effect of misspecification of the underline lifetime model as log-logistic (i.e., the log-lifetimes follow a logistic distribution, LOGIS ). M ∗ ). We also consider the case that the true underline lifetime model for the test units to be lognormal (i.e., the log-lifetimes follow a normal distribution, NOR) and we are interested in investigating the effect of misspecification of M 0 the underline lifetime model as Weibull (M ∗ E V ).
=
=
=
3.3.1 3.3 .1
=
Analyti Analytical cal Ap Appr proach oach
In this this subsec subsecti tion, on, we ev evalu aluate ate the sensit sensitiv iviti ities es of the optima optimall prog progres ressi sive ve Type-II ype-II cencensoring scheme analytically based on the expected Fisher information matrix and the asymptotic variance-covarianc variance-covariancee matrix of the MLEs. For the specific model M ∗ , we determ det ermine ine the optima optimall pro progre gressi ssive ve Type-II ype-II censor censoring ing sch scheme emess und under er differ different ent optima optimall ∗ (M ∗ ), S ∗ (M ∗ ), S ∗ (M ∗ ) and S ∗ (M ∗ ) fro ruee criteria, S D from m Eqs Eqs.. (4) to (8). For the tru A V .95 V .05 model M 0 , we also determine the optimal progressive Type-II censoring schemes ∗ (M 0 ), S ∗ (M 0 ) and S ∗ (M 0 ) and S ∗ (M 0 ). under different optimal criteria, S D V .95 V .05 A Then, we can compa compare re these experi experiment mental al sche schemes mes S ∗ (M ∗ ) and S ∗ (M 0 ). In addition, we compute the objective functions based on the optimal censoring scheme under the specified specified model M ∗ while the true underline model is M 0 , i.e., we com∗ (M ∗ ), M 0 ), Q A(S ∗ (M ∗ ), M 0 ), and QV (S ∗ (M ∗ ), M 0 ) for δ 0 .05 pute Q D (S D δ V δ A ∗ LOGIS , M 0 E V ) and and 0.95. The results for n 10, m 5 (1)9 with (M (M ∗ E V , M 0 NOR) are presented in Tables 2 and and 3 3,, respectively.
=
=
=
=
=
=
=
Quantifying the Uncertainty in Optimal Experiment Schemes …
3.3.2 3.3 .2
115
Simulat Simulation ion Appro Approach ach
In this subsection, we use Monte-Carlo simulation to evaluate the performance of
the optimal censoring scheme by comparing the objective functions based on the asymptotic variance-covariance matrix and the simulated values. Moreover, we can also evaluate the sensitivities of the optimal censoring schemes when the model is misspecified based on Monte-Carlo simulation. In our simulation study, 20,000 sets of progressively Type-II censored data are generated from the true model M 0 E V and the MLEs of the parameters µ and σ are obtained. The simulated values of the objective functions for the progressive censoring schemes ( n 10 , m 5 (1)9) in Tables 2 and and 3 3 are are presented in Table 4.
=
=
3.4
=
Discussions
3.4.1 3.4 .1
Compari Comparing ng Exp Experim erimenta entall Sch Schemes emes
From Tables 2 and and 3 3,, we can compare the optimal censoring schemes under models LOGIS and M 0 E V . While comparing the optimal censoring schemes M ∗ of the same optimal criterion under these two models, we can see that these two
=
=
optimal censoring schemes are different except the case when n 10, m 9 for V optimality with δ 0 .95. In some cases, the two optimal censoring schemes can be very different from each other. For example, when n 10 10,, m 7 for A -optimality, ∗ LOGIS is the the opti optima mall ce cens nsor orin ing g sche scheme me under under mode modell M is (0, 0, 0, 0, 0, 0, 3) while the optimal censoring scheme under model M 0 E V is (0, 0, 3, 0, 0, 0, 0). These results indicate that when the model is misspecified, the optimal censoring scheme based on logistic distribution may not be optimal under the true model (i.e., extreme value distribution). Neve Ne verth rthele eless, ss, the analyt analytica icall approa approach ch pro propos posed ed her heree will will be useful useful for practi practitio tioner nerss to choose an appropriate censoring scheme which is robust with respect to model misspecification. For instance, from Table2 Table 2 with n 10, m 5 for D-optimality, if ∗ LOGIS but but also suspects that the model might one believes that the model is M
=
= =
=
M
=
=
= =
=
=
0 Escheme V , then(0, 2, it may not be objective the best option to use be from censoring 0, Table 0, 0, 5) with optimal function det the (I) optimal 45 .05 because there is a substantial loss in efficient if the underline model is M 0 E V (det (I) 32 .89 compared to the optimal value 54.52 under M 0 E V in Table 3). In this situation, it may better to use a non-optimal censoring scheme such as (0, 3, 0, 0, 2) which gives objective function det (I) 44 .64 under M ∗ LOGIS and det (I) 47 .09 under M 0 E V .
= =
= =
=
=
=
=
=
116
H.K.T. Ng et al.
)
5 0 . 0
ˆ q 5 7 5 1 9 2 9 1 4 2 0 4
7 5 0 2 7 3 9 8 8 7 r 6 6 9 9 7 8 5 6 1 1 8 9 5 a 0 0 1 1 . . . 2 . 4 . 3 . 1 . 1 . 1 . 9 . 8 . 8 .
(
2 7 5 7 .
8 7 5 6 .
V 2 1 1 0 1 1 1 1 1 0 0 0
0
0
ˆ q 5 0 1 4 4 5 1 5 6 8 8 0
1 1 5 2 . 0
8 4 4 2 . 0
] 4 5 3 1 2 4 4 9 1 9 9 I 8 . . 1 . 3 . 4 . 1 . 9 . 6 . 2 . 0 . 8 . 6 . 0 [ t 2 7 5 1 8 1 5 9 7 2 9 1
5 0 . 0 0 1
4 9 . 1 2 1
6 1 9 4 9 0 2 9 0 5 2 4
0 7 0 2 . 0
5 9 8 1 . 0
2 6 8 3 . 0
8 6 6 3 . 0
3 8 2 4 . 0
6 9 6 3 . 0
)
5 9 . 0
7 1 0 1 8 9 5 8 1 7 7 r 8 0 2 3 5 5 7 6 4 6 5 4 a 5 3 2 . . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 3 . . 3 V 0 0 0 0 0 0 0 0 0 0 0 0 (
)
V E , )
S I G O L ( ∗ S ( V Q
E
=
0
M d n a
S I G O L =
∗
M
e 3 4 4 4 4 5 5 5 6 7 7 9 d
] 6 7 9 0 0 9 5 7 8 9 1 3 0 0 3 0 8 7 6 4 3 3 1 V 8 3 [ . . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 3 . 3 . 3 . 3
r 0 0 0 0 0 0 0 0 0 0 0 0 t
)
5 0 . 0
ˆ q 2 4 5 6 0 8 3 3 6 0 7 9
9 8 6 0 7 9 9 5 9 9 r 0 3 8 2 7 5 5 6 3 1 2 0 8 a 0 4 5 . . 3 . 4 . 4 . 4 . 4 . 4 . 4 . 4 . 4 . . 5 V 0 0 0 0 0 0 0 0 0 0 0 0 (
)
5 9 . 0
ˆ q 6 8 3 8 9 4 0 2 4 2 5 3
7 6 3 3 4 7 5 4 8 8 8 8 r 2 5 2 0 2 0 8 2 9 8 0 1 . . 4 . 5 . 4 . 4 . 6 . 5 . 6 . 6 7 . 8 7 . 0 . 0 a 8 V 0 0 0 0 0 0 0 0 0 0 (
h t i ) w I S 2 4 2 8 4 8 2 7 4 9 4 1 7 9 ] 5 1 7 ) G I 0 . . . . 9 . 9 . 9 . 3 . 4 . 8 . 8 . 2 . 4 . 2 . 6 [ 1 7 4 9 3 0 9 6 7 6 2 1 0 4 O t 5 ( 0 8 4 4 3 4 6 5 5 5 7 7 7 9 e L 5 , 1 d ) S = I m G , O 0 L 6 0 9 0 1 3 9 7 2 5 1 0 4 1 ( 1 ∗ ] 6 6 7 5 9 7 1 3 4 7 9 1 6 5 9 1 1 3 3 3 7 7 6 6 2 2 1 2 V 3 3 3 3 2 2 2 2 2 2 2 2 ( = S 1 2 [ . . . . . . . . . . . . . . r 0 0 0 0 0 0 0 0 0 0 0 0 t 0 0 n Q r o f g n s i e r 1 2 1 3 2 1 2 1 1 m o s ) = = = = = = = = = ) e 8 7 7 6 6 6 5 5 5 S e h n I c c l R R R R R R R R R s G , , , , , , , , , e a 1 3 1 1 2 2 1 3 3 2 4 4 4 5 O g = = = = = = = = = = = = = = e m i m ( L i n r t h ∗ p 9 3 8 3 1 7 3 1 2 6 3 1 2 5 c S o O s ( R R R R R R R R R R R R R R s n e ] ] ] ] ] ] ] ] c 5 5 , 5 5 , 5 5 5 5 0 0 0 9 0 9 0 9 y e t . . 3 ] . 3 ] . . . . . v i 0 0 [ 0 0 0 0 5 0 5 0 [ l n i , , 9 9 a ] . = ] . = ] s o = = 2 = = 2 0 = = s m i 2 r [ 0 δ [ δ δ [ e i δ δ δ δ δ e r t , = , , , t ] , ] , , ] , , ] ] , , ] ] i = p g r 1 3 1 3 3 1 3 3 2 1 3 3 2 1 o [ [ [ [ [ [ [ [ [ [ [ [ [ δ [ [ δ 3 r O c p l a m n i t / p m O − % % % % % 2 1 e l b a T m
0 5
0 4
0 3
0 2
0 1
5
6
7
8
9
Quantifying the Uncertainty in Optimal Experiment Schemes …
)
5 0 . 0
ˆ q 6 9 2 5 1 1 7 0 3 1 1 0 0 5 3 8 0 1
7 5 6 9 2 4 7 4 2 4 9 4 4 4 0 7 9 r 9 8 8 2 7 6 0 6 5 6 8 5 4 8 6 4 4 4 4
(
117
a 2 . 2 . 3 . 2 . 2 . 3 . 2 . 2 . 2 . 2 . 2 . 2 . 1 . 2 . 2 . 2 . 2 . 2 . V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 )
5 9 . 0
ˆ q 1 0 3 1 3 4 4 2 5 7 2 7 6 1 1 4 2 6
5 9 9 3 8 4 9 3 2 3 0 2 7 1 6 5 8 r 5 1 2 8 6 6 3 4 7 2 9 0 7 8 6 6 6 5 a 6 4 . 6 . 4 . 3 . 4 . 3 . 3 . 3 . 3 . 3 . 2 . 3 . 3 . 2 . 2 . 2 . 2 . 2 . V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (
] 3 0 9 8 3 0 3 4 4 3 8 1 8 7 4 1 8 3 I 2 . 1 . 7 . 7 . 0 . 7 . 8 . 4 . 0 . 2 . 0 . 3 . 7 . 4 . 8 . 8 . 6 . 7 . [ 3 1 8 3 5 9 5 5 9 1 9 9 8 5 3 3 1 1 t
e 8 7 7 0 9 9 2 2 1 2 4 4 7 4 7 7 7 7 d 1 1 1 1 1 1 1 1 1 1 1 1 1
)
R O N , )
R O N
V E ( ∗ S ( Q
9 5 3 8 6 5 5 0 5 9 6 8 3 5 7 0 7 1
] 6 9 3 9 6 5 9 9 0 3 3 2 2 6 0 0 9 2 5 4 0 1 1 8 8 9 9 7 7 6 7 6 6 5 6 V 3 2 [ . 2 . 2 . 2 . 2 . 2 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 .
r t 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
)
5 0 . 0
=
ˆ q 7 5 7 3 1 7 7 7 1 8 5 8 1 7 7 4 6 3
6 4 5 5 9 9 8 6 7 1 9 9 1 5 1 2 2 r 0 0 2 2 9 7 9 3 6 7 1 2 7 0 4 6 0 4 a 2 0 . 0 . 9 . 8 . 1 . 7 . 6 . 7 . 9 . 6 . 6 . 6 . 7 . 6 . 5 . 5 . 6 . 5 . V 1 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0
(
M d n a
)
V E
5 9 . 0
ˆ q 2 5 4 7 6 8 2 0 4 6 7 8 3 2 5 2 7 7
8 7 8 4 7 4 3 8 3 1 4 5 3 0 5 2 5 r 6 0 5 4 8 5 1 7 6 4 9 6 5 4 7 5 4 4 5 . 2 . 3 . 2 . 2 . 3 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . a 3 V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (
= ∗
M
h t i w 9 ) 1 ( ) 5 V = E
9 1 5 8 1 2 7 8 8 1 4 0 4 3 0 6 8 ] 2 I 5 . 8 . 2 . 6 . 7 . 3 . 2 . 5 . 2 . 3 . 2 . 4 . 4 . 1 . 6 . 4 . 5 . 4 . [ t 4 2 4 3 7 2 4 2 6 2 6 5 9 4 9 8 1 8
e 5 3 5 7 5 7 9 9 7 9 1 1 9 1 3 3 3 3 d 1 1 1 1 1 1 1
, )
m V ,
0 E 8 6 8 4 7 9 2 5 7 1 5 4 0 9 6 6 3 2 ( 1 ∗ ] 1 6 4 9 3 4 1 0 4 6 0 9 7 3 4 3 5 6 8 9 4 7 5 2 2 3 2 0 9 0 0 8 8 8 8 S V 9 = ( 2 3 [ . . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 1 . 2 . 2 . 1 . 1 . 1 . 1 . r t 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 n Q r o f s g ) e l n e ) i a V 5 5 5 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 m m r o m E e i s = = = = = = = = = = = = = = = = = = e ( t h p n h ∗ 2 5 1 2 5 1 2 3 6 1 2 3 7 1 2 4 7 1 c e c S s O c s ( R R R R R R R R R R R R R R R R R R g n i r y t o i l n s n a i o e i m r c t e t e p i v O r c i s s e r g n o / r p m l a − m 1 i t p O
] 5 ] 5 9 0 . . 0 0
] 5 ] 5 9 0 . . 0 0
] 5 ] 5 9 0 . . 0 0
] 5 ] 5 9 0 . . 0 0
] 5 ] 5 9 0 . . 0 0
] = = ] = = = = = = = = 2 [ δ δ 2 [ δ δ δ δ δ δ δ δ , , ] , , ] , , ] ] , , ] ] , , ] ] , , 1 [ 3 [ 3 [ 1 [ 3 [ 3 [ 1 [ 2 [ 3 [ 3 [ 1 [ 2 [ 3 [ 3 [ 1 [ 2 [ 3 [ 3 [
3 e l b a T m
% 0 5
% 0 4
% 0 3
% 0 2
% 0 1
5
6
7
8
9
118
H.K.T. Ng et al. 3 d n a 2 s
)
) d e u n i t n
o c 5 5 1 4 3 3 8 0 4 6 5 0 3 7 4 9 8 6 5 ( ˆ q ( 1 0 7 0 6 9 8 5 3 9 7 8 7 3 7 7 2 7 4 r 1 5 5 3 7 0 0 2 4 4 1 5 4 1 8 6 9 6 a 1 1 . 8 . 8 . 9 . . 0 . 9 . 8 . 9 . 9 . 9 . 9 . 0 . 0 . 9 . 1 . 0 . 0 . 0 . 0 V 1 1 1 1 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0
e l b a T n i d e t n e s e r p ) 9 ) 1 ( 5
5 0 . 0
)
5 9 . 0
ˆ q 2 6 1 0 5 7 9 8 3 6 2 6 8 7 1 6 0 1 9 (
9 3 2 8 3 2 2 7 1 5 6 5 1 7 5 7 2 4 2
1 3 8 1 7 5 6 7 4 1 4 9 8 9 7 3 1 6 1 r a 3 . 2 . 2 . 2 . . 2 . 3 . 2 . 2 . 2 . 2 . 3 . 3 . 4 . 4 . 3 . 3 . 4 . 6 . 3 V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
=
m ,
0 1 =
n (
s e m e h c s g n i r o s n e c e v i s s e r g o r p e h t r o f
V E
5 5 5 4 1 8 0 8 1 1 7 5 6 0 7 5 8 1 ] 6 I 0 . 0 . 8 . 8 . . 3 . 9 . 1 . 5 . 3 . 1 . 8 . 0 . 7 . 9 . 9 . 3 . 2 . 5 . 1 [ t 7 6 1 7 8 8 5 8 3 4 4 4 3 0 2 7 5 1 1
e 4 4 3 3 3 3 4 4 5 5 6 6 6 7 7 8 8 7 8 d s e u l a v d e t a l 1 7 6 0 3 5 u ] 2 4 8 8 4 7 2 0 5 4 5 V 2 m [ i 3 . . 3 . 3 . 3 . 4 . 3 r 0 S t 0 0 0 0 0 e m e h c s g n i r o s n e C
5 7 0 3 0 9 6 4 6 7 2 1 1 0 8 8 7 7 3 . 2 . . 2 . 2 . 2 . 3 0 0 0 0 0 0
= = =
= =
R R R
R R
= = =
7 7
6 6 6
5 5 5
R R R
,
,
,
2 1
3 2 1
2 1 1
,
2 3 1 5 1 2 5 8 6 6 1 2 5 9 5 4 4 3 3 4 3 2 . 2 . 2 . 2 . . 2 . 2 . 2 0 0 0 0 0 0 0
,
,
,
,
5 5 5 3 4 4 4 1 2 3 4 4 3 1 2 3 3 3 3 = = = = = = = = = = = = = = = = = = =
1 2 5 2 1 3 6 2 1 3 2 1
7 1 3 2 3 6 1
R R R R R R R R R R R R R R R R R R R
=
0
M r e d n u s n o i t c n u f e v i
)
)
0
∗
M M ∗ D
)
0
y t i l a m i t p O
t c e j b o e h t f o s n / e u m l a − v 1 d S e I t a G l u O m L i S =
∗ A )
0
,
)
0
)
)
∗ V
∗ A
∗
)
∗
)
,
M
M
(
(
S )
)
∗
0
5 9 .
S
, S )
)
(
(
(
M )
∗
)
∗
S ,
)
∗ A
)
0
S ,
)
)
∗
)
∗
) )
)
(
(
)
0 0
M M M M M M M M M M M 5 5 M 5 M 5 5 M 5 M 5 M 5 5 M M 5 M 5 9 0 0 9 0 9 0 9 0 9 0 0
(
(
(
.
∗
(
.
(
.
∗
∗
(
(
(
(
.
∗
0
.
(
(
.
(
.
0 0
(
(
.
(
(
.
.
(
.
S S S S S S S S S S S S S S S S S S S
∗ V ∗ D ∗ V ∗ A ∗ V ∗ V ∗ D ∗ A ∗ V ∗ V ∗ D ∗ V ∗ D ∗ V ∗ V ∗ D ∗ A ∗ V ∗ V
% 0 5
% 0 4
% 0 3
5
6
7
∗
4 e l b h t a i m T w
M
Quantifying the Uncertainty in Optimal Experiment Schemes …
119
)
5 0 . 0
ˆ q 1 7 6 8 2 0 7 0 9 3
9 1 6 7 7 8 4 7 r 0 0 8 6 4 7 8 4 2 3 6 a 1 8 9 . 8 . 8 . 8 . 8 . 8 . 8 . . . 8 . 8 V 0 0 0 0 0 0 0 0 0 0 (
)
5 9 . 0
ˆ q 9 0 6 6 3 7 3 3 7 4
1 2 6 4 8 6 7 1 2 2 2 2 r a 3 . 0 . 1 . 1 . . 8 . 1 V 0 0 0 0 0 0 (
3 6 1 3 7 7 5 1 1 . 1 . 1 . . 5 0 0 0 0
8 9 4 9 2 4 4 7 0 ] 0 I 7 . 8 . 8 . 9 . 2 . 0 . 6 . 3 . 7 . . 8 [ t 5 3 3 0 3 6 8 9 0 5
e 8 9 0 0 9 0 2 2 2 1 d 1 1 1 1 1 1 1 s e u l a v d e t a l 2 0 7 4 5 4 u ] 2 5 1 0 4 1 2 1 1 1 1 1 V m i [ 2 . 2 . 2 . . 2 . 2 . 2 r 0 S t 0 0 0 0 0 e m e h c s g n i r o s n e C
9 9 1 0 3 0 2 5 9 9 9 9 1 . 1 . 1 . . 1 0 0 0 0
1 =
8
R ,
2 1 2 2 2 2 1 1 1 1 = = = = = = = = = =
8 3 2 5 7 1 1 2 8 9
R R R R R R R R R R ,
)
∗
)
M
∗
∗ V
∗ V
M
(
5 9 .
(
S
S
,
,
)
)
∗
∗
M
M
(
(
∗ A
∗ A
S
S )
y t i l a m i t
∗
O p
∗ V ∗ D ∗ V ∗ D ∗ A ∗ V ∗ V ∗ V ∗ D ∗ A ∗ D
n m
,
)
)
∗
) )
)
(
(
)
,
)
0 0 0
)
)
(
(
)
0
M M M M M M M M 5 5 5 5 5 M M M 9 0 9 0 0 0 0 ∗
0 0
(
(
(
.
.
(
.
(
(
.
(
.
S S S S S S S S S S S
/
) − d 1 e u n i t n o c ( 4 e l b a T m
% 0 2
% 0 1
8
9
120
3.4.2
H.K.T. Ng et al.
Compari Comparing ng Values of Objecti Objective ve Functions Functions
By comp compar arin ing g the the value aluess of the the obje object ctiive fu func ncti tion onss Q(S ∗ (M ∗ ), M 0 ) with Q(S ∗ (M ∗ ), M ∗ ) and Q(S ∗ (M 0 ), M 0 ), we can observe that the model misspecification has a more substantial effect for V -optimality and a relatively minor effect for A-op -optim timal ality ity.. For instan instance ce,, we com compar paree Q(S ∗ (M ∗ ), M 0 ) and Q(S ∗ (M 0 ), M 0 ) by considering n 10 and m 5, the optimal censoring scheme for V -optimality with LOGIS IS )),, EV E V ) 0 .3211 δ 0 .95 under under M ∗ LOGIS is(4,0,0,0,1)with QV .95 (S ∗ ( LOG
=
=
=
=
=
M
0 (Table , while the optimal scheme is (0, 0, 0, 5) E V EV V ) 0censoring (S ∗ ( EV ), E .2585 which givesunder 19.50% loss EofV efficient. In 0, conwith Q2V ).95 trast, consider n 10 and m 5, the optimal censoring scheme for A-optimality is (0, 3, 0, 0, 2) with Q A (S ∗ ( LOG LOGIS IS )),, EV E V ) 0 .3071 (Table 2), under M ∗ LOGIS is while whi le the opt optima imall censor censoring ing scheme scheme unde underr M 0 E V is (0, 5, 0, 0, 0) with ∗ E V ), EV E V ) 0 .2918 (Table 3) which gives 4.98% loss of efficient. We have Q A (S ( EV a similar observation when we compare the objective functions Q(S ∗ (M ∗ ), M ∗ ) 10,, m 6, the optimal and Q(S ∗ (M ∗ ), M 0 ). For example, in Table 3, when n 10 censoring scheme for V -optimality with δ 0 .95 under M ∗ E V is (0, 0, 0, 0, 4) E V ), E EV V ) 0 .2546. If the censoring scheme (0, 0, 0, 0, 4) is applied with Q V .95 (S ∗ ( EV when the true model is normal (M 0 NOR ), the asymptotic variance of the estimator of 95-th percentile is V ar (q0.95 ) 0 .4633, which is clearly not the minimum variance that can be obtained because the censoring scheme (4, 0, 0, 0, 0) yields V ar (q0.95 ) 0 .3684. Based on the results from our simulation studies, one should be cautious when the quantity of interest is one of those extreme percentiles (e.g., 1-st, 5-th, 95-th, 99-th percentiles) because the optimal censoring schemes could be sensitive to the change of the model. Base Ba sed d on the the si simu mula lati tion on appr approa oach ch,, we obse observ rvee th that at th thee optim optimal al cens censor orin ing g sche scheme mess determined based on asymptotic theory of the MLEs may not be optimal even when the underline model is correctly specified. Since the Monte-Carlo simulation is a numerically mimic of the real data analysis procedure in practice, the results are showing that when the analytical value of the objective function of the optimal censoring scheme and the values of the objective functions of other censoring schemes are closed, it is likely that those non-optimal censoring schemes will perform better than than the the opti optima mall cens censor orin ing g sc sche heme me.. We woul would d su sugg gges estt th thee pr prac acti titi tion oner erss to use use Mo Mont ntee-
=
=
=
=
=
=
=
=
ˆ
=
=
=
=
=
= ˆ =
=
Carlo Carl o si simu mula lati tion on in co comp mpar arin ing g wi with th othe otherr pro progr gres essi sive ve cens censor orin ing g sche scheme mess an and d choo choose se the optimal one. However, since the number of possible censoring schemes can be numerous when n and m are large, it will not be feasible to use Monte-Carlo simulation to compare all the possible censoring schemes. Therefore, in practice, we can use the analytical approach to identify the optimal censoring scheme and some near optimal censoring schemes, then Monte-Carlo simulation can be used to choose the best censoring schemes among those candidates. This approach will be illustrated in the example which will be presented in the next section.
Quantifying the Uncertainty in Optimal Experiment Schemes …
4
Illust Illustrat rativ ivee Examp Example le
121
R Core Team (2016 (2016)) presented a progressively Type-II censored sample based on the breakdown data on insulating fluids tested at 34 kV from Nelson (1982). The progressively progressivel y censored data presented in R Core T Team eam (2016 (2016)) has n 19 and m 8 with censoring scheme (0, 0, 3, 0, 3, 0, 0, 5). Suppose that we want to re-run the same experiment with n 19 and m 8 and we are interested in using the optimal censoring scheme that minimizing the variances of the parameter estimators (i.e.,
=
=
=
=
A-optimality) or minimizing the variance of the estimator of the 95-th percentile of the lifetime distribution (i.e., V -optimality with δ 0 .95). We can first identify the top k optimal optimal censoring schemes based on the asymptotic variances and then use Monte-Carlo simulation to evaluate the performances of those censoring schemes. discussed sed the linear linear infere inference nce under under progre progressi ssive ve Type ype-Sinc Sincee R Core Core Team eam (2016) 2016) discus II censoring when the lifetime distribution is Weibull and used the breakdown data on insul insulating ating fluids as a numer numerical ical examp example, le, we assume assume here the underline underline lifetime lifetime dist distri ribu buti tion on to be Weibu eibull ll an and d de dete term rmin inee the the top top ten ten cens censor orin ing g ssch chem emes es su subj bjec ectt to th thee A-optimality and V -optimality with δ 0 .05. To study the effect of model misspecification ifica tion to the optimal censoring scheme schemes, s, we compu compute te the objective objective functions for these censoring schemes when the true underline lifetime distribution is lognormal. Then, we also use Monte-Carlo simulation to evaluate the performances of these
=
=
censoring schemes. To reduce the effect of Monte-Carlo simulation errors, we used 100,000 simulations to obtain the simulated variances. These results are presented in Tables 5 and and 6 6 for for A -optimality and V -optimality with δ 0 .95, respectively. As we illustrated in the previous section, the optimal censoring scheme under a particular model may not be optimal if the model is misspecified. The same observation is obtained in this numerical example. For A-optimality, from Table 5, instead of choosing the optimal censoring scheme (0, 11, 0, 0, 0, 0, 0, 0) based on asymptotic variances, the censoring scheme (1, 9, 1, 0, 0, 0, 0, 0) can be a better option based on the simulati simulation on resul results. ts. For V -optimality with δ 0 .95, from Table 6, instead of choosing the optimal censoring scheme (0, 0, 0, 0, 0, 0, 11, 0) based on asymptotic variances, one may adopt (0, 0, 0, 0, 2, 0, 9, 0) as the censoring scheme because it gives a smaller simulated V ar (q0.95 ) and the performance of this censoring under NOR is better than (0, 0, 0, 0, 0, 0, 11, 0). M 0
=
=
ˆ
= 5
Conclu Concludin ding g Remark Remarkss
In this chapter, we propose analytical and simulation approaches to quantify the uncertainty in optimal experiment schemes systematically. The robustness of the optimal progressive Type-II censoring scheme with respect to changes of model is stud studie ied. d. We ha have ve show shown n that that the the op opti tima mall cens censor orin ing g sche scheme mess are are sens sensit itiive to miss misspe peccification of models, especially when the V -optimal criterion is under consideration. In practice, we would recommend the use of Monte-Carlo simulation to verify if the
122
H.K.T. Ng et al. )
R O N , )
) V 3 d e 5 E 6 t 9 ( 1 a V ∗ 3 l . u Q S ( 0 m i S ( ) .
s n o i t a l u
0 0 2 3 . 0
4 5 1 3 . 0
8 1 2 3 . 0
5 8 1 3 . 0
6 5 2 3 . 0
9 3 1 3 . 0
9 3 2 3 . 0
0 5 2 3 . 0
6 9 2 3 . 0
R O N
M R O N
, )
5 5 4 3 4 0 3 5 3 V 0 3 3 3 2 2 3 3 4 2 3 E 5 5 5 5 5 5 5 5 5 5 (
=
0 A
( 1 0 . 1 0 . 1 0 . 1 0 . 1 0 . 1 0 . 1 0 . 1 0 . 1 0 . 1 0 . i m M Q S s 0 ) 0 R 0 , O 0 N 0 , 1 ) ) c i d t V 9 4 1 2 5 5 2 3 4 0 n t 5 E 7 1 8 4 1 6 8 4 8 0 a o ( 5 6 5 6 6 6 5 6 6 7 9 s p V ∗ 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . S e m Q ( 0 0 0 0 0 0 0 0 0 0 y c s n A a i r ( ) a M R v R O c O N i t N ) , o t 7 5 3 3 0 2 8 9 1 0 9 9 0 9 0 9 0 9 9 9 p = V E 3 3 4 3 4 3 4 3 3 3 ( 1 1 1 1 1 1 1 1 1 1 m 0 A S . . . . . . . . . . M y Q ( 0 0 0 0 0 0 0 0 0 0 s a n ) o V d e E , ) s a b V 2 8 6 5 7 5 7 1 8 2 8 ) 5 E 2 7 2 9 5 1 1 7 4 6 9 ( 0 0 0 0 0 1 0 0 1 1 ∗ 2 2 2 2 2 2 2 2 2 2 V = d . . . . . . . . . . e t Q S ( 0 0 0 0 0 0 0 0 0 0 .
.
a m l u
, 9 i m 1 S =
)
(
V E
n V E
,
) h 9 4 5 2 1 6 7 7 8 t = V 3 i 0 1 0 0 9 0 1 9 0 1 E 9 9 9 9 8 9 9 8 9 9 ∗ w ( 1 1 1 1 1 1 1 1 1 1 A S . . . . . . . . . . y M Q t ( 0 0 0 0 0 0 0 0 0 0 i l a m ) i t V p E o , ) A V 9 1 2 8 3 9 5 0 3 9 r ) E 0 8 2 5 9 3 3 7 2 0 o i c 5 9 ( 9 8 9 8 8 8 9 8 8 8 f t V ∗ 1 1 1 1 1 1 1 1 1 1 o e t S . . . . . . . . . . m p Q ( 0 0 0 0 0 0 0 0 0 0 e m h s c y s ) A ( g V n V E i , r E ) o 2 2 3 4 5 5 6 6 7 s = V 0 7 7 7 7 7 7 7 7 7 7 n ∗ E 7 7 7 7 7 7 7 7 7 7 e ( 1 1 1 1 1 1 1 1 1 1 c M A S . . . . . . . . . . Q ( 0 0 0 0 0 0 0 0 0 0 l a m i t ) ) ) p 0 0 0 ) ) ) ) ) ) ) o , , , 0 0 0 0 0 0 0 0 0 0 , , , , , , , n e , , , 0 0 0 0 0 0 0 e m t 0 0 0 , , , , , , , e , , , 0 0 0 0 0 0 0 h p c 0 0 0 , , , , , , , o s , , , 0 0 0 0 0 0 0 T g 0 0 0 , , , , , , , .
n
5 i r e o s l b n a e T C
, 0 , 1 1 , 0 (
, 1 , 0 1 , 0 (
, 0 , 0 1 , 1 (
0 , 2 , 9 , 0 (
0 , 1 , 9 , 1 (
0 , 3 , 8 , 0 (
0 , 0 , 9 , 2 (
0 , 2 , 8 , 1 (
0 , 4 , 7 , 0 (
0 , 5 , 6 , 0 (
Quantifying the Uncertainty in Optimal Experiment Schemes …
123
R O N , )
) V 4 s d e 5 E 5 n t 9 ( 1 a V ∗ 4 o l i . t u Q S ( 0 a m l i u S ( m ) .
8 9 1 4 . 0
1 3 1 4 . 0
3 0 2 4 . 0
8 4 1 4 . 0
8 5 1 4 . 0
2 6 1 4 . 0
2 5 1 4 . 0
0 5 1 4 . 0
6 3 1 4 . 0
i s M R O 0 R 0 O N , 0 , N ) 9 4 4 6 9 0 9 5 1 6 0 = V 4 6 4 6 4 5 4 4 5 4 0 0 E 6 6 6 6 6 6 6 6 6 6 ( 1 A ( 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 0 0 0 0 0 0 0 0 0 0 d n M Q S a ) s R e c O n N a , i ) r ) c a i V 9 5 9 9 6 0 0 4 4 v t o 5 E 1 1 9 8 7 6 6 5 5 5 4 t ( 9 c p 9 8 8 8 8 8 8 8 8 8 i 3 t m V ∗ . 3 . 3 . 3 . 3 . 3 . 3 . 3 . 3 . 3 . S o Q ( 0 0 0 0 0 0 0 0 0 0 t y s p A m ( ) y R s M a R O n O N , o N ) 0 7 5 3 2 9 9 0 7 d = V 2 8 8 7 7 7 7 6 6 7 6 e E 5 5 5 5 5 5 5 5 5 5 s 0 A ( 1 a . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . b M Q S ( 0 0 0 0 0 0 0 0 0 0 8 .
=
)
V E
m 9 , 1
, )
=
) n d e
, 5 9 . 0
t a l u m i S = (
V 5 1 6 6 0 0 8 3 0 3 5 E 9 6 1 1 0 8 6 7 2 6 9 ( 8 8 8 8 8 7 7 7 8 7 V ∗ 3 . 3 . 3 . 3 . 3 . 3 . 3 . 3 . 3 . 3 . Q S ( 0 0 0 0 0 0 0 0 0 0 .
)
V E
δ V , h ) t E i 4 9 7 4 7 2 6 7 0 V 1 = 3 1 9 8 8 7 7 7 9 7 w ∗ E 5 5 4 4 4 4 4 4 4 4 ( 2 2 2 2 2 2 2 2 2 2 y M A S . . . . . . . . . . t Q ( 0 0 0 0 0 0 0 0 0 0 i l a m i ) t V p o E , )
V
6 ) 5 V 5 c 9 E ( 4 i t V ∗ 1 o S . t p Q ( 0 m e m h s c y s ) A ( g V
7 5 4 1 . 0
0 , 0 , 0 , 0 (
0 , 0 , 0 , 0 (
r o f e
.
9 5 4 1 . 0
0 6 4 1 . 0
1 6 4 1 . 0
1 6 4 1 . 0
3 6 4 1 . 0
3 6 4 1 . 0
3 6 4 1 . 0
4 6 4 1 . 0
n V E i , r E ) o 7 1 4 3 7 7 8 9 4 8 s = V 2 2 1 1 0 0 9 9 0 9 n ∗ E 3 3 3 3 3 3 2 2 3 2 e ( 2 2 2 2 2 2 2 2 2 2 A c M S . . . . . . . . . . Q ( 0 0 0 0 0 0 0 0 0 0 l a m i t ) ) ) ) p 0 0 ) 0 ) ) ) ) 0 ) o , , 0 , 0 0 0 0 , 0 1 0 , 0 , , , , 0 , n e 1 1 9 1 8 9 7 8 1 9 e m t , , , , , , , , , , e 0 1 2 0 3 1 4 2 0 0 h p c , , , , , , , , , , o 0 0 0 1 0 1 0 1 0 2 T s , , , , , , , , , , g n
6 i r e o s l b n a e T C
0 , 0 , 0 , 0 (
0 , 0 , 0 , 0 (
0 , 0 , 0 , 0 (
0 , 0 , 0 , 0 (
0 , 0 , 0 , 0 (
0 , 0 , 0 , 0 (
1 , 0 , 0 , 0 (
0 , 0 , 0 , 0 (
124
H.K.T. Ng et al.
optimal censoring schemes are delivering significant superior results compared to other censoring schemes. Thee cu Th curr rren entt st study udy is limi limite ted d to the the pr prog ogre ress ssiive Typeype-II II cens censor orin ing; g; it will will be of in inte terrest to apply the proposed approaches to other optimal experiment design problems and to study the effect of model misspecification. Moreover, the methodologies and illustrations presented in this chapter are mainly focused on misspecification of the underlying statistical model. In the case that the determination of the optimal experiment me nt sc sche heme me on the the spec specifi ified ed value alue of pa para rame mete ters rs,, th thee meth method odol olog ogie iess deve develo lope ped d here here ca can n be appl applie ied d as well well.. Fo Forr inst instan ance ce,, in expe xperi rime ment nt desi design gn of mult multii-le leve vell stre stress ss test testin ing g 2016), ), with extreme value regression under censoring (Ka et al. 2011 and Chan et al. al . 2016 the expected Fisher information matrix depends on the proportions of observed failures which are functions of the unknown parameters, and consequently the optimal experimental scheme also depends on the unknown parameters. Quantifying the uncertainty of the optimal experimental scheme due to parameter misspecification can be proceed in a similar manner as presented in this chapter. R func functi tion on for for co comp mput utee the the ob obje jecti ctive ve funct functio ions ns for for a sp speci ecific fic prog progre ress ssiv ivee Typ ypee-II II censoring scheme for extreme value distribution ############################################## ## Function to compute the objective
##
## func functi tion ons s f for or a s spe peci cifi fic c pro progr gres essi sive ve
##
## Type-II censoring scheme
##
############################################## ####################################### # Input values:
#
# nn: Sample size
#
# mm: Effective sample size
#
# ir: ir: Cens Censor orin ing g sche scheme me (leng (length th = mm) mm)
#
####################################### ################################################### # Output values:
#
# dfi: dfi: Det Deter ermi mina nant nt of Fish Fisher er inf infor orma mati tion on mat matri rix x
#
# tvar: Trace of variance-covariance matrix
#
# vq95 vq95: : Vari Varian ance ce of the ML MLE E of 95-t 95-th h per percent centil ile e
#
# vq05 vq05: : Vari Varian ance ce of of th the e ML MLE E of 5 5-t -th h perc percen enti tile le
#
################################################### objp objpcs cs