Historically assessing the validity of molecular interaction prediction algorithms
Rajan M. Lukose*, Eytan Adar*, and Mai-San Chan

The complexity of biological knowledge, expressed in a vast scientific literature as well as a flood of data from high-throughput and genome-wide experimental methods, has led some to argue that molecular biology should increase its emphasis on hypothesis-driven research relative to data collection [1-3]. Past examples have shown that good hypotheses can be generated by examining the scientific literature of distant fields of research [4-7]. In addition, certain generalized properties of molecular interaction relationships and networks can be expected to hold in cases where data are not yet available [8-11]. Here, we take advantage of existing data on protein-protein interactions annotated with dates of discovery to assess the effectiveness of three simple prediction algorithms historically. We unwind the history of discovery to test how automated hypothesis generation algorithms would have performed if they had been applied systematically to what was known at the time. We find that these simple algorithms are effective at automatically identifying good novel hypotheses, suggesting that they may be useful in guiding future research. The historical validation framework we propose can apply in a wide variety of situations.

*These authors contributed equally to the work. Submitted for publication