The effects of applying cell-suppression and perturbation to aggregated genetic data
Constantinou, Andreas I.
Kyriacou, Kyriacos C.
Loizidou, Maria A.
Demetriou, Christiana A.
Pattichis, Constantinos S.
SourceIEEE 12th International Conference on BioInformatics and BioEngineering, BIBE 2012
12th IEEE International Conference on BioInformatics and BioEngineering, BIBE 2012
Google Scholar check
MetadataShow full item record
The key test for confidence in any association discovered within the medical domain is replication testing. That is, the ability of the association to be detected in independent populations. At the same time, in order to increase the likelihood of discovering statistically significant associations there is a clear need to increase the statistical power of any given study. A key methodology for increasing statistical power is through the use of as many subjects as possible that match a study's inclusion criteria. Thus many have attempted to merge data from multiple independent sources/sites/studies that contain the same inclusion criteria for subjects as a way of creating a much larger study with significantly more statistical power. For these approaches to work though data from multiple sites need to be made available to a single analysis. This practice is significantly limited by the need to respect legal and ethical requirements that are often complicated, ambiguous and inconsistent across different countries. The common approach to achieve merging of data is by sharing aggregated data rather than subject's personal data. Aggregated data however may still in some cases be reverse engineered, therefore traditionally cells within the aggregated data with small values were suppressed, and some or all of the aggregated data were perturbed in order to add noise inhibiting any attempts at identifying personal information of a specific person or sub-group in the original data. In this paper we study the effects of cell-suppression and perturbation on the results of the data analysis. Each approach is looked at by itself as well as in combination using the typical settings documented in the literature. The tests are based on a real dataset that looks for associations between phenotypes and genetic markers. This work is part of the Linked2Safety project that aims to dynamically interconnect distributed patients' data to better enable medical research efforts, whilst respecting patients' anonymity, as well as European and national legislation. © 2012 IEEE.