Introduction
Pairwise association measure is an important operation in searching for meaningful insights within a dataset by examining potentially interesting relationships between data variables of the dataset. In bioinformatics, one typical application is to mine gene co-expression relationship via gene expression data, which can be realized by query-based gene expression database search or gene co-expression network analysis. Pearson's product-moment correlation coefficient, Spearman's rank correlation coefficient, Kendall rank correlation coefficient, Distance correlation and Mutual information are widely used correlation/dependence measures. However, all-pairs pairwise correlation computation (PCC) is computationally demanding for large number of variables, exspecially when coupled with permutation tests for statistical inference, thus motivating our acceleration of its execution using high-performance computing.
LightPCC is the first parallel and distributed library for pairwise correlation/dependence computation on Intel Xeon Phi clusters. This library is written in C++ template classes, and achieves high speed by exploring the SIMD-instruction-level and thread-level parallelism within Xeon Phis as well as accelerator-level parallelism among multiple Xeon Phis. To facilitate balanced workload distribution, we have proposed a generic framework for symmetric all-pairs computation by building provable bijective functions between job identifier and coordinate space for the first time. As of today, LightPCC has already implemented the following widely used correlation/dependence meansures: Pearson's product-moment correlation coefficient, Spearman's rank correlation coefficient, Kendall's tau correlation coefficient, Distance correlation and Mutual informaiton.We will keep updating actively in the future!
Downloads
- Latest release (v1.0.15) NEW
More details about the changes are available at ChangeLog.
- Sample gene expression datasets
The sample gene expression datasets are from Affymetrix whole human genome expression array (GeneChip Human Genome U133 Plus 2.0 Array).
Citation
- Yongchao Liu, Tony Pan, Srinivas Aluru: "Parallel pairwise correlation computation on Intel Xeon Phi clusters". 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2016), 2016, pp. 141-149.
- Yongchao Liu, Tony Pan, Oded Green and Srinivas Aluru: "Parallelized Kendall's tau coefficient computation via SIMD vectorized sorting on many-integrated-core processors". Journal of Parallel and Distributed Computing, 2017, submitted [arXiv]
Other related papers
- Yongchao Liu and Bertil Schmidt: "LightSpMV: faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs". 26th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP 2015), 2015, pp. 82-89
- Yongchao Liu and Bertil Schmidt: "LightSpMV: faster CUDA-compatible sparse matrix-vector multiplication using compressed sparse rows". Journal of Signal Processing Systems, 2017, doi:10.1007/s11265-016-1216-4.
- Yongchao Liu and Srinivas Aluru: "LightScan: faster scan primitive on CUDA compatible manycore processors". arXiv:1604.04815, 2016.
Parameters
Currently, LightPCC already implemented Pearson's correlation coefficient, Spearman's rank correlation coefficient, Kendall's tau correlation coefficient, Distance correlation and Mutual information. These correlation/dependence measures are implemented as C++ template classes. This library has a non-MPI-based version: LightPCC and a MPI-based one: mpiLightPCC. For version 1.0.14 and higher, the input file format will be the same with the one used by ARACNE.
For benchmarking purposes, we have also implemented a subprogram for each correlation measure based on the corresponding templated class, which shares the same set of parameters as shown in the following table.
LightPCC
Usage:: LightPCC cmd [options] -m exe_mode
- Command:
- pearson: Pearson's correleation coefficient
- spearman: Spearman's rank correlation coefficient
- kendall: Kendall tau correlation coefficient
- distance: Distance correlation
- miadaptive; Mututal information based on adaptive partitioning
- Options (may vary subject to specific commands):
- -i <str> (input EXP formatted file [random data if not given]
- -d <int> (use double precision, default = 1)
- -n <int> (number of vectors, default = 0 [random data])
- -l <int> (vector size, default = 0 [random data])
- -t <int> (number of CPU threads, default = 0 [0 means auto])
- -p <int> (number of Xeon Phi threads, default = 0 [0 means auto])
- -m <int> (execution mode, default = -1 [-1 invaid])
- 0: singled-threaded on the CPU
- 1: multi-threaded on the CPU
- 2: single Xeon Phi
- -x <int> (Xeon Phi index [single Xeon Phi mode], default = 0)
- -h (print out options)
mpiLightPCC
Usage:: mpiLightPCC cmd [options] -m exe_mode
- Command:
- pearson: Pearson's correleation coefficient
- spearman: Spearman's rank correlation coefficient
- kendall: Kendall's tau correlation coefficient
- distance: Distance correlation
- miadaptive; Mututal information based on adaptive partitioning
- Options (may vary subject to specific commands):
- -i <str> (input EXP formatted file [random data if not given]
- -d <int> (use double precision, default = 1)
- -n <int> (number of vectors, default = 0 [random data])
- -l <int> (vector size, default = 0 [random data])
- -t <int> (number of CPU threads, default = 0 [0 means auto])
- -p <int> (number of Xeon Phi threads, default = 0 [0 means auto])
- -m <int> (execution mode, default = -1 [-1 invaid])
- 3: MPI for CPU clusters
- 4: MPI for Xeon Phi clusters
- -x <int> (Xeon Phi index [single Xeon Phi mode], default = 4)
- -h (print out options)
Installation and Usage
Prerequisites
- Intel C/C++ compiler or any other C/C++ compiler that supports Xeon Phi coprocessors.
- A C/C++ MPI library (e.g. OpenMPI, MPICH, Intel MPI) that is compiled by the aforementioned C/C++ compiler.
Input File Format
From version 1.0.14, our input file format will be the same with the one used by ARACNE.
Download and Compile
Before compiling, please modify the corresponding Makefile to point to the correct compilers and libraries.
- If the subdirctory "apps" exists, please enter the subdirctory "apps/lightpcc" to compile it.
- Otherwise, type command "make" to compile both the non-MPI-based version (named LightPCC) and MPI-based one (named mpiLightPCC).
Typical Usage
- Pearson's correlation coefficient: "pearson" command
- LightPCC pearson -n 16000 -l 5000 -m 2
Randomly generate 16000 vector variables of 5000 elements each and use a single Xeon Phi.
- mpirun -np 2 mpiLightPCC pearson -n 16000 -l 5000 -m 4
Randomly generate 16000 vector variables of 5000 elements each and run two Xeon Phis on a cluster.
- LightPCC pearson -n 16000 -l 5000 -m 2
- For the rest, please use the corresponding command and have the same usage with Pearson's correlation coefficient.
Important Notices
- For Spearman and Kendall rank correlation coefficient, our implementation will sort each vector variable X and then compute the rank of each element in X. This means that users do not need to sort and rank all elements in each variable using a third-party software/program.
Change Log
- Feb. 28, 2017 (v1.0.15)
- We further optimized the SIMD vectorized code for the Kendall's tau correlation coefficient.
- Feb. 18, 2017 (v1.0.14)
- We significantly improved the speed of Kendall's tau correlation coefficient.
- We implemented mutual information based on adaptive partitioning.
- We changed to use an input file format that is the same with ARACNE, starting from this version.
- We released some human whole-genome gene expression datasts in tab-based matrix (plain text) and ARACNE format.
- June 20, 2016 (v1.0.9)
- We added a License file in the source code tarball. This software will be distributed by complying with Apache Licence version 2.0.
- A readme file is also added to the code tarball.
- June 10, 2016 (v1.0.9)
- First release of LightPCC v1.0.9.
Contact
If any questions or improvements, please feel free to contact Liu, Yongchao.