Title: | Adjacency-Constrained Clustering of a Block-Diagonal Similarity Matrix |
---|---|
Description: | Implements a constrained version of hierarchical agglomerative clustering, in which each observation is associated to a position, and only adjacent clusters can be merged. Typical application fields in bioinformatics include Genome-Wide Association Studies or Hi-C data analysis, where the similarity between items is a decreasing function of their genomic distance. Taking advantage of this feature, the implemented algorithm is time and memory efficient. This algorithm is described in Ambroise et al (2019) <doi:10.1186/s13015-019-0157-4>. |
Authors: | Christophe Ambroise [aut], Shubham Chaturvedi [aut], Alia Dehman [aut], Pierre Neuvial [aut, cre] , Guillem Rigaill [aut], Nathalie Vialaneix [aut] , Gabriel Hoffman [aut] |
Maintainer: | Pierre Neuvial <[email protected]> |
License: | GPL-3 |
Version: | 0.6.10 |
Built: | 2024-11-06 05:44:36 UTC |
Source: | https://github.com/pneuvial/adjclust |
Adjacency-constrained hierarchical agglomerative clustering
adjClust( mat, type = c("similarity", "dissimilarity"), h = ncol(mat) - 1, strictCheck = TRUE, nthreads = 1L )
adjClust( mat, type = c("similarity", "dissimilarity"), h = ncol(mat) - 1, strictCheck = TRUE, nthreads = 1L )
mat |
A similarity matrix or a dist object. Most sparse formats from
|
type |
Type of matrix : similarity or dissimilarity. Defaults to
|
h |
band width. It is assumed that the similarity between two items is 0
when these items are at a distance of more than band width h. Default value
is |
strictCheck |
Logical (default to |
nthreads |
Integer (default to |
Adjacency-constrained hierarchical agglomerative clustering (HAC) is HAC in which each observation is associated to a position, and the clustering is constrained so as only adjacent clusters are merged. These methods are useful in various application fields, including ecology (Quaternary data) and bioinformatics (e.g., in Genome-Wide Association Studies (GWAS)).
This function is a fast implementation of the method that takes advantage of
sparse similarity matrices (i.e., that have 0 entries outside of a diagonal
band of width h
). The method is fully described in (Dehman, 2015) and
based on a kernel version of the algorithm. The different options for the
implementation are available in the package vignette entitled
"Notes on CHAC implementation in adjclust".
An object of class chac
which describes the tree
produced by the clustering process. The object is a list with the same
elements as an object of class hclust
(merge
,
height
, order
, labels
, call
, method
,
dist.method
), and two extra elements:
mat |
: (the data on which the clustering has been performed, possibly after the pre-transformations described in the vignette entitled "Notes on CHAC implementation in adjclust" |
.
correction |
: the value of the correction for non positive
definite similarity matrices (also described in the same vignette). If
|
When performed on a distance matrix with the option
type = "dissimilarity"
, adjclust
is identical to using the
option "ward.D"
on in the function
hclust
when the ordering of the (unconstrained)
clustering (in hclust
) is compatible with the natural
ordering of objects used as a constraint. It is also equivalent (under the
same assumption or orderings) to the option "ward.D2"
performed on the
distance matrix itself, except for the final heights of the merges
that are equal to the square of the heights obtained with
"ward.D2"
in
hclust
. See the
vignette on implementation
and (Murtagh and Legendre, 2014) for further details.
Murtagh F., and Legendre P. (2014). Ward's hierarchical agglomerative clustering method: which algorithms implement Ward's criterion? Journal of Classification, 31, 274-295. DOI: doi:10.1007/s00357-014-9161-z.
Dehman A. (2015). Spatial Clustering of Linkage Disequilibrium Blocks for Genome-Wide Association Studies, PhD thesis, Universite Paris Saclay, France.
Ambroise C., Dehman A., Neuvial P., Rigaill G., and Vialaneix N (2019). Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics. Algorithms for Molecular Biology, 14(22). DOI: doi:10.1007/s11222-018-9806-6.
Randriamihamison N., Vialaneix N., and Neuvial P. (2020). Applicability and interpretability of Ward's hierarchical agglomerative clustering with or without contiguity constraints. Journal of Classification, 38, 1-27. DOI: doi:10.1007/s00357-020-09377-y.
snpClust
to cluster SNPs based on linkage
disequilibrium
hicClust
to cluster Hi-C data
sim <- matrix( c(1.0, 0.1, 0.2, 0.3, 0.1, 1.0 ,0.4 ,0.5, 0.2, 0.4, 1.0, 0.6, 0.3, 0.5, 0.6, 1.0), nrow = 4) ## similarity, full width fit1 <- adjClust(sim, "similarity") plot(fit1) ## similarity, h < p-1 fit2 <- adjClust(sim, "similarity", h = 2) plot(fit2) ## dissimilarity dist <- as.dist(sqrt(2-(2*sim))) ## dissimilarity, full width fit3 <- adjClust(dist, "dissimilarity") plot(fit3) ## dissimilarity, h < p-1 fit4 <- adjClust(dist, "dissimilarity", h = 2) plot(fit4)
sim <- matrix( c(1.0, 0.1, 0.2, 0.3, 0.1, 1.0 ,0.4 ,0.5, 0.2, 0.4, 1.0, 0.6, 0.3, 0.5, 0.6, 1.0), nrow = 4) ## similarity, full width fit1 <- adjClust(sim, "similarity") plot(fit1) ## similarity, h < p-1 fit2 <- adjClust(sim, "similarity", h = 2) plot(fit2) ## dissimilarity dist <- as.dist(sqrt(2-(2*sim))) ## dissimilarity, full width fit3 <- adjClust(dist, "dissimilarity") plot(fit3) ## dissimilarity, h < p-1 fit4 <- adjClust(dist, "dissimilarity", h = 2) plot(fit4)
S3 class for Constrained Hierarchical Agglomerative Clustering results
## S3 method for class 'chac' as.hclust(x, ...) ## S3 method for class 'chac' print(x, ...) ## S3 method for class 'chac' head(x, ...) ## S3 method for class 'chac' summary(object, ...) ## S3 method for class 'chac' plot( x, y, ..., mode = c("standard", "corrected", "total-disp", "within-disp", "average-disp"), nodeLabel = FALSE ) diagnose(x, graph = TRUE, verbose = TRUE) correct(x) cutree_chac(tree, k = NULL, h = NULL)
## S3 method for class 'chac' as.hclust(x, ...) ## S3 method for class 'chac' print(x, ...) ## S3 method for class 'chac' head(x, ...) ## S3 method for class 'chac' summary(object, ...) ## S3 method for class 'chac' plot( x, y, ..., mode = c("standard", "corrected", "total-disp", "within-disp", "average-disp"), nodeLabel = FALSE ) diagnose(x, graph = TRUE, verbose = TRUE) correct(x) cutree_chac(tree, k = NULL, h = NULL)
x , object , tree
|
an object of class 'chac' |
... |
for |
y |
not used |
mode |
type of dendrogram to plot (see Details). Default to
|
nodeLabel |
(logical) whether the order of merging has to be displayed
or not. |
graph |
(logical) whether the diagnostic plot has to be displayed or
not. Default to |
verbose |
(logical) whether to print a summary of the result or not.
Default to |
k |
an integer scalar or vector with the desired number of groups |
h |
numeric scalar or vector with heights where the tree should be cut. Only available when the heights are increasing |
Methods for class 'chac'
When plot.chac
is called with
mode = "standard"
, the standard dendrogram is plotted, even though,
due to contingency constrains, some branches are reversed (decreasing
merges). When plot.chac
is called with
mode = "corrected"
, a correction is applied to original heights so as
to have only non decreasing merges). It does not change the result of the
clustering, only the look of the dendrogram for easier interpretation.
Other modes are provided that correspond to different alternatives
described in Grimm (1987):
in mode = "within-disp"
, heights correspond to within-cluster
dispersion, i.e., for a corresponding cluster, its height is
where is the dissimilarity
used to cluster objects and
is the center of gravity of cluster
. In this case, heights are always non decreasing;
in mode = "total-disp"
, heights correspond to the total
within-cluster dispersion. It is obtained from mode = "standard"
by
the cumulative sum of its heights. In this case, heights are always
non decreasing;
in mode = "average-disp"
, heights correspond to the
within-cluster dispersion divided by the cluster size. In this case, there
is no guaranty that the heights are non decreasing. When reversals are
detected, a warning is printed to advice the user to change the mode of the
representation.
Grimm (1987) indicates that heights as provided by
mode = "within-disp"
are highly dependent on cluster sizes and that
the most advisable representation is the one provided by
mode = "total-disp"
. Further details are provided in the vignette
"Notes on CHAC implementation in adjclust".
The function plot.chac
displays the dendrogram and
additionally invisibly returns an object of class
dendrogram
with heights as specified by the user through
the option mode
.
diagnose
invisibly exports a data frame with the
numbers of decreasing merges described by the labels of the clusters being
merged at this step and at the previous one, as well as the corresponding
merge heights.
The function correct
returns a chac
objects with
modified heights so as they are increasing. The new heights are calculated in
an way identical to the option mode = "corrected"
of the function
plot.chac
(see Details). In addition, the chac
object has its
field method
modified from adjClust
to
adjClust-modified
.
The function cutree_chac
returns the clustering with
k
groups or with the groups obtained by cutting the tree at height
h
. If the heights are not increasing, the cutting of the tree is based
on the corrected heights as provided by the function correct
.
Grimm, E.C. (1987) CONISS: a fortran 77 program for stratigraphically constrained analysis by the method of incremental sum of squares. Computer & Geosciences, 13(1), 13-35.
Adjacency-constrained hierarchical agglomerative clustering of Hi-C contact maps
hicClust(x, h = NULL, log = FALSE, ...)
hicClust(x, h = NULL, log = FALSE, ...)
x |
either: 1. A pxp contact sparse or dense matrix (classes matrix, Matrix, dscMatrix, dgTMatrix, dgCMatrix, dgeMatrix). Its entries are the number of counts of physical interactions observed between all pairs of loci. 2. An object of class HiTC::HTCexp. The corresponding Hi-C data is stored as a Matrix::dsCMatrix object in the intdata slot. 3. A text file path with one line per pair of loci for which an interaction has been observed (in the format: locus1<tab>locus2<tab>signal) or a matrix or data frame with similar data (3 columns). |
h |
band width. If not provided, |
log |
logical. Whether to log-transform the count data. Default to
|
... |
further arguments to be passed to |
Adjacency-constrained hierarchical agglomerative clustering (HAC) is HAC in which each observation is associated to a position, and the clustering is constrained so as only adjacent clusters are merged. Genomic regions (loci) are clustered according to information provided by high-throughput conformation capture data (Hi-C).
An object of class chac
.
Ambroise C., Dehman A., Neuvial P., Rigaill G., and Vialaneix N (2019). Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics, Algorithms for Molecular Biology 14(22)"
Servant N. et al (2012). HiTC : Exploration of High-Throughput 'C' experiments. Bioinformatics.
# input as HiTC::HTCexp object ## Not run: if (require("HiTC", quietly = TRUE)) { load(system.file("extdata", "hic_imr90_40_XX.rda", package = "adjclust")) res1 <- hicClust(hic_imr90_40_XX) } ## End(Not run) # input as Matrix::dsCMatrix contact map ## Not run: mat <- HiTC::intdata(hic_imr90_40_XX) res2 <- hicClust(mat) ## End(Not run) # input as text file res3 <- hicClust(system.file("extdata", "sample.txt", package = "adjclust"))
# input as HiTC::HTCexp object ## Not run: if (require("HiTC", quietly = TRUE)) { load(system.file("extdata", "hic_imr90_40_XX.rda", package = "adjclust")) res1 <- hicClust(hic_imr90_40_XX) } ## End(Not run) # input as Matrix::dsCMatrix contact map ## Not run: mat <- HiTC::intdata(hic_imr90_40_XX) res2 <- hicClust(mat) ## End(Not run) # input as text file res3 <- hicClust(system.file("extdata", "sample.txt", package = "adjclust"))
Heatmap of the (dis)similarity matrix
plotSim( mat, type = c("similarity", "dissimilarity"), clustering = NULL, dendro = NULL, k = NULL, log = TRUE, legendName = "intensity", main = NULL, priorCount = 0.5, stats = c("R.squared", "D.prime"), h = NULL, axis = FALSE, naxis = min(10, nrow(mat)), axistext = NULL, xlab = "objects", cluster_col = "darkred", mode = c("standard", "corrected", "total-disp", "within-disp", "average-disp") )
plotSim( mat, type = c("similarity", "dissimilarity"), clustering = NULL, dendro = NULL, k = NULL, log = TRUE, legendName = "intensity", main = NULL, priorCount = 0.5, stats = c("R.squared", "D.prime"), h = NULL, axis = FALSE, naxis = min(10, nrow(mat)), axistext = NULL, xlab = "objects", cluster_col = "darkred", mode = c("standard", "corrected", "total-disp", "within-disp", "average-disp") )
mat |
matrix to plot. It can be of class |
type |
input matrix type. Can be either |
clustering |
vector of clusters to display on the matrix (if not
|
dendro |
dendrogram provided as an |
k |
number of clusters to display. Used only when |
log |
logical. Should the breaks be based on log-scaled values of the
matrix entries. Default to |
legendName |
character. Title of the legend. Default to
|
main |
character. Title of the plot. Default to |
priorCount |
numeric. Average count to be added to each entry of the
matrix to avoid taking log of zero. Used only if |
stats |
input SNP correlation type. Used when |
h |
positive integer. Threshold distance for SNP correlation
computation. Used when |
axis |
logical. Should x-axis be displayed on the plot? Default to
|
naxis |
integer. If |
axistext |
character vector. If |
xlab |
character. If |
cluster_col |
colour for the cluster line if |
mode |
type of dendrogram to plot (see |
This function produces a heatmap for the used (dis)similarity matrix that can be used as a diagnostic plot to check the consistency between the obtained clustering and the original (dis)similarity
## Not run: clustering <- rep(1:3, each = 50) dist_data <- as.matrix(dist(iris[, 1:4])) dendro_iris <- adjClust(dist_data, type = "dissimilarity") plotSim(dist_data, type = "dissimilarity", dendro = dendro_iris, axis = TRUE) plotSim(dist_data, type = "dissimilarity", dendro = dendro_iris, clustering = clustering) plotSim(dist_data, type = "dissimilarity", dendro = dendro_iris, axis = TRUE, k = 3) plotSim(dist_data, type = "dissimilarity", legendName = "IF", axis = TRUE, clustering = clustering) p <- plotSim(dist(iris[, 1:4]), type = "dissimilarity", log = FALSE, clustering = clustering, cluster_col = "blue") # custom palette p + scale_fill_gradient(low = "yellow", high = "red") # dsCMatrix m <- Matrix(c(0, 0, 2, 0, 3, 0, 2, 0, 0), ncol = 3) res <- adjClust(m) plotSim(m, axis = TRUE) plotSim(m, dendro = res) # dgCMatrix m <- as(m, "generalMatrix") plotSim(m) m <- as.dist(m) if (require("HiTC", quietly = TRUE)) { load(system.file("extdata", "hic_imr90_40_XX.rda", package = "adjclust")) res <- hicClust(hic_imr90_40_XX, log = TRUE) plotSim(hic_imr90_40_XX, axis = TRUE) } if (requireNamespace("snpStats", quietly = TRUE)) { data(testdata, package = "snpStats") plotSim(Autosomes[1:200, 1:5], h = 3, stats = "R.squared", axis = TRUE, axistext = c("A", "B", "C", "D", "E")) } ## End(Not run)
## Not run: clustering <- rep(1:3, each = 50) dist_data <- as.matrix(dist(iris[, 1:4])) dendro_iris <- adjClust(dist_data, type = "dissimilarity") plotSim(dist_data, type = "dissimilarity", dendro = dendro_iris, axis = TRUE) plotSim(dist_data, type = "dissimilarity", dendro = dendro_iris, clustering = clustering) plotSim(dist_data, type = "dissimilarity", dendro = dendro_iris, axis = TRUE, k = 3) plotSim(dist_data, type = "dissimilarity", legendName = "IF", axis = TRUE, clustering = clustering) p <- plotSim(dist(iris[, 1:4]), type = "dissimilarity", log = FALSE, clustering = clustering, cluster_col = "blue") # custom palette p + scale_fill_gradient(low = "yellow", high = "red") # dsCMatrix m <- Matrix(c(0, 0, 2, 0, 3, 0, 2, 0, 0), ncol = 3) res <- adjClust(m) plotSim(m, axis = TRUE) plotSim(m, dendro = res) # dgCMatrix m <- as(m, "generalMatrix") plotSim(m) m <- as.dist(m) if (require("HiTC", quietly = TRUE)) { load(system.file("extdata", "hic_imr90_40_XX.rda", package = "adjclust")) res <- hicClust(hic_imr90_40_XX, log = TRUE) plotSim(hic_imr90_40_XX, axis = TRUE) } if (requireNamespace("snpStats", quietly = TRUE)) { data(testdata, package = "snpStats") plotSim(Autosomes[1:200, 1:5], h = 3, stats = "R.squared", axis = TRUE, axistext = c("A", "B", "C", "D", "E")) } ## End(Not run)
Clustering selection from a chac object with the slope heuristic or the broken stick heuristic
select( x, type = c("capushe", "bstick"), k.max = NULL, graph = FALSE, pct = 0.15 )
select( x, type = c("capushe", "bstick"), k.max = NULL, graph = FALSE, pct = 0.15 )
x |
an object of class 'chac' |
type |
model selection approach between slope heuristic
( |
k.max |
maximum number of clusters that can be selected. Default to
|
graph |
logical. Whether the diagnostic plot for the capushe selection
is displayed or not. Default to |
pct |
minimum percentage of points for the plateau selection in
capushe selection. See |
The function returns the clustering selected by the slope heuristic,
as implemented in the R package capushe
.
Baudry J.P., Maugis C., and Michel B. (2012). Slope heuristics: overview and implementation. Statistics and Computing, 22(2), 355-470. MacArthur, R.H. (1957) On the relative abundance of bird species. Proceedings of the National Academy of Sciences, 43, 293-295.
## Not run: if (require("HiTC", quietly = TRUE)) { load(system.file("extdata", "hic_imr90_40_XX.rda", package = "adjclust")) res <- hicClust(hic_imr90_40_XX, log = TRUE) selected.capushe <- select(res) table(selected.capushe) selected.bs <- select(res, type = "bstick") table(selected.bs) } ## End(Not run) res <- adjClust(dist(iris[, 1:4])) select.clust <- select(res, "bs") table(select.clust)
## Not run: if (require("HiTC", quietly = TRUE)) { load(system.file("extdata", "hic_imr90_40_XX.rda", package = "adjclust")) res <- hicClust(hic_imr90_40_XX, log = TRUE) selected.capushe <- select(res) table(selected.capushe) selected.bs <- select(res, type = "bstick") table(selected.bs) } ## End(Not run) res <- adjClust(dist(iris[, 1:4])) select.clust <- select(res, "bs") table(select.clust)
Adjacency-constrained hierarchical agglomerative clustering of Single Nucleotide Polymorphisms based on Linkage Disequilibrium
snpClust(x, h = ncol(x) - 1, stats = c("R.squared", "D.prime"))
snpClust(x, h = ncol(x) - 1, stats = c("R.squared", "D.prime"))
x |
either a genotype matrix of class
|
h |
band width. If not provided, |
stats |
a character vector specifying the linkage disequilibrium
measures to be calculated (using the |
Adjacency-constrained hierarchical agglomerative clustering (HAC) is HAC in which each observation is associated to a position, and the clustering is constrained so as only adjacent clusters are merged. SNPs are clustered based on their similarity as measured by the linkage disequilibrium.
In the special case where genotypes are given as input and the corresponding LD matrix has missing entries, the clustering cannot be performed. This can typically happen when there is insufficient variability in the sample genotypes. In this special case, the indices of the SNP pairs which yield missing values are returned.
If x
is of class
SnpMatrix
or matrix
,
it is assumed to be a matrix of
genotypes for
individuals. This input is converted to a LD similarity matrix
using the
snpStats::ld
. If x
is of class
dgCMatrix
, it is assumed to be a
(squared) LD matrix.
Clustering on a LD similarity other than "R.squared" or "D.prime" can be
performed by providing the LD values directly as argument x
. These
values are expected to be in [0,1], otherwise they are truncated to [0,1].
An object of class chac
(when no LD value is missing)
Dehman A. (2015) Spatial Clustering of Linkage Disequilibrium Blocks for Genome-Wide Association Studies, PhD thesis, Universite Paris Saclay.
Dehman, A. Ambroise, C. and Neuvial, P. (2015). Performance of a blockwise approach in variable selection using linkage disequilibrium information. *BMC Bioinformatics* 16:148.
Ambroise C., Dehman A., Neuvial P., Rigaill G., and Vialaneix N (2019). Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics, Algorithms for Molecular Biology 14(22)"
## a very small example if (requireNamespace("snpStats", quietly = TRUE)) { data(testdata, package = "snpStats") # input as snpStats::SnpMatrix fit1 <- snpClust(Autosomes[1:200, 1:5], h = 3, stats = "R.squared") # input as base::matrix fit2 <- snpClust(as.matrix(Autosomes[1:200, 1:5]), h = 3, stats = "R.squared") # input as Matrix::dgCMatrix ldres <- snpStats::ld(Autosomes[1:200, 1:5], depth = 3, stats = "R.squared", symmetric = TRUE) fit3 <- snpClust(ldres, 3) }
## a very small example if (requireNamespace("snpStats", quietly = TRUE)) { data(testdata, package = "snpStats") # input as snpStats::SnpMatrix fit1 <- snpClust(Autosomes[1:200, 1:5], h = 3, stats = "R.squared") # input as base::matrix fit2 <- snpClust(as.matrix(Autosomes[1:200, 1:5]), h = 3, stats = "R.squared") # input as Matrix::dgCMatrix ldres <- snpStats::ld(Autosomes[1:200, 1:5], depth = 3, stats = "R.squared", symmetric = TRUE) fit3 <- snpClust(ldres, 3) }