Scripts to recreate figures in paper: Integrative modelling reveals the principles of multi-scale chromatin boundary formation in human nuclear organisation (2015). For queries or help getting these running, you can contact me on twitter or open an issue at the github repository.
Scripts will run on OS X and other Unix-based systems. External dependencies should be installed somewhere on your $PATH
.
These scripts rely on an ad hoc R package I've written called "blmR". To install, use the devtools
package:
devtools::install_github("blmR", "blmoore")
Lots of commonly-installed R packages are also used, including but not limited to:
caret
corrgram
dplyr
ggplot2
(w/ dependenciesreshape2
,scales
,RColorBrewer
etc.)gridExtra
naturalsort
plotrix
pvclust
randomForest
RHmm
snow
(andRmpi
)
BSgenome
(and UCSC hg19)GenomicRanges
bedtools
bigWigAverageOverBed
*ICE
/ hiclib*
* To generate new input files, otherwise not required
Serialised rds data files are provided under data/rds
. These were built from ENCODE uniformly processed .bigWig files which are available to download from ENCODE (list of filenames used here). These files (used in Boyle et al. (2014)) come from the July 2012 data freeze and use MACSv2 to convert aligned ChIP-seq reads into a measure of "signal" relative to input chromatin background.
Some of the scripts rely on other large files not included in this repository (in order to reanalyse data from scratch), but reasonable intermediates are provided where possible and stored under data/
. For notes on which scripts are effected, see the how to run section of this README.
First clone the repository (warning: ~250 MB total) and enter the directory:
git clone https://github.com/blmoore/3dgenome
cd 3dgenome
Now scripts can be run with Rscript, i.e.:
> Rscript R/0_buildDatFiles.R
# output follows...
Use all variables is set to: FALSE
Reading in files...
[1] "HaibTfbsH1hescAtf3V0416102AlnRep0"
[1] "SydhTfbsH1hescCebpbIggrabAlnRep0"
[1] "BroadHistoneH1hescChd1a301218aStdAlnRep0"
[1] "SydhTfbsH1hescChd2IggrabAlnRep0"
[1] "OpenChromChipH1hescCmycAlnRep1"
[1] ...
Building Random Forest...
| Out-of-bag |
Tree | MSE %Var(y) |
25 | 0.005201 35.61 |
50 | 0.004946 33.87 |
75 | 0.004833 33.09 |
100 | 0.004842 33.15 |
...
Scripts are numbered in the order of which they should be run. You may wish to delete existing PDFs under figures/
to convince yourself they're being regenerated (if not they'll be overwritten).
Builds a tidy data.frame
in the format: {Y, x_1, x_2, ..., x_n} where Y is a vector of values to be predicted (compartment eigenvector profiles in this instance) and each x is a numeric vector input feature. Here each cell represents an averaged feature (fold-change relative to input chromatin) over 1 Mb.
This script generates two sets of .rds
data files under data/
:
cellType_XVars.rds
- the aforementioneddata.frame
containing input features and empirical independent variable data.cellType_XVars_RFmod.rds
- the Random Forest regression model for the respectivedata.frame
.
Plots genome-wide compartment profiles for each cell type under study. The resulting figures make up part of figure 1 (figures/f1_chr2.pdf
) and supplementary figure 1 (figures/suppl/s1_GenomewideWigglePlots.pdf
).
Generates detailed view of a ~22 Mb region of chromosome 2, indicating conservation of higher order genome organisation at multiple levels. Plots figures/f1_ZoomedRegion.pdf
which makes up the second half of figure 1 in the manuscript.
Compares boundaries for compartments and TADs across cell types and tests the significance of observed relationships. Plots supplementary figure 2 in two parts: 2a
and 2b
.
Plots Random Forest modelling results per cell type as three separate plots (f2_gmRes.pdf
, f2_h1Res.pdf
and f2_k5Res.pdf
). Also calculates variable importance per model and plots a summary of the top ten (f2b_varImpPerModel.pdf
). Combining these plots gives figure 2 in the manuscript.
Performs cross-application of cell type specific models and generates a summary plot (f3b_crossApplyBars.pdf
). Also plots a reciprocal example of cross application between two of the cell types (f3ai
and f3aii
). Together these are figure 3 in the manuscript.
This script has a non-neglible runtime (~10 mins on a modern processor) but could easily be parallelised / optimised. Here we're splitting the genome into an equal number of bins based on how variable chromatin structure is across the cell types under study. Then a model (as in script 3) is built per split, and the results compared (f4a
).
Requires external files. To run this script you need to download the ENCODE predicted chromatin states from here (i.e. gzip archives for the three cell types being used). They should be unzipped and placed under data/bedfiles/
, e.g.
# download bedfiles from ENCODE ftp (35 MB total)
wget -P data/bedfiles/ -i data/text/chromStateFiles.txt
# un-gzip archives
gunzip data/bedfiles/*.gz
Also requires R package RHmm
, which at the time of writing has been removed from CRAN. Download the newest available version from the archive and install with e.g. R CMD INSTALL <RHmm_2.0.0.tar.gz>
.
Finally, this script also has the external dependancy bedtools which should be on your $PATH
. Should be easy enough to install through your linux central repo (e.g. apt-get install bedtools
) or via your OS X pkg manager of choice (brew install bedtools
). bedtools may throw an "out of memory" exception if running on a machine with < approx. 4 GB RAM.
The output of this script includes f4b
as well as supplementary figures s6
and s7
. Combine f4b with f4a (script 5) and you'll get figure 4.
Reproducing figure 5 requires a lot of leg work, I am happy to send you intermediate files if you wish to run this script (ping me on twitter). Comments within the Rscript explain what's needed but the essential steps are:
- Call compartment and TAD boundaries (or use those provided in
data/bedfiles
) - Generate a series of equally spaced bins around each one (Python script: binAroundBed.py)
- Use
bigWigAverageOverBed
to average all input features (~300 .bigWig files, maybe 200 GB of raw data) — ideally use a cluster for this
This gives you a series of text files with intervals and averaged signal per bin, with which you can then test for enrichments per boundary and plot figure 5 (f5a, f5b) as well as supplementary figures (s9 and s10).
This generates a single figure in a resource-intensive, uninteresting way (probably a good one to skip). As written uses the pvclust
R package to find "significant" clusters in the input features through multiscale bootstrap (if a cluster is stable through undersampled and oversample data, they reason, it's significantly more clustery than would be expected by chance --- give or take, read the docs). Parallelisation is achieved with snow
and Rmpi
running on an openmpi backend (4 threads), but by all means run it single-threaded if you so desire. In the final(ish) paper, these heatmaps became supplementary figure 4 (not 10 as the title suggests, TO FIX).
Finally this script will plot the remaining supplementary figures: s3
and s5
. The only figure not regenerated through these scripts is supplementary figure 8 (MS numbering), which is UCSC browser screenshots, laboriously generated by hand (well, ish).
Scripts prefixed with letters (A, B, ...) are in the process of being tidied and instructions to run will then follow.
Below is an output of sessionInfo() for troubleshooting purposes, some loaded packages may not be required and likewise, some required packages may not be loaded. An exception caused by attached packages is likely due to version issues.
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rmpi_0.6-5 snow_0.3-13 pvclust_1.2-2 naturalsort_0.1.2 GenomicRanges_1.16.4 GenomeInfoDb_1.0.2
[7] IRanges_1.22.10 BiocGenerics_0.10.0 dplyr_0.2 data.table_1.9.2 RHmm_2.0.3 nlme_3.1-117
[13] plyr_1.8.1 blmR_0.1 RColorBrewer_1.0-5 infotheo_1.2.0 ROCR_1.0-5 gplots_2.14.1
[19] calibrate_1.7.2 randomForest_4.6-10 caret_6.0-30 lattice_0.20-29 gridExtra_0.9.1 ggplot2_1.0.0
[25] fGarch_3010.82 fBasics_3010.86 MASS_7.3-33 timeSeries_3010.97 timeDate_3010.98 corrgram_1.6
[31] seriation_1.0-13 reshape2_1.4 R.matlab_3.0.1 plotrix_3.5-7 scales_0.2.4
loaded via a namespace (and not attached):
[1] assertthat_0.1 bitops_1.0-6 BradleyTerry2_1.0-5 brglm_0.5-9 car_2.0-20 caTools_1.17
[7] cluster_1.15.2 codetools_0.2-8 colorspace_1.2-4 devtools_1.5 digest_0.6.4 evaluate_0.5.5
[13] foreach_1.4.2 gclus_1.3.1 gdata_2.13.3 gtable_0.1.2 gtools_3.4.1 httr_0.4
[19] iterators_1.0.7 KernSmooth_2.23-12 labeling_0.2 lme4_1.1-7 magrittr_1.0.1 Matrix_1.1-4
[25] memoise_0.2.1 minqa_1.2.3 munsell_0.4.2 nloptr_1.0.4 nnet_7.3-8 proto_0.3-10
[31] R.methodsS3_1.6.1 R.oo_1.18.0 R.utils_1.32.4 Rcpp_0.11.2 RCurl_1.95-4.3 roxygen2_4.0.1.99
[37] splines_3.1.1 stabledist_0.6-6 stats4_3.1.1 stringr_0.6.2 tools_3.1.1 TSP_1.0-9
[43] whisker_0.3-2 XVector_0.4.0