GTF2F1 Prediction of GTF2F1 in small intestine Track Settings
Assembly: Human Dec. 2013 (GRCh38/hg38) Data last updated at UCSC: 2018-10-02 10:24:33
Virtual ChIP-seq
Virtual ChIP-seq Predicting transcription factor binding by learning from the transcriptome
Karimzadeh M, Hoffman MM. 2017.
Virtual ChIP-seq: Predicting transcription factor binding by learning from the transcriptome.
in prep;
doi: https://doi.org/.
(BibTeX)
The free Virtual ChIP-seq software package efficiently predicts
binding of 40 TFs in any cell type with RNA-seq and ATAC-seq (or DNase-seq).
Predicting transcription factor binding
Virtual ChIP-seq uses multi-layer perceptron to predict binding of individual TFs.
Virtual ChIP-seq uses data on chromatin accessibility, genomic conservation, and binding
characteristics of TFs from previous experiments in other cell types. It also learns from the asso-
ciation of gene expression and TF binding at different genomic regions.
By incorporating existing ChIP-seq data, there is no longer a need to represent TF sequence preferences in
form of position weight matrices. For a new cell type with data on chromatin accessibility and gene
expression, Virtual ChIP-seq predicts indirect TF binding, as well as binding of TFs without known
sequence preference.
Accuracy of predictions
To build a generalizable classifier that performs well on new cell types with only transcriptome
and chromatin accessibility data, we train the multi-layer perceptron on training cell types
(A549, GM12878, HCT-116, HepG2, HeLa-S3). We assess the performance of the model in
validation cell types (IMR90 K562 MCF-7 NHEK H1 Ishikawa BJ T47D PANC-1 Jurkat).
Below, we report median and standard deviation of performance
among validation cell types.
TF
Median auROC
S.D auROC
Median auPR
S.D auPR
Median MCC
S.D MCC
BACH1
0.977
0.00923
0.429
0.0508
0.384
0.0923
BHLHE40
0.918
0.00224
0.378
0.0325
0.398
0.0196
BRCA1
0.991
0.00388
0.356
0.0322
0.369
0.0223
CEBPB
0.965
0.0254
0.392
0.0735
0.371
0.042
CHD2
0.98
0.0213
0.462
0.0606
0.451
0.047
CREB1
0.98
0.107
0.519
0.164
0.448
0.109
CTCF
0.989
0.0385
0.81
0.101
0.605
0.15
E2F4
0.993
0.00786
0.502
0.0867
0.322
0.161
EGR1
0.974
0.034
0.418
0.186
0.456
0.176
ELF1
0.954
0.0374
0.496
0.0709
0.455
0.0403
ESRRA
0.939
0.0288
0.308
0.047
0.309
0.0185
FOS
0.858
0.00542
0.334
0.0152
0.369
0.02
FOXA1
0.966
0.0279
0.584
0.0133
0.453
0.0903
GABPA
0.978
0.0272
0.434
0.0605
0.414
0.0533
GATA3
0.916
0.0314
0.241
0.0627
0.312
0.0597
GTF2F1
0.991
0.0123
0.29
0.0709
0.341
0.0624
H2AZ
0.932
0.0728
0.304
0.141
0.317
0.129
HCFC1
0.988
0.00668
0.499
0.0419
0.44
0.0583
JUND
0.992
0.00984
0.319
0.18
0.346
0.142
MAFF
0.964
0.00405
0.361
0.0987
0.374
0.102
MAFK
0.983
0.00458
0.523
0.0958
0.478
0.0398
MAX
0.968
0.0269
0.459
0.115
0.416
0.0645
MAZ
0.987
0.00437
0.546
0.0798
0.455
0.063
MXI1
0.991
0.00456
0.426
0.0318
0.43
0.0305
MYC
0.978
0.114
0.312
0.191
0.319
0.154
NRF1
0.997
0.0127
0.72
0.0508
0.359
0.0593
RAD21
0.986
0.0135
0.75
0.0552
0.581
0.0952
REST
0.985
0.0181
0.562
0.126
0.439
0.0759
RFX5
0.971
0.0138
0.32
0.0461
0.305
0.0536
SIN3A
0.977
0.0095
0.413
0.0399
0.394
0.0384
SMC3
0.998
0.00005
0.779
0.0177
0.723
0.0184
SRF
0.971
0.0355
0.363
0.0833
0.398
0.0584
TAF1
0.992
0.0216
0.541
0.0558
0.484
0.0457
TBP
0.982
0.00548
0.365
0.111
0.387
0.0704
TEAD4
0.947
0.0367
0.392
0.0208
0.352
0.0445
USF1
0.917
0.0223
0.411
0.0858
0.401
0.0785
USF2
0.97
0.0128
0.471
0.0371
0.409
0.0893
YY1
0.93
0.0334
0.46
0.049
0.485
0.0665
Virtual ChIP-seq accepts chromatin accessibility data in narrowPeak format
and RNA-seq data in format of a matrix where rows are human gene symbols
and columns are cell types (Minimum of 1 column with your cell of interest).
The RNA-seq measure must be normalized to length and library (accepts RPKM, FPKM, TPM, but not raw read counts).
It takes an average of 6 CPU hours (depending on TF) and a minimum RAM of 8GB to generate the input tables for your TF of interest.
Applying the trained model takes less than 20 minutes for most TFs and datasets.
There are 40 supertracks corresponding to each transcription factor.
Each supertrack contains to bigBed9 files, one showing genomic bins with TF binding
in Cistrome DB datasets, and one showing Virtual ChIP-seq predictions in the Roadmap
consortium datasets.
Using the track hub
There are 40 supertracks corresponding to each transcription factor.
Each supertrack contains to bigBed9 files, one showing genomic bins with TF binding
in Cistrome DB datasets, and one showing Virtual ChIP-seq predictions in the Roadmap
consortium datasets.
Please ask questions about Virtual ChIP-seq on our
mailing list. If you want to report a bug or request a feature,
use Virtual ChIP-seq
issue tracker. We are interested in all comments on the package,
and the ease of use of installation and documentation.