PhyloCSF++ scores the coding potential of genomic regions from a whole-genome multiple sequence alignment (MSA).
The scores were computed with PhyloCSF++ , a fast and easy-to-use implementation of the method PhyloCSF [2, 3].
A more detailed description of the underlying method is available here.
PhyloCSF++ raw tracks
The raw tracks (one for each of the six frames) score each codon. Green tracks represent the frames on the positive strand, red tracks frames on the negative strand.
If a score is negative, it indicates that this codon is non-coding, and coding if the score is positive.
The scores are unbounded and do not take the other codons in the region into account.
Hence, we recommend in general to use the smoothened tracks (named "PhyloCSF++ +1", etc.).
PhyloCSF++ (smoothened) tracks
The scores in smoothened tracks are posterior probabilities, based on the raw tracks (smoothened with an HMM).
They are normalized and are in an interval between [-15,+15].
Positive scores indicate codons in coding regions, negative scores indicate codons in non-coding regions.
The power track gives a confidence on the PhyloCSF scores, the branch length sum.
For each position in the genome it has a confidence score between [0,1] and corresponds to how many species were aligned at that position in the MSA (taking the phylogenetic distances of these species into account).
In other words, if only very few and closely related species were aligned at a position, it has a lower confidence score.
Overview of tracks
The tracks can be downloaded here.
||Species subset (intersection of model and MSA)
|Rat (Rattus norvegicus)
||rn6, mm10, ailMel1, ornAna2, galGal5, melGal5, xenTro7, danRer10, micOch1, hg38, panTro5, rheMac8, cavPor3, felCat8, bosTau8, oryCun2, canFam3, monDom5
|Fugu (Takifugu rubripes)
||fr3 / fugu5
||fr3, tetNig2, oreNil1, oryLat2, danRer7, gasAcu1, latCha1, gadMor1
|Stickleback (Gasterosteus aculeatus)
||gasAcu1, danRer4, fr2, oryLat1, tetNig1, galGal3, mm8, hg18
|Tarsier (Tarsius syrichta)
||tarSyr2, micMur1, tupBel1, otoGar3, hg38, panTro4, rheMac3, mm10, canFam3
|Yeast (Saccharomyces cerevisiae)
||sacCer3, sacPar, sacMik, sacKud, sacBay, sacCas, sacKlu
PhyloCSF++ vs. PhyloCSF
You might wonder what the difference is between these tools.
Technically speaking they will give you the exact same scores (except very minor differences in the smoothened scores due to randomization in the initialization of the HMM).
PhyloCSF++ was developed to make tracks available for more species. Unfortunately, the original implementation of PhyloCSF does not allow to create tracks without doing additional coding.
Furthermore PhyloCSF++ is faster, supports multi-threading and is available as static binaries, on bioconda and as C++ code (making it hopefully easier to compile and run for users).
It also comes with additional tools so you can use these tracks to annotate the transcripts in your GFF/GTF files with PhyloCSF and confidence scores.
PhyloCSF++ was developed by a different group. Its underlying method is the only connection to PhyloCSF.
If you use the tracks or the software in your work, please consider citing the PhyloCSF++ paper .
For citing the original method, see .
- Pockrandt C et al. PhyloCSF++: A fast and user-friendly implementation of PhyloCSF with annotation tools. bioRxiv, 2021.
- Lin MF at al. PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions. Bioinformatics, 2011.
- Mudge JM et al. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Research, 2019.