Statistics comparing the simulated sequences to actual sequences

To ensure that the simulated sequence have similar properties to the CFTR locus, we start by computing a multiple alignment for the simulated sequences and another multiple alignment for the CFTR sequences. We then compare the two multiple alignments by studying the properties of the pair-wise alignment induced by certain pairs of species. Since the set of 20 species used in our simulation only partially overlaps the set of species for which sequences are available in the CFTR region, we restrict our attention to only the sequences from human, mouse, cat, dog, cow, and pig.

For each induced pair-wise alignment, we report (on the upper diagonal) the expected number of substitutions per site (based on a Kimura 2-parameter model). It can be seen that the simulated sequences approximate quite well the actual sequences, except for the mouse, whose mutation has been slightly underestimated. We also report (on the lower diagonal) the fraction of non-repetitive bases that are aligned (either match or mismatch). Our simulated sequences are again in good agreement with the actual sequences, except with respect to the mouse sequence, where the rate of insertions and deletions in our simulations was actually higher than that observed in the greater-CFTR.

Actual greater-CFTR sequence Simulated sequences
            Hum   Mou     Cat     Dog    Cow    Pig
Hum              0.434  0.302  0.274  0.311  0.301
Mou 0.758               0.479  0.507  0.510  0.500
Cat   0.753    0.626             0.174  0.280  0.269
Dog  0.786    0.610  0.903             0.308  0.295
Cow 0.774    0.602  0.818  0.763             0.216
Pig    0.764    0.591  0.813  0.794  0.828 
         Hum   Mou     Cat     Dog    Cow    Pig
Hum            0.422  0.284  0.285  0.332  0.316
Mou 0.674             0.417  0.417  0.465  0.449
Cat   0.753  0.548             0.159  0.284  0.268
Dog  0.754  0.549  0.879             0.284  0.268
Cow 0.741  0.544  0.806  0.811             0.219
Pig    0.741  0.546  0.806  0.811  0.835 

2) Repetitive content

Running RepeatMasker on both actual and simulated sequences shows the general agreement in the repeat content and age distribution for the human sequence. In bold are given the average %divergence, %deletions and %insertions within certain repeat families. 

Actual CFTR sequence Simulated sequences
file name: human_T1.fasta
sequences: 1
total length: 1877426 bp (1877426 bp excl N-runs)
GC level: 38.45 %
bases masked: 756106 bp ( 40.27 %)
======================================
-------------------------------------------------- 
SINEs: 871 198016 bp 10.55 %
ALUs 529 146780 bp 7.82 % 12.8 1.39 1.16
MIRs 342 51236 bp 2.73 % 28.5 7.08 2.83

LINEs: 578 356592 bp 18.99 %
LINE1 289 266126 bp 14.18 % 19.8 4.92 2.78
LINE2 260 82881 bp 4.41 % 30.3 8.19 3.39
L3/CR1 29 7585 bp 0.40 %

LTR elements: 249 117587 bp 6.26 % 20.1 5.84 3.06
MaLRs 145 60421 bp 3.22 %
ERVL 63 27760 bp 1.48 %
ERV_classI 41 29406 bp 1.57 %
ERV_classII 0 0 bp 0.00 %

DNA elements: 227 57625 bp 3.07 % 21.3 5.37 2.76
MER1_type 126 26000 bp 1.38 %
MER2_type 45 23059 bp 1.23 %

Unclassified: 3 1627 bp 0.09 %

Total interspersed repeats: 731447 bp 38.96 %


Small RNA: 5 326 bp 0.02 %

Satellites: 0 0 bp 0.00 %
Simple repeats: 244 13428 bp 0.72 %
Low complexity: 270 11010 bp 0.59 %
==========================================
file name: HOMO
sequences: 1.0
total length: 50975.5 bp (58899 bp excl N-runs)
GC level: 40.579000 %
bases masked: 23318.1 bp ( 45.701000 %)
======================================
------------------------------------------------- 
SINEs: 40.6 6407.8 bp 12.584000 % 
ALUs 37.6 5967.7 bp 11.744000 % 9.68 1.89 0.768
MIRs 3.0 440.1 bp 0.840000 % 24.9 4.13 2.89

LINEs: 32.6 13168.1 bp 25.767000 %
LINE1 23.9 11607.9 bp 22.659000 % 17.1 3.84 2.93
LINE2 8.7 1560.2 bp 3.109000 % 27.9 5.79 2.53
L3/CR1 0.0 0.0 bp 0.0 %

LTR elements: 6.6 2461.6 bp 4.816000 % 23.3 3.79 2.28
MaLRs 0.0 0.0 bp 0.0 %
ERVL 0.0 0.0 bp 0.0 %
ERV_classI 6.6 2461.6 bp 4.816000 %
ERV_classII 0.0 0.0 bp 0.0 %

DNA elements: 11.3 1280.1 bp 2.531000 % 16.6 3.13 1.42
MER1_type 11.3 1280.1 bp 2.531000 %
MER2_type 0.0 0.0 bp 0.0 %

Unclassified: 0.0 0.0 bp 0.0 %

Total interspersed repeats: 23317.6 bp 45.7 %


Small RNA: 0.0 0.0 bp 0.0 %

Satellites: 0.0 0.0 bp 0.0 %
Simple repeats: 0.0 0.0 bp 0.0 %
Low complexity: 0.4 10.5 bp 0.020000 %
=======================================

 

On mouse:

Actual CFTR sequence Simulated sequences
file name: mouse_T1.fasta
sequences: 1
total length: 1486509 bp (1486509 bp excl N-runs)
GC level: 40.10 %
bases masked: 484710 bp ( 32.61 %)
========================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
SINEs: 626 85194 bp 5.73 % 24.6 4.92 2.85
B1s 216 25346 bp 1.71 %
B2-B4 293 48273 bp 3.25 %
IDs 42 2938 bp 0.20 %
MIRs 75 8637 bp 0.58 %

LINEs: 397 233034 bp 15.68 % 20.16 5.15 2.27
LINE1 347 222408 bp 14.96 %
LINE2 45 10021 bp 0.67 %
L3/CR1 5 605 bp 0.04 %

LTR elements: 323 100821 bp 6.78 % 22.2 7.02 4.00
MaLRs 235 72374 bp 4.87 %
ERVL 21 5110 bp 0.34 %
ERV_classI 14 4980 bp 0.34 %
ERV_classII 26 8605 bp 0.58 %

DNA elements: 92 15084 bp 1.01 % 24.8 6.53 2.37
MER1_type 68 10422 bp 0.70 %
MER2_type 16 3477 bp 0.23 %

Unclassified: 14 3423 bp 0.23 %

Total interspersed repeats: 437556 bp 29.44 %


Small RNA: 10 630 bp 0.04 %

Satellites: 0 0 bp 0.00 %
Simple repeats: 552 34055 bp 2.29 %
Low complexity: 246 12570 bp 0.85 %
file name: MUS
sequences: 1.0
total length: 37516.5 bp (40991 bp excl N-runs)
GC level: 38.785000 %
bases masked: 10845.6 bp ( 28.903000 %)
========================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
SINEs: 23.4 2346.2 bp 6.331000 % 16.0 2.61 2.74
B1s 6.6 569.1 bp 1.501000 %
B2-B4 14.8 1540.3 bp 4.147000 %
IDs 0.5 38.2 bp 0.114000 %
MIRs 1.5 198.6 bp 0.569000 %

LINEs: 19.6 7219.3 bp 19.213000 % 24.6 4.82 2.63
LINE1 14.9 6496.8 bp 17.241000 % 
LINE2 4.7 722.5 bp 1.970000 %
L3/CR1 0.0 0.0 bp 0.0 %

LTR elements: 1.3 86.3 bp 0.226000 % 25.2 5.25 8.56
MaLRs 0.1 10.8 bp 0.025000 %
ERVL 0.8 49.0 bp 0.129000 %
ERV_classI 0.0 0.0 bp 0.0 %
ERV_classII 0.4 26.5 bp 0.073000 %

DNA elements: 11.2 1139.9 bp 2.996000 % 17.8 2.92 1.47
MER1_type 11.2 1139.9 bp 2.996000 %
MER2_type 0.0 0.0 bp 0.0 %

Unclassified: 0.0 0.0 bp 0.0 %

Total interspersed repeats: 10791.7 bp 28.767000 %


Small RNA: 0.1 8.3 bp 0.022000 %

Satellites: 0.2 19.9 bp 0.047000 %
Simple repeats: 0.7 16.5 bp 0.047000 %
Low complexity: 0.3 9.0 bp 0.021000 %