Assembly Process Validation

In an effort to validate SGN's unigene assembly process, we have attempted to compare our combined Lycopersicon build with TIGR's tomato gene index. These comparisons are based on the latest TIGR tomato gene index available at the time, published on June 1, 2002. It is noted here that neither SGN's unigene nor TIGR's gene index builds are supported by experimental evidence, and thus both remain approximations of the true nature of the genomes represented.

Due to differences in input data, such as EST sequences not common to both builds, and differences in chromatogram processing, direct comparison of the two builds exposes mostly "noisy" differences that lead to inconclusive results in attempts to characterize or manually curate the observed differences.

Thus, the data presented below serves to indicate the observed similarity between builds and demonstrate that neither build differs significantly from the other indicating a suspicious assembly process. See this page for a discussion on the assembly process.

SGN Lycopersicon combined build #1 TIGR Tomato Gene Index
Total # of output sequences 31278 31102
Contigs (TCs) 16200 15211
Singlets 15078 15891
Censored inputs 14310 11054
Exclusive Contigs 0 0
Exclusive Singlets 2044 707

Contigs are unigenes or gene index sequences which are composed of the consensus of an alignment of two or more EST sequences. Singlets are sequences which have been determined not to overlap sufficiently with any other sequence in the input data set. Censored inputs are input sequences which are not common to both sets. Exclusive contigs are contigs composed entirely of input sequences which are not common to both builds. Exclusive singlets are singlets found only in the indicated build. Since no exclusive contigs were found, this indicates that every contig in SGN's build, and every TC in TIGR's tomato gene index is represented by at least one common input sequence for both builds.

After normalizing the unigene membership data to compare solely in terms of input sequences common to both builds, we find:

SGN TIGR
Total # of output sequences 29234 30395
Contigs (TCs) 15034 14432
Singlets 14200 15963

Since the input sequences have been normalized to a common set at this point, and output sequences which are resultant of exclusively non-common sequences are removed from consideration, this data suggests that SGN's assembly process is slightly more lenient, allowing the assembly of more sequences in to contigs. We find here that 74.5% of SGN unigene build is identical to TIGR's gene index. Most of the remaining differences turn out to be cases where a contig in SGN is represented in TIGR as one contig and one or more singlets, or vice versa. Investigation of these cases is consistent with the claim above, that SGN's build is biased slightly toward inclusion of sequences into contigs. Although above it indicates that 2044 singlets are exclusive to SGN, the number of singlets has not dropped by 2044 becuase some contigs have become singlets after censoring non-common input sequences from consideration. The same is true for TIGR's build.

Since the Lycopersicon combined build and TIGR's tomato gene index contain data from 3 different Lycopersicon species, its useful to look at the number of unigenes specific to Lycopersicon hirsutum and Lycopersicon pennellii, which ought to show substantial allelic variation with the species dominantly represented in the input data, Lycopersicon esculentum.

SGN TIGR
hirsutum specific contigs 94 157
pennellii specific contigs 147 113
hirsutum/esculentum mixed contigs 1908 1863
pennellii/esculentum mixed contigs 6552 6624

From this data, both TIGR and SGN's assembly processes are allowing the contig assembly of sequences which contain small evolutionary divergence as well as sequencing errors. It is not clear from this data whether or not orthologs are specifically isolated in the assembly. Neither assembly process at this time contains specific steps for isolating orthologs from paralogs in cross-species assemblies. This question can not be completely settled in silico.

In conclusion, we find that the insight gained from comparing TIGR's gene index with SGN's Lycopersicon combined unigene build indicates that each procedure confirms the predictions of the other in most cases. Differences are observed, but most are attributable to differences in inputs to the processes. The reader is reminded that the above data attempts to characterize the differences in outputs of two separate processes, while not being able to control the differences in inputs. Thus, the conclusive power of the analysis is limited.