SGN Assembly Process Version 2

ESTs are preclustered using a custom developed tool to coarsely identify strong sequence overlaps. (Why precluster?) This produces a set of pairwise scores to be used in transitive closure clustering, implemented as a graph algorithm using depth-first search.

In graph theoretic terms, the sequences are considered nodes of a graph. Undirected edges between nodes indicate a detected overlap between the sequences represented by the nodes. Edges may be weighted, indicating the strength of the overlap. The connected components of the graph are discovered by depth first search, yielding a depth first "forest" of sequence clusters.

Articulation points in the graph are discovered by analyzing the "tree edge" and "back edge" classification of edges from depth first search. Nodes identified as articulation points are potentially chimeric sequences and their overlaps are analyzed further for adjacent but distinct homology regions. Sequences with adjacent but distinct homology regions are considered likely to be chimeric and are discarded. Since the sequence is an articulation point, this will break the cluster into two separate clusters, as expected.

The resulting clusters are supplied as input, with base calling quality scores, to the CAP3 assembly program. We have used the following parameters (for Lycopersicon combined build):

CAP3 option default value value used description
-e 30 5000 "extra" number of observed differences
-s 900 401 minimum similarity score for an overlap
-p 75 90 percent identity required for overlap
-d 200 10000 maximum allowed sum of quality scores of mismatched bases in overlaps
-b 20 60 quality score threshold for scoring a base mismatch

Please see the documentation for CAP3 for further information on other parameters (which are left to default values) and complete descriptions of the above.

The point here is to restrict or eliminate the effect of the "-e, -s, -d, and -b" options, leaving "-p" in the driver's seat. This makes the decisions to assemble or not assemble easily interpretable. The other parameters are attempts to introduce more sensitive discriminations than just percent identity of a detected overlap. However, our experience has shown the effects of these parameters (at default or similar settings) yield arbitrary assemblies that dominate over the most intuitive measure, the percent identity in an overlap. Preliminary experiments indicate that "-p" is the most useful option for controlling CAP3's behavior, but its effects are only noticeable when the other overlap assessment features (options) are effectively disabled.