I have results! What do they mean?


To understand the output from GEvo, let's break down the image into its parts. First, let's take a look at the graphical representation of one genomic region:

Genomic regions are broken into two halves -- the top half for the top strand of duplexed DNA and the bottom half for the bottom strand of DNA. The dashed line in the middle represents the division between the top and bottom strands.


Genomic features, such as gene models, are drawn closest to the center dashed line and are drawn as arrows pointing in the 5'->3' direction. Thus, genes on the top and bottom strand will be pointing to the right and left respectively. In the example shown above, gene models are drawn as a set of overlapping arrows. The gene is the colored red arrow, on top of which is drawn the mRNA (blue arrow), on top of which is drawn CDS (coding) regions (green or yellow arrow). In the above example, you will notice that one gene has its CDS regions colored yellow instead of green. This is because this gene (At1g01030) was used to retrieve the genomic region and was highlighted yellow to provide a visual cue for identifying it quickly.


HSPs, which is the generic term we use to denote an identified region of sequence similarity, are displayed as colored, numbered boxes. In the above example, they are pink boxes with a number inside. The number is used to identify the corresponding partner of the HSP in another genomic region. If the HSP is in the "++" orientation, it are drawn on the top strand, and if the HSP is in the "+-" orientation, it is drawn on the bottom strand. In the above example, all the HSPs are in the "++" orientation and are drawn on the top strand.


Sometimes a gene is represented by multiple sets of arrows instead of only one. This happens when two or more genomic features overlap on the same strand (such as splice variants of transcripts). When this happens, they may be drawn in such a way as to show both features. This is achieved by dividing the height of the pictorial representation of the genomic features by the number of overlapping features. So, although this works well for a small number of overlapping features, it can become difficult to visualize if more than 5 features overlap the same region. You can turn on and off this behavior for genomic features in the "Image Parameters". However, overlapping HSPs will always be draw with this behavior on so you will not miss any HSPs.

Now, let us take a look at a completed analysis (using blastz) for a small syntenic region between two Arabidopsis chromosomes centered on genes At1G01030 and At4G01500 with an additional 5000 nucleotides left and right of each gene:

From the image above, you can see all the settings used to generate the results. The results show that the region around At1g01030 contains two genes on the bottom strand, and the region around At4g05100 contains four genes on the top strand (however, the two without coding regions are annotated a retrotransposon and pseudogene). Also, the results show two HSPs (labeled 1 and 2) that cover two genes in each region. Since the HSPs are on the bottom strand, they are in the +/- orientation. Since this are small regions and the results are simple, having the regions inverted does not significantly impede our ability to interpret the results. However, if the results are more complicated, have all genomic regions in the same orientation can improve our ability to interpret the results. To to this, we can open one of the "Sequence Options:" menus and select a region to be "reverse complement" and regenerate our results. (Note: do not refresh the web-page to do this. If you do, all the entered data, such as gene names, will be cleared. Instead, just click on the menus to open them, select your options, and hit "Go" again.)

So now that we have the regions oriented, what can we learn? First off, we can see that the two genes in the top genomic region have sequence similarity to two genes in the bottom region. This is a good indication that they may be potential homologs of one another. Also, they are co-linear with respect to one another which may indicate that these regions are syntenic as well (i.e. derived from a common ancestral genomic region). However, we can still ask many questions: What are these genes? How similar are the HSPs? Are there conserved non coding sequence? Is this really a syntenic region? Let's tackle each of these questions individually!

What are these genes? How similar are the HSPs?


The GEvo's result visualizer used in these examples is known as "Gobe". The meaning of this name is a secret known only to its programmer, Dr. P. However, it is a flashy embedded web-application written for Adobe's Flash Player. Using Gobe gives you access to higher resolution information about parts of your analysis. For example, if you want to know more about any genomic feature or HSP, simply click on it and information will appear in the info box:

When you click on an HSP, besides from getting information in the information box, Gobe will also draw a line connect the clicked HSP to its partner. This can be extremely useful for quickly identifying HSP pairs when there are many HSPs or when multiple sequences are compared.

There are many options for generating connecting lines for HSPs, but we will cover those later with more complex examples. In the meanwhile, note that there are links in the information box for "full annotation". This will take you other pages in CoGe to give you more information about the genomic feature. For example, if you click on the link for a gene, you will go to CoGe's genomic feature information page, FeatView, and if you click on the link for an HSP, you will go to CoGe's HSP information page, HSPView:

Are there conserved non-coding sequence?


In case you aren't familiar with them, conserved non-coding sequences (CNS) are evolutionarily conserved DNA sequences that are usually found around genes and, although most have not been characterized, are putative gene regulatory elements. Since conservation implies function, their identification is often useful for many scientists ranging from those interested in genome evolution and structure to those interested in gene regulation. One thing to keep in mind is that CNSs are different in different organisms. Animal CNSs tend to be large (>100nt) and conserved over large evolutionary distances (humans to fish ~400my) while plant CNSs are small and vanish quickly over evolutionary time (~25nt; ~60my).


In the example above, you notice that HSP 2 is large, extending well beyond the 3' and 5' ends of both genes. This may mean that there are CNSs, but blastz's optimizations for finding long blocks of conserved sequence is obscuring the finer details of sequence similarity. To see this detail, we need to change alignment algorithms. Blastn is one that works well for this kind of task:

Now, you can see that the results are becoming more complicated. There are many more HSPs and the HSPs are much smaller. Although they are labeled with numbers, finding which set pair together is becoming more difficult. However, you can clearly still clearly see that two gene from each region pair with one another, and that the gene pair in yellow has HSPs that are clearly in non-coding sequence. To make this more clear, we can mask the coding sequence from our analysis and only look at non-coding sequence:

Now, you can see that the results are becoming more complicated. There are many more HSPs and the HSPs are much smaller. Although they are labeled with numbers, finding which set pair together is becoming more difficult. However, you can clearly still clearly see that two gene from each region pair with one another, and that the gene pair in yellow has HSPs that are clearly in non-coding sequence. To make this more clear, we can mask the coding sequence from our analysis and only look at non-coding sequence:

Now, you can see that the results are becoming more complicated. There are many more HSPs and the HSPs are much smaller. Although they are labeled with numbers, finding which set pair together is becoming more difficult. However, you can clearly still clearly see that two gene from each region pair with one another, and that the gene pair in yellow has HSPs that are clearly in non-coding sequence. To make this more clear, we can mask the coding sequence from our analysis and only look at non-coding sequence:

Is this region really syntenic?


In case you are not familiar with the term, synteny is used to describe two or more genomic regions that are descended from the same ancestral region. This can be cause by speciation events (such as the divergence of man and mouse) or by genomic duplication events (such as plant polyploidy). In both cases, the genomes evolve over time resulting in genomic changes at different scales. Individual nucleotides will change and non-functional regions randomize faster than functional regions; entire regions may be deleted, duplicated, and/or translocated. Thus, in order to detect synteny, you need to find two or more large genomic regions that share a conservation of putative homologous structures. One example of this, and the watermark of synteny, is to find series of homologous genes between two regions that also are colinear with respect to one another.


In the example shown above, we have two genes that are putative homologs and are colinear -- which means that these regions may be syntenic. To explore this possibility, we need to look at a larger genomic region. Also, since we are going to be examining a large genomic region, blastz is a more appropriate algorithm than blastn because of its optimizations for finding large blocks of conserved sequence. To do this is GEvo, all you need to do is change the alignment algorithm, specify additional sequence to the left and right of the genes, and press "Go":

Wait, I asked for 32k left and right of both genes. How come the top region is only 65kb?


In this analysis, we are looking at two regions that are approximately 45kb and 65kb respectively. The reason that the top region is shorter than expected is because the gene used to center that region, At1g01030 (yellow CDS), is very close to the end of the chromosome. GEvo and CoGe can only get sequence that exists, and automatically corrects the amount of sequence requested to that available.

What do these results mean?


Since we are analyzing a larger region than before, there are many more potential regions with sequence similarity. Previously, we had found two HSPs. Now, there are 7. Although it is possible to see that the majority of the HSPs are co-linear by matching their labels between the two regions, the Gobe flash viewer allows us the ability to highlight HSPs and draw a line connecting the HSP partners. As an aside, you can press "shift-click" to select all the HSPs in a track:

Now the results are easier to interpret. We can see that there are 5 genes in the top region that have sequence similarity to 6 genes in the bottom region, and that most/all of the HSPs are co-linear with respect to the two regions. This result indicates that these genomic regions are syntenic, and are derived from the same ancestral genomic region.


Additionaly, you will notice that there are two sets of more "interesting" HSP patters on the left and right of the analysis:

The left region appears to be a partial local duplication of genic region. This could be due to a local duplication of the gene followed by rapid evolution of one copy (covered by HSP2), or it is a conserved domain in anciently related proteins. You might be able to flush out the story by determining the function of the genes, and you can begin this process by clicking on the genes to get their annotations (all three genes are annotated as having transcription factor activity, and the gene in the lower region with HSP 2 has a transmembrane domain).


The right region shows a homologous pair of genes with an internal repeat. You can investigate this feature in more detail by getting the annotation of these genes (unknown proteins similar to nodulin MtN21), getting the HSP sequence for further analysis, and by getting their protein sequences: