[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1. Introduction  
2. Installation  
3. Use XAT  
4. Output Format  

1. Introduction

XAT (cross-species alignment tool) comes as a cross-species cDNA-to-genome alignment tools at nucleotide level. It is designed to be used on three conditions: accurate intra-species cDNA-to-genome alignment, fast positioning for cross-species mapping, and gene structure annotation for well aligned regions that contain no frame-shifting indels.

In technical angle, XAT incorporates several heuristic techniques used in Blastz and SIM4, and also inspires some other ideas in performance enhancing and statistical testing. It is fast, sensitive and fairly accurate. It is capable of genome-wide alignment in a considerable speed, and can find less conserved regions with statistical reliability. XAT shows that heuristic algorithm can still achieve a high speed without losing sensitivity.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2. Installation

2.1 System Requirement  
2.2 Compilation  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.1 System Requirement

XAT is written in C. It is known to work in i386-Linux, powerpc-AIX and MIPS-IRIX, and should be ported to any POSIX-compatile system. XAT is available in 32-bit environment, but it is recommanded to compiled XAT in 64-bit systems where XAT will be faster and more sensitive. Large memory helps performance, too.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.2 Compilation

In 32-bit systems, you should type
 
make install32
while in 64-bit systems, you should use
 
make install64
XAT can automatically detect 64-bit compiling options in powerpc-AIX and MIPS-IRIX. If you are using another system, please offer correct options by modifying Makefile.

xat will be generated in directory `bin'. You need to set evironment variable `XAT_CONFIG_FILE' to tell XAT the location of the configure file. In sh shell, you can achieve this by
 
cd config; export XAT_CONFIG_FILE=`pwd`/xat_config


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3. Use XAT

3.1 Invoking  
3.2 Command-Line Options  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1 Invoking

XAT can be simply invoked as
 
xat <mRNA_seq> <genomic_seq> <output_file>
where <mRNA_seq> and <genomic_seq> are in multi-sequence FASTA format.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2 Command-Line Options

3.2.1 Basic Options  
3.2.2 Advanced Options  
3.2.3 Debug Options  

The full XAT command line is
 
xat <mRNA_seq> <genomic_seq> [<output_file>] [<option>=<value>]


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2.1 Basic Options

`result=[5]'
How many best results should be output. XAT will only handle several best results rather than give all the possible ones. Enlarging this number will lead to slower speed, but may render XAT to find more gene duplications.

`searchflag=[0]'
Tell XAT the strands of mRNA sequences. Valid values are 0 (forward), 1 (reverse) and 2 (both). In general, cDNAs are in forward strands, but ESTs, by its nature, are hard to decide the strand sometimes. When searchflag is set to 0, only GT-AG splice sites are considered; when set to 1, only CT-AC are used; when set to 2, both of them are used.

`flevalue=[1e-15]'
The first P-value threshold. This value will not only affect the terminus exons but also determine whether the whole cDNA alignment should be discarded. This is not the direct P-value which can be calculated as flevalue * genome_length. Please read XAT paper if you are interested in how the P-value is calculated.

`p1=[1e-6]'
The second P-value threshold. Only affect terminus exons.

`genesize=[1000000]'
The size of largest permitted gene. Larger genes cannot be fully aligned by XAT. We recommand not to enlarge this number, otherwise both speed and specificity will be affected.

`usemap=<list_file>'
Use a list file to specify the query-subject pairs that should be aligned. By default, XAT will align each cDNA to every genomic sequence, but sometimes we only want to align a cDNA to one specific genomic sequence. You can use this option. The format of list file follows a two-column, space-delimited format. In each line, the first column is the query name, and the second the subject name.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2.2 Advanced Options

`outflag=[4]'
Output format. Valid values are:
`0'
just output the exons and indel information
`1'
just output the aligning of the exons
`2'
output both 0 and 1
`3'
standard output
`4'
standard output with alignment
`5'
cigar format

`matrix=[HUMAN_MOUSE_MATRIX]'
Specify the scoring matrix used in alignment. In our experience, HUMAN_MOUSE_MATRIX used in Blastz is the best choice even for chicken-human or zebrafish-human alignment.

`gapopen=[900]'
Gap-open penalty.

`gapextend=[50]'
Gap-extension penalty.

`band=[50]'
Bandwidth for banded dynamic programming.

`processblock=[2000000]'
Specify the total length of cDNA sequences in a processing cycle. XAT will read dozens of cDNAs at a time. Given enough memory, this strategy will improve the speed by far. XAT will output alignment result at the end of a processing cycle. Do not feel worried if XAT does not output anyting in hours.

`remove_at=[1]'
Whether remove ploy-A or ploy-T at the terminus. 0:no and 1:yes

`linkgap=[1]'
Whether patch breaking points of cDNA by a full dynamic programming. Set the option as 1, XAT will become slower, but this option will always lead to a better result. Sometimes this options may produce more false positives.

`wordtemplate=[1]'
Set word template. Valid values are: 0 (no template), 1 (11/18), 2 (12/19) and 3 (12/12).


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2.3 Debug Options

These options are mainly used for program debugging. Do not change them even if you know about the XAT algorithm.

`windowsize=[50000]'
Window size in the cDNA-positioning stage.

`wordsize1=[11]'
Set the word size for the first round of seeding. The default seed template is 110100110010101111.

`wordsize2=[5]'
Set the word size for the second round of seeding. Seed template is 11111.

`threshold1=[1600]'
Set the gap-free extention cut-off for the first round of seeding.

`threshold2=[1500]'
Cut-off for the second round.

`xdrop1=[2500]'
Set X-drop (gapped extention cut-off) for the first round of seeding.

`xdrop2=[1000]'
X-drop for the second round.

`linkweight=[200]?'
The weight for each MSP when linking.

`column=[100]'
Number of columns in alignment output. This option only affects 1 and 2 format (see outflag). In standard format, no line-break in alignment.

`dot=[0]'
Control whether output dot information. 0:no and 1:yes

`save=[0]'
Whether save the current command-line options to config file. 0:no and 1:yes

`diagonal=[20]'
the diagonal range for merging two adjacent alignments into an exon. This option also means than the minimal intron that can be found by XAT is 20bp.

`checkexon=[1]'
Whether calculate the statistical significance of the terminus exons. It is strongly recommanded to set this option as 1 (yes). Heuristic alignment always generate a bulk of false positives.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4. Output Format

4.1 Standard Format  
4.2 Cigar Format  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.1 Standard Format

Here is a man-made example:
 
>mRNA	2910	851	974	+	GENOME	14713	6857	10492	1000
57  851 914 6857    6920    2.33e-28    ->  1	64,	S,
gaggaaacagcagactttagaagcggaagaggccaagaggcggttgaaggagcagtctatcttt
|||||||||||||| |.|.||||| |||||.||||||||| ||||.||||||||||||||||||
GAGGAAACAGCAGAGTCTGGAAGCTGAAGAAGCCAAGAGGAGGTTAAAGGAGCAGTCTATCTTT
54  915 974 10433   10492   4.89e-27    ->  5	17,1,9,1,32,	S,D,S,I,S,
ggtgaccatcgggatga-gaggaagagacccacatgaagaagtcagagtcggaggtggag
|||||||| |||||||| ||||||||| |||| |||||||||||.|||||.|||||||||
GGTGACCAGCGGGATGAAGAGGAAGAG-CCCAGATGAAGAAGTCGGAGTCAGAGGTGGAG
//

Each alignment begins with a `>' and ends with `//'. The first `>' line contains the fields: mRNA sequence name, mRNA length, mRNA start position, mRNA position, strand, genome sequence name, gen_seq length, gen_seq start position, gen_seq stop position and score.

The following lines report the detailed alignments of each exon. The line started with a number consists of the number of matched bases, mRNA start position, mRNA stop position, gen_seq start position, gen_seq stop position, the first kind of P-value, direction, the number of fragments, the length of each fragments and the type of each fragments. In theory, one can reconstruct the alignment with this line. The output alignments are only to faciliate observations.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.2 Cigar Format

Cigar stands for Concise Idiosyncratic Gapped Alignment Report. Cigar format is one of the output formats which can be generated by Exonerate It is also used in the feature tables in the Ensembl database, but in an altered form. It is designed to contain the minimal information necessary for the reconstruction of an alignment. One alignment is described per line, to allow easy manipulation with UNIX tools.

The example above can be translated into Cigar format:
 
mRNA 851 974 + GENOME 6857 10492 + 1000 M 64 N 3512 M 17 D 1 M 9 I 1 M 32
According to cigar format, this line contains the following fields: query identifier, query start position, query stop position, query strand, target identifier, target start position, target stop position, target strand and score. The remaining fields are in pairs, describing the edit path throught the alignment. These contain a M, I, D, B or N corresponding to a Match, Insert, Delete, Break or iNtron, followed by the length.

Note that standard cigar format does not contain `B', as exonerate permits no breaking point in the cDNA sequence. If `linkgap=1' is specified, XAT will permits no breaking point, either; if not, `B' will appear showing that some cDNA fragments are not aligned.


[Top] [Contents] [Index] [ ? ]

Table of Contents


[Top] [Contents] [Index] [ ? ]

Short Table of Contents

1. Introduction
2. Installation
3. Use XAT
4. Output Format

[Top] [Contents] [Index] [ ? ]

About this document

This document was generated by Heng Li on September, 11 2007 using texi2html

The buttons in the navigation panels have the following meaning:

Button Name Go to From 1.2.3 go to
[ < ] Back previous section in reading order 1.2.2
[ > ] Forward next section in reading order 1.2.4
[ << ] FastBack previous or up-and-previous section 1.1
[ Up ] Up up section 1.2
[ >> ] FastForward next or up-and-next section 1.3
[Top] Top cover (top) of document  
[Contents] Contents table of contents  
[Index] Index concept index  
[ ? ] About this page  

where the Example assumes that the current position is at Subsubsection One-Two-Three of a document of the following structure:

This document was generated by Heng Li on September, 11 2007 using texi2html