Gblocks

Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis

Version 0.91b || January 2002
Copyright © Jose Castresana

Institut de Biologia Evolutiva (CSIC-UPF)
Passeig Marítim de la Barceloneta 37, 08003 Barcelona, Spain

Gblocks is a computer program written in ANSI C language that eliminates poorly aligned positions and divergent regions of an alignment of DNA or protein sequences. These positions may not be homologous or may have been saturated by multiple substitutions and it is convenient to eliminate them prior to phylogenetic analysis. Gblocks selects blocks in a similar way as it is usually done by hand but following a reproducible set of conditions. The selected blocks must fulfill certain requirements with respect to the lack of large segments of contiguous nonconserved positions, lack or low density of gap positions and high conservation of flanking positions, making the final alignment more suitable for phylogenetic analysis. Gblocks outputs several files to visualize the selected blocks. The use of a program such as Gblocks reduces the necessity of manually editing multiple alignments, makes the automation of phylogenetic analysis of large data sets feasible and, finally, facilitates the reproduction of the alignments and subsequent phylogenetic analysis by other researchers.

Several parameters can be modified to make the selection of blocks more or less stringent. In general, a relaxed selection of blocks is better for short alignments, whereas a stringent selection is more adequate for longer ones. Be aware that the default options of Gblocks are stringent.

The program is available free of charge for Mac OS X, Windows, Linux and some UNIX systems.

The software and its accompanying documentation are provided as is, without guarantee of support or maintenance.


Contents


Installation

Mac OS X, Linux and UNIX: Decode the downloaded file with the commands (substituting OS for your corresponding file):

uncompress Gblocks_OS_0.91b.tar.Z
tar xvf Gblocks_OS_0.91b.tar

Windows: The downloaded file is in zip format. You can un-zip it with different compression/decompression tools.

In all cases you will get the folder "Gblocks_0.91b" with the executable program, example files and the documentation.


Quick start

You can use the alignment file of protein sequences nad3.pir, which is included in the distribution of Gblocks, to understand how the program works. The file is in NBRF/PIR format. Open Gblocks, select the option o (Open File) and enter the name of the file. Accept the default parameters with the option g (Get Blocks). You will obtain two files:

nad3.pir-gb. This is the final alignment containing the positions selected by Gblocks to be used with any tree reconstruction or genetic distance estimation program.

nad3.pir-gb.htm. This file contains the original alignment in HTML format with the selected positions highlighted and a description of the parameters used. It can be viewed with any web browser.


How the method works

Gblocks selects conserved blocks from a multiple alignment according to a set of features of the alignment positions.

First, the degree of conservation of every positions of the multiple alignment is evaluated and classified as nonconserved, conserved, or highly conserved. All stretches of contiguous nonconserved positions bigger than a certain value are rejected. In such stretches, alignments are normally ambiguous and, even when in some cases a unique alignment could be given, multiple hidden substitutions make them inadequate for phylogenetic analysis. In the remaining blocks, flanks are examined and positions are removed until blocks are surrounded by highly conserved positions at both flanks. This way, selected blocks are anchored by positions that can be aligned with high confidence.

Then, all gap positions -that can be defined in three different ways- are removed. Furthermore, nonconserved positions adjacent to a gap position are also eliminated until a conserved position is reached, because regions adjacent to a gap are the most difficult to align. Finally, small blocks remaining after gap cleaning are also removed.

An important requirement for the input alignment is that there are no individual sequences completely misaligned within a very well-conserved block (because of a frame-shifting sequencing error or high divergence in this particular sequence). If there are enough number of identities in the rest of sequences, Gblocks will consider that the block is conserved and it will be selected. There are methods to detect such misaligned sequences, such as the visualization of low-scoring segments implemented in the ClustalX program, that should be used to assure that sequences with many of these segments are not included in the alignment.

The following figure is generated by Gblocks in HTML format and shows the blocks selected using default parameters from the alignment nad3.pir, that was created with the program ClustalW 1.7 using default parameters. Conserved positions as defined by the parameters given to Gblocks are highlighted, showing in color the most repeated amino acid, and selected positions are underlined with a heavy blue line. Amino acid colors (used only as a visual guide, and not indicative of any distance matrix) are: lime for A, G, S and T; aqua for P; orange for C; white for D, E, Q and N; yellow for F, W and Y; red for H, K and R; and fuchsia for I, L, M and V.

                         10        20        30        40        50        60
                 =========+=========+=========+=========+=========+=========+
nad3_parde       ------MEYLLQEYLPILVFLGMASALAIVLILAAAVIAVRN--PDPEKVSAYECGFNAF
nad3_acaca       ---------MTLEYIYIFIFFWGAFFISCLLIFLSYFLVYQE--SDIEKNSAYECGFQPF
nad3_allma       --------------MTYLVYIVFTIVLTVGLILVSYLLSQAQ--PDSEKVSAYECGFSPL
nad3_apec        -----------IFNFLTLFVSILIFLITTLITFAAHFLPSRN-TD-SEKSSPYECGFDPL
nad3_arath       ---------MMSEFAPISIYLVISLLVSLILLGVPFPFASNS-STYPEKLSAYECGFDPS
nad3_balca       -------------MNSFLIYLLIAITLSFILSIVGHRLPTRN-MD-QEKLSPYECGFDPQ
nad3_chocr       ------MKLIFTEYSAILIFFAISSLLSSVIFLLSYFLIPQK--PDQEKVSAYECGFNPF
nad3_drome       -------------MFSIIFIALLILLITTIVMFLASILSKKA-LIDREKSSPFECGFDPK
nad3_human       -------------MN-FALILMINTLLALLLMIITFWLPQLN-GY-MEKSTPYECGFDPM
nad3_ktun        -------------MFFVLSLVLFTFLLSLVLLSVSLSLTKKK-MMNREKSSPFECGFDPK
nad3_lter        -------------MILTALSSAIALLVPIIILGAAWVLASRS-TEDREKSSPFECGFDPK
nad3_marpo       -----------MEFAPIFVYLVISLLLSLILIGVSFLFASSSSLAYPEKLSAYECGFDPF
nad3_metse       ---------MYTEFYGILVLLIFSVVLSAIISGASYILGDKQ--PDREKVSAYECGFDPF
nad3_picca       MLNYFVYPYGIENDMGMKFYMMLVPMMSMVLMMINYMMTNKS-DNNMNKTGPYECGFDSF
nad3_podan       -------------MSSMTLFILFVSIIALLFLFINLIFAPHN--PYQEKYSIFECGFHSF
nad3_prowi       ----------MYEFLGILIYFFIALALSLLLLGLPFLVSTRK--ADPEKISAYECGFDPF
nad3_recam       -----MNTMILSEYLSVLIFFIFSFGLSCIILGLSYVLATQN--ADTEKLSPYECGFNPF
                                                                #############



                         70        80        90       100       110       120
                 =========+=========+=========+=========+=========+=========+
nad3_parde       D-DARMKFDVRFYLVSILFIIFDLEVAFLFPWAVSFASLS-DVAFWGLMVFLAVLTVGFA
nad3_acaca       E-DTRSKFNVRYYLIAILFMIFDLEIMYLFPWSISISTGS-FFGVWAIFLFLIILTVGFI
nad3_allma       G-DARQKFDVSFYLIAILFIIFDLEVVFILPFASVIHNVS-LLGGWITIIFLVILTIGFI
nad3_apec        N-SARVPFSFRFFLVAILFLLFDLEIALLFPLPFSVFFH--P--IHTP----LILTVGLI
nad3_arath       G-DARSRFDIRFYLVSILFLIPDLEVTFFFPWAVPPNKID-LFGFWSMMAFLFILTIGFL
nad3_balca       A-SARLPFSLRFFLVAILFLLFDLEIALLLPFPAALSARDPQLSFTLAFLILLILTIGLI
nad3_chocr       D-DARATFDIRFYLVAILFLIFDLEISFLFPWSLVLGEIS-IIGFWSMIVFLVILTIGFI
nad3_drome       S-SSRLPFSLRFFLITIIFLIFDVEIALILPMIIIMKYSNIMIWTITSIIFILILLIGLY
nad3_human       S-PARVPFSMKFFLVAITFLLFDLEIALLLPLPWALQTTNLPLMVMSSLLLIIILALSLA
nad3_ktun        S-SARLPFSMRFFLITVVFLVFDVEIVLLLPYLFSSGWSIDVFSLVGSMMILVILIIGVL
nad3_lter        S-TARIPFSTRFFLLAIIFIVFDIEIVLLMPLPTILHTSDVFTTVTTSVLFLMILLIGLI
nad3_marpo       D-DARSRFDIRFYLVSILFIIFDLEVTFLFPWAVSLNKIG-LFGFWSMMVFLFILTIGFV
nad3_metse       G-TPGRPFSIRFFLIGILFLIFDLEISFLFPWCVVCNQVF-PFGYWTMIVFLAVLTLGLV
nad3_picca       R-QSRTTYSIKFILIAILFLPFDLELTSILPYTLSMYNTN-IYGLFILLYFLLPLIIGFI
nad3_podan       LGQNRTQFGVKFFIFALVYLLLDLEILLTFPFAVSEYVNN-IYGLIILLGFITIITIGFV
nad3_prowi       D-DARGRFDIQFYLVAILFIIFDLEVAFLFPWALTLNKIG-YFGFWSMMLFLFILTVGFI
nad3_recam       D-DARGAFDVRFYLVAILFIIFDLEVAFLFPWAVALSDVT-IFGFWTMFIFLLILTVGFI
                    ############################                      #######



                        130       140       150
                 =========+=========+=========+===
nad3_parde       YEWKKGALEWA----------------------
nad3_acaca       YEWQKGALEWD----------------------
nad3_allma       YEFVSGAITDSF---------------------
nad3_apec        FEWVQGGLDWAE---------------------
nad3_arath       YEWKRGASDRE----------------------
nad3_balca       YEWMEGGLEWAE---------------------
nad3_chocr       YEWYKGALEWE----------------------
nad3_drome       HEWNQGMLNWSN---------------------
nad3_human       YEWLQKGLDWTE---------------------
nad3_ktun        HEWSEGSLEWFSSSN------------------
nad3_lter        HEWKEGSLDWSS---------------------
nad3_marpo       YEWKKGALDWE----------------------
nad3_metse       YEWLKGGLEWE----------------------
nad3_picca       IEINTKAIYMTKMFNRNVKSMTSYVKYNNKI--
nad3_podan       YELGKSALKIDSRQVITMTRFNYSSTIEYLGKI
nad3_prowi       YEWRKGALDWS----------------------
nad3_recam       YEWKKGALDWE----------------------
                 ########                         


Available options

Main menu

******************************************************
                    GBLOCKS 0.91b                     
SELECTION OF CONSERVED BLOCKS FROM MULTIPLE ALIGNMENTS
        FOR THEIR USE IN PHYLOGENETIC ANALYSIS        
******************************************************

CURRENT FILE: nad3.pir
t. Type Of Sequence: Protein

o. Open File…

b. Block Parameters…

s. Saving Options…

g. Get Blocks

q. Quit


Your Choice: 

t. specifies the type of sequences in the current alignment. It can be Protein, DNA or Codons. In protein alignments, the 20 amino acid letters are used to calculate the degree of conservation of positions, and it is possible to invoke the use of a similarity matrix. In DNA and codon alignments, only A, C, G, T and U letters are considered (other symbols are allowed but do not count in the calculations). In codon alignments, selected blocks are made to contain only complete codons (if the alignment is really based on codons).

o. opens a file. NBRF/PIR and FASTA formats are accepted (see the specifications of the formats below). There is no limit for the number of sequences or positions in the alignment as long as there is enough memory available for the program.
In NBRF/PIR-formatted alignments, the specification in the first line is used by Gblocks to define the type of sequence. In FASTA-formatted alignments, the type of sequence is always assigned to Protein. In both cases this can be changed with the option t (Type Of Sequence).
Gap-only positions are eliminated before the analysis. Take into account that, if there are gap-only positions, selected block positions are referred to the alignment without gap-only positions, not to the original alignment.
When entering the filename through drag and drop, which is possible in some systems, the last blank space must be removed if it was introduced by the system.

Pathnames: It is also possible to enter a file with the pathnames of a batch of files to be processed. This file must have one path -absolute or relative to the directory where Gblocks resides- per line. You can see the example file called paths included in the distribution, that has the paths to the files in the folder more_alignments. If all alignments in a pathnames file have the same number of sequences and sequences in all alignments are in the same relative order then it is possible to concatenate the resulting files (see below in extended saving options). If the alignments in a pathnames file do not have the same number of sequences they can be processed but they cannot be concatenated.

b. displays the Block Parameters menu (see below).

s. displays the Saving Options menu (see below).

g. asks the program to proceed with the calculations. The program doesn't quit after a calculation so that it is possible to make other calculations with different parameters or to open a new file and re-use the previous options.

q. quits the program.

Block Parameters menu

BLOCK PARAMETERS

1. Minimum Number Of Sequences For A Conserved Position: . 9
2. Minimum Number Of Sequences For A Flank Position: ..... 14
3. Maximum Number Of Contiguous Nonconserved Positions: .. 8
4. Minimum Length Of A Block: ............................ 10
5. Allowed Gap Positions: ................................ None

r. Restore Defaults
g. Get Blocks

z. Extended Block Options
m. Go To Main Menu


Your Choice: 

1. sets the Minimum Number Of Sequences For A Conserved Position, i.e. it sets the threshold for the definition of conserved positions. This value must be bigger than half the number of sequences. Bigger values of this parameter DECREASE the selected number of positions.

2. sets the Minimum Number Of Sequences For A Flank Position, i.e. it sets the threshold for the definition of flank positions. This value must be bigger than or equal to the Minimum Number Of Sequences For A Conserved Position. Bigger values of this parameter DECREASE the selected number of positions.

3. sets the Maximum Number Of Contiguous Nonconserved Positions. All segments with contiguous nonconserved positions bigger than this value are rejected. Bigger values of this parameter INCREASE the selected number of positions.

4. sets the Minimum Length Of A Block after gap cleaning. Blocks smaller than this value after gap cleaning are rejected. Bigger values of this parameter DECREASE the selected number of positions.

NOTE: In an older version of the program (0.73b) there were two parameters to modulate the length of the final blocks. The parameter at which the current menu option corresponds is Minimum Length Of A Block After Gap Cleaning, which is the most crucial. The other parameter, Minimum Length Of An Initial Block, is given in this version of the program the same value than the previous length. However, if you want to reproduce exactly the same results obtained with the older version, you can give any value to Minimum Length Of An Initial Block through the command line (see below). You can view the values of both minimum length parameters with the Save Short option of Results and Parameters File in the Saving Options menu (see below).

5. toggles among three different possibilities for treating gap positions:
None: no gap positions are allowed in the final alignment. All positions with a single gap or more are treated as a gap position for the block selection procedure, and they and the adjacent nonconserved positions are eliminated.
With Half: only positions where 50% or more of the sequences have a gap are treated as a gap position. Thus, positions with a gap in less than 50% of the sequences can be selected in the final alignment if they are within an appropriate block.
All: all gap positions can be selected. Positions with gaps are not treated differently from other positions.

r. restores defaults for all parameters. These defaults are considered to be most appropriate for moderately conserved protein alignments. For rDNA-like alignments, that contain many small but very well-conserved blocks, it is advisable to manually set Minimum Length Of A Block to a smaller value, like 5 (instead of 10, which is the default).

g. asks the program to proceed with the calculations. This option does the same here than in the Main Menu. The use of this option together with the Generic File Extension option (available in the extended menu) allows to test several parameters without leaving the block parameters menu.

z. displays additional options in the Block Parameters menu.

m. goes to the Main Menu.

As explained above it is possible to enter a file with the pathnames of a batch of files to be processed. In this case, the first two parameters, that are related to the number of sequences, are set to defaults for every alignment, allowing to process alignments with different number of sequences. If one of these parameters is changed, then all alignments should have the same number sequences; otherwise, if an alignment with a different number of sequences is found the program will stop.

BLOCK PARAMETERS

1. Minimum Number Of Sequences For A Conserved Position: . Default
2. Minimum Number Of Sequences For A Flank Position: ..... Default
3. Maximum Number Of Contiguous Nonconserved Positions: .. 8
4. Minimum Length Of A Block: ............................ 10
5. Allowed Gap Positions: ................................ None

r. Restore Defaults
g. Get Blocks

z. Extended Block Options
m. Go To Main Menu


Your Choice: 

Additional options in the Block Parameters menu (they are in effect also if the menu is not extended):

6. Use Similarity Matrix: ................................ Yes
e. Generic File Extension: ............................... -gb

6. toggles between using or not using a similarity matrix to define the degree of conservation in conserved positions and flank positions in protein alignments. However, a position needs to have a number of identities bigger than half the number of sequences to start adding more values from similar amino acids. The similarity matrix used by Gblocks is derived from the Gonnet 120 matrix.

e. sets the generic file extension to be added to output files. It must have 5 characters maximum. Yu can change this extension when testing different parameters to avoid overwriting files. Another extension specific for each output file is also added. This option is also present in the saving options menu.

Saving Options menu

SAVING OPTIONS

s. Selected Blocks: ...................................... Save
p. Results And Parameters File: .......................... Save
e. Generic File Extension: ............................... -gb

z. Extended Saving Options
m. Go To Main Menu


Your Choice: 

s. toggles between saving or not saving the alignment file with the selected blocks in NBRF/PIR or FASTA format, depending on the format of the input alignment. The file receives the extension -gb (or whatever is entered in the Generic File Extension option).

p. toggles among saving an HTML file, saving a text file, saving a short text file or not saving any of them. With the first two options the original file is shown with the selected blocks underlined and, in the HTML file, with colored conserved positions (if you do not see these colors in your browser you may need to activate JavaScript and style sheets in the preferences). The parameters used and the flank positions of the selected blocks are also written in these files. These files receive the extension -gb.htm, -gb.txt or -gb.txts, respectively. Residue coloring in the HTML file is explained above.

e. sets the generic file extension to be added to output files. It must have 5 characters maximum. You can change this extension when testing different parameters to avoid overwriting files. Another extension specific for each output file is also added. This option is also present in the Block Parameters menu.

z. displays additional options in the saving menu.

m. goes to the Main Menu.

Additional options in the Saving Options menu (they work also if the menu is not extended):

v. Characters Per Line In Results And Parameters File: ... 60
n. Nonconserved Blocks: .................................. Don't Save
u. Ungapped Alignment: ................................... Don't Save
k. Mask File With The Selected Blocks: ................... Don't Save
d. Postscript File With The Selected Blocks: ............. Don't Save

v. sets the number of characters per line in the alignment shown in the Results And Parameters File.

n. toggles between saving or not saving the alignment file with the blocks NOT selected, in NBRF/PIR or FASTA format (i.e., the complementary of the selected blocks). The file receives the extension -gbComp.

u. toggles between saving or not saving the alignment file in NBRF/PIR or FASTA format where only gap positions (i.e. positions with at least one gap) have been removed. The file receives the extension -- (two hyphens).

k. toggles between saving or not saving a file that can be read by the program SeqPup, where conserved positions as defined by Gblocks are shadowed and selected blocks underlined. The file receives the extension -gbMask. When viewing this file with SeqPup the "View" pop-up menu in the alignment window must be in "Select mask 1".

d. toggles between saving or not saving a postscript file that shows schematically the selected blocks. This file is useful to quickly view the position and distribution of the selected blocks. The file receives the extension -gbPS. You need a postscript viewer or editor to view this file.

As explained above it is possible to enter a file with the pathnames of a batch of files to be processed. In this case, three extra saving options allow you to concatenate the original, the ungapped and the Gblocks alignments. However, this should only be used if all alignments have the same number of sequences (see also the warning below). If an alignment with a different number of sequences is found the program will stop.

a. Concatenated Blocks From Alignments In Batch: ......... Don't Save
c. Concatenated Input Alignments In Batch: ............... Don't Save
w. Concatenated Ungapped Alignments In Batch: ............ Don't Save

a. toggles between saving or not saving the concatenated blocks selected from all the alignments in a batch file in NBRF/PIR or FASTA format. The file receives the name paths-gb.seq (where paths is the name of the pathnames file and -gb the Generic File Extension).

c. toggles between saving or not saving all the concatenated input alignments in a batch file, without any other processing, in NBRF/PIR or FASTA format. The file receives the name paths.seq (where paths is the name of the pathnames file).

w. toggles between saving or not saving all the concatenated ungapped alignments in a batch file in NBRF/PIR or FASTA format. The file receives the name paths--.seq (where paths is the name of the pathnames file).

WARNING. All alignments in a pathnames file must have the same number of sequences and sequences must be in the same relative order in order to concatenate the results. Sequence names of the last alignment in the pathnames file will be used for the concatenated files. If you prepare a batch of unaligned sequences with the same relative order you should assure that your alignment program doesn't change the order of the sequences. For example, in ClustalW or ClustalX you must set in the Output Format Options that the output order is INPUT (and not ALIGNED, which is the default). Otherwise the concatenated files will be scrambled.


Command line parameters

All program parameters can be entered in the command line. The first parameter must always be the name of the alignment file or the pathnames file. If this is the only given parameter, the menu of the program is activated. For example:

Gblocks nad3.pir

The next parameters are entered according to the letter of the corresponding menu item. They can be entered in any order. The list of all parameters is:

PARAMETER NAME MEANING
(Default)
ALLOWED VALUES
(None) Filename
(No default)
Alignment or pathnames file
-t= Type Of Sequence
(Protein, DNA, Codons)
p, d, c
-b1= Minimum Number Of Sequences For A Conserved Position
(50% of the number of sequences + 1)
Any integer bigger than half the number of sequences and smaller or equal than the total number of sequences
-b2= Minimum Number Of Sequences For A Flank Position
(85% of the number of sequences)
Any integer equal or bigger than Minimum Number Of Sequences For A Conserved Position
-b3= Maximum Number Of Contiguous Nonconserved Positions
(8)
Any integer
-b4= Minimum Length Of A Block
(10)
Any integer equal or bigger than 2
-b5= Allowed Gap Positions
(None, With Half, All)
n, h, a
-b6=
(Only available for protein alignments; only visible in the extended block parameters menu)
Use Similarity Matrices
(Yes, No)
y, n
-b0=
(This option does not appear in the menu)
Minimum Length Of An Initial Block
(Same as Minimum Length Of A Block)
Any integer equal or bigger than 2
-s= Selected Blocks
(Yes, No)
y, n
-p= Results And Parameters File
(Yes, Text, Short Text, No)
y, t, s, n
-v=
(Only visible in the extended saving options)
Characters Per Line In Results And Parameters File
(60)
Any integer bigger than 50
-n=
(Only visible in the extended saving options)
Nonconserved Blocks
(Yes, No)
y, n
-u=
(Only visible in the extended saving options)
Ungapped Alignment
(Yes, No)
y, n
-k=
(Only visible in the extended saving options)
Mask File With The Selected Blocks
(Yes, No)
y, n
-d=
(Only visible in the extended saving options)
Postscript File With The Selected Blocks
(Yes, No)
y, n
-a=
(Only available with paths files)
Concatenated Blocks From Alignments In Batch
(Yes, No)
y, n
-c=
(Only available with paths files)
Concatenated Input Alignments In Batch
(Yes, No)
y, n
-w=
(Only available with paths files)
Concatenated Ungapped Alignments In Batch
(Yes, No)
y, n
-e= Generic File Extension
(-gb)
Any string with 5 or less characters

For example, the following command line will open an alignment file, insist in that the alignment type is Protein, change the Generic File Extension, set the Minimum Length Of A Block to 5 and ask to save the postscript file:

Gblocks nad3.pir -t=p -e=-gb1 -b4=5 -d=y

Note for paths files. If one of the options a, c or w are set to yes (a concatenated file is to be saved) or parameters b1 or b2 (that are related to the number of sequences) are changed, then the paths file should contain all alignments with the same number or sequences; otherwise the program will stop.


File formats

The NBRF/PIR format has the following features:

For example, this is a protein alignment with three sequences in NBRF/PIR format:

>P1;nad3_parde

------MEYLLQEYLPILVFLGMASALAIVLILAAAVIAVRN--PDPEKVSAYECGFNAF
D-DARMKFDVRFYLVSILFIIFDLEVAFLFPWAVSFASLS-DVAFWGLMVFLAVLTVGFA
YEWKKGALEWA----------------------*

>P1;nad3_picca

MLNYFVYPYGIENDMGMKFYMMLVPMMSMVLMMINYMMTNKS-DNNMNKTGPYECGFDSF
R-QSRTTYSIKFILIAILFLPFDLELTSILPYTLSMYNTN-IYGLFILLYFLLPLIIGFI
IEINTKAIYMTKMFNRNVKSMTSYVKYNNKI--*

>P1;nad3_podan

-------------MSSMTLFILFVSIIALLFLFINLIFAPHN--PYQEKYSIFECGFHSF
LGQNRTQFGVKFFIFALVYLLLDLEILLTFPFAVSEYVNN-IYGLIILLGFITIITIGFV
YELGKSALKIDSRQVITMTRFNYSSTIEYLGKI*

The FASTA format is simpler:

This is the same protein alignment in FASTA format:

>nad3_parde
------MEYLLQEYLPILVFLGMASALAIVLILAAAVIAVRN--PDPEKVSAYECGFNAF
D-DARMKFDVRFYLVSILFIIFDLEVAFLFPWAVSFASLS-DVAFWGLMVFLAVLTVGFA
YEWKKGALEWA----------------------
>nad3_picca
MLNYFVYPYGIENDMGMKFYMMLVPMMSMVLMMINYMMTNKS-DNNMNKTGPYECGFDSF
R-QSRTTYSIKFILIAILFLPFDLELTSILPYTLSMYNTN-IYGLFILLYFLLPLIIGFI
IEINTKAIYMTKMFNRNVKSMTSYVKYNNKI--
>nad3_podan
-------------MSSMTLFILFVSIIALLFLFINLIFAPHN--PYQEKYSIFECGFHSF
LGQNRTQFGVKFFIFALVYLLLDLEILLTFPFAVSEYVNN-IYGLIILLGFITIITIGFV
YELGKSALKIDSRQVITMTRFNYSSTIEYLGKI

Citation

The following paper describes the Gblocks method as well as the advantages and apparent disadvantages of eliminating nonconserved segments from an alignment intended for phylogenetic analysis:

Other paper showing the effects of Gblocks on alignments is:


Version history

Gblocks 0.91b:
- It is now possible to select blocks that contain some gap positions inside.
- It now uses a similarity matrix in protein alignments so that conserved positions are less strictly defined.
- In codon alignments, selected blocks are made to coincide with complete codons.
- It generates an HTML output file of the original alignment with the selected blocks highlighted.
- All saving and block parameters can be entered through the command line (in UNIX and PC versions).
- It now reads and writes FASTA format (besides PIR/NBRF). In addition, gap-only positions are removed previous to the analysis.
- It can now process alignments with only two positions (previously the minimum was five) or only two sequences (previously the minimum was three), and path files with only one sequence (previously the minimum was two).
- It is possible to process path files containing alignments with different number of sequences; in this case the program calculates defaults for every alignment.
- The block parameters and saving menus have been simplified. As a consequence of the block parameters simplification and other changes, the results with defaults are slightly different from those of version 0.73b.
- A bug that would crash the program under certain rare conditions in the alignment ends has been fixed.

Gblocks 0.73b:
- First release.