Full-LengtherNEXT

DESCRIPTION:

FULL-LENGTHERNEXT is a tool adapted to NGS technologies, able to work in parallel and in a distributed way to minimise computing time. It is able to classify unigenes to full-length, 5’-end, 3’-end and internal, suggesting which unknown genes are coding or not. It will be also shown that FULL-LENGTHERNEXT fixes frame shifts, one of the main mistake found in wrong entries of full-length sequences databases, and it is a fast tool to compare different transcriptome assemblies.

 

 

FEATURES/PROBLEMS:

  • FULL-LENGTHERNEXT uses scbi_mapreduce and thus is able to exploit all the benefits of a cluster environment. It also works in multi-core machines big shared-memory servers.

  • It is able to classify unigenes to full-length, 5’-end, 3’-end and internal.

  • FULL-LENGTHERNEXT fixes frame shifts.

  • It returns the translated protein sequence for the complete genes and the nucleotide sequence with frame shift fixed and highlighting the start and end codon for an easier finding of the gene and the UTR regions.

  • FULL-LENGTHERNEXT suggests putative new genes analysing what of the genes classified as unknown are probably coding.

  • It produces a stats file useful for assemblies comparison.

SYNOPSIS:

FULL-LENGTHERNEXT must be fed with a multifasta file containing all unigenes to analyse and which group belongs the organism under study among fungi, human, invertebrates, mammals, plants, rodents or vertebrates, to use the most appropriate databases. Furthermore, it is possible parametrizing the number of cpus to be used (workers), the minimum identity percent (default = 45%) and minimum e value (default = 1e-25) thresholds, the maximum distance between query and subject gene limits (default = 15 amino acids) and a user database of complete proteins if desired.

full_lengther_next -f input.fasta -g [fungi|human|invertebrates|mammals|plants|rodents|vertebrates] -d user_db [options]

PBS Submission script

$> cat sample_work.sh

# 12 distributed workers and 1 GB memory per worker: #PBS -l select=12:ncpus=1:mpiprocs=1:mem=1gb # request 10 hours of walltime: #PBS -l walltime=10:00:00 # cd to working directory (from where job was submitted) cd $PBS_O_WORKDIR

# create workers file with assigned node names

cat ${PBS_NODEFILE} > workers

# init seqtrimnext source ~seqtrimnext/init_env

time seqtrimnext -t paired_ends.txt -Q fastq -w workers -s 10.0.0 Once this submission script is created, you only need to launch it with:

qsub sample_work.sh

REQUIREMENTS:

Ruby 1.9.2

Blast plus 2.24 or greater (prior versions have bugs that produces bad results)

INSTALL:

Installing Blast

*Download the latest version of Blast+ from ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/ *You can also use a precompiled version if you like *To install from source, decompress the downloaded file, cd to the decompressed folder, and issue the following commands:

./configure make sudo make install

Installing Ruby 1.9

*You can use RVM to install ruby:

Download latest certificates (maybe you don’t need them):

$ curl -O curl.haxx.se/ca/cacert.pem $ export CURL_CA_BUNDLE=`pwd`/cacert.pem # add this to your .bashrc or equivalent

Install RVM:

$ bash < <(curl -k rvm.beginrescueend.com/install/rvm) Setup environment:

$ echo ‘[[ -s “$HOME/.rvm/scripts/rvm” ]] && . “$HOME/.rvm/scripts/rvm” # Load RVM function’ >> ~/.bash_profile Install ruby 1.9.2 (this can take a while):

$ rvm install 1.9.2 Set it as the default:

$ rvm use 1.9.2 –default

Install Full-LengtherNEXT

Full-LengtherNEXT is very easy to install. It is distributed as a ruby gem. The next command will install Full-LengtherNEXT and all the required gems:

gem install full_lengther_next

Install and rebuild Full-LengthNEXT databases

Full-LengthNEXT needs some databases to work. You can use the BLASTDB environment variable to to change the default database location. To install them, execute:

$ download_fln_dbs.rb

User database

In addition, Full-LengthNEXT is able to use a customised database. It can be created executing:

$ make_user_db.rb

This script only needs two parameters, a database division among fungi, human, invertebrates, mammals, plants, rodents or vertebrates, and a ‘taxon’ corresponds to a specific taxonomic group such as genus, family, or order. For example, if our organism under study is a pine, the database division will be ‘plants’ and the taxon may be Pinus, Pinaceae or even Coniferales order, which includes pines and the other of conifers. Therefore, the command line will be:

$ mk_user_db.rb plants Coniferales.

Otherwise, this database must contain only with complete proteins, and formatted with the BLAST command:

makeblastdb -in sequences.fasta -dbtype ‘prot’ -parse_seqids

LICENSE:

(The MIT License)

Copyright © 2012 - Noe Fernandez & Dario Guerrero, This email address is being protected from spambots. You need JavaScript enabled to view it. & This email address is being protected from spambots. You need JavaScript enabled to view it.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ‘Software’), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED ‘AS IS’, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Additional information