Usage
Note
Pytrf
is only responsible for finding tandem repeats from given DNA sequence. It can not directly extract TRs from Fasta files. So we used pyfastx to parse Fasta file and feed the sequence to pytrf
.
STR identification
Pytrf provide STRFinder
class to find all microsatellites or SSRs from given sequence.
The fastest way to get all SSRs from sequence:
>>> # parse input fasta sequence
>>> fa = pyfastx.Fastx('tests/data/test.fa.gz', uppercase=True)
>>> # get the first sequence from fasta
>>> name, seq = next(fa)
>>> # feed sequence to pytrf
>>> # the fastest way to get all SSRs from sequence
>>> ssrs = pytrf.STRFinder(name, seq).as_list()
You can also iterate over STRFinder object to get exact tandem repeat (ETR) object. ETR object allows you to access more information and format the information to tsv, csv or gff string.
>>> # iterate over STRFinder object to get ETR object
>>> for ssr in pytrf.STRFinder(name, seq):
>>> print(ssr.chrom)
>>> print(ssr.motif)
>>> print(ssr.repeat)
You can define the minimum number of repeats required to determine a SSR.
>>> # change the minimum repeats for mono-, di-, tri-, tetra-, penta-, hexa-nucleotide repeat
>>> ssrs = pytrf.STRFinder(name, seq, 10, 6, 4, 3, 3, 3)
A complete example, get all ssrs and output csv format
>>> fa = pyfastx.Fastx('tests/data/test.fa', uppercase=True)
>>> for name, seq in fa:
>>> for ssr in pytrf.STRFinder(name, seq, 12, 7, 5, 4, 4, 4):
>>> print(ssr.as_string(','))
GTR identification
Pytrf provide GTRFinder
class to find all generic tandem repeats (GTRs) with
any size of motif from given sequence.
The fastest way to get all GTRs from sequence:
>>> # feed sequence to pytrf
>>> # the fastest way to get all GTRs from sequence
>>> gtrs = pytrf.GTRFinder(name, seq).as_list()
Iterate over GTRFinder object to get ETR object
>>> for gtr in pytrf.GTRMiner(name, seq):
>>> print(gtr.chrom)
>>> print(gtr.motif)
>>> print(gtr.repeat)
You can customize the motif size, minimum repeat and minimum length.
>>> gtrs = pytrf.GTRFinder(name, seq, min_motif=20, max_motif=100, min_repeat=3, min_length=10)
A complete example, get all gtrs and output csv format
>>> fa = pyfastx.Fastx('tests/data/test.fa', uppercase=True):
>>> for name, seq in fa:
>>> for gtr in pytrf.GTRFinder(name, seq, 30, 100, 2, 10):
>>> print(vntr.as_string(','))
Exact tandem repeat
When iterating over STRFinder
or GTRFinder
object, an exact tandem repeat (ETR) object will be returned.
ETR is a readonly object and allows you to access the attributes and convert to desired formats.
>>> ssrs = STRFinder(name, seq)
>>> # get one ssr
>>> ssr = next(ssrs)
>>> # get sequence name where SSR located on
>>> ssr.chrom
>>> # get one-based start and end position
>>> ssr.start
>>> ssr.end
>>> # get repeat sequence
>>> ssr.seq
>>> # get motif sequence
>>> ssr.motif
>>> # get number of repeats
>>> ssr.repeat
>>> # get repeat length
>>> ssr.length
>>> # convert to a list
>>> ssr.as_list()
>>> # convert to a dict
>>> ssr.as_dict()
>>> # convert to a gff formatted string
>>> ssr.as_gff()
>>> # convert to tsv string
>>> ssr.as_string(separator='\t')
>>> # convert to csv string
>>> ssr.as_string(separator=',')
>>> # added a terminator to the end
>>> ssr.as_string(separator=',', terminator='\n')
ATR identification
Pytrf provide ATRFinder
class to find all imperfect or approximate tandem repeats from given sequence.
The fastest way to get all ATRs from sequence:
>>> # feed sequence to pytrf
>>> # the fastest way to get all ATRs from sequence
>>> itrs = pytrf.ATRFinder(name, seq).as_list()
Iterate over ATRFinder object to get atr object
>>> for atr in pytrf.ATRFinder(name, seq):
>>> print(atr.chrom)
>>> print(atr.motif)
>>> print(atr.length)
You can customize the motif size and seed parameters.
>>> itrs = pytrf.ATRFinder(name, seq, max_motif_size=10, seed_min_repeat=3, seed_min_length=10)
A complete example, get all atrs and output csv format
>>> fa = pyfastx.Fastx('tests/data/test.fa', uppercase=True)
>>> for name, seq in fa:
>>> for atr in pytrf.ATRFinder(name, seq):
>>> print(atr.as_string(','))
Approximate tandem repeat
When iterating over ATRFinder
object, an imperfect or approximate tandem repeat (ATR) object will be returned.
ATR is a readonly object and allows you to access the attributes and convert to desired formats.
>>> atrs = ATRFinder(name, seq)
>>> # get one ATR
>>> atr = next(atrs)
>>> # get sequence name where ATR located on
>>> atr.name
>>> # get one-based start and end position
>>> atr.start
>>> atr.end
>>> # get repeat sequence
>>> atr.seq
>>> # get motif sequence
>>> atr.motif
>>> # get length
>>> atr.length
>>> # get number of matches
>>> atr.matches
>>> # get number of substitutions
>>> atr.substitutions
>>> # get number of insertions
>>> atr.insertions
>>> # get number of deletions
>>> atr.deletions
>>> # convert to a list
>>> atr.as_list()
>>> # convert to a dict
>>> atr.as_dict()
>>> # convert to a gff formatted string
>>> atr.as_gff()
>>> # convert to tsv string
>>> atr.as_string(separator='\t')
>>> # convert to csv string
>>> atr.as_string(separator=',')
>>> # added a terminator to the end
>>> atr.as_string(separator=',', terminator='\n')
Commandline interface
Pytrf
also provide command line tools for users to find tandem repeats from fasta or fastq files.
pytrf -h
usage: pytrf command [options] fastx
a python package for finding tandem repeats from genomic sequences
options:
-h, --help show this help message and exit
-v, --version show program version number and exit
commands:
findstr find exact or perfect short tandem repeats
findgtr find exact or perfect generic tandem repeats
findatr find approximate or imperfect tandem repeats
extract get tandem repeat sequence and flanking sequence
Find exact microsatellites or simple sequence repeats (SSRs) from fasta/q file.
pytrf findstr -h
usage: pytrf findstr [-h] [-o] [-f] [-r mono di tri tetra penta hexa] fastx
positional arguments:
fastx input fasta or fastq file (gzip support)
options:
-h, --help show this help message and exit
-o , --out-file output file (default: stdout)
-f , --out-format output format, tsv, csv or gff (default: tsv)
-r mono di tri tetra penta hexa, --repeats mono di tri tetra penta hexa
minimum repeats for each STR type (default: 12 7 5 4 4 4)
Find exact generic tandem repeats (GTRs) from fasta/q file.
pytrf gtrfinder -h
usage: pytrf findgtr [-h] [-o] [-f] [-m] [-M] [-r] [-l] fastx
positional arguments:
fastx input fasta or fastq file (gzip support)
options:
-h, --help show this help message and exit
-o , --out-file output file (default: stdout)
-f , --out-format output format, tsv, csv or gff (default: tsv)
-m , --min-motif minimum motif length (default: 10)
-M , --max-motif maximum motif length (default: 100)
-r , --min-repeat minimum repeat number (default: 3)
-l , --min-length minimum repeat length (default: 10)
Find imperfect or approximate tandem repeats (ATRs)
pytrf atrfinder -h
usage: pytrf findatr [-h] [-o] [-f] [-m] [-M] [-r] [-l] [-e] [-p] [-x] fastx
positional arguments:
fastx input fasta or fastq file (gzip support)
options:
-h, --help show this help message and exit
-o , --out-file output file (default: stdout)
-f , --out-format output format, tsv, csv or gff (default: tsv)
-m , --min-motif-size
minimum motif length (default: 1)
-M , --max-motif-size
maximum motif length (default: 6)
-r , --min-seed-repeat
minimum repeat number for seed (default: 3)
-l , --min-seed-length
minimum length for seed (default: 10)
-e , --max-continuous-error
maximum number of continuous alignment errors (default: 3)
-p , --min-identity minimum identity from 0 to 1 (default: 0.7)
-x , --max-extend-length
maximum length allowed to extend (default: 2000)
Extract tandem repeat sequence and flanking sequence according results of findatr, findgtr or findstr.
pytrf extract -h
usage: pytrf extract [-h] -r [-o] [-f] [-l] fastx
positional arguments:
fastx input fasta or fastq file (gzip support)
options:
-h, --help show this help message and exit
-r , --repeat-file the csv or tsv output file of findatr, findstr or findgtr
-o , --out-file output file (default: stdout)
-f , --out-format output format, tsv, csv or fasta (default: tsv)
-l , --flank-length flanking sequence length (default: 100)