picture credit: nmpdr.org
FASTA file format is a text based file format of nucleotide or peptide sequences. It is widely used in bioinformatics and computational biology. However, there is a problem with the format that it can not be directly implemented as string or list in python or any other data type in any other language. The reason is because (as you can see in the picture) it has an information header that is not the part of a nucleotide or peptide sequence. I have implemented this code in order to demonstrate how the FASTA sequence can be purified in number of ways. First, there is a shorter and more efficient code example that uses some not-very-common string operations. Other two examples are ways in which this could be achieved with just the very basic string and list operations and the romantic in me wanted to write the code for these examples too. These codes can be executed online here with an option of adding text files and testing it.
If you have come up with another exciting way to slice off the header.. I would love to see the code in the comments 🙂
Here is the first example:
## To strip header from FASTA file format.
lines_Minus_Header = open('input.txt').readlines()[1:] # to skip the first fasta header
lines_Minus_Header = ''.join(lines_Minus_Header) # converting lists to str for better display.
FASTA = lines_Minus_Header.strip() #remove all white space
print(FASTA)
Here is the second example that requires one to provide the header (as a string argument in a function) to remove it from the FASTA sequence.
# other way to strip FASTA header
def stripHeader(arg, arg2): # here I am writing a two argument function. first argument (arg)..
#takes example of the header you want to omit to purify your basepair only sequence.
# second argument (arg2) is the full FASTA file.
fasta_header = arg
initiate_splice = len(fasta_header)
lines_Minus_Header = arg2[initiate_splice:] # start counting basepairs from the length of header.
FASTA = lines_Minus_Header.strip() # remove any spaces in the 'string' for better visual display.
print(FASTA)
example_Header = '>gi|224589820:c95907482-95892452 Homo sapiens chromosome 8, GRCh37.p10 Primary Assembly'
example_FASTAsequence = '>gi|224589820:c95907482-95892452 Homo sapiens chromosome 8, GRCh37.p10 Primary AssemblyGTGCGGG...'
stripHeader(example_Header, example_FASTAsequence)
The third example is not very straight forward and asks for a primer from where it can slice the FASTA sequence and return the purified sequence.
# another way to strip FASTA header is by specifying the primer (must be > 2 bp long) from where you want your FASTA sequence.
def stripHeader(arg, arg2): # a two argument function that takes ..
# starting or primer sequence and 2nd argument is a full FASTA sequence
arg = list(arg) # arguments converted to type 'list' as this allows easy looping and other functions
arg2 = list(arg2)
v = [] #empty list that will contain the sequence to start from
k = 0
for i in arg2[:]: # loop over all the terms of FASTA file.
if arg2[k] == arg[0]:
index = len(arg) - 1 # -1 because in python lists are indexed from '0'.
print (index)
while index > 0:
if arg[index] == arg2[k + index]: # if the k'th value + length of primer sequence...
#...is equal to same index in full FASTA sequence then add that index to list 'v'.
v.append(arg[index])
index = index - 1
if len(v) == (len(arg) - 1): #if that list is equal in length to teh primer sequence,..
# .. it means that we have found teh primer sequence in the FASTA fiel and we can..
#.. begin to read from here.
string = arg2[k:] # reading !!
FASTA = ''.join(string) # converting the list into string.
print(FASTA)
break
else:
k = k + 1 # indexor
example_primer = 'GTG'
example_FASTAsequence = '>gi|224589820:c95907482-95892452 Homo sapiens chromosome 8, GRCh37.p10 Primary AssemblyGTGCGGG...'
stripHeader(example_primer, example_FASTAsequence)
Commentaires