Bioinformatics-with-teja: solution to perl problem -1

Hi everyone.

As there are no replies for perl problem -1 that was posted yesterday, i am posting the solution.

#!/usr/bin/perl -w

chomp(@dna=<>);

@dna= sort {(($a=~tr/G//+$a=~tr/C//)/length($a)) <=> (($b=~tr/G//+ $b=~tr/C//)/length($b))} map {uc} @dna;

print $_,"\n" foreach (@dna);

Here is the explanation of the solution

#!usr/bin/perl -w

This 'shebang' line is usually the first line of any perl program , and it suggests the operating system to execute the perl using perl.exe located in usr/bin directory. The flag w which is written next to the shebang line as - w , tells perl to print warnings if any.

chomp(@dna=<>);

The problem states that user can enter any number of DNA strings . So to these dna strings we need an array. In perl arrays are denoted using symbol @. so here @dna is the array which will hold our DNA strings.

The operator <> is called the diamond operator which is useful to capture the command line input. As user enters the DNA strings on the command line each DNA string is sent to @dna array. Once all the strings are entered we have to press ctrl+Z to terminate input on windows , or ctrl+D on linux.

chomp function is generally used to strip the trailing new line character from a given string. chomp(@dna=<>) statement chomps the trailing new line characters from all the strings in @dna. for more information on chomp visit http://perldoc.perl.org/functions/chomp.html

once we have the DNA strings in array, we have to change all of them into upper case then sort them accoding to gc content. The following statement does the same

@dna= sort {$a=~tr/G//+$a=~tr/C// <=> $b=~tr/G//+ $b=~tr/C//} map {uc} @dna;

There is a lot to explain in above line. So i will break the line into smaller parts and explain them.

map {uc} @dna

map is a construct (a keyword) which applies a given function inside {} to all the elements in a array.

The syntax of map is

(returns array) map {function } arrayname

map applies function specified in {} to each and every element in array specified by arrayname and returns the resultant array. map doesnot modify the original array, so we have to specify a name for the array returned by map.

map {uc} @dna

This applies function uc which means uppercase to each and every element of @dna . so all the elements in @dna will be transformed to uppercase and a new array with all the elements of @dna in uppercase is returned.

sort {(($a=~tr/G//+$a=~tr/C//)/length($a)) <=> (($b=~tr/G//+ $b=~tr/C//)/length($b))} map {uc} @dna

Now the sorting part.

sort function takes an array as an argument and sorts the array and returns the resultant array.

syntax of sort

(returns array) sort myfunction array

(returns array) sort {function} array

if myfunction is ommited sort sorts the array alphabetically.

we can specify our function in sort and change the behaviour of sort according to our needs.

so we will take look at sort function closely

sort {(($a=~tr/G//+$a=~tr/C//)/length($a)) <=> (($b=~tr/G//+ $b=~tr/C//)/length($b))}

when we specify to use sort function on a array sort function automatically places two elements from an array at a time into two special variables called $a and $b.We can specify what we have to do with that $a and $b.

Here i am calculating the count of G and count of C in both $a and $b ( i.e dna strings ) and dividing them by the length of the string to give the GC percentage of the individual string, and then sorting according to the GC content. I will leave as homework on how to use tr/// and how to use <=>

More information on sort can be found at

http://perldoc.perl.org/functions/sort.html

so once the sorting is done , we have to print the sorted array. this can be done using a loop.

print $_,"\n" foreach (@dna);

Again, i will leave the interpretation of the above statement for yourself.

Thanks

Teja

6 comments:

Surendra KumarJune 5, 2009 at 9:21 AM
Hi nice blog ....

What if we have fasta file where the sequences (80 characters each line).

>seq1
cgatagactacgactacgactagacatc
cgatacgatacgacatcgacatagacta
gcatcagactacgacatcagacgacata

I have written one but its contain a way too much code ...

Thanx ........
Surendra KumarJune 5, 2009 at 9:22 AM
Can I suggest problem 2?
tejaJune 5, 2009 at 10:59 AM
yes you can suggest answers for any problem, or you cans end me any problems in perl.

Thankyou

Teja
Surendra KumarJune 8, 2009 at 5:08 AM
perl has many ways to do any particular problem. I am not quite use to perl hash. The second problem I was talking about in my earlier post was "write a perl script which intakes a fasta files (the fasta file is like first line header and remaining ten lines sequences with 80 character each), Remove the identical sequences and then print them in file along with the added information on header about the number of duplicates."... don't use any system command and use hash variable.
Surendra KumarJune 9, 2009 at 8:45 AM
Am I posting the question in the wrong section?
UnknownJune 11, 2009 at 9:56 PM
Minnu,your solution to the first problem cleared my doubts on substitution and transliteration...:)

Bioinformatics-with-teja

Thursday, June 4, 2009

solution to perl problem -1

6 comments:

Contributors

Followers

Blog Archive