Wednesday, October 03, 2007

FASTA, betta, more

My girlfriend's sister asked for help. But all she got was a 16.

The purpose of this script is to get a file(name) as a first argument, a regexp as a second argument and then search trough the file and print out the whole sequence where any of the possible regexp values are found, being that the searched sequence itself must be identified with "<>" symbols and then do a little basic math with the elements we've found.

To try this out, save the following as (for instance) sequences.txt:

>50c's and 50t's
cccccccccccccccccccccccccccccccccccccccccccccccccctttttttttt
tttttttttttttttttttttttttttttttttttttttt
>10G's and 90A's
GGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>A random sequence
ccagtcctgtctaaggactttggttcatgcgttaatttcggttagccagtggcgcccacg
taacaaagtgcgaccgacctgcttccatagcattgaaatgtccttgattggagattttac
gcaggaggactgataagttcgggtctgaattgatgcagcgaacgttcatccaatactcag
acttgcactcagtcg
>A second random sequence
cgcatgacggccctcctcgacggttcattaaccttgcacaccaactgttcctcaaccgat
ctggtgtctttctcacttacatagcagttgctgtaccatttgatgggaacccgagatcac
cgggttatcgcgggagtttattccgaattgttctcggaaatgtggcgtcggcttgaattg
ggaataatacggatcttgacaagcacgatttcatccacaatgcggcacgagtaatcccct
tctggaaggatgcagaaaggacatatacaatgagctagccacgtggcgcataacgcagct
tggttagaaactaaatctatacaggaagatacagaatgggaacagtgtcgctcagtctag
ccagctagatccgcctttagtcgaccttaggggtaaggca

And the below code as scripty.pl:

print "\n\n *girfriends_sister_school_id* \*girfriends_sisters_friend_school_id*\n\n";
print "filename:\n";
$filename = ;
open(TEMPFILE, $filename) or die("Cannot open the file $filename");
print "Insert the regular expression to find:\n";
chomp($regexp=);
while () {

if ( /^>/xms) {
$paragraph = "";
$paragraph_head = $_
}
if ( /^$/xms) {
push @paragraphs, $paragraph;
} else {
$paragraph .= $_;
}
}
push @paragraphs, $paragraph;
foreach (@paragraphs) {
while ( $_ =~ /($regexp)/gi) {
$expressions{$1} += 1;
}
}
foreach $key (keys %expressions) {
print "\n\nPattern = $key\n";
foreach (@paragraphs) {
if ($_ =~ /$key/) {
$paragraph_with_patterns = $_;
$total = length$paragraph_with_patterns;
$counterA = 0;
$counterC = 0;
$counterT = 0;
$counterG = 0;
if ($key =~ /a/){
while ( $paragraph_with_patterns =~ /a/gi) {
$counterA += 1;
$sum1 = ($counterA/$total);
}
}
if ($key =~ /c/){
while ( $paragraph_with_patterns =~ /c/gi) {
$counterC += 1;
$sum2 = ($counterC/$total);
}
}
if ($key =~ /t/){
while ( $paragraph_with_patterns =~ /t/gi) {
$counterT += 1;
$sum3 = ($counterT/$total);
}
}
if ($key =~ /g/){
while ( $paragraph_with_patterns =~ /g/gi) {
$counterG += 1;
$sum4 = ($counterG/$total);
}
}
$paragraph_with_patterns =~ s/($key)/\<$1\>/gi;
print "$paragraph_with_patterns\n\n";
%sum = ();
%sum1 = ();
%sum2 = ();
%sum3 = ();
%sum4 = ();
if ($counterA >0){
%sum1 = ("A", $counterA);
}
if ($counterC >0){
%sum2 = ("C", $counterC);
}
if ($counterT >0){
%sum3 = ("T", $counterT);
}
if ($counterG >0){
%sum4 = ("G", $counterG);
}
%sum = (%sum1,%sum2,%sum3,%sum4);
foreach $key_id (sort hashValueDescendingNum (keys(%sum))) {
$div = ($sum{$key_id}/$total);
print "$key_id = $sum{$key_id} \/ $total = $div\n";
}
}
}
}
sub hashValueDescendingNum{
$sum{$b} <=> $sum{$a};
}
exit;

Then run perl scripty.pl . Put sequences.txt as the file and a[c|t]g as the regexp when prompted.

I had alot of fun doing this wich is a good thing since I didn't get paid.

And no, neither bio::perl, nor any other module for that matter, were options.

Finally, a major thank you must go out to sab for the priceless (as in: he didn't get paid either) help.

No comments: