$24
Aim: The goal of this lab is to get hands-on experience with using Regular Expressions.
Let’s get started!
a. Create a directory structure to hold your work for this course and all the subsequent labs: Suggestion: CS202/Lab7
b. Write scripts / code to implement regular expressions for the following exercises in Perl!
c. For exercise 1 and 2 below, the program should take a string as an input and display either “ACCEPTED” or “REJECTED”
Exercises
o You are in the market to buy a red pick-up truck, and you wish to develop an automated web searching program (a spider) to search daily through various online newsgroups and classified ad websites to find text containing the word red and the phrase pick-up truck close to each other, followed by a price. Specifically, you should match the words red and the phrase (pickup/pick-up/pick up) truck separated by at most two other words in between. The pick-up truck phrase could appear before or after the word red. After the words red and the phrase pick-up truck, the text should also contain a price. Sample text strings that should be accepted / rejected by the RE are given below: (Truck.pl)
ACCEPT
REJECT
red pickup truck $5000
Red
red pickup truck $5,000
Truck
red pickup truck $1,234.56
pickup truck
red pick-up truck $5000
red pickup truck
red pick up truck $5000
red $5000
red toyota pick-up truck $5000
pickup truck $5000
red toyota 1993 pick-up truck $5000
red truck $5000
blah blah red toyota 1993 pick-up$5000 red pickup truck
truck blah blah $5000 blah
blue pickup truck $5000
pickup truck red $5000
red car $5000
pick-up truck 1993 toyota red $5000
red toyota 1993 pick-up truck
blah blah blah pick-up truck toyotared 1993 toyota automatic pick-up
1993 red blah blah blah $5000
truck $5000
desperate: red 1993 toyota pickupfred's pick-up truck sold for $5000
truck for sale. $2,000 o.b.o.
pick-up trucks by fred: $5000
toy pickup truck - cherry red: $12.
reddy for sale pickup truck: $5000)
red red pickup pickup truck truck
$5000.
o DNA sequences are comprised of a simple 4-alphabet language with the symbols {A,C,G,T}. Three consecutive letters are known as a codon, so ACT and TCG are both codons. A Gene is a collection of at least three codons that starts with an ATG codon and ends with aTAA, TAG, or TGA codon. You need to develop a regular expression that will match strings that contain a gene. Sample DNA sequences that should be accepted/rejected as Genes are given below: (Gene.pl)
ACCEPT
REJECT
ATGCCCTAA
GATTACA
ATGCCCTAG
ATGTAA
ATGCCCTGA
ATGTAG
CATGCCCTAA
ATGTGA
CATGCCCTAG
ATGCCCCTAG
CATGCCCTGA
ATGCCCCCTAG
CATGCCCTAAC
CCCATGCCCCTAGCCC
CATGCCCTAGC
CCCATGCCCCCTAGCCC
CATGCCCTGAT
TCATGCCCTGACC
TTATGCCCGGGTGACC
AAACTCATGCCCGGGCCCTGACCTTAA ATGATGATGTAA
ATGAAAAACAAGAATTAA ATGACAACCACGACTTAA ATGAGAAGCAGGAGTTAA ATGATAATCATGATTTAA ATGCAACACCAGCATTAA ATGCCACCCCCGCCTTAA ATGCGACGCCGGCGTTAA ATGCTACTCCTGCTTTAA ATGGAAGACGAGGATTAA ATGGCAGCCGCGGCTTAA ATGGGAGGCGGGGGTTAA ATGGTAGTCGTGGTTTAA
ATGTACTATTCATCCTCGTCTTGCTGGTGTTTATTCTTGTTTTAA
o Tokenization is the task of extracting tokens from the input text. The definition of ‘token’ depends on the application, but in most cases complete words count as tokens; sometimes punctuation markers do as well. Write a simple tokenizer that given an input text and delimiting characters outputs one word per line by replacing strings of delimiting characters with newlines. (Token.pl)
Submitting your work:
o All source files and class files as one tar-gzipped archive.
◦ When unzipped, it should create a directory with your ID. Example: 2008CSB1001 (NO OTHER FORMAT IS ACCEPTABLE!!! Case sensitive!!!)
Should include: Truck.pl, Gene.pl, Token.pl, and README file
• Negative marks for any problems/errors in running your programs
o If any aspects of the tasks are confusing, make an assumption and state it clearly in your README
o README file should also have instructions on how to use/run your program!
o Submit/Upload it to Google Classroom
◦ Marks Allocation: Truck [5 points], Gene [5 points], Token [3 points], README [2 points]