Starting from:
$30

$24

Create a script that analyzes various features of a bacterial genome.

The goal of this project is to create a script that analyzes various features of a bacterial genome.
You are given the following types files to analyze:

Sequence file is a FASTA file containing the DNA sequence of a bacterial species. The DNA is organized into 1 or more chromosomes.

Annotation file is a text file containing tab-delimited data for each gene. There should be a header line, containing the following five columns: <GeneName><Chromosome><Strand><Start><Stop>. Each line will contain information for a single gene. Assume the coordinate system is 1 based.

Your script should take the following arguments:
Positional arguments

Sequence file – required, must be a string
Annotation file – required, must be a string
Optional arguments
Codon analysis flag – optional, should not take a value
Gene sequence flag – optional, should take 1 or more gene names to return thesequence
Your script should do the following things:
    A. Using argparse, take in all the above arguments and store them appropriately into a single object.
    B. Read in and perform error checking on the sequence and annotation file.
    1. The file exists
    2. It is proper fasta format

    3. All nucleotides are A,C,G, or T (uppercase or lowercase are allowed) For the annotation file, you should use pandas to read in the data. Verify that:

    1. The file exists
    2. It contains five columns
    3. The headers of the columns are named: GeneName, Chromosome, Strand, Start, Stop
    4. None of the genes have the same name
    5. Strand equals ‘+’ or ‘-‘
    6. Start is less than stop
    7. The length of the gene is divisible by 3

If any of these conditions are violated, the program should print an informative statement of all of the violations and quit the program.

    C. If no optional arguments are given, your script should report: name, length, number of genes, and GC content for each of the chromosomes.

    D. If the codon analysis option is used, you should report that calculates the amino acid and codon usage for the entire genome (i.e. how often each amino acid is used within all of the proteins and how often each codon is used for a given amino acid):

A 5.5% - GCA: 23%; GCC – 37%; GCG – 21%; GCT – 19%

    E. If the gene sequence option is used, you should print on the protein sequence for each of the genes that are requested in FASTA format.

We’ve provided a template script for you to use. Some example outputs for this script are given below:
OUTPUTS

General Usage

Help documentation:
<user>$ python3 Assignment3_Solution.py -h















Base case:
<user>$ python3 Assignment3_Solution.py Seq.fa Annotation.txt





Codon flag:

<user>$ python3 Assignment3_Solution.py Seq.fa Annotation.txt -c















Gene flag:

<user>$ python3 Assignment3_Solution.py Seq.fa Annotation.txt -g fadA fadB X recF VV_RS00470

Error Handling

Missing command inputs:
<user>$ python3 Assignment3_Solution.py





Missing files:

<user>$ python3 Assignment3_Solution.py Seq2.fa Annotation.txt <user>$ python3 Assignment3_Solution.py Seq2.fa Annotation2.txt







Bad input files:

<user>$ python3 Assignment3_Solution.py SeqError1.fa Annotation.txt <user>$ python3 Assignment3_Solution.py SeqError2.fa Annotation.txt <user>$ python3 Assignment3_Solution.py Seq.fa AnnotationError1.txt








Bad input files (duplicate gene):

<user>$ python3 Assignment3_Solution.py Seq.fa AnnotationError2.txt

More products