Starting from:
$35

$29

Programming Project 06 Solution

Assignment Overview

This assignment will give you more experience on the use of:
    1. Lists and tuples
    2. function
    3. File manipulation

The goal of this project is to extract gene lengths from a gene annotation file. With a gene annotation GFF file, you will need to extract the gene coordinates on each chromosome and calculate the average and standard deviation of gene lengths.

Assignment Background

The eukaryotic genome is composed of multiple chromosomes. On each chromosome, there are multiple genes. In bioinformatics, the genome annotations can be saved in a file format called GFF. In NCBI genome database (https://www.ncbi.nlm.nih.gov/genome/), there are many publically available annotated organisms. These annotated genomes can be downloaded in multiple file formats, including GFF format. For this project, we will focus on a relatively simple model species: Caenorhabditis elegans. This worm has a genome of six chromosomes named chrI, chrII, chrIII, chrIV, chrV, and chrX.

We provide two input files:
# a small file for development
C.elegans_small.gff

C.elegans.gff
# a real BIG data file

Project Description

    a) open_file() prompts the user to enter a filename. The program will try to open a tab- separated GFF file (a text file). An error message should be shown if the file cannot be opened. This function will loop until it receives proper input and successfully opens the file. It returns a file pointer.

    b) read_file(fp) receivers a file pointer of the data file and read all the genes information. For this project, we are only interested in the following columns: the chromosome name (string) is in column 0, the gene_start is in column 3, and the gene_end is in column 4. Convert number strings to int. No other values are needed for this project. If a value is missing, use 0 as the value.
For each gene, save it in a tuple, (chromosome, gene_start, gene_end), and append each tuple to a list of genes. Sort the list and then return the sorted list of genes (sorting makes a canonical list for comparison testing on Mimir).

    b) extract_chromosome(genes_list, chromosome) receives a list of genes (such as what

was returned by the read_file() function) and a chromosome name, extract the gene information for this chromosome and save in list chrom_gene_list. Sort and return the list (sorting makes a canonical

list for comparison testing on Mimir).

    c) extract_genome(genes_list) receives a list of genes and extract the gene information for each chromosome. In this function, use extract_chromosome(genes_list, chromosome) to extract
the genes for each chromosome (chrom_gene_list) then save the returned result in a list genome_list (a list of six chrom_gene_lists). Sort and return the list (sorting makes a canonical list for comparison testing on Mimir).

    d) compute_gene_length(chrom_gene_list) This function receives chrom_gene_list, the

list of genes for a specific chromosome (such as returned from the extract_chromosome() function). For each gene, compute the gene length and save the result in a list gene_length. Given the gene

length list, compute the average gene length and standard deviation for these genes. Save the results in a tuple with gene length first followed by gene standard deviation. Return that tuple.

The length of one gene, gene_len, is calculated as: gene_len = gene_end – gene_start + 1

(Hint: make a list that has the lengths of the genes.)

The gene_mean is the average length of all the genes (Hint: consider using list sum() and len()

functions—something you can do if your lengths are in a list.)

The gene_number is a count of all the genes.

The gene_stddev is the standard deviation of all the gene lengths, calculated according to the following formula. The summation in the formula sums across all genes. That is, for each gene

subtract the mean from the length and square the difference; then sum those values. Take that sum and divide by the number of genes (gene_number); then take the square root (remember to import math). (Hint: “for each gene” can be done easily by walking through a list with a for loop, if your lengths are in a list.)


(gene len − gene mean)2

gene stddev =

gene number


    e) display_data(chrom_gene_list, chrom) This function receives chrom_gene_list, the list of genes for a specific chromosome as well as chrom, the string name. It displays the chromosome name, the average length of the gene and the standard deviation (from the compute_gene_length() function that gets called within this function). The chromosome name must be displayed with the first three characters ‘chr’ in lower case and the remaining characters in upper case, e.g. ‘chrIV’. (Hint: slicing is your friend.)

Assignment Deliverable

The deliverable for this assignment is the following file:

proj06.py – the source code for your Python program

Be sure to use the specified file name and to submit it for grading via the Mimir system before the project deadline.

Assignment Notes
    1. The gff input files we provide have lines starting with ‘#’ as file annotations; skip these lines when reading the gff file.

    2. To parse the lines in gff file, use split() function. You can split by the tab.
    3. Use this constant for chromosome names

CHROMOSOMES = ['chri','chrii','chriii','chriv','chrv','chrx'] 4. Items 1-9 of the Coding Standard will be enforced for this project.

Test Cases

Function Test 1: read_file

Input file: C.elegans_small.gff

Returns:
[('chri', 3747, 3909), ('chri', 4221, 10148), ('chri', 11641, 16585), ('chrii', 25, 175), ('chrii', 25, 175), ('chrii', 1867, 4663), ('chriii', 1271, 2917), ('chriii', 4251, 11940), ('chriii', 12189, 14753), ('chriv', 695, 14926), ('chriv', 8765, 11070), ('chriv', 15499, 20899), ('chrv', 180, 329), ('chrv', 180, 329), ('chrv', 2851, 6511), ('chrx', 151, 263), ('chrx', 151, 263), ('chrx', 13494, 13643)]

Function Test 2: extract_chromosome

genes_list = [('chri', 3747, 3909), ('chri', 4221, 10148), ('chri', 11641, 16585), ('chrii', 25, 175), ('chrii', 25, 175), ('chrii', 1867, 4663), ('chriii', 1271, 2917), ('chriii', 4251, 11940), ('chriii', 12189, 14753), ('chriv', 695, 14926), ('chriv', 8765, 11070), ('chriv', 15499, 20899), ('chrv', 180, 329), ('chrv', 180, 329), ('chrv', 2851, 6511), ('chrx', 151, 263), ('chrx', 151, 263), ('chrx', 13494, 13643)]
chromosome = 'chrv'

Returns:

[('chrv', 180, 329), ('chrv', 180, 329), ('chrv', 2851, 6511)]

Function Test 3: extract_genome

genes_list = [('chri', 3747, 3909), ('chri', 4221, 10148), ('chri', 11641, 16585), ('chrii', 25, 175), ('chrii', 25, 175), ('chrii', 1867, 4663), ('chriii', 1271, 2917), ('chriii', 4251, 11940), ('chriii', 12189, 14753), ('chriv', 695, 14926), ('chriv', 8765, 11070), ('chriv', 15499, 20899), ('chrv', 180, 329), ('chrv', 180, 329), ('chrv', 2851, 6511), ('chrx', 151, 263), ('chrx', 151, 263), ('chrx', 13494, 13643)]

Returns:

[[('chri', 3747, 3909), ('chri', 4221, 10148), ('chri', 11641, 16585)], [('chrii', 25, 175), ('chrii', 25, 175), ('chrii', 1867, 4663)], [('chriii', 1271, 2917), ('chriii', 4251, 11940), ('chriii', 12189,

14753)], [('chriv', 695, 14926), ('chriv', 8765, 11070), ('chriv', 15499,
20899)], [('chrv', 180, 329), ('chrv', 180, 329), ('chrv', 2851, 6511)], [('chrx', 151, 263), ('chrx', 151, 263), ('chrx', 13494, 13643)]]

Function Test 4: compute_gene_length

chrom_gene_list = [('chrii', 25, 175), ('chrii', 25, 175), ('chrii', 1867, 4663)]
Returns:

(1033.0, 1247.33636201307)

Test Case 1

Gene length computation for C. elegans.


Input a file name: C.elegans_small.gff


Enter chromosome or 'all' or 'quit': chri

Chromosome Length
std-dev
chromosome
mean

chrI
3678.67
2518.14

Enter chromosome or 'all' or 'quit': chriv

Chromosome Length
std-dev
chromosome
mean

chrIV
7313.00
5053.00

Enter chromosome or 'all' or 'quit': chrX

Chromosome Length
std-dev
chromosome
mean

chrX
125.33
17.44

Enter chromosome or 'all' or 'quit': quit

Test case 2

Gene length computation for C. elegans.


Input a file name: xxx

Unable to open file.

Input a file name: C.elegans_small.gff


Enter chromosome or 'all' or 'quit': xxx Error in chromosome. Please try again.

Enter chromosome or 'all' or 'quit': chrII

Chromosome Length
std-dev
chromosome
mean

chrII
1033.00
1247.34
Enter chromosome or 'all' or 'quit': CHIII Error in chromosome. Please try again.

Enter chromosome or 'all' or 'quit': CHRIII

Chromosome Length
std-dev
chromosome
mean

chrIII
3967.33
2658.87

Enter chromosome or 'all' or 'quit': aLL

Chromosome Length
std-dev
chromosome
mean

chrI
3678.67
2518.14
chrII
1033.00
1247.34
chrIII
3967.33
2658.87
chrIV
7313.00
5053.00
chrV
1320.33
1655.10
chrX
125.33
17.44

Enter chromosome or 'all' or 'quit': qUiT

Test Case 3

Gene length computation for C. elegans.


Input a file name: C.elegans.gff


Enter chromosome or 'all' or 'quit': all

Chromosome Length
std-dev
chromosome
mean

chrI
2542.65
4104.10
chrII
1879.71
2945.42
chrIII
2469.57
3761.81
chrIV
535.14
1949.55
chrV
1711.47
2687.29
chrX
1575.51
3110.69

Enter chromosome or 'all' or 'quit': quit

Test Case 4

Blind test.

Grading Rubric

General Requirements:

__0__ ( 5 pts) Coding Standard 1-9

(descriptive comments, function headers, etc...)
Implementation:

__0__ ( 4 pts) open_file function (no Mimir test)
__0__ ( 4 pts) read_file function

__0__ ( 4 pts) extract_chromosome function
__0__ ( 4 pts) extract_genome function

__0__ ( 4 pts) computer_gene_length function

__0__ ( 5 pts) Pass Test1

__0__ ( 5 pts) Pass Test2
__0__ ( 5 pts) Pass Test3

__0__ ( 5 pts) Pass Test4 (Blind Test)

More products