Section 26.3 – cs 112: programming | computer science | University of Illinois at Chicago

Except where otherwise noted, this and all course materials for CS 112 are licensed under Attribution-NonCommercial-ShareAlike CC BY-NC-SA held by the Trustees of the University of Illinois (University of Illinois at Chicago).

Learning objectives:

  • Working with strings.
  • Slicing strings.
  • Basic functions.
  • Working with GenBank.
  • Understanding connection between DNA, mRNA, and proteins

Sequences in GenBank

On your computer, use a web browser to access GenBank: http://www.ncbi.nlm.nih.gov/genbank. Once there, find a nucleotide sequence for the human coagulation factor IX, sometimes called the “Christmas factor” (F9) gene. In other words, find a DNA sequence for the gene that encodes the coagulation factor IX protein. This is found by using the search area at the top of the GenBank web page. You are looking for a specific “accession”—a sequence submission record—with the accession ID: NG_007994.

To summarize, we need to find the nucleotide sequence for the human coagulation factor IX, sometimes called the “Christmas factor” (F9) gene. To do this we:

  1. Use a web browser to access GenBank: http://www.ncbi.nlm.nih.gov/genbank
  2. Use the search area at the top of the GenBank web page
  3. Make sure we are searching for a Nucleotide (select Nucleotide using the drop down menu).
  4. Enter the accession ID: NG_007994 in the search field
  5. Click search
  6. Verify the page we go to specifies NCBI Reference Sequence: NG_007994.1 just under the main title.

Structure of Eukaryotic Genes

Eukaryotic genes (like F9) are composed of messenger RNA (mRNA)-coding sequences called exons (expressed portions of DNA sequence) and intervening sequences called introns (the name emphasizes their intervening role). Intron sequences in pre-mRNA are non-coding and are removed before transcription to mRNA. The exons are then joined together (concatenated) and comprise mature mRNA. The process of removing introns and reconnecting exons is called ‘splicing.’ Mature mRNA is comprised of coding sequence (CDS) and untranslated regions (UTR) at 5′ and 3′ ends. Coding sequence is made up of codons—the portion of mRNA that codes for amino acids.

The amino acid coding portions (CDS), along with other gene features, are annotated on the left side of the description in GenBank records. For example, you will see something similar to this in the annotations for the F9 gene:

CDS         join(5030..5117,11275..11438)

The actual line on the GenBank page will be much longer (i.e. containing more than just the ranges for two exons) but the first two ranges match exactly what is given above.

The word join in a GenBank record is analogous to a function in Python. It is an instruction to slice out and join (concatenate) the segments separated by commas within parentheses. The resulting string represents the amino acid coding sequence (CDS). Assuming we have the entire F9 gene sequence stored in a variable F9, the example above could be written in Python as:

cds = F9[5029:5117] + F9[11274:11438]

Caution: Python indexes start at 0, but GenBank annotations start at 1. Notice how the coordinates differ between the GenBank record example and the Python code above. Failure to adjust indexes correctly is a common situation in computer science and the bugs related to this are known as off-by-one errors. While seemingly trivial, these errors may have serious consequences.

Assignment Description

  1. Write a function named extract_f9_cds which has one parameter is to take the argument of F9, the F9 gene sequence. The goal of this function is to extract the coding regions from the F9 gene sequence (provided in the template), concatenate them, and return the resulting string. Hint: You can confirm your program is functioning correctly by clicking on the CDS annotation in GenBank. This will highlight the relevant parts of the sequence, it should match your output.
  2. Write a function named get_max_possible_codons which has one parameter seq and returns the maximum number of codons this DNA sequence would contain if it was wholly composed of coding regions. Remember that each codon is made up of 3 nucleotide bases.
  3. Write a function named get_gc_percent which has one parameter seq. The goal of this function is to compute the proportion of G and C bases (characters) in seq to the total number of bases (characters) in seq. The returned value should be of type float in the range between 0.0 and 100.0 (as a percentage, not a fraction). To do this, use the string method count( ) to determine the number of ‘G’ bases and the number of ‘C’ bases.
  4. Write a function named get_coding_ratio which has two parameters seq and cds. The goal of this function is to calculate the proportion of coding nucleotides to total nucleotides in the entire sequence. In other words: of the total number of nucleotides in the gene (seq), what is the proportion that codes for amino acids (cds)? Remember that a ratio will a value of type ‘float’ in the range between 0.0 and 1.0.
  5. Write a function named print_seq_info which has two parameters seq and cds. This function should use the functions you wrote for problems 1 through 4 and print a correctly formatted summary:
    Sequence length: ... Coding sequence length: ... Number of possible codons: ... Number of actual codons: ... First 4 codons of the coding sequence: ... Ratio of Coding NT to Total NT: ... GC percent of the entire sequence: ... GC percent of the coding sequence: ...
    • The Sequence length: output should use the built-in len( ) function with the ‘seq’ parameter.
    • The Coding sequence length: output should use the built-in len( ) function with the ‘cds’ parameter.
    • The Number of possible codons: output should use your get_max_possible_codons( ) function with the ‘seq’ parameter.
    • The Number of actual codons: output should use your get_max_possible_codons( ) function with the ‘cds’ parameter.
    • The First 4 codons of the coding sequence: output should use slicing with the ‘cds’ parameter.
    • The Ration of Coding NT to Total NT: output should use the get_coding_ratio( ) function with both the ‘seq’ and ‘cds’ parameters.
    • The GC percent of the entire sequence: output should use the get_gc_percent( ) function with the ‘seq’ parameter.
    • The GC percent of the coding sequence: output should use the get_gc_percent( ) function with the ‘cds’ parameter.
  6. Write a few sentences explaining what this gene is and what its protein does, state the name of a disease caused by a variant (mutation) at the F9 gene, and describe one such disease-causing variant. Hint: look in the right panel on GenBank, or use the web. (You can write your answer in the same file as your Python code by commenting out the text. The starter code for Lab 3 already has a place for this near the top of the file.)
  7. Make sure you are writing your code using Good Programming Style. Aspects of Good Program Style include (but are not limited to):
    • File Header Comment/docstring at the beginning of the file to describe the purpose of the program
    • File Header Comment/docstring at the beginning of the file to give information about the programmer/author of the program
    • Function Comments/docstrings to describe the purpose of EACH function
    • Using meaning variable names
    • In-line comments/docstrings where needed
    • Blank lines to separate sections of your code
    • Proper use of indentation and consistent depth of indentation
Calculate Your Essay Price
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more