[0, :]: Python bits

This chapter (and this entire course) assumes familiarity with basic coding concepts such as flow controls, variables and booleans. Python is not required and I will briefly summarize what the reader needs to know to understand everything that will follow.
Throughout this lessons we will use Python 3 (see this blog post if you want to keep a reminder of all the changes from Python 2, otherwise ignore it).

Lists and Dictionaries

Lists

Lists are python’s arrays, or vectors. They can contain any type of elements, even all together (and even other lists). A list is defined by square brackets:

myEmptyList = [] 
anotherList = [1, 'birds', ['anotherList, 4] ]

Lists can be accessed by index, starting from 0:

myList = [42, 84, 126] 
print('the Answer is:', myList[0] ) 
'the Answer is 42' 
#Lists indexes can be read backwards, 
#from the last position having index -1
print('the Answer is still:', myList[-3] ) 
'the Answer is still 42'

Lists can be sliced, this is useful if only a part of the list is needed:

myList = [42, 84, 126] 
print( myList[0:2] ) 
[42, 84] 
#The syntax is myList[start:end], where end is not included
print( myList[1:] ) 
[84, 126] 
#if start or end (or both) are omitted, the slice is keeps 
#going until the left or right end of the list 
#Important! 
#The list itself is left untouched, the print function #returns a modified version of the list but myList 
#still contains all its starting elements.

Elements can be added to an existing list by the method append. As a method, it must be written together with the list and connected by a dot:

myList = [] 
myList.append('Goodbye!') 
print(myList) 
['Goodbye!']

This method is extremely useful when creating a list iteratively:

story="Goodbye my love and bring my thanks to Sebastian for his gift. All day under the sun, the fish looked happy." 
l= [] 
for i in [0, 3, 6, 9, 12, 15, 18]: 
#for loops are better explained later
l.append(story.split()[i]) 
#...

The String join method combines all elements of a string list into a string

#... 
separator = ' ' 
#I want the list elements to be separated by a space 
better_story = separator.join(l) 
#the syntax seems weird but remember that separator is a string 
#and join is a string method. The argument of join is the list. 
print(better_story) 

'Goodbye and thanks for All the fish'

Dictionaries

Dictionaries are objects made of pairs of elements. These elements are called respectively key and value. Like in a paper dictionary, we can look up for the key to find its corresponding value.
A dictionary’s characteristic signature are the curly brackets, and its elements are accessed by key instead of by position e.g.:

firstDict= { 'firstKey': 2, 'secondKey': 'Wabba' } 
#keys and values may be of pretty much any data type, yet I suggest 
#not to use numbers as keys, it may lead to confusion 
print(firstDict['secondKey']) 
'Wabba' 
#to add an entry to the dictionary, simply call the non existent key 
firstDict['thirdKey']= 'Lubba' 
print(firstDict) 
{ 'firstKey': 2, 'secondKey': 'Wabba', 'thirdKey': 'Lubba' }

A value can be another Dictionary, or a List

secondDict= { 'anotherKey': firstDict, 'lastKey': ['Dub']*2 } 
#is a valid dictionary 

Unlike lists, which can be accessed by index and are therefore necessarily ordered, Dictionaries do not preserve the order of the entries (neither keys nor values). They only keep track of the pairs they have.

DefaultDict

The collections module provides the wonderful defaultdict. It is exactly like a dictionary except it doesn’t throw errors if a key does not exist.
Check this situation:

"""Task: info is a list which contains several strings from a file. Each of these strings 
can be split in 2 fields: a method used and a score obtained (it doesn't matter 
what kind of score, just keep following me). We need to aggregate all the score per method, 
so we want to build a dictionary with methods as keys and scores as a list of floats. 
We don't know in advance how many methods will be in the file. The crucial part is 
that there will be many scores per method. """ 

standard_dict = {} 
for row in info: 
    spl = info.split() 
    method=spl[0] 
    score=spl[1] 
    standard_dict[method].append(score)

This will clearly throw an Exception when the first method tries to get into the dictionary. The append method can be called only from a list and the stanard_dict[method] is actually nothing. Therefore we should have lists as values, but only as soon as we know what the keys are. Let’s modify the code:

standard_dict = {} 
for row in info: 
    spl = info.split() 
    method=spl[0] 
    score=spl[1] 
    if not method in standard_dict: 
        standard_dict[method] = []  
    standard_dict[method].append(score)

In this way everytime a new key pops up a list for its values is set, and the append command won’t fail. This is why defaultdict has been created. It automatically assumes that a missing key has a standard value of a defined type.

from collections import defaultdict 
def_dict = defaultdict(list) 
for row in info: 
    spl = info.split() 
    method=spl[0] 
    score=spl[1] 
    def_dict[method].append(score)

The defaultdict argument is the data type we want to assume as default (other possibilities: int, set, str, …)


Flow Controls

All the main flow controls will be presented as toy code, there should not be any obstacles in applying them straightforwardly.

For loops

for var in iterable: 
    do_something 
    do_something_else 
code_outside_for_loop

Easy enough, right?
The main questions you should ask are:

  • What is an iterable?
  • How does Python distinguish code inside the loop from code outside it?

From the Python docs: an Iterable is An object capable of returning its members one at a time. Examples of iterables include all sequence types (such as list, str, and tuple) and some non-sequence types like dict.
In practice an iterable is everything on which you can call a for loop on. The for loop simply accesses one element of the iterable at a time and puts it into a temporary variable (that exists only during the loop).
A string is an iterable, e.g.:

myvar="How do you do?" 
for ch in myvar: 
#ch is just temporary variable 
print(ch) 
H 
o 
w 
d 
o 
y 
o 
u 
d 
o 
?

Regarding the way the code is interpreted instead, indentation is the answer.
Everything inside the for loop must be tabbed once, every time you want to specify that a portion of code is inside a flow control structure, it must be tabbed. Clearly, multiple indentation is needed for nested for loops.

myList=[ 'Hi', 'Mum' ]
#notice that there are two kinds of iterable here:
#the outside list and the inner strings.
for word in myList:
    #word will assume the values of 'Hi' and 'Mum'
    for letter in word:
        #I can iterate over each word to access letters one by one
        print(letter)
H
i
M
u
m

While loops:

While loops are pretty much like in other languages, their structure is

while condition:
    do something

Where condition is anything that is either True of False (Boolean)
In Python a boolean variable can come from a comparison (like in other languages):

tmp=3
print( tmp > 4 )
False
print( 5 % tmp == 2 )
True

or by variable themselves. This is a useful feature of Python. Every type of variable has a value that evaluates to False:

myString=''
myInt=0
myFloat=0.0
myList=[]
myDict={}

While every other value evaluates to True:

falseString=''
trueString='Hello!'
falseList=[]
trueList=["I can be Anything!"]
#...and so on

An example of while loop used to find prime numbers:

max_prime=10
#let's set up a limit for our search
p=2
while p < max_prime:
    for i in range(2,p):
        if p % i == 0:
            break
    else:
        print(p)
    p += 1

There is sure a lot going on up there. Let’s break it down:
While p (that is equal to 2 at the beginning)  is less than our set limit, do a for loop.
In the for loop, put in the temporary variable the values that come from the iterable range.
The range(start, end) function is easy to understand: it creates a list (an array) of numbers from start to end. Take note: the end is not included. So calling range(2,4) will output[2, 3].
For each we use an if statement (new entry!), which is straighforward as well. If the condition evaluates to True, then execute the indented code (you can add else if conditions with the statement elif). In this case the condition to evaluate is: is the remainder of the division between p and i equal to 0? If yes, we have found a divisor, and p is not prime. In that case, the break statement (in Python a command which is not followed by parenthesis is called a statement, like iffor and while) will simply break out of the current, upper level loop, and go back into the while loop.
If the for loop is not broken by a break, it goes to an end and encounters the else statement (Beware: this else is indented at the for level and is not executed if the for hits a break). The variable p is written because it does not have any divisors, therefore is a prime number.
The last part increases the value of p by one (this syntax is equivalent to writing p = p+1). In this way we apply the whole algorithm to another number and so on until p breaks the condition stated at the begininning of the while loop.
Question: why does the range function start from 2?


Remember to put a reachable exit condition for the while loop or your code will end up stuck forever (unlike for loops, they naturally come to ends usually):

p=0
while p < 3:
    print(p)
    p += 1
    p %= 3

This seems easy enough to avoid, but with real life code it happens more times than you imagine!
Question: what will the above code print?
Summary: Flow Control dos and don’ts
else statements can be used both for ifs and fors: the former executes if the condition is False, the latter if the loop comes to an end without encountering breaks.
Variables can be used as booleans
The indentation is what defines a flow control statement


Core methods and libraries

Zip

the Zip function takes as input 2 or more lists and returns an iterator which yields the same index elements of all lists, until the shorter list ends. In practice, since it returns an iterator, is better to directly use it with a for loop:

list1=['Agricola','Caylus','Puerto Rico']
list2=[8,7.9,8.1,10]
#since the zip iterator yields more than one value
#we can pick them both
for game,score in zip(list1,list2):
    print(game, " has a score of ", score)
    #or in more elegant way, using the string format method
    #print("{} has a score of {}".format(game,score))
Agricola has a score of 8
Caylus has a score of 7.9
Puerto Rico has a score of 8.1

Map

Map takes as input a function and an iterable, and applies the function to all its elements. The iterable itself is not modified. A typical use will be that of converting inline a list of elements to print them on screen or into a file

phosphorilated_positions=[12,1337,2442]
print( "This protein is phosphorilated at the following aminoacids:", ",".join( map(str, phosphorilated_positions)  )  )
"This protein is phosphorilated at the following aminoacids: 12, 1337, 2442

Enumerate

If you iterate over a list with a for loop, you have no direct access to the index of the element you are using at each iteration. The enumerate function behaves like you had zipped your list to a list of numbers from 0 up to the last index of the list. In this way you have direct control over the element’s index.
A simple use case could be that of keeping track of all the positions in a primary RNA sequence where a special character is found. Assume we received the following sequence by a friendly lab:
AGGACUACUCGUMACUGCACMUUGGGGGGAACAGUMGUUGMAUAGCUAUGC
where M are methylated positions. With enumerate we can easily keep track of those indexes:

seq="AGGACUACUCGUMACUGCACMUUGGGGGGAACAGUMGUUGMAUAGCUAUGC"
positions=[]
#we will use a list to keep track of the indexes
for index, nucleotide in enumerate(seq):
    if nucleotide is 'M':
        positions.append(index)
print("\t".join(map(str,positions)))
"12	20	35	40"

Itertools module

This module will be imported to do combinatorial work. It allows you to create all the permutations/combinations with or without replacement. It’s a very specific task, yet in bioinformatics there are few functions which can result very useful.
The product() function takes as input a series of iterables (even only one) and a parameter repeat and computes the cartesian product of all elements. A cartesian product is simply the equivalent of nested for loops. An example will clarify its use:

import itertools
#create all possible DNA 2-mers
print(list(itertools.product('ACGT', repeat=2)))
[('A', 'A'),
 ('A', 'C'),
 ('A', 'G'),
 ('A', 'T'),
 ('C', 'A'),
 ('C', 'C'),
 ('C', 'G'),
 ('C', 'T'),
 ('G', 'A'),
 ('G', 'C'),
 ('G', 'G'),
 ('G', 'T'),
 ('T', 'A'),
 ('T', 'C'),
 ('T', 'G'),
 ('T', 'T')]

The iterable was the string ‘ACGT’ and the repeat indicated the length of each combination, the call to list() is necessary if we want to “see” the product since every function of itertools returns an iterator, which is an object that emits its components one at a time if called with the right tools (like for loops). Do not confuse iterators with iterables: an iterator is an iterable but an iterable can be something different from an iterator (like a list: is iterable, but a completely different object from an iterator, for example you can print it).
We would have achieved the same result by using nested for loops (yet with more, unnecessary code):

l=[]
for x in 'ACGT':
    for y in 'ACGT':
        #the number of loops equals to the repeat parameter
        l.append((x,y))
print(l)
[('A', 'A'),
 ('A', 'C'),
 ('A', 'G'),
 ('A', 'T'),
 ('C', 'A'),
 ('C', 'C'),
 ('C', 'G'),
 ('C', 'T'),
 ('G', 'A'),
 ('G', 'C'),
 ('G', 'G'),
 ('G', 'T'),
 ('T', 'A'),
 ('T', 'C'),
 ('T', 'G'),
 ('T', 'T')]

The permutations(), combinations() and combinations_with_replacement() exactly do what they promise to do, they return iterators with combinatorial operations of their arguments (which must be iterables):

from itertools import permutations, combinations, combinations_with_replacement
#extract all possible permutations of an RNA sequence
sequence="UCUGUCUCUU"
perms=permutations(sequence)
#extract all possible permutations from subset of fixed length
perms_fix_len=permutations(sequence, 4)
###
#Combinations draws all combinations of size k from the iterable
#this task is suitable for creating sets of groups. Let's assume
#we want to compare (maybe align) every sequence with the others, pairwise
seqs= ['UCUGUCUCUU','ACCGUAUAGCUUUUUA','GAAAUUCGAACAACCUAG']
k=2
combs= combinations(seqs, k)
#a call to list(combs) would give:
[('UCUGUCUCUU', 'ACCGUAUAGCUUUUUA'),
 ('UCUGUCUCUU', 'GAAAUUCGAACAACCUAG'),
 ('ACCGUAUAGCUUUUUA', 'GAAAUUCGAACAACCUAG')]
for seq1, seq2 in combs:
#suppose we have access to a generic align() function
align(seq1,seq2)
#combinations with replacement have an identical syntax
#but it can be used with k > the elements of the iterable.
combinations_with_replacement(seqs, k)

Question1: Why didn’t I write the results of permutations on screen as usual?


Summary: Core methods and libraries dos and don’ts


Function definition, Write your own library

It’s important to keep your code tidied up. Not only a pleasure for the eye, a well organized code is more maintainable. You cannot imagine the times you will have to come back on a script written (by you) months before just to end up spending an entire day just to figure out how it could have possibly worked.
Since I am preaching about modularization, the least I could do is writing this section in a different page: Tidying up.


File Manipulation

As Bioinformaticians, you are going to do a lot of file opening, parsing and writing. Let’s briefly overview the main functions we have at our disposal. Jump to File Handling

Code examples:

1-function definitions and time computation https://github.com/noise42/datastructures/blob/master/materials/tv19/misc_data/les1/count.py
2-list comprehensions, zip
https://github.com/noise42/datastructures/blob/master/materials/tv19/misc_data/les2/Lesson2.ipynb
3-itertools, dictionaries
https://github.com/noise42/datastructures/blob/master/materials/tv19/misc_data/les3/Lesson3.ipynb