[0, 2]: File Handling

Whether you are going to end up building software for bioinformatics use or simply (not that simply) doing data analysis, your work will necessarily pass through data files.
The main door is the open function, which takes as input the path of the file you want to open (relative to the main script or absolute) and an optional parameter which is the opening mode, that means whether you want to read (‘r’), write (‘w’) or append something to that file (‘a’).
The write mode does not need the file to exist at the moment of calling the open function, it will be created on spot. Be careful though, because if the file already exists, w will overwrite its content (to avoid this use a mode instead).

f = open('workfile', 'w')
#we can write on a file called workfile which is in the same
#folder as the script
#if mode is omitted, it defaults as 'r'
g = open('data/datafile')
#we want to read a file which is in a folder 'data'

Opening does actually read or write the file. We have assigned a variable to the function open. This is because the function returns a file object, a handler. What is a handler? Exactly what the name suggests: it is an object from which we can manipulate the file, which has been readied for reading or writing. As long as we have that variable, we have a direct stream with the file. At the end of our operations, we should remember to close the stream with the handler close method. Closing files is important since leaving them open can create conflicts in the execution or in concurrent writing of files (if two different functions try to open and write on the same file there may be confusion). Some implementation of Python (like CPython) will close them by themselves, but not every implementation will.


Before explaining how to manipulate a file, a common good practice is to call open within a python statement called with. This actually throws away the need of using the close method:

with open('workfile', 'w') as f:
#outside the with statement the file object is closed.

This construct is way more pythonic and works like a charm (please refer to the docs for more details).
Once the file object is stored, we have several options to read it:
a for loop, and the readlines and readline methods.
The file object is iterable, so the for loop can access its contents, which are the lines (everything that is separated by a ‘\n’). This method is suitable for big files, but have no control over the flow. At each iteration you will be reading the next line. There may be times when this is not desirable, but we will see in a while.
The readlines method reads all the lines of the file and returns a list. It is not efficient if working on big files or if we need to simply stop scanning the file at a certain point. Yet in this way you have constant and simultaneous access to all the lines, and there may be cases where it is useful.
The method readline instead reads one line each time it is called (it has an internal counter to keep track of the last line read).
The advantage of this approach is that you still have control over the flux of lines (it cannot go back, but the lines reading can be sped up or slowed down). A common use for this method is within a while loop:

with open(datafile) as f:

where the loop condition is the boolean value of the variable line. Such variable will contain one line of the file at a time, up until the last line has been encountered. At that time, the readline method will simply return an empty string, which evaluates to False.
This construct is especially useful for dealing with files that have information scattered across lines, but with a constant structure, like FASTA files. In the next example we will see how to selectively remove some entries.

#this fasta file is made up of these entries:
...other 10000 entries
#Our task is to remove entries from chrX
with open(datafile) as f:
if not line.startswith(">chrX"):
print(line.strip()) #write the ID line
line=f.readline() #move on to the sequence line
line=f.readline() #move on to the dot bracket

The fact that our key discriminator (‘>chrX’) is contained only in one every 3 rows makes it hard for us to use a for loop (you are invited to try writing a solution with it).
With the readline instead, we can start a subroutine each time we encounter a valid line and print the subsequent 2. Notice that the outside while loop does not care about what lines you are reading, as long as there is a line.


Writing a file can be made through the file object method write. We have to pass it a String as argument and it will write in the file the object is associated to.

"""assume a list with the results of your last experiment:
performance points for different datasets
results={'background':0.7333, 'IRES_HCV':0.9126, 'IRE': 0.9921}
with open(resultsfile, 'w') as f:
HEADER = "source\tscore"
for data in results:
#this cycles over the keys, remember they are not ordered
row="\n{}\t{:.2f}".format(data, results[data])

The resulting file will look like this:

source score
background 0.73
IRE 0.99

The String format method seems complicated but actually helps you in composing strings without thinking about type conversion and number formatting, which can be controlled by some keywords (like {:.2f}) in a separated fashion. Take a look at this format help page for more (the official docs are somewhat confusing).