Object Oriented Evolution

Jul 21 2010

From Tophat to Pysam

I recently needed to interact with a SAM file in python that was created as part of a Tophat run. Rather rolling my own parser I decided to use the handy pysam module. However, I immediately ran into trouble. I couldn’t get pysam to open my file. It turns out pysam requires a BAM file not a SAM file.  This meant I need to convert filetypes. So, I fired up samtools and tried a conversion.

>>> samtools view -bS -o accepted_hits.bam accepted_hits.sam 

This produced an error about a missing ‘@SQ line.’  Turns that the Tophat SAM output doesn’t contain the necessary information about contig/scaffold/chrm locations. So, to generate that data I made a ‘fasta index’ file from the genome I was working on.

>>> samtools faidx mygenome.fa 

Then I ran ‘samtools view’ with the index.

>>> samtools view -bS -t mygenome.fa.fai -o accepted_hits.bam accepted_hits.sam

This created BAM file I need with no errors.  Well, almost.  Running it in pysam generated an error. To get it to work there required more manipulation.  The BAM file needed sorting and indexing.

>>> samtools sort accepted_hits.bam accepted_hits.sorted.bam
>>> samtools index accepted_hits.sorted.bam

accepted_hits.sorted.bam’ was searchable when I opened it in pysam.  Whew!

Apr 20 2010

Inferring Heterozygosity from SNP data

This looks straightforward. 

n = total number of reads aligning to a base

y = total number of reads aligning to a base that match the reference

p = probability that reads match the reference base

Bases with LOD>3 were called SNPs or sites of heterozygosity.

From Bloom et al. (2009) Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays. BMC Genomics vol. 10 (1) pp. 221

Apr 15 2010

Why we are “Python Fanboys”

I think I can safely say that we’re all Python evangelists.  In fact, I’m pretty sure my first conversation with ‘FixedByDrift’ revolved around just how enamored with Python he is.  Likewise my first email with ‘AmbiguousBase’ was a code exchange of microsatellite finding software written in Python. We just love Python. So why should you learn to love Python too and why do we use it so much?  A few salient points follow:

  1. Python is an object oriented scripting language that is easy to write. More importantly, it’s easy to read.  Instead of a cluster-f#$k of brackets and colons, if you’re ever tried to decipher C or Perl code you didn’t write you know what I’m talking about, indentation is used to delimit code blocks.  This makes for some surprisingly clean code and for code that is easy to collaborate on.
  2. Python has useful scientific modules.  Numpy, Scipy and MatPlotLib make Python an essentially free version of Matlab. BioPython lets you interact with biological sequences and many of the online biological repositories. Rpy2 allows you to interact with the R statistical language from within Python. However, since Python is a general-purpose language, you can seamlessly combine these modules with web frameworks (e.g., Django) and GUI frameworks such as PyQt
  3. The type of an object is defined when you create it.  There are no complicated scalars, floats, and hashes to keep track of.  If you want to create an object that consists of a string the syntax is simply:  
    >> test = 'test string'
    It’s really as simple as that. In a similar vein Python deals with pointers behind the scenes.  You don’t ever have to declare pointers when writing Python code.

That is all I can think of for the time being. I’ll continue updating this post as more things I love about Python come to mind.

Page 1 of 1