Thursday, October 22, 2009

ASCII Histograms

I have recently come across an interestng problem while working on my random variable implementation for BIP. Any type in Python is expected to have a __str__(self) method which returns an adequate and expressive string representation of the object. Well, as far as I could think, the most straightforward representation of a random variable is its probability distribution. Probability distributions most often depicted graphically by a continuous density function, or a histogram. So my challenge was how to bring the information conveyed by a histogram to a concise ascii string, suitable to be the output of a print statement?

I immediately rejected the boring solution of representing the distribution by its moments (mean, variance, skewness, etc.). I wanted a full histogram in as few ascii characters as possible. So I set out to implement my own ASCII histogram generator. I can anticipate that it was a very simple task given the handy histogram function in Numpy and how easy it is to do string formatting in Python. It was nevertheless a fun couple of hours of programming. I ended up implementing a horizontal and a vertical histogram. The ascii histogram proved to be very useful since it helped enormously in debugging code involving probability calculations with simple print statements. Probabilistic simulations are extremely hard to test because the results of a given operation are never strictly the same. However, they should have the same probability distribution, so by looking at the rough shape of the histogram, you tell you if your calculations are going in the right direction.

Curiously, such a simple and expressive representation for probability distributions is not available in any package I knew, so I decided to share the code with the scientific Python community so that people that may put it to good use. The code below is part of BIP and consequently under GPL license. Any suggestions of improvements are welcome.

# -*- coding: utf-8 -*-
class Histogram(object):
    Ascii histogram
    def __init__(self, data, bins=10):
        Class constructor
            - `data`: array like object
        """ = data
        self.bins = bins
        self.h = histogram(, bins=self.bins)
    def horizontal(self, height=4, character ='|'):
        """Returns a multiline string containing a
        a horizontal histogram representation of
            - `height`: Height of the histogram in characters
            - `character`: Character to use
        >>> d = normal(size=1000)
        >>> h = Histogram(d,bins=25)
        >>> print h.horizontal(5,'|')
        106            |||
        -3.42                         3.09
        his = """"""
        bars = self.h[0]/max(self.h[0])*height
        for l in reversed(range(1,height+1)):
            line = ""
            if l == height:
                line = '%s '%max(self.h[0]) #histogram top count
                line = ' '*(len(str(max(self.h[0])))+1) #add leading spaces
            for c in bars:
                if c >= ceil(l):
                    line += character
                    line += ' '
            line +='\n'
            his += line
        his += '%.2f'%self.h[1][0] + ' '*(self.bins) +'%.2f'%self.h[1][-1] + '\n'
        return his
    def vertical(self,height=20, character ='|'):
        Returns a Multi-line string containing a
        a vertical histogram representation of
            - `height`: Height of the histogram in characters
            - `character`: Character to use
        >>> d = normal(size=1000)
        >>> Histogram(d,bins=10)
        >>> print h.vertical(15,'*')
        -2.14: ***
        -1.51: *********
        -0.87: *************
        -0.23: ***************
        0.41 : ***********
        1.04 : ********
        1.68 : *
        2.32 :
        his = """"""
        xl = ['%.2f'%n for n in self.h[1]]
        lxl = [len(l) for l in xl]
        bars = self.h[0]/max(self.h[0])*height
        his += ' '*(max(bars)+2+max(lxl))+'%s\n'%max(self.h[0])
        for i,c in enumerate(bars):
            line = xl[i] +' '*(max(lxl)-lxl[i])+': '+ character*c+'\n'
            his += line
        return his
if __name__ == "__main__":
    from numpy.random import normal
    d = normal(size=1000)
    h = Histogram(d,bins=10)
    print h.vertical(15)
    print h.horizontal(5)