Thursday, October 22, 2009

ASCII Histograms

I have recently come across an interestng problem while working on my random variable implementation for BIP. Any type in Python is expected to have a __str__(self) method which returns an adequate and expressive string representation of the object. Well, as far as I could think, the most straightforward representation of a random variable is its probability distribution. Probability distributions most often depicted graphically by a continuous density function, or a histogram. So my challenge was how to bring the information conveyed by a histogram to a concise ascii string, suitable to be the output of a print statement?

I immediately rejected the boring solution of representing the distribution by its moments (mean, variance, skewness, etc.). I wanted a full histogram in as few ascii characters as possible. So I set out to implement my own ASCII histogram generator. I can anticipate that it was a very simple task given the handy histogram function in Numpy and how easy it is to do string formatting in Python. It was nevertheless a fun couple of hours of programming. I ended up implementing a horizontal and a vertical histogram. The ascii histogram proved to be very useful since it helped enormously in debugging code involving probability calculations with simple print statements. Probabilistic simulations are extremely hard to test because the results of a given operation are never strictly the same. However, they should have the same probability distribution, so by looking at the rough shape of the histogram, you tell you if your calculations are going in the right direction.

Curiously, such a simple and expressive representation for probability distributions is not available in any package I knew, so I decided to share the code with the scientific Python community so that people that may put it to good use. The code below is part of BIP and consequently under GPL license. Any suggestions of improvements are welcome.

# -*- coding: utf-8 -*-
class Histogram(object):
    """
    Ascii histogram
    """
    def __init__(self, data, bins=10):
        """
        Class constructor
        
        :Parameters:
            - `data`: array like object
        """
        self.data = data
        self.bins = bins
        self.h = histogram(self.data, bins=self.bins)
    def horizontal(self, height=4, character ='|'):
        """Returns a multiline string containing a
        a horizontal histogram representation of self.data
        :Parameters:
            - `height`: Height of the histogram in characters
            - `character`: Character to use
        >>> d = normal(size=1000)
        >>> h = Histogram(d,bins=25)
        >>> print h.horizontal(5,'|')
        106            |||
                      |||||
                      |||||||
                    ||||||||||
                   |||||||||||||
        -3.42                         3.09
        """
        his = """"""
        bars = self.h[0]/max(self.h[0])*height
        for l in reversed(range(1,height+1)):
            line = ""
            if l == height:
                line = '%s '%max(self.h[0]) #histogram top count
            else:
                line = ' '*(len(str(max(self.h[0])))+1) #add leading spaces
            for c in bars:
                if c >= ceil(l):
                    line += character
                else:
                    line += ' '
            line +='\n'
            his += line
        his += '%.2f'%self.h[1][0] + ' '*(self.bins) +'%.2f'%self.h[1][-1] + '\n'
        return his
    def vertical(self,height=20, character ='|'):
        """
        Returns a Multi-line string containing a
        a vertical histogram representation of self.data
        :Parameters:
            - `height`: Height of the histogram in characters
            - `character`: Character to use
        >>> d = normal(size=1000)
        >>> Histogram(d,bins=10)
        >>> print h.vertical(15,'*')
                              236
        -3.42:
        -2.78:
        -2.14: ***
        -1.51: *********
        -0.87: *************
        -0.23: ***************
        0.41 : ***********
        1.04 : ********
        1.68 : *
        2.32 :
        """
        his = """"""
        xl = ['%.2f'%n for n in self.h[1]]
        lxl = [len(l) for l in xl]
        bars = self.h[0]/max(self.h[0])*height
        his += ' '*(max(bars)+2+max(lxl))+'%s\n'%max(self.h[0])
        for i,c in enumerate(bars):
            line = xl[i] +' '*(max(lxl)-lxl[i])+': '+ character*c+'\n'
            his += line
        return his
            
if __name__ == "__main__":
    from numpy.random import normal
    d = normal(size=1000)
    h = Histogram(d,bins=10)
    print h.vertical(15)
    print h.horizontal(5)

8 comments:

jsalvati said...

This is pretty neat. Why did you choose GPL though? Why not let proprietary users use it?

Adrian Dragulescu said...

You might want to check
http://en.wikipedia.org/wiki/Stemplot

It is implemented in R. For example, using 100 Normal random variables you get:
> stem(rnorm(100))

The decimal point is at the |

-2 | 31
-1 | 9877
-1 | 443332110000
-0 | 9988877766655
-0 | 4433333221100
0 | 000111111111222222333444
0 | 55556666777799
1 | 001122233334
1 | 555
2 | 023

Flavio Coelho said...

@jsalvati: because proprietary developers don't give back to the community.

Flavio Coelho said...

@Adrian: I was refering to Python, not R. But I still think my histogram looks nicer than R's stem plot.

Michael said...

Looks great. Maybe you should mention it on the scipy/numpy forums, am sure they would like to integrate it.

torgdoo said...

Thank you so much for posting this.

I had to make a few changes so it would for me:
* import numpy
* import math
* use float() to get floating-point division

here's the result of a diff with --unified=2


--- histogram.py 2010/10/19 17:03:11 1.1
+++ histogram.py 2010/10/19 17:42:17 1.2
@@ -9,4 +9,7 @@
#
#
+import numpy as np
+import math
+
class Histogram(object):
"""
@@ -18,9 +21,10 @@

:Parameters:
- - `data`: array like object
+ - `data`: array-like object
+ - `bins`: number of bins (default 10)
"""
self.data = data
self.bins = bins
- self.h = histogram(self.data, bins=self.bins)
+ self.h = np.histogram(self.data, bins=self.bins)
def horizontal(self, height=4, character ='|'):
"""Returns a multiline string containing a
@@ -40,5 +44,5 @@
"""
his = """"""
- bars = self.h[0]/max(self.h[0])*height
+ bars = self.h[0]/float(max(self.h[0]))*height
for l in reversed(range(1,height+1)):
line = ""
@@ -48,5 +52,5 @@
line = ' '*(len(str(max(self.h[0])))+1) #add leading spaces
for c in bars:
- if c >= ceil(l):
+ if c >= math.ceil(l):
line += character
else:
@@ -81,5 +85,5 @@
xl = ['%.2f'%n for n in self.h[1]]
lxl = [len(l) for l in xl]
- bars = self.h[0]/max(self.h[0])*height
+ bars = self.h[0]/float(max(self.h[0]))*height
his += ' '*(max(bars)+2+max(lxl))+'%s\n'%max(self.h[0])
for i,c in enumerate(bars):


is there really no way to format a comment with a fixed-width font?

Matthew Miller said...

Could you be specific about which version or versions of the GPL you intend? Thanks!

Mark Hamilton said...

I like your post; it definitely beats looking a histogram as a list of numbers!

I ran into a compatibility problem with older versions of python (2.6 for me but should manifest in anything without future division). The line
bars = self.h[0]/max(self.h[0])*height
could be changed to
bars = self.h[0]*height/max(self.h[0])
(i.e. do the multiplication before the division) in order to not lose precision from integer division.

ccp

Amazon