I have recently come across an interestng problem while working on my random variable implementation for BIP. Any type in Python is expected to have a __str__(self) method which returns an adequate and expressive string representation of the object. Well, as far as I could think, the most straightforward representation of a random variable is its probability distribution. Probability distributions most often depicted graphically by a continuous density function, or a histogram. So my challenge was how to bring the information conveyed by a histogram to a concise ascii string, suitable to be the output of a print statement?
I immediately rejected the boring solution of representing the distribution by its moments (mean, variance, skewness, etc.). I wanted a full histogram in as few ascii characters as possible. So I set out to implement my own ASCII histogram generator. I can anticipate that it was a very simple task given the handy histogram function in Numpy and how easy it is to do string formatting in Python. It was nevertheless a fun couple of hours of programming. I ended up implementing a horizontal and a vertical histogram. The ascii histogram proved to be very useful since it helped enormously in debugging code involving probability calculations with simple print statements. Probabilistic simulations are extremely hard to test because the results of a given operation are never strictly the same. However, they should have the same probability distribution, so by looking at the rough shape of the histogram, you tell you if your calculations are going in the right direction.
Curiously, such a simple and expressive representation for probability distributions is not available in any package I knew, so I decided to share the code with the scientific Python community so that people that may put it to good use. The code below is part of BIP and consequently under GPL license. Any suggestions of improvements are welcome.
# -*- coding: utf-8 -*-
class Histogram(object):
"""
Ascii histogram
"""
def __init__(self, data, bins=10):
"""
Class constructor
:Parameters:
- `data`: array like object
"""
self.data = data
self.bins = bins
self.h = histogram(self.data, bins=self.bins)
def horizontal(self, height=4, character ='|'):
"""Returns a multiline string containing a
a horizontal histogram representation of self.data
:Parameters:
- `height`: Height of the histogram in characters
- `character`: Character to use
>>> d = normal(size=1000)
>>> h = Histogram(d,bins=25)
>>> print h.horizontal(5,'|')
106 |||
|||||
|||||||
||||||||||
|||||||||||||
-3.42 3.09
"""
his = """"""
bars = self.h[0]/max(self.h[0])*height
for l in reversed(range(1,height+1)):
line = ""
if l == height:
line = '%s '%max(self.h[0]) #histogram top count
else:
line = ' '*(len(str(max(self.h[0])))+1) #add leading spaces
for c in bars:
if c >= ceil(l):
line += character
else:
line += ' '
line +='\n'
his += line
his += '%.2f'%self.h[1][0] + ' '*(self.bins) +'%.2f'%self.h[1][-1] + '\n'
return his
def vertical(self,height=20, character ='|'):
"""
Returns a Multi-line string containing a
a vertical histogram representation of self.data
:Parameters:
- `height`: Height of the histogram in characters
- `character`: Character to use
>>> d = normal(size=1000)
>>> Histogram(d,bins=10)
>>> print h.vertical(15,'*')
236
-3.42:
-2.78:
-2.14: ***
-1.51: *********
-0.87: *************
-0.23: ***************
0.41 : ***********
1.04 : ********
1.68 : *
2.32 :
"""
his = """"""
xl = ['%.2f'%n for n in self.h[1]]
lxl = [len(l) for l in xl]
bars = self.h[0]/max(self.h[0])*height
his += ' '*(max(bars)+2+max(lxl))+'%s\n'%max(self.h[0])
for i,c in enumerate(bars):
line = xl[i] +' '*(max(lxl)-lxl[i])+': '+ character*c+'\n'
his += line
return his
if __name__ == "__main__":
from numpy.random import normal
d = normal(size=1000)
h = Histogram(d,bins=10)
print h.vertical(15)
print h.horizontal(5)
This is pretty neat. Why did you choose GPL though? Why not let proprietary users use it?
ReplyDeleteYou might want to check
ReplyDeletehttp://en.wikipedia.org/wiki/Stemplot
It is implemented in R. For example, using 100 Normal random variables you get:
> stem(rnorm(100))
The decimal point is at the |
-2 | 31
-1 | 9877
-1 | 443332110000
-0 | 9988877766655
-0 | 4433333221100
0 | 000111111111222222333444
0 | 55556666777799
1 | 001122233334
1 | 555
2 | 023
@jsalvati: because proprietary developers don't give back to the community.
ReplyDelete@Adrian: I was refering to Python, not R. But I still think my histogram looks nicer than R's stem plot.
ReplyDeleteLooks great. Maybe you should mention it on the scipy/numpy forums, am sure they would like to integrate it.
ReplyDeleteThank you so much for posting this.
ReplyDeleteI had to make a few changes so it would for me:
* import numpy
* import math
* use float() to get floating-point division
here's the result of a diff with --unified=2
--- histogram.py 2010/10/19 17:03:11 1.1
+++ histogram.py 2010/10/19 17:42:17 1.2
@@ -9,4 +9,7 @@
#
#
+import numpy as np
+import math
+
class Histogram(object):
"""
@@ -18,9 +21,10 @@
:Parameters:
- - `data`: array like object
+ - `data`: array-like object
+ - `bins`: number of bins (default 10)
"""
self.data = data
self.bins = bins
- self.h = histogram(self.data, bins=self.bins)
+ self.h = np.histogram(self.data, bins=self.bins)
def horizontal(self, height=4, character ='|'):
"""Returns a multiline string containing a
@@ -40,5 +44,5 @@
"""
his = """"""
- bars = self.h[0]/max(self.h[0])*height
+ bars = self.h[0]/float(max(self.h[0]))*height
for l in reversed(range(1,height+1)):
line = ""
@@ -48,5 +52,5 @@
line = ' '*(len(str(max(self.h[0])))+1) #add leading spaces
for c in bars:
- if c >= ceil(l):
+ if c >= math.ceil(l):
line += character
else:
@@ -81,5 +85,5 @@
xl = ['%.2f'%n for n in self.h[1]]
lxl = [len(l) for l in xl]
- bars = self.h[0]/max(self.h[0])*height
+ bars = self.h[0]/float(max(self.h[0]))*height
his += ' '*(max(bars)+2+max(lxl))+'%s\n'%max(self.h[0])
for i,c in enumerate(bars):
is there really no way to format a comment with a fixed-width font?
Could you be specific about which version or versions of the GPL you intend? Thanks!
ReplyDeleteI like your post; it definitely beats looking a histogram as a list of numbers!
ReplyDeleteI ran into a compatibility problem with older versions of python (2.6 for me but should manifest in anything without future division). The line
bars = self.h[0]/max(self.h[0])*height
could be changed to
bars = self.h[0]*height/max(self.h[0])
(i.e. do the multiplication before the division) in order to not lose precision from integer division.