Copyright © 2009 Purdea Andrei

Webharvest and Jython

Webharvest is a scripting framework used to collect data from webpages. It integrates many useful technologies like XPath, XSLT, XQuery, regular expressions, and different scripting languages. The original project only supports Javascript, Beanshell, and Groovy.

Before I found out about this project, I used python, my favourite programming language, extensively to do this kind of work. Webharvest was an impressive little tool but I missed the power of Python. Since this little application was written in Java it wasn't too hard to integrate it with jython, which is the Python language, implemented in Java.


<script language="jython">
    for i in range(100):
        print i,

Be careful with indenting! since I wanted to be able to have goodlooking source files I had to find a way to be able to indent the python code snippets. The way I did this was to insert an if true: as the first line of the evaluated code. That means you cant write code into the first column of the file.

I found webharvest to be in lack of escape/unescape functions, that are frequently needed while parsing webpages. A useful class is StringEscapeUtils from the apache commons lang component, which I also included in this version modified by me.