Js Crawler

Stop Press!

Shanti Rao has created an extended standalone Windows Javascript interpreter based Mozilla's SpiderMonkey. This can be used to do any kind of programming you like; Rao has even created a web server using it.

Of course this doesn't really solve my problems because it is Windows specific and anyway I wanted something that would run in the browser.

Javascript Crawler

If you want to see the crawler in action then visit Crawler3.

Creating the crawler was unreasonably difficult for a number of reasons:

How it works

This is very simple. Here is the pseudo-code:

  push the URL of a page to a stack

  while stack not empty {
    pop URL from stack
    open page for URL
    add all links from this page that have not yet been visited to the stack
  }

Yes I know that if even one page points at an external page that this is probably not going to terminate until hell freezes over. Just getting this far was so hard that I'm not feeling sympathetic to cries of help it won't stop; I'm just glad I managed to make it go at all. In fact it probably only goes in IE6. Opera will have to wait (even though it is my favourite browser), I haven't got a copy of Netscape and I haven't tried Mozilla in ages.

Obviously the pseudo-code describes a too simple algorithm. Other sections on this page address the problems.

What is wrong with it

What it might be useful for

What next?

In no particular order here are some things that I intend to do: