Js Crawler
Stop Press!
Shanti Rao has created an extended standalone Windows Javascript
interpreter based Mozilla's SpiderMonkey. This can be used to do any
kind of programming you like; Rao has even created a web server using
it.
Of course this doesn't really solve my problems because it is Windows
specific and anyway I wanted something that would run in the browser.
If you want to see the crawler in action then visit Crawler3.
Creating the crawler was unreasonably difficult for a number of
reasons:
- Microsoft, Netscape and W3C disagree on a number of quite
fundamental points so that simply finding the names of the
properties or methods that are relevant is a task for Peter Wimsey
or Hercule Poirot rather than a working programmer like me.
- The very specification of Javascript turns it into a second class
language by denying it the ability to access files outside the
domeain of the executing script. Yes I know it is supposed to be
more secure that way but really we all execute insecure code all the
time. Everytime you download a new program these days it calls home
to see if there is a new version and offers to download it for you.
This process plainly involves reading files from another domain so
why should Javascript not be allowed to do the same. I'm not
suggesting that it be allowed to read files from local disks just
that it have the same ability to read public web sites that the user
has.
How it works
This is very simple. Here is the pseudo-code:
push the URL of a page to a stack
while stack not empty {
pop URL from stack
open page for URL
add all links from this page that have not yet been visited to the stack
}
Yes I know that if even one page points at an external page that this
is probably not going to terminate until hell freezes over. Just
getting this far was so hard that I'm not feeling sympathetic to cries
of help it won't stop; I'm just glad I managed to make it go
at all. In fact it probably only goes in IE6. Opera will have to
wait (even though it is my favourite browser), I haven't got a copy
of Netscape and I haven't tried Mozilla in ages.
Obviously the pseudo-code describes a too simple algorithm. Other
sections on this page address the problems.
What is wrong with it
- Doesn't stop crawling until it runs out of links which could be never.
- doesn't work in my favorite browser.
- Doesn't actually do anything useful!
- opens a window instead of keeping the process in the background.
- doesn't crawl Javascript links properly. Probably doesn't do it at
all.
- Adds all pages to the stack even if they are already there
- Wastes time waiting for pages that could be used loading others
What it might be useful for
- A search engine for a small web site.
- A concordance creater.
- A site map creater
What next?
In no particular order here are some things that I intend to do:
- don't crawl outside own host
- don't crawl above starting page
- crawl only n steps away from starting point
- create concordance of words found with links to the pages on which
found
- create site map
- operate in the background
- store some state as cookies. How much can a cookie hold?
- multithreading. Well not exactly; just the ability to start a new
page download before giving up an earlier one. This can easily be
done by adding another list of pages so that every time a new page
is opened it is added to the list. Instead of timing out and giving
up we could have a two stage timeout so that the first timeout
simply opens the next link. Each link would be given an amount of
time to become readystate=complete. The number of links open at a
given time should be limited to some user defined value. The list
would be scanned at intervals looking for pages that have arrived
and pages that are still trying. Any that have been trying too long
would be discarded and replaced with new links. y