Skip Navigation

Sign up

If you sign up for an account on this web site you can customise elements of this site and subscribe to an email newsletter.

If you have an account on this web site you may login.

If you have an account on this site but have forgotten your user name and / or your password then you can request an account reminder email.

JSpider start-up process

The JSpider engine has quite a complex start-up process that uses a range of property files, static factory methods, abstractions and reflective invocation. These notes are to clarify the origin of runtime properties, configuration and storage facilities. Most class references refer to JSpider library classes.

JSpider main entry point

The primary JSpider class takes two arguments: a base URL to start spidering, and a configuration directory reference. If no configuration is specified, a directory called "default" is checked for configuration property files.

The JSpider class completes the following steps:

  1. Loads the relevant configuration using the static ConfigurationFactory.getConfiguration method.
  2. Creates a new JSpider instance.
  3. Calls the start method on the JSpider instance.

ConfigurationFactory

The two configuration factory methods that may be called by JSpider create a singleton JSpiderConfiguration instance from a new PropertiesConfiguration. Methods also exist to assign a JSpiderConfiguration directly and "clean" the configuration by assigning a null. The properties configuration assembles JSpider, plugin and site configuration settings from directories and property files under the command line configuration directory.

JSpider instance

The JSpider constructor instantiates a SpiderNest object; gets a SpiderContext from the static SpiderContextFactory.createContext(URL) method; and passes the context to the nest to get a Spider instance. The spider nest gets number of spider threads and thinker threads from the context and passes these to the spider implementation.

Spider implementation

The spider implementation contains two WorkerThreadPools; one for the spiders and one for the thinkers. It also creates thread pool monitors to periodically report the status of the pools. The spider's crawl method gets an EventDispatcher from the SpiderContext and dispatches a SpideringStartedEvent with the base URL passed on the command line. This is followed by the creation of new DispatchSpiderTasks and DispatchThinkerTasks instances, which are assigned to the worker thread pools.

When the crawl method returns, the spidering process is complete, but before it shuts down the spider context, it dispatches a SpideringSummaryEvent and a SpideringStoppedEvent

Up

This document was last modified by Philip Shaw on 2005-04-12 09:22:34
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html