Skip Navigation

Sign up

If you sign up for an account on this web site you can customise elements of this site and subscribe to an email newsletter.

If you have an account on this web site you may login.

If you have an account on this site but have forgotten your user name and / or your password then you can request an account reminder email.

Heretrix

Heretrix is the Internet Archive's Web crawler, which is released under the GPL licence, but depends on many other packages whose licence terms may not be compatible. In many cases, the dependency is to provide handling for specific content types and may not be critical for (X)HTML-only retreival.

  • Classes in the org.archive.crawler.extractor package depend on packages in com.anotherbigidea.flash.*. These Flash parsing packages are released under a BSD License that is compatible with the GPL, see the JavaSWF2-BSD License.
  • Classes in the org.archive.crawler.extractor package depend on packages in com.lowagie.text.pdf.*. This iText PDF parsing package is released under the Library General Public License, see the master version, and the Mozilla Public License.
  • One class, org.archive.util.GateSync, depends on the class EDU.oswego.cs.dl.util.concurrent.Sync. The package as a whole is released to the public domain, but the CopyOnWriteArrayList and ConcurrentReaderHashMap classes are released under a special licence from Sun Microsystems, see the TECHNOLOGY LICENSE FROM SUN MICROSYSTEMS, INC. TO DOUG LEA (PDF). Dependency on these classes has not been established.
  • Various classes depend on Apache packages released under the Apache Software License version 2.0:
    • The Command Line Interface (CLI) package, org.apache.commons.cli.
    • The Commons Collections package, org.apache.commons.collections.
    • The Commons HTTP Client package, org.apache.commons.httpclient.
    • The Commons Logging package, org.apache.commons.logging.
    • The Commons Net package, org.apache.commons.net.
    • The Commons Pool package, org.apache.commons.pool.
    • The Jakarta POI package, org.apache.poi.hdf.extractor.
  • Classes in the org.archive.crawler package depend on classes in the org.mortbay.http and org.mortbay.jetty packages. These packages are released under the Apache Software License version 2.0 with special restrictions.
  • Classes in various packages depend on the Java DNS package, org.xbill.DNS, released under the BSD License.
  • JUnit tests depend on the junit.extensions and junit.framework packages, see secondary dependencies on JUnit below.
  • Classes in org.archive.util and org.archive.datamodel depend on classes in the st.ata.util package, which does not appear to be maintained except by the Heretrix project. The source code contains no licence information nor copyright statement.

Heretrix also has dependencies on standard Java extensions that may not be fully implemented by GNU Classpath extensions:

  • Classes in many packages depend on the javax.management package.
  • Classes in several packages depend on the javax.net and javax.net.ssl packages.
  • Classes in the org.archive.crawler package depend on the javax.xml.parsers and javax.xml.transform packages, which should be compatible with GNU JAXP.
  • Classes in the org.archive.settings package depend on classes in org.xml.sax, which should be compatible with GNU JAXP.

Up

This document was last modified by Philip Shaw on 2004-11-03 06:39:28
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html