Skip Navigation

Sign up

If you sign up for an account on this web site you can customise elements of this site and subscribe to an email newsletter.

If you have an account on this web site you may login.

If you have an account on this site but have forgotten your user name and / or your password then you can request an account reminder email.

Grunk

Grunk is a framework for spidering and indexing metadata from semi-structured text formats. Grunk itself is released under the GPL licence, but it depends on various Apache packages.

  • The emerge.grunk.regex.RegEx class depends on the org.apache.oro.text.regex package, released under the Apache Software License.
  • The nsca.emerge.grunk.xml.XSLTransformer class depends on org.apache.xml.serialize.XMLSerializer, which is part of the Apache Xerces XML parser, released under the Apache Software License.

Grunk also has dependencies on standard Java extensions, which should be compatible with GNU JAXP:

  • The nsca.emerge.grunk.xml.XSLTransformer class depends on the javax.xml.transform and javax.xml.transform.stream packages.
  • Various packages depend on the org.w3c.dom package.
  • Various packages depend on the org.xml.sax package.

Initial review notes ==================== Grunk turns out to be more of a content parser than a spidering application per se. It is a tool for analysing source data structures and applying appropriate parsing tools to the content.

Grunk uses layered sets of Importer, Scanner, Preprocessor components to identify an appropriate parsing scheme for a source then apply it. This makes the system quite large in terms of class numbers and biased towards plain text formats, rather than HTML or XML. Grunk seems to have a capacity for extremely large input source.

Up

This document was last modified by Philip Shaw on 2004-11-03 06:30:23
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html