Documenttype configuration

A documenttype node specifies which document factory should be used to pull the contents of an OpenCms resource with a distinct resource type and/or mimetype into a Lucene index document. For any matching combination of the specified resource types and the specified mimetypes, the given document factory is used.

<documenttype>
<name>...</name>
<class>...</class>
<mimetypes>
<mimetype>...</mimetype>
...
</mimetypes>
<resourcetypes>
<resourcetype>...</resourcetype>
...
</resourcetypes>
</documenttype>

Configuration nodes

The following nodes are used to specify a documenttype:

  • the <name> node gives the documenttype a unique name
  • the <class> node specifies the package/class name of the document factory 
  • either zero or more <mimetype> nodes specify a mimetype for resource contents handled with the given document factory. When indexing a resource, its mimetype is derived from the extension of the resource name.
  • one ore more <resourcetype> nodes specify an OpenCms resource type of resources handled with the given document factory

Example 1

This example shows how to configure a documenttype for PDF documents:

<documenttype>
<name>pdf</name>
<class>org.opencms.search.documents.CmsDocumentPdf</class>
<mimetypes>
<mimetype>application/pdf</mimetype>
</mimetypes>
<resourcetypes>
<resourcetype>binary</resourcetype>
<resourcetype>plain</resourcetype>
</resourcetypes>
</documenttype> 

Available document classes

Currently, these document factories are part of the OpenCms search package:

  • org.opencms.search.documents.CmsDocumentGeneric
    Extracts index data from a VFS resource. This factory extracts only the property data like title, description and keywords, not the content and is used as base class of the other document factories.
  • org.opencms.search.documents.CmsDocumentHtml
    Extracts index data from a resource that contains HTML as plain text.
  • org.opencms.search.documents.CmsDocumentMsExcel
    Extracts index data from a document in Microsoft Excel 97(-2002) file format (BIFF8).
  • org.opencms.search.documents.CmsDocumentMsPowerPoint
    Extracts index data from a document in Microsoft Powerpoint file format.
  • org.opencms.search.documents.CmsDocumentMsWord
    Extracts index data from a document in Microsoft Word 97 file format.
  • org.opencms.search.documents.CmsDocumentPdf
    Extracts index data from a document in Adobe Portable Document Format.
  • org.opencms.search.documents.CmsDocumentPlainText
    Extracts index data from a document in plain text format.
  • org.opencms.search.documents.CmsDocumentRtf
    Extracts index data from a document in Rich Text (RTF) file format.
  • org.opencms.search.documents.CmsDocumentXmlContent
    Extracts index data from a resource of type xmlcontent.
  • org.opencms.search.documents.CmsDocumentXmlPage
    Extracts index data from a resource of type xmlpage.
    All tags in the content are filtered away, so the xmlpage elements can contain both XML and HTML data.

Available resource types

Currently, OpenCms uses the following resource types:

  • binary (org.opencms.file.types.CmsResourceTypeBinary)
  • folder (org.opencms.file.types.CmsResourceTypeFolder)
  • image (org.opencms.file.types.CmsResourceTypeImage)
  • jsp (org.opencms.file.types.CmsResourceTypeJsp)
  • plain (org.opencms.file.types.CmsResourceTypePlain)
  • pointer (org.opencms.file.types.CmsResourceTypePointer)
  • xmlcontent (org.opencms.file.types.CmsResourceTypeXmlContent)
  • xmlpage (org.opencms.file.types.CmsResourceTypeXmlPage)