Documenttype configuration
A documenttype node specifies which document factory should be used to pull the contents of an OpenCms resource with a distinct resource type and/or mimetype into a Lucene index document. For any matching combination of the specified resource types and the specified mimetypes, the given document factory is used.
<documenttype>
<name>...</name>
<class>...</class>
<mimetypes>
<mimetype>...</mimetype>
...
</mimetypes>
<resourcetypes>
<resourcetype>...</resourcetype>
...
</resourcetypes>
</documenttype>
Configuration nodes
The following nodes are used to specify a documenttype:
- the <name> node gives the documenttype a unique name
- the <class> node specifies the package/class name of the document factory
- either zero or more <mimetype> nodes specify a mimetype for resource contents handled with the given document factory. When indexing a resource, its mimetype is derived from the extension of the resource name.
- one ore more <resourcetype> nodes specify an OpenCms resource type of resources handled with the given document factory
Example 1
This example shows how to configure a documenttype for PDF documents:
<documenttype>
<name>pdf</name>
<class>org.opencms.search.documents.CmsDocumentPdf</class>
<mimetypes>
<mimetype>application/pdf</mimetype>
</mimetypes>
<resourcetypes>
<resourcetype>binary</resourcetype>
<resourcetype>plain</resourcetype>
</resourcetypes>
</documenttype>
Available document classes
Currently, these document factories are part of the OpenCms search package:
- org.opencms.search.documents.CmsDocumentGeneric
Extracts index data from a VFS resource. This factory extracts only the property data like title, description and keywords, not the content and is used as base class of the other document factories. - org.opencms.search.documents.CmsDocumentHtml
Extracts index data from a resource that contains HTML as plain text. - org.opencms.search.documents.CmsDocumentMsExcel
Extracts index data from a document in Microsoft Excel 97(-2002) file format (BIFF8). - org.opencms.search.documents.CmsDocumentMsPowerPoint
Extracts index data from a document in Microsoft Powerpoint file format. - org.opencms.search.documents.CmsDocumentMsWord
Extracts index data from a document in Microsoft Word 97 file format. - org.opencms.search.documents.CmsDocumentPdf
Extracts index data from a document in Adobe Portable Document Format. - org.opencms.search.documents.CmsDocumentPlainText
Extracts index data from a document in plain text format. - org.opencms.search.documents.CmsDocumentRtf
Extracts index data from a document in Rich Text (RTF) file format. - org.opencms.search.documents.CmsDocumentXmlContent
Extracts index data from a resource of type xmlcontent. - org.opencms.search.documents.CmsDocumentXmlPage
Extracts index data from a resource of type xmlpage.
All tags in the content are filtered away, so the xmlpage elements can contain both XML and HTML data.
Available resource types
Currently, OpenCms uses the following resource types:
- binary (org.opencms.file.types.CmsResourceTypeBinary)
- folder (org.opencms.file.types.CmsResourceTypeFolder)
- image (org.opencms.file.types.CmsResourceTypeImage)
- jsp (org.opencms.file.types.CmsResourceTypeJsp)
- plain (org.opencms.file.types.CmsResourceTypePlain)
- pointer (org.opencms.file.types.CmsResourceTypePointer)
- xmlcontent (org.opencms.file.types.CmsResourceTypeXmlContent)
- xmlpage (org.opencms.file.types.CmsResourceTypeXmlPage)