Setup a Solr schema.xml for AEM
Contents
Now that we have successfully convinced AEM to use Solr as Indexer, the next step is to create a Schema which is used by Solr for Index/Query Processing.
Why do we need a schema?
Solr does not know anything about your data structure but you want it to perform complex operation like fulltext searches, faceting etc. To allow Solr to create a fast index, you need to define which fields you want to index and which operations should be performed upon index or query1.
There is an excellent book by Trey Grainger2 and Timothy Potter which gives a good view on the capabilities of Solr3. Although it is written for Solr 5 most of the concepts are the same for Solr 6 and just need minimal adjustments.
By default Solr 6 uses a managed-schema.xml
4 which allows you to use the Schema API5 to modify the schema. You can change this behavior in solrconfix.xml
per core and enable the classic schema.xml
which we’ll use in this example.
The Jackrabbit project provides a basic configuration for a core you can use with Solr 4.x6 and as base for a custom configuration. I recommend that you have a look at the schema.xml
which is the base for the following definitions.
Schema.xml for AEM
You can find an example for a basic schema.xml
7 in the aem-solr
Github repository8 which I’ll explain here.
Unique Key
The uniqueKey
field is the identity of an indexed document. If a new document with an already existing uniqueKey is indexed it replaces the existing entry. For structured content like a JCR content the path is a great identifier and therefor used.
Fields
Path*
Since you most likely not only want to query the complete index but restrict your queries to certain paths, some adjustments are required here. The Jackrabbit Oak Solr indexer supports multiple fields out of the box that should be added to your schema9. The documentation also provides some examples, where those fields are used.
Note: Only the field path_exact
is stored in our index and is therefor retrievable. All other fields are only used for indexing.
JCR/Sling and DAM attributes
The schema.xml
contains some interesting JCR attributes like jcr_title
or jcr_lastModified
that can be queried as string or date (e.g before xyz). To allow queries of DAM assets, you can also see the mimetype attributes of DAM.
Content attributes
For this example I’ll use three different JCR properties that should be index:
Fieldname | Index as |
---|---|
headline | Simple String, no fulltext search |
title | Simple String, no fulltext search |
text | English text, indexed for fulltext search, suggestions etc |
Fieldtypes
All fieldtypes you can find in the schema.xml
are quite simple and by the book. There are primitive fieldtypes like int
or string
but also types that support fulltext searches like text_en
.
For the two *_path
fieldtypes some rules that replace or group the result by slashes are defined.
Summary
For a simple AEM application where you want to perform fulltext searches on predefined fields (like text
) the provided schema is a good starting point. You can extend it by adding additional fields or using the copyField10 mechanism to index more fields into the already defined ones.
If your application uses a property named richText
which you want to index, the following definition would copy it into the text
field and merge the results:
<copyField source="richText" dest="text">
The next post will deal with a sample application you can setup to get a better insight of the already achieved steps.
Comments