Chapter 4.  Configuration

Table of Contents

Mandatory configuration.
General
Database
Indexing/search
Distributed mode
Non-mandatory
General
Classifying
Zookeeper
Highlight and MoreLikeThis
Metadata

Sample files are available. For standalone single node: aether.prop.standalone For standalone using cloud services: aether.prop.cloud For two nodes: aether.prop.testnode1 aether.prop.testnode2

There are two modes, standalone and multi mode. The standalone is set if hibernate or lucene is chosen. (Assuming multinodes connect to the same Solr and Hbase, it is not checked.) If standalone is used, the nodename will be localhost automatically, otherwise a nodename will have to be set. Standalone will use no global locker.

The content source is a directory list in dirlist. The directory list may be ordinary, local filesystem, or it may be HDFS. (More about HDFS later) The exceptions are to be listedd in dirlistnot.

The indexed filenames and metadata are stored in a database. This may either be the standalone H2 (with Hibernate) or the multinode Hbase (with or without DataNucleus). The standalone is currently too slow to handle as big disks/sizes as multimode.

The indexed file content and metadata are stored in Lucene or Solr. Lucene is the standalone, while Solr is the multinode.

If using multinode, the setting distributedlockmode is for setting big or small granularity using Curator, if the mode is big only one node may use the indexer or other write operations at one single time.

If using the small granularity, the locking will be on the level on the indexed item.

If using multinode, and setting distributedprocess, it will use Hazelcast to distribute through the nodes, and use Hazelcast locking. Configure aether.prop:

Mandatory configuration.

General

Which directories to include, and subdirectories to exclude.

dirlist=file:/home/dir/dir1,hdfs:/home/dir/dir2...
dirlistnot=/home/dir/dir1/tmp (just pure /, no file:/hdfs:)
      

Known uses are file, hdfs and none (defaulting to file).

Name of the node.

  Nodename=...

(But if db is hibernate or lucene, it is localhost anyway).

How many to max reindex each turn, or 0 for no limit.

reindexlimit=10000

How many to max index each turn, or 0 for no limit.

indexlimit=10000

The (re)indexing limits may be exceeded a bit, due to the parallell nature of the application.

Don't try to index if the file has failed this many times

  failedlimit=10

Number of seconds for Tika conversion timeout

tikatimeout=600

Number of seconds for Other conversion timeout

  othertimeout=600

Database

The database may be HBase, DataNucleus/HBase or Hibernate/H2.

Hibernate/H2

Hibernate/H2 relevant settings:

db=hibernate
h2dir=...
	  

Hibernate/H2 is for standalone only.

Hbase

Hbase relevant settings:

db=hbase
hbasequorum=localhost
hbaseport=2181
hbasemaster=localhost:2181
	

DataNucleus/HBase

DataNucleus/HBase elevant settings:

db=datanucleus
	  

And before starting jetty, do this in a hbase shell:

create 'IndexFiles', { NAME => 'IndexFiles' }
create 'Files', { NAME => 'Files' }

(A temporary fix until datanucleus creates it itself)

Indexing/search

Can be Lucene directly or Solr. Lucene is for standalone only.

Lucene

Lucene relevant settings:

index=lucene
lucenepath=...
	  

Solr

Solr relevant settings:

index=solr
solrurl=http://localhost:8983/solr/mystuff

mkdir -p server/solr/MYCORE/conf
rsync -v -a server/solr/configsets/basic_configs/conf/  server/solr/MYCORE/conf
	  

if creating manually, create server/solr/MYCORE/core.properties go to server/solr/MYCORE/conf in server/solr/MYCORE/conf apply patch with

  patch -p0 < core.nostore.patch

(or core.store.patch, if using highlight and morelikethis)

Langdetect

Download from here, and put profiles* in the same dir as pom.xml. Languages to be used for classification, configure (default en) with:

  languages=en,fr

Distributed mode

There are two config settings for this. With distributedlockmode=small there is a lock for each index entry. With distributedprocess=true the files are placed in a distributed queue.