Table of Contents
Sample files are available. For standalone single node: aether.prop.standalone For standalone using cloud services: aether.prop.cloud For two nodes: aether.prop.testnode1 aether.prop.testnode2
There are two modes, standalone and multi mode. The standalone is set if hibernate or lucene is chosen. (Assuming multinodes connect to the same Solr and Hbase, it is not checked.) If standalone is used, the nodename will be localhost automatically, otherwise a nodename will have to be set. Standalone will use no global locker.
The content source is a directory list in dirlist. The directory list may be ordinary, local filesystem, or it may be HDFS. (More about HDFS later) The exceptions are to be listedd in dirlistnot.
The indexed filenames and metadata are stored in a database. This may either be the standalone H2 (with Hibernate) or the multinode Hbase (with or without DataNucleus). The standalone is currently too slow to handle as big disks/sizes as multimode.
The indexed file content and metadata are stored in Lucene or Solr. Lucene is the standalone, while Solr is the multinode.
If using multinode, the setting distributedlockmode is for setting big or small granularity using Curator, if the mode is big only one node may use the indexer or other write operations at one single time.
If using the small granularity, the locking will be on the level on the indexed item.
If using multinode, and setting distributedprocess, it will use Hazelcast to distribute through the nodes, and use Hazelcast locking. Configure aether.prop:
Which directories to include, and subdirectories to exclude.
dirlist=file:/home/dir/dir1,hdfs:/home/dir/dir2... dirlistnot=/home/dir/dir1/tmp (just pure /, no file:/hdfs:)
Known uses are file, hdfs and none (defaulting to file).
Name of the node.
Nodename=...
(But if db is hibernate or lucene, it is localhost anyway).
How many to max reindex each turn, or 0 for no limit.
reindexlimit=10000
How many to max index each turn, or 0 for no limit.
indexlimit=10000
The (re)indexing limits may be exceeded a bit, due to the parallell nature of the application.
Don't try to index if the file has failed this many times
failedlimit=10
Number of seconds for Tika conversion timeout
tikatimeout=600
Number of seconds for Other conversion timeout
othertimeout=600
The database may be HBase, DataNucleus/HBase or Hibernate/H2.
Hibernate/H2 relevant settings:
db=hibernate h2dir=...
Hibernate/H2 is for standalone only.
Hbase relevant settings:
db=hbase hbasequorum=localhost hbaseport=2181 hbasemaster=localhost:2181
Can be Lucene directly or Solr. Lucene is for standalone only.
Solr relevant settings:
index=solr solrurl=http://localhost:8983/solr/mystuff mkdir -p server/solr/MYCORE/conf rsync -v -a server/solr/configsets/basic_configs/conf/ server/solr/MYCORE/conf
if creating manually, create server/solr/MYCORE/core.properties go to server/solr/MYCORE/conf in server/solr/MYCORE/conf apply patch with
patch -p0 < core.nostore.patch
(or core.store.patch, if using highlight and morelikethis)
Download from here, and put profiles* in the same dir as pom.xml. Languages to be used for classification, configure (default en) with:
languages=en,fr