Non-mandatory

General

If using Hadoop HDFS

hdfsconffs=hadoop uri (sets fs.default.name)
    

see also Hadoop . Where to log

logdir=...
    

Whether to enable downloading

downloader=true
    

Whether to use authenication

      authenticate=true
    

The admin user has username/password admin/admin, and the user has user/user. The admin has access to the control panel and configuration, and will see more search results than the user. The admin user will get results regarding indexing time usage and error messages from the indexing. If no authentication is used, everything is open. Accessing from localhost will determine admin right in both cases.

mltcount=...
mltmindf=...
mltmintf...

For setting More Like This search document count, minimum term frequency and minimum document frequency. These settings may influence the time More Like Queries take.

Classifying

There are two types, Mahout and OpenNLP.

Mahout

Mahout relevant settings: For the older map reduce based:

classify=mahout
mahoutbasepath=.../mahoutc/LANG
mahoutalgorithm=bayes (or cbayes)
mahaoutmodelpath=.../mahout/model
mahoutlabelindexfilepath=.../mahout/labelindex
mahoutdictionarypath=.../mahout/dataset-vectors/dictionary.file-0
mahoutdocumentfrequencypath=.../mahout/dataset-vectors/df-count/part-r-00000
mahaoutconffs=hadoop uri (sets fs.default.name in this specific case)

For the newer Spark based:

classify=mahoutspark
mahoutbasepath=.../mahoutc/LANG
mahoutalgorithm=bayes (or cbayes)
mahaoutmodelpath=.../mahout/model
mahoutdictionarypath=.../mahout/dataset-vectors/dictionary.file-0
mahoutdocumentfrequencypath=.../mahout/dataset-vectors/df-count/part-r-00000
mahaoutconffs=hadoop uri (sets fs.default.name in this specific case)
mahoutsparkmaster=spark-master

The LANG will be replaced by the detected languages configured, so the files and directories will be required to exist. The mahoutbasepath, if existing, will just be prepended to the other paths, which then will just indicate relative paths. For more about Spark, see Spark .

Training could be based on the Bayes part of examples/bin/classify-20newsgroups.sh in the Mahout distribution, more about this here: Mahout .

OpenNLP

OpenNLP relevant settings:

classify=opennlp
opennlpmodelpath=.../opennlp/LANG-doccat.bin
  

The LANG will be replaced by the detected languages configured, so the file will be required to exist.

Training is standard, done with ./bin/opennlp DoccatTrainer in the OpenNLP distribution.

Zookeeper

Configured by

    zookeeper=...
  

Should be used in a multinode environment (not yet mandatory).

Highlight and MoreLikeThis

Configured by

highlightmlt=true
  

If using Solr, also go to server/solr/MYCORE/conf in server/solr/MYCORE/conf apply patch with

patch -p0 < core.store.patch

Beware that this also stores the content, and increases disk space usage.

Metadata

Tika is providing the metadata, but it is not always sufficient. Therefore, have added a property Files-Content-Type for content type probed by the Files class, in cases not supported by Tika.