Commit 4168a4b066964a57c0b41612308f82a972e3a5ac

Authored by conradverm
1 parent 121a1f92

KTS-673

"The search algorithm needs some work"
Added. Basic documentation on the search

Committed By: Conrad Vermeulen
Reviewed By: Kevin Fourie

git-svn-id: https://kt-dms.svn.sourceforge.net/svnroot/kt-dms/trunk@7232 c91229c3-7414-0410-bfa2-8a42b809f60b
search2/docs/adminguide.txt 0 โ†’ 100644
  1 +SEARCH2 Administrator Guide
  2 +===========================
  3 +
  4 +TODO: put this on the wiki.
  5 +
  6 +Configuration
  7 +-------------
  8 +
  9 +[search]
  10 +; The number of results per page
  11 +; defaults to 25
  12 +resultsPerPage = default
  13 +
  14 +; The date format used when making queries using widgets
  15 +; defaults to Y-m-d .... NOTE Future development
  16 +dateFormat = default
  17 +
  18 +[indexer]
  19 +; The core indexing class
  20 +coreClass=PHPLuceneIndexer
  21 +
  22 +; The number of documents to be indexed in a cron session
  23 +; defaults to 20
  24 +batchDocuments = default
  25 +
  26 +; The location of the lucene indexes
  27 +luceneDirectory=${varDirectory}/indexes
  28 +
  29 +; The url for the Java Lucene Server. This should match up the the Lucene Server configuration.
  30 +; Defaults to http://localhost:8875
  31 +javaLuceneURL = default
  32 +
  33 +Setting up the Lucene Directory
  34 +-------------------------------
  35 +
  36 +If using the Java Lucene Server, simply start the server. Ensure that it is configured correctly. Some more information is available
  37 +in ktroot/bin/luceneserver/README.TXT
  38 +
  39 +Edit the config.ini and ensure that the 'javaLuceneURL' field is correct.
  40 +
  41 +If using the PHP Lucene Server, you need to run the search2/indexing/bin/recreateIndex.php.
  42 +
  43 +Migration
  44 +---------
  45 +
  46 +Migrating to the new server requires that the content of the full text tables are extracted and inserted into the Lucene indexes.
  47 +This is done using the search2/indexing/bin/migrate.php script. (this feature can be heavy - care should be taken when implementing)
  48 +
  49 +Search Results Ranking
  50 +----------------------
  51 +
  52 +Review the 'search_ranking' table to find the weightings associated with matching subexpressions. These may be modified to improve the
  53 +relevance of search results according to your needs.
  54 +
  55 +Status
  56 +------
  57 +
  58 +TODO:
  59 +The lucene indexers should provide some statistics on the lucene index. It should provide some general information on the index, but a diagnostics
  60 +function should be available to ensure that the correct version of the documents are indexed and possibly reschedule indexing if there is a mismatch for
  61 +some reason. (this feature could be heavy on the system - care should be taken when implementing)
  62 +
  63 +Background Tasks
  64 +----------------
  65 +search2/indexing/bin/cronIndexer.php - task to batch index files.
  66 +search2/indexing/bin/optimise.php - task to optimise the lucene index.
  67 +
  68 +The indexing script should be run frequently - say every 5 minutes. The config.ini allows for the number of documents to be indexed to be configured. This
  69 +defaults to 20. If the frequency is shortened, you may want to decrease the number of documents that will be indexed so that there is no serious load that can
  70 +impact on the performance of the system.
  71 +
  72 +The lucene index requires optimisation to ensure that performance is optimal. This could be run once a day around midnight, or weekly depending on frequency
  73 +of updates to the index.
  74 +
  75 +HOWTO - how to run a php script from the command line
  76 +-----------------------------------------------------
  77 +
  78 +php -Cq script.php
... ...
search2/docs/architecture.txt 0 โ†’ 100644
  1 +SEARCH2 ARCHITECTURE
  2 +====================
  3 +
  4 +TODO: put this on the wiki.
  5 +
  6 +Introduction
  7 +------------
  8 +
  9 +Locating documents easily should be one of the most important features of the DMS. Implementing the new search must be flexible
  10 +to accomodate KnowledgeTree's metadata and document content.
  11 +
  12 +The previous search was implemented using mysql's full text indexes, but it was found to be rather limiting from the perspective
  13 +of returning useful results. We decided to adapt a known search library - Lucene - to remedy the situation.
  14 +
  15 +The complexity of integrating Lucene with the KnowledgeTree is that the data is now seperated between a database and an external source.
  16 +
  17 +KnowledgeTree needs to provide a mechanism where the two an be queried easily. The idea was to provide a mechanism to create an
  18 +expression which could be used. The expression can be evaluated and the subexpressions can be identified that should run on lucene and those
  19 +that should run on the metadata in the database, with the results finally being merged.
  20 +
  21 +New Database Requirements
  22 +-------------------------
  23 +
  24 +In order to further improve the user experience, the indexing of documents is to be scheduled as a background task. When documents
  25 +are added/checked-into KnowledgeTre, a reference to the document is added to a 'pending' index queue. The background task will process
  26 +items in the 'pending' index queue.
  27 +
  28 +The index queue is maintained by the 'index_files' table. It has a 'what' field that identifies what should be indexed. Possible values
  29 +are: 'C' = Content, 'D' = Discussion, 'A' = Content and Discussion
  30 +
  31 +The 'search_ranking' table is used to associate weightings with different fields. The weights are used when subexpressions match on various fields
  32 +and when results from the database and Lucene must be merged.
  33 +
  34 +The 'search_saved' table stores the expressions. The 'type' field describes what the saved search would be used for. The features will be used
  35 +in future versions. The types defined include; 'S' = Saved Search, 'C' = Conditional Permission, 'W' = Workflow Guard, 'B' = Subscription
  36 +
  37 +The 'search_saved_events' table tracks events so that the subscribed search functionality can run in the background.
  38 +
  39 +Folder Structure
  40 +----------------
  41 +
  42 +The core search functionality is located in the ktroot/search2 folder. This is further comprised of an 'indexing' folder and a 'search' folder.
  43 +The 'indexing' folder contains the core functionality regarding indexing using Lucene - using the Java Lucene server or the PHP Lucene Server.
  44 +The 'search' folder contains the core search functionality that deals with evaluating a search expression, breaking it up into parts for Lucene
  45 +and the database, ranking and merging results.
  46 +
  47 +search2/indexing/bin - various scripts that can be run from the command line.
  48 +search2/indexing/extractors - text extractors used to extract text from various files.
  49 +search2/indexing/extractorHooks - hooking mechanisms around extraction process.
  50 +search2/indexing/indexers - the location of the actual indexers that could be used. Only one may be used in an installation.
  51 +search2/indexing/lib - libraries that may be required that are specific to Lucene.
  52 +search2/indexing/test - some basic test scripts.
  53 +
  54 +
  55 +search2/search - the primary location of search functionality.
  56 +search2/search/bin - various scripts that can be run from the command line.
  57 +search2/search/fields - the of fields that can be used in expressions.
  58 +search2/search/test - some basic test scripts.
  59 +
  60 +bin/luceneserver - the location of the Java Lucene Server.
  61 +
  62 +Additional Search Requirements
  63 +------------------------------
  64 +
  65 +The search2 expression engine is built using a 'compiler' tool called phplemon, which is part of the PEAR PHP_ParserGenerator project.
  66 +See http://pear.php.net/package/PHP_ParserGenerator for more details.
  67 +
  68 +Lucene is an Apache project - http://lucene.apache.org. The 'main' project is Java based, but it has also been ported to PHP and incorporated
  69 +into the ZendFramework. See http://framework.zend.com for more details.
  70 +
  71 +search2/indexing/PHPLuceneIndexer.inc.php contains the code to interface to the PHP ZendFramework.
  72 +
  73 +search2/indexing/JavaXMLRPCLuceneIndexer.inc.php contains the code to interface with the Java Lucene Server. The Java Lucene Server
  74 +must be running for this to work.
... ...
search2/docs/extractors.txt 0 โ†’ 100644
  1 +SEARCH2 - HOWTO WRITE AN EXTRACTOR
  2 +==================================
  3 +
  4 +All extractors are located in the search2/indexing/extractors folder.
  5 +
  6 +Naming Convention
  7 +-----------------
  8 +
  9 +The extractor must be a class descendant from DocumentExtractor and must be suffixed with the text 'Extractor'. The filename for the class
  10 +should have the same name as the class, but with the extension '.inc.php'.
  11 +
  12 +Example
  13 +-------
  14 +
  15 +The simplest extractor is the following:
  16 +
  17 +class SomeExtractor extends DocumentExtractor
  18 +{
  19 + public function getDisplayName()
  20 + {
  21 + return _kt('Some Extractor');
  22 + }
  23 +
  24 + public function getSupportedMimeTypes()
  25 + {
  26 + return array('text/plain','text/csv');
  27 + }
  28 +
  29 + public function extractTextContent()
  30 + {
  31 + $content = file_get_contents($this->sourcefile);
  32 + if (false === $content)
  33 + {
  34 + return false;
  35 + }
  36 +
  37 + $result = file_put_contents($this->targetfile, $this->filter($content));
  38 +
  39 + return false !== $result;
  40 + }
  41 +
  42 + public function diagnose()
  43 + {
  44 + return null;
  45 + }
  46 +}
  47 +
  48 +The filename is 'SomeExtractor.inc.php'.
  49 +
  50 +Note that the DocumentExtractor class has some attributes that can be referenced:
  51 +1) sourcefile - the source filename from which the text must be extracted
  52 +2) targetfile - the target filename where the text that is extracted should be saved.
  53 +
  54 +The class requires 4 methods:
  55 +1) getDisplayName() - provides the system with a friendly name for the extractor which will be displayed to users.
  56 +2) getSupportedMimeTypes() - tells the system what mime types the extractor supports.
  57 +3) extractTextContent() - the function that does the work. It must read from sourcefile and write to targetfile.
  58 +4) diagnose() - it must return null if there are no problems. Otherwise, it should return a string with an error/informational message.
  59 +
  60 +Writing an extractor based on a command line application
  61 +--------------------------------------------------------
  62 +
  63 +To illustrate how this can be done, the PDFExtractor is displayed:
  64 +
  65 +class PDFExtractor extends ApplicationExtractor
  66 +{
  67 + public function __construct()
  68 + {
  69 + parent::__construct('extractors','pdftotext','pdftotext','PDF Text Extractor','-nopgbrk -enc UTF-8 {source} {target}');
  70 + }
  71 +
  72 + public function getSupportedMimeTypes()
  73 + {
  74 + return array('application/pdf');
  75 + }
  76 +}
  77 +
  78 +Note that the constructor takes the parameters:
  79 +
  80 +function __construct($section, $appname, $command, $displayname, $params)
  81 +
  82 +The application path is resolved from $section/$appname in the config.ini. If it is not found in the config.ini, the $command is
  83 +used by default. If you rely on $command, it should be accessible via the PATH environment variable.
  84 +
  85 +$displayname is the friendly name that will be displayed in the dashboard.
  86 +
  87 +Note that $params should contain {source} and {target} placeholders. These will be replaced by the system.
... ...
search2/docs/userguide.txt 0 โ†’ 100644
  1 +SEARCH2 User Guide
  2 +==================
  3 +
  4 +TODO: put this on the wiki.
  5 +
  6 +The new search engine provides for more complicated search expressions than were possible in the past.
  7 +
  8 +Expression Language
  9 +-------------------
  10 +
  11 +The core of the search engine is the 'expression language'.
  12 +
  13 +Expressions may be built up using the following grammar:
  14 +expr ::= expr { AND | OR } expr
  15 +expr ::= NOT expr
  16 +expr ::= (expr)
  17 +expr ::= expr { < | <= | = | > | >= | CONTAINS |STARTS WITH | ENDS WITH } value
  18 +expr ::= field BETWEEN value AND value
  19 +expr ::= field DOES [ NOT ] CONTAIN value
  20 +expr ::= field IS [ NOT ] LIKE value
  21 +value ::= "search text here"
  22 +
  23 +A field may be one of the following:
  24 +CheckedOut , CheckedOutBy , CheckedoutDelta , Created , CreatedBy , CreatedDelta , DiscussionText , DocumentId ,
  25 +DocumentText , DocumentType , Filename , Filesize , Folder , GeneralText , IsCheckedOut , IsImmutable ,
  26 +Metadata , MimeType , Modified , ModifiedBy , ModifiedDelta , Tag , Title , Workflow ,
  27 +WorkflowID , WorkflowState , WorkflowStateID
  28 +
  29 +A 'field' may also refer to metadata using the following syntax:
  30 +["fieldset name"]["field name"]
  31 +
  32 +Note that 'values' must be contained within "double quotes".
  33 +
  34 +User Interface Features
  35 +-----------------------
  36 +
  37 +A) Quick Search widget
  38 +
  39 +This appears on the main navigation bar. Text entered into this widget will be searched according to two options:
  40 +1) metadata only
  41 +2) filename, title, metadata and document content
  42 +
  43 +B) Text Extractor Diagnostics Plugin
  44 +
  45 +This is available via the dashboard to the administrator.
  46 +The results may also be obtained by running the search2/indexing/bin/diagnose.php script.
  47 +
  48 +C) Search Portlet
  49 +
  50 +When browsing through the repository, the search portlet will be available to the right. It will provide a few extra options regarding search.
... ...