Commit 4168a4b066964a57c0b41612308f82a972e3a5ac
1 parent
121a1f92
KTS-673
"The search algorithm needs some work" Added. Basic documentation on the search Committed By: Conrad Vermeulen Reviewed By: Kevin Fourie git-svn-id: https://kt-dms.svn.sourceforge.net/svnroot/kt-dms/trunk@7232 c91229c3-7414-0410-bfa2-8a42b809f60b
Showing
4 changed files
with
289 additions
and
0 deletions
search2/docs/adminguide.txt
0 → 100644
| 1 | +SEARCH2 Administrator Guide | ||
| 2 | +=========================== | ||
| 3 | + | ||
| 4 | +TODO: put this on the wiki. | ||
| 5 | + | ||
| 6 | +Configuration | ||
| 7 | +------------- | ||
| 8 | + | ||
| 9 | +[search] | ||
| 10 | +; The number of results per page | ||
| 11 | +; defaults to 25 | ||
| 12 | +resultsPerPage = default | ||
| 13 | + | ||
| 14 | +; The date format used when making queries using widgets | ||
| 15 | +; defaults to Y-m-d .... NOTE Future development | ||
| 16 | +dateFormat = default | ||
| 17 | + | ||
| 18 | +[indexer] | ||
| 19 | +; The core indexing class | ||
| 20 | +coreClass=PHPLuceneIndexer | ||
| 21 | + | ||
| 22 | +; The number of documents to be indexed in a cron session | ||
| 23 | +; defaults to 20 | ||
| 24 | +batchDocuments = default | ||
| 25 | + | ||
| 26 | +; The location of the lucene indexes | ||
| 27 | +luceneDirectory=${varDirectory}/indexes | ||
| 28 | + | ||
| 29 | +; The url for the Java Lucene Server. This should match up the the Lucene Server configuration. | ||
| 30 | +; Defaults to http://localhost:8875 | ||
| 31 | +javaLuceneURL = default | ||
| 32 | + | ||
| 33 | +Setting up the Lucene Directory | ||
| 34 | +------------------------------- | ||
| 35 | + | ||
| 36 | +If using the Java Lucene Server, simply start the server. Ensure that it is configured correctly. Some more information is available | ||
| 37 | +in ktroot/bin/luceneserver/README.TXT | ||
| 38 | + | ||
| 39 | +Edit the config.ini and ensure that the 'javaLuceneURL' field is correct. | ||
| 40 | + | ||
| 41 | +If using the PHP Lucene Server, you need to run the search2/indexing/bin/recreateIndex.php. | ||
| 42 | + | ||
| 43 | +Migration | ||
| 44 | +--------- | ||
| 45 | + | ||
| 46 | +Migrating to the new server requires that the content of the full text tables are extracted and inserted into the Lucene indexes. | ||
| 47 | +This is done using the search2/indexing/bin/migrate.php script. (this feature can be heavy - care should be taken when implementing) | ||
| 48 | + | ||
| 49 | +Search Results Ranking | ||
| 50 | +---------------------- | ||
| 51 | + | ||
| 52 | +Review the 'search_ranking' table to find the weightings associated with matching subexpressions. These may be modified to improve the | ||
| 53 | +relevance of search results according to your needs. | ||
| 54 | + | ||
| 55 | +Status | ||
| 56 | +------ | ||
| 57 | + | ||
| 58 | +TODO: | ||
| 59 | +The lucene indexers should provide some statistics on the lucene index. It should provide some general information on the index, but a diagnostics | ||
| 60 | +function should be available to ensure that the correct version of the documents are indexed and possibly reschedule indexing if there is a mismatch for | ||
| 61 | +some reason. (this feature could be heavy on the system - care should be taken when implementing) | ||
| 62 | + | ||
| 63 | +Background Tasks | ||
| 64 | +---------------- | ||
| 65 | +search2/indexing/bin/cronIndexer.php - task to batch index files. | ||
| 66 | +search2/indexing/bin/optimise.php - task to optimise the lucene index. | ||
| 67 | + | ||
| 68 | +The indexing script should be run frequently - say every 5 minutes. The config.ini allows for the number of documents to be indexed to be configured. This | ||
| 69 | +defaults to 20. If the frequency is shortened, you may want to decrease the number of documents that will be indexed so that there is no serious load that can | ||
| 70 | +impact on the performance of the system. | ||
| 71 | + | ||
| 72 | +The lucene index requires optimisation to ensure that performance is optimal. This could be run once a day around midnight, or weekly depending on frequency | ||
| 73 | +of updates to the index. | ||
| 74 | + | ||
| 75 | +HOWTO - how to run a php script from the command line | ||
| 76 | +----------------------------------------------------- | ||
| 77 | + | ||
| 78 | +php -Cq script.php |
search2/docs/architecture.txt
0 → 100644
| 1 | +SEARCH2 ARCHITECTURE | ||
| 2 | +==================== | ||
| 3 | + | ||
| 4 | +TODO: put this on the wiki. | ||
| 5 | + | ||
| 6 | +Introduction | ||
| 7 | +------------ | ||
| 8 | + | ||
| 9 | +Locating documents easily should be one of the most important features of the DMS. Implementing the new search must be flexible | ||
| 10 | +to accomodate KnowledgeTree's metadata and document content. | ||
| 11 | + | ||
| 12 | +The previous search was implemented using mysql's full text indexes, but it was found to be rather limiting from the perspective | ||
| 13 | +of returning useful results. We decided to adapt a known search library - Lucene - to remedy the situation. | ||
| 14 | + | ||
| 15 | +The complexity of integrating Lucene with the KnowledgeTree is that the data is now seperated between a database and an external source. | ||
| 16 | + | ||
| 17 | +KnowledgeTree needs to provide a mechanism where the two an be queried easily. The idea was to provide a mechanism to create an | ||
| 18 | +expression which could be used. The expression can be evaluated and the subexpressions can be identified that should run on lucene and those | ||
| 19 | +that should run on the metadata in the database, with the results finally being merged. | ||
| 20 | + | ||
| 21 | +New Database Requirements | ||
| 22 | +------------------------- | ||
| 23 | + | ||
| 24 | +In order to further improve the user experience, the indexing of documents is to be scheduled as a background task. When documents | ||
| 25 | +are added/checked-into KnowledgeTre, a reference to the document is added to a 'pending' index queue. The background task will process | ||
| 26 | +items in the 'pending' index queue. | ||
| 27 | + | ||
| 28 | +The index queue is maintained by the 'index_files' table. It has a 'what' field that identifies what should be indexed. Possible values | ||
| 29 | +are: 'C' = Content, 'D' = Discussion, 'A' = Content and Discussion | ||
| 30 | + | ||
| 31 | +The 'search_ranking' table is used to associate weightings with different fields. The weights are used when subexpressions match on various fields | ||
| 32 | +and when results from the database and Lucene must be merged. | ||
| 33 | + | ||
| 34 | +The 'search_saved' table stores the expressions. The 'type' field describes what the saved search would be used for. The features will be used | ||
| 35 | +in future versions. The types defined include; 'S' = Saved Search, 'C' = Conditional Permission, 'W' = Workflow Guard, 'B' = Subscription | ||
| 36 | + | ||
| 37 | +The 'search_saved_events' table tracks events so that the subscribed search functionality can run in the background. | ||
| 38 | + | ||
| 39 | +Folder Structure | ||
| 40 | +---------------- | ||
| 41 | + | ||
| 42 | +The core search functionality is located in the ktroot/search2 folder. This is further comprised of an 'indexing' folder and a 'search' folder. | ||
| 43 | +The 'indexing' folder contains the core functionality regarding indexing using Lucene - using the Java Lucene server or the PHP Lucene Server. | ||
| 44 | +The 'search' folder contains the core search functionality that deals with evaluating a search expression, breaking it up into parts for Lucene | ||
| 45 | +and the database, ranking and merging results. | ||
| 46 | + | ||
| 47 | +search2/indexing/bin - various scripts that can be run from the command line. | ||
| 48 | +search2/indexing/extractors - text extractors used to extract text from various files. | ||
| 49 | +search2/indexing/extractorHooks - hooking mechanisms around extraction process. | ||
| 50 | +search2/indexing/indexers - the location of the actual indexers that could be used. Only one may be used in an installation. | ||
| 51 | +search2/indexing/lib - libraries that may be required that are specific to Lucene. | ||
| 52 | +search2/indexing/test - some basic test scripts. | ||
| 53 | + | ||
| 54 | + | ||
| 55 | +search2/search - the primary location of search functionality. | ||
| 56 | +search2/search/bin - various scripts that can be run from the command line. | ||
| 57 | +search2/search/fields - the of fields that can be used in expressions. | ||
| 58 | +search2/search/test - some basic test scripts. | ||
| 59 | + | ||
| 60 | +bin/luceneserver - the location of the Java Lucene Server. | ||
| 61 | + | ||
| 62 | +Additional Search Requirements | ||
| 63 | +------------------------------ | ||
| 64 | + | ||
| 65 | +The search2 expression engine is built using a 'compiler' tool called phplemon, which is part of the PEAR PHP_ParserGenerator project. | ||
| 66 | +See http://pear.php.net/package/PHP_ParserGenerator for more details. | ||
| 67 | + | ||
| 68 | +Lucene is an Apache project - http://lucene.apache.org. The 'main' project is Java based, but it has also been ported to PHP and incorporated | ||
| 69 | +into the ZendFramework. See http://framework.zend.com for more details. | ||
| 70 | + | ||
| 71 | +search2/indexing/PHPLuceneIndexer.inc.php contains the code to interface to the PHP ZendFramework. | ||
| 72 | + | ||
| 73 | +search2/indexing/JavaXMLRPCLuceneIndexer.inc.php contains the code to interface with the Java Lucene Server. The Java Lucene Server | ||
| 74 | +must be running for this to work. |
search2/docs/extractors.txt
0 → 100644
| 1 | +SEARCH2 - HOWTO WRITE AN EXTRACTOR | ||
| 2 | +================================== | ||
| 3 | + | ||
| 4 | +All extractors are located in the search2/indexing/extractors folder. | ||
| 5 | + | ||
| 6 | +Naming Convention | ||
| 7 | +----------------- | ||
| 8 | + | ||
| 9 | +The extractor must be a class descendant from DocumentExtractor and must be suffixed with the text 'Extractor'. The filename for the class | ||
| 10 | +should have the same name as the class, but with the extension '.inc.php'. | ||
| 11 | + | ||
| 12 | +Example | ||
| 13 | +------- | ||
| 14 | + | ||
| 15 | +The simplest extractor is the following: | ||
| 16 | + | ||
| 17 | +class SomeExtractor extends DocumentExtractor | ||
| 18 | +{ | ||
| 19 | + public function getDisplayName() | ||
| 20 | + { | ||
| 21 | + return _kt('Some Extractor'); | ||
| 22 | + } | ||
| 23 | + | ||
| 24 | + public function getSupportedMimeTypes() | ||
| 25 | + { | ||
| 26 | + return array('text/plain','text/csv'); | ||
| 27 | + } | ||
| 28 | + | ||
| 29 | + public function extractTextContent() | ||
| 30 | + { | ||
| 31 | + $content = file_get_contents($this->sourcefile); | ||
| 32 | + if (false === $content) | ||
| 33 | + { | ||
| 34 | + return false; | ||
| 35 | + } | ||
| 36 | + | ||
| 37 | + $result = file_put_contents($this->targetfile, $this->filter($content)); | ||
| 38 | + | ||
| 39 | + return false !== $result; | ||
| 40 | + } | ||
| 41 | + | ||
| 42 | + public function diagnose() | ||
| 43 | + { | ||
| 44 | + return null; | ||
| 45 | + } | ||
| 46 | +} | ||
| 47 | + | ||
| 48 | +The filename is 'SomeExtractor.inc.php'. | ||
| 49 | + | ||
| 50 | +Note that the DocumentExtractor class has some attributes that can be referenced: | ||
| 51 | +1) sourcefile - the source filename from which the text must be extracted | ||
| 52 | +2) targetfile - the target filename where the text that is extracted should be saved. | ||
| 53 | + | ||
| 54 | +The class requires 4 methods: | ||
| 55 | +1) getDisplayName() - provides the system with a friendly name for the extractor which will be displayed to users. | ||
| 56 | +2) getSupportedMimeTypes() - tells the system what mime types the extractor supports. | ||
| 57 | +3) extractTextContent() - the function that does the work. It must read from sourcefile and write to targetfile. | ||
| 58 | +4) diagnose() - it must return null if there are no problems. Otherwise, it should return a string with an error/informational message. | ||
| 59 | + | ||
| 60 | +Writing an extractor based on a command line application | ||
| 61 | +-------------------------------------------------------- | ||
| 62 | + | ||
| 63 | +To illustrate how this can be done, the PDFExtractor is displayed: | ||
| 64 | + | ||
| 65 | +class PDFExtractor extends ApplicationExtractor | ||
| 66 | +{ | ||
| 67 | + public function __construct() | ||
| 68 | + { | ||
| 69 | + parent::__construct('extractors','pdftotext','pdftotext','PDF Text Extractor','-nopgbrk -enc UTF-8 {source} {target}'); | ||
| 70 | + } | ||
| 71 | + | ||
| 72 | + public function getSupportedMimeTypes() | ||
| 73 | + { | ||
| 74 | + return array('application/pdf'); | ||
| 75 | + } | ||
| 76 | +} | ||
| 77 | + | ||
| 78 | +Note that the constructor takes the parameters: | ||
| 79 | + | ||
| 80 | +function __construct($section, $appname, $command, $displayname, $params) | ||
| 81 | + | ||
| 82 | +The application path is resolved from $section/$appname in the config.ini. If it is not found in the config.ini, the $command is | ||
| 83 | +used by default. If you rely on $command, it should be accessible via the PATH environment variable. | ||
| 84 | + | ||
| 85 | +$displayname is the friendly name that will be displayed in the dashboard. | ||
| 86 | + | ||
| 87 | +Note that $params should contain {source} and {target} placeholders. These will be replaced by the system. |
search2/docs/userguide.txt
0 → 100644
| 1 | +SEARCH2 User Guide | ||
| 2 | +================== | ||
| 3 | + | ||
| 4 | +TODO: put this on the wiki. | ||
| 5 | + | ||
| 6 | +The new search engine provides for more complicated search expressions than were possible in the past. | ||
| 7 | + | ||
| 8 | +Expression Language | ||
| 9 | +------------------- | ||
| 10 | + | ||
| 11 | +The core of the search engine is the 'expression language'. | ||
| 12 | + | ||
| 13 | +Expressions may be built up using the following grammar: | ||
| 14 | +expr ::= expr { AND | OR } expr | ||
| 15 | +expr ::= NOT expr | ||
| 16 | +expr ::= (expr) | ||
| 17 | +expr ::= expr { < | <= | = | > | >= | CONTAINS |STARTS WITH | ENDS WITH } value | ||
| 18 | +expr ::= field BETWEEN value AND value | ||
| 19 | +expr ::= field DOES [ NOT ] CONTAIN value | ||
| 20 | +expr ::= field IS [ NOT ] LIKE value | ||
| 21 | +value ::= "search text here" | ||
| 22 | + | ||
| 23 | +A field may be one of the following: | ||
| 24 | +CheckedOut , CheckedOutBy , CheckedoutDelta , Created , CreatedBy , CreatedDelta , DiscussionText , DocumentId , | ||
| 25 | +DocumentText , DocumentType , Filename , Filesize , Folder , GeneralText , IsCheckedOut , IsImmutable , | ||
| 26 | +Metadata , MimeType , Modified , ModifiedBy , ModifiedDelta , Tag , Title , Workflow , | ||
| 27 | +WorkflowID , WorkflowState , WorkflowStateID | ||
| 28 | + | ||
| 29 | +A 'field' may also refer to metadata using the following syntax: | ||
| 30 | +["fieldset name"]["field name"] | ||
| 31 | + | ||
| 32 | +Note that 'values' must be contained within "double quotes". | ||
| 33 | + | ||
| 34 | +User Interface Features | ||
| 35 | +----------------------- | ||
| 36 | + | ||
| 37 | +A) Quick Search widget | ||
| 38 | + | ||
| 39 | +This appears on the main navigation bar. Text entered into this widget will be searched according to two options: | ||
| 40 | +1) metadata only | ||
| 41 | +2) filename, title, metadata and document content | ||
| 42 | + | ||
| 43 | +B) Text Extractor Diagnostics Plugin | ||
| 44 | + | ||
| 45 | +This is available via the dashboard to the administrator. | ||
| 46 | +The results may also be obtained by running the search2/indexing/bin/diagnose.php script. | ||
| 47 | + | ||
| 48 | +C) Search Portlet | ||
| 49 | + | ||
| 50 | +When browsing through the repository, the search portlet will be available to the right. It will provide a few extra options regarding search. |