Commit 4168a4b066964a57c0b41612308f82a972e3a5ac
1 parent
121a1f92
KTS-673
"The search algorithm needs some work" Added. Basic documentation on the search Committed By: Conrad Vermeulen Reviewed By: Kevin Fourie git-svn-id: https://kt-dms.svn.sourceforge.net/svnroot/kt-dms/trunk@7232 c91229c3-7414-0410-bfa2-8a42b809f60b
Showing
4 changed files
with
289 additions
and
0 deletions
search2/docs/adminguide.txt
0 โ 100644
| 1 | +SEARCH2 Administrator Guide | |
| 2 | +=========================== | |
| 3 | + | |
| 4 | +TODO: put this on the wiki. | |
| 5 | + | |
| 6 | +Configuration | |
| 7 | +------------- | |
| 8 | + | |
| 9 | +[search] | |
| 10 | +; The number of results per page | |
| 11 | +; defaults to 25 | |
| 12 | +resultsPerPage = default | |
| 13 | + | |
| 14 | +; The date format used when making queries using widgets | |
| 15 | +; defaults to Y-m-d .... NOTE Future development | |
| 16 | +dateFormat = default | |
| 17 | + | |
| 18 | +[indexer] | |
| 19 | +; The core indexing class | |
| 20 | +coreClass=PHPLuceneIndexer | |
| 21 | + | |
| 22 | +; The number of documents to be indexed in a cron session | |
| 23 | +; defaults to 20 | |
| 24 | +batchDocuments = default | |
| 25 | + | |
| 26 | +; The location of the lucene indexes | |
| 27 | +luceneDirectory=${varDirectory}/indexes | |
| 28 | + | |
| 29 | +; The url for the Java Lucene Server. This should match up the the Lucene Server configuration. | |
| 30 | +; Defaults to http://localhost:8875 | |
| 31 | +javaLuceneURL = default | |
| 32 | + | |
| 33 | +Setting up the Lucene Directory | |
| 34 | +------------------------------- | |
| 35 | + | |
| 36 | +If using the Java Lucene Server, simply start the server. Ensure that it is configured correctly. Some more information is available | |
| 37 | +in ktroot/bin/luceneserver/README.TXT | |
| 38 | + | |
| 39 | +Edit the config.ini and ensure that the 'javaLuceneURL' field is correct. | |
| 40 | + | |
| 41 | +If using the PHP Lucene Server, you need to run the search2/indexing/bin/recreateIndex.php. | |
| 42 | + | |
| 43 | +Migration | |
| 44 | +--------- | |
| 45 | + | |
| 46 | +Migrating to the new server requires that the content of the full text tables are extracted and inserted into the Lucene indexes. | |
| 47 | +This is done using the search2/indexing/bin/migrate.php script. (this feature can be heavy - care should be taken when implementing) | |
| 48 | + | |
| 49 | +Search Results Ranking | |
| 50 | +---------------------- | |
| 51 | + | |
| 52 | +Review the 'search_ranking' table to find the weightings associated with matching subexpressions. These may be modified to improve the | |
| 53 | +relevance of search results according to your needs. | |
| 54 | + | |
| 55 | +Status | |
| 56 | +------ | |
| 57 | + | |
| 58 | +TODO: | |
| 59 | +The lucene indexers should provide some statistics on the lucene index. It should provide some general information on the index, but a diagnostics | |
| 60 | +function should be available to ensure that the correct version of the documents are indexed and possibly reschedule indexing if there is a mismatch for | |
| 61 | +some reason. (this feature could be heavy on the system - care should be taken when implementing) | |
| 62 | + | |
| 63 | +Background Tasks | |
| 64 | +---------------- | |
| 65 | +search2/indexing/bin/cronIndexer.php - task to batch index files. | |
| 66 | +search2/indexing/bin/optimise.php - task to optimise the lucene index. | |
| 67 | + | |
| 68 | +The indexing script should be run frequently - say every 5 minutes. The config.ini allows for the number of documents to be indexed to be configured. This | |
| 69 | +defaults to 20. If the frequency is shortened, you may want to decrease the number of documents that will be indexed so that there is no serious load that can | |
| 70 | +impact on the performance of the system. | |
| 71 | + | |
| 72 | +The lucene index requires optimisation to ensure that performance is optimal. This could be run once a day around midnight, or weekly depending on frequency | |
| 73 | +of updates to the index. | |
| 74 | + | |
| 75 | +HOWTO - how to run a php script from the command line | |
| 76 | +----------------------------------------------------- | |
| 77 | + | |
| 78 | +php -Cq script.php | ... | ... |
search2/docs/architecture.txt
0 โ 100644
| 1 | +SEARCH2 ARCHITECTURE | |
| 2 | +==================== | |
| 3 | + | |
| 4 | +TODO: put this on the wiki. | |
| 5 | + | |
| 6 | +Introduction | |
| 7 | +------------ | |
| 8 | + | |
| 9 | +Locating documents easily should be one of the most important features of the DMS. Implementing the new search must be flexible | |
| 10 | +to accomodate KnowledgeTree's metadata and document content. | |
| 11 | + | |
| 12 | +The previous search was implemented using mysql's full text indexes, but it was found to be rather limiting from the perspective | |
| 13 | +of returning useful results. We decided to adapt a known search library - Lucene - to remedy the situation. | |
| 14 | + | |
| 15 | +The complexity of integrating Lucene with the KnowledgeTree is that the data is now seperated between a database and an external source. | |
| 16 | + | |
| 17 | +KnowledgeTree needs to provide a mechanism where the two an be queried easily. The idea was to provide a mechanism to create an | |
| 18 | +expression which could be used. The expression can be evaluated and the subexpressions can be identified that should run on lucene and those | |
| 19 | +that should run on the metadata in the database, with the results finally being merged. | |
| 20 | + | |
| 21 | +New Database Requirements | |
| 22 | +------------------------- | |
| 23 | + | |
| 24 | +In order to further improve the user experience, the indexing of documents is to be scheduled as a background task. When documents | |
| 25 | +are added/checked-into KnowledgeTre, a reference to the document is added to a 'pending' index queue. The background task will process | |
| 26 | +items in the 'pending' index queue. | |
| 27 | + | |
| 28 | +The index queue is maintained by the 'index_files' table. It has a 'what' field that identifies what should be indexed. Possible values | |
| 29 | +are: 'C' = Content, 'D' = Discussion, 'A' = Content and Discussion | |
| 30 | + | |
| 31 | +The 'search_ranking' table is used to associate weightings with different fields. The weights are used when subexpressions match on various fields | |
| 32 | +and when results from the database and Lucene must be merged. | |
| 33 | + | |
| 34 | +The 'search_saved' table stores the expressions. The 'type' field describes what the saved search would be used for. The features will be used | |
| 35 | +in future versions. The types defined include; 'S' = Saved Search, 'C' = Conditional Permission, 'W' = Workflow Guard, 'B' = Subscription | |
| 36 | + | |
| 37 | +The 'search_saved_events' table tracks events so that the subscribed search functionality can run in the background. | |
| 38 | + | |
| 39 | +Folder Structure | |
| 40 | +---------------- | |
| 41 | + | |
| 42 | +The core search functionality is located in the ktroot/search2 folder. This is further comprised of an 'indexing' folder and a 'search' folder. | |
| 43 | +The 'indexing' folder contains the core functionality regarding indexing using Lucene - using the Java Lucene server or the PHP Lucene Server. | |
| 44 | +The 'search' folder contains the core search functionality that deals with evaluating a search expression, breaking it up into parts for Lucene | |
| 45 | +and the database, ranking and merging results. | |
| 46 | + | |
| 47 | +search2/indexing/bin - various scripts that can be run from the command line. | |
| 48 | +search2/indexing/extractors - text extractors used to extract text from various files. | |
| 49 | +search2/indexing/extractorHooks - hooking mechanisms around extraction process. | |
| 50 | +search2/indexing/indexers - the location of the actual indexers that could be used. Only one may be used in an installation. | |
| 51 | +search2/indexing/lib - libraries that may be required that are specific to Lucene. | |
| 52 | +search2/indexing/test - some basic test scripts. | |
| 53 | + | |
| 54 | + | |
| 55 | +search2/search - the primary location of search functionality. | |
| 56 | +search2/search/bin - various scripts that can be run from the command line. | |
| 57 | +search2/search/fields - the of fields that can be used in expressions. | |
| 58 | +search2/search/test - some basic test scripts. | |
| 59 | + | |
| 60 | +bin/luceneserver - the location of the Java Lucene Server. | |
| 61 | + | |
| 62 | +Additional Search Requirements | |
| 63 | +------------------------------ | |
| 64 | + | |
| 65 | +The search2 expression engine is built using a 'compiler' tool called phplemon, which is part of the PEAR PHP_ParserGenerator project. | |
| 66 | +See http://pear.php.net/package/PHP_ParserGenerator for more details. | |
| 67 | + | |
| 68 | +Lucene is an Apache project - http://lucene.apache.org. The 'main' project is Java based, but it has also been ported to PHP and incorporated | |
| 69 | +into the ZendFramework. See http://framework.zend.com for more details. | |
| 70 | + | |
| 71 | +search2/indexing/PHPLuceneIndexer.inc.php contains the code to interface to the PHP ZendFramework. | |
| 72 | + | |
| 73 | +search2/indexing/JavaXMLRPCLuceneIndexer.inc.php contains the code to interface with the Java Lucene Server. The Java Lucene Server | |
| 74 | +must be running for this to work. | ... | ... |
search2/docs/extractors.txt
0 โ 100644
| 1 | +SEARCH2 - HOWTO WRITE AN EXTRACTOR | |
| 2 | +================================== | |
| 3 | + | |
| 4 | +All extractors are located in the search2/indexing/extractors folder. | |
| 5 | + | |
| 6 | +Naming Convention | |
| 7 | +----------------- | |
| 8 | + | |
| 9 | +The extractor must be a class descendant from DocumentExtractor and must be suffixed with the text 'Extractor'. The filename for the class | |
| 10 | +should have the same name as the class, but with the extension '.inc.php'. | |
| 11 | + | |
| 12 | +Example | |
| 13 | +------- | |
| 14 | + | |
| 15 | +The simplest extractor is the following: | |
| 16 | + | |
| 17 | +class SomeExtractor extends DocumentExtractor | |
| 18 | +{ | |
| 19 | + public function getDisplayName() | |
| 20 | + { | |
| 21 | + return _kt('Some Extractor'); | |
| 22 | + } | |
| 23 | + | |
| 24 | + public function getSupportedMimeTypes() | |
| 25 | + { | |
| 26 | + return array('text/plain','text/csv'); | |
| 27 | + } | |
| 28 | + | |
| 29 | + public function extractTextContent() | |
| 30 | + { | |
| 31 | + $content = file_get_contents($this->sourcefile); | |
| 32 | + if (false === $content) | |
| 33 | + { | |
| 34 | + return false; | |
| 35 | + } | |
| 36 | + | |
| 37 | + $result = file_put_contents($this->targetfile, $this->filter($content)); | |
| 38 | + | |
| 39 | + return false !== $result; | |
| 40 | + } | |
| 41 | + | |
| 42 | + public function diagnose() | |
| 43 | + { | |
| 44 | + return null; | |
| 45 | + } | |
| 46 | +} | |
| 47 | + | |
| 48 | +The filename is 'SomeExtractor.inc.php'. | |
| 49 | + | |
| 50 | +Note that the DocumentExtractor class has some attributes that can be referenced: | |
| 51 | +1) sourcefile - the source filename from which the text must be extracted | |
| 52 | +2) targetfile - the target filename where the text that is extracted should be saved. | |
| 53 | + | |
| 54 | +The class requires 4 methods: | |
| 55 | +1) getDisplayName() - provides the system with a friendly name for the extractor which will be displayed to users. | |
| 56 | +2) getSupportedMimeTypes() - tells the system what mime types the extractor supports. | |
| 57 | +3) extractTextContent() - the function that does the work. It must read from sourcefile and write to targetfile. | |
| 58 | +4) diagnose() - it must return null if there are no problems. Otherwise, it should return a string with an error/informational message. | |
| 59 | + | |
| 60 | +Writing an extractor based on a command line application | |
| 61 | +-------------------------------------------------------- | |
| 62 | + | |
| 63 | +To illustrate how this can be done, the PDFExtractor is displayed: | |
| 64 | + | |
| 65 | +class PDFExtractor extends ApplicationExtractor | |
| 66 | +{ | |
| 67 | + public function __construct() | |
| 68 | + { | |
| 69 | + parent::__construct('extractors','pdftotext','pdftotext','PDF Text Extractor','-nopgbrk -enc UTF-8 {source} {target}'); | |
| 70 | + } | |
| 71 | + | |
| 72 | + public function getSupportedMimeTypes() | |
| 73 | + { | |
| 74 | + return array('application/pdf'); | |
| 75 | + } | |
| 76 | +} | |
| 77 | + | |
| 78 | +Note that the constructor takes the parameters: | |
| 79 | + | |
| 80 | +function __construct($section, $appname, $command, $displayname, $params) | |
| 81 | + | |
| 82 | +The application path is resolved from $section/$appname in the config.ini. If it is not found in the config.ini, the $command is | |
| 83 | +used by default. If you rely on $command, it should be accessible via the PATH environment variable. | |
| 84 | + | |
| 85 | +$displayname is the friendly name that will be displayed in the dashboard. | |
| 86 | + | |
| 87 | +Note that $params should contain {source} and {target} placeholders. These will be replaced by the system. | ... | ... |
search2/docs/userguide.txt
0 โ 100644
| 1 | +SEARCH2 User Guide | |
| 2 | +================== | |
| 3 | + | |
| 4 | +TODO: put this on the wiki. | |
| 5 | + | |
| 6 | +The new search engine provides for more complicated search expressions than were possible in the past. | |
| 7 | + | |
| 8 | +Expression Language | |
| 9 | +------------------- | |
| 10 | + | |
| 11 | +The core of the search engine is the 'expression language'. | |
| 12 | + | |
| 13 | +Expressions may be built up using the following grammar: | |
| 14 | +expr ::= expr { AND | OR } expr | |
| 15 | +expr ::= NOT expr | |
| 16 | +expr ::= (expr) | |
| 17 | +expr ::= expr { < | <= | = | > | >= | CONTAINS |STARTS WITH | ENDS WITH } value | |
| 18 | +expr ::= field BETWEEN value AND value | |
| 19 | +expr ::= field DOES [ NOT ] CONTAIN value | |
| 20 | +expr ::= field IS [ NOT ] LIKE value | |
| 21 | +value ::= "search text here" | |
| 22 | + | |
| 23 | +A field may be one of the following: | |
| 24 | +CheckedOut , CheckedOutBy , CheckedoutDelta , Created , CreatedBy , CreatedDelta , DiscussionText , DocumentId , | |
| 25 | +DocumentText , DocumentType , Filename , Filesize , Folder , GeneralText , IsCheckedOut , IsImmutable , | |
| 26 | +Metadata , MimeType , Modified , ModifiedBy , ModifiedDelta , Tag , Title , Workflow , | |
| 27 | +WorkflowID , WorkflowState , WorkflowStateID | |
| 28 | + | |
| 29 | +A 'field' may also refer to metadata using the following syntax: | |
| 30 | +["fieldset name"]["field name"] | |
| 31 | + | |
| 32 | +Note that 'values' must be contained within "double quotes". | |
| 33 | + | |
| 34 | +User Interface Features | |
| 35 | +----------------------- | |
| 36 | + | |
| 37 | +A) Quick Search widget | |
| 38 | + | |
| 39 | +This appears on the main navigation bar. Text entered into this widget will be searched according to two options: | |
| 40 | +1) metadata only | |
| 41 | +2) filename, title, metadata and document content | |
| 42 | + | |
| 43 | +B) Text Extractor Diagnostics Plugin | |
| 44 | + | |
| 45 | +This is available via the dashboard to the administrator. | |
| 46 | +The results may also be obtained by running the search2/indexing/bin/diagnose.php script. | |
| 47 | + | |
| 48 | +C) Search Portlet | |
| 49 | + | |
| 50 | +When browsing through the repository, the search portlet will be available to the right. It will provide a few extra options regarding search. | ... | ... |