diff --git a/search2/docs/adminguide.txt b/search2/docs/adminguide.txt new file mode 100644 index 0000000..03a6160 --- /dev/null +++ b/search2/docs/adminguide.txt @@ -0,0 +1,78 @@ +SEARCH2 Administrator Guide +=========================== + +TODO: put this on the wiki. + +Configuration +------------- + +[search] +; The number of results per page +; defaults to 25 +resultsPerPage = default + +; The date format used when making queries using widgets +; defaults to Y-m-d .... NOTE Future development +dateFormat = default + +[indexer] +; The core indexing class +coreClass=PHPLuceneIndexer + +; The number of documents to be indexed in a cron session +; defaults to 20 +batchDocuments = default + +; The location of the lucene indexes +luceneDirectory=${varDirectory}/indexes + +; The url for the Java Lucene Server. This should match up the the Lucene Server configuration. +; Defaults to http://localhost:8875 +javaLuceneURL = default + +Setting up the Lucene Directory +------------------------------- + +If using the Java Lucene Server, simply start the server. Ensure that it is configured correctly. Some more information is available +in ktroot/bin/luceneserver/README.TXT + +Edit the config.ini and ensure that the 'javaLuceneURL' field is correct. + +If using the PHP Lucene Server, you need to run the search2/indexing/bin/recreateIndex.php. + +Migration +--------- + +Migrating to the new server requires that the content of the full text tables are extracted and inserted into the Lucene indexes. +This is done using the search2/indexing/bin/migrate.php script. (this feature can be heavy - care should be taken when implementing) + +Search Results Ranking +---------------------- + +Review the 'search_ranking' table to find the weightings associated with matching subexpressions. These may be modified to improve the +relevance of search results according to your needs. + +Status +------ + +TODO: +The lucene indexers should provide some statistics on the lucene index. It should provide some general information on the index, but a diagnostics +function should be available to ensure that the correct version of the documents are indexed and possibly reschedule indexing if there is a mismatch for +some reason. (this feature could be heavy on the system - care should be taken when implementing) + +Background Tasks +---------------- +search2/indexing/bin/cronIndexer.php - task to batch index files. +search2/indexing/bin/optimise.php - task to optimise the lucene index. + +The indexing script should be run frequently - say every 5 minutes. The config.ini allows for the number of documents to be indexed to be configured. This +defaults to 20. If the frequency is shortened, you may want to decrease the number of documents that will be indexed so that there is no serious load that can +impact on the performance of the system. + +The lucene index requires optimisation to ensure that performance is optimal. This could be run once a day around midnight, or weekly depending on frequency +of updates to the index. + +HOWTO - how to run a php script from the command line +----------------------------------------------------- + +php -Cq script.php diff --git a/search2/docs/architecture.txt b/search2/docs/architecture.txt new file mode 100644 index 0000000..4233861 --- /dev/null +++ b/search2/docs/architecture.txt @@ -0,0 +1,74 @@ +SEARCH2 ARCHITECTURE +==================== + +TODO: put this on the wiki. + +Introduction +------------ + +Locating documents easily should be one of the most important features of the DMS. Implementing the new search must be flexible +to accomodate KnowledgeTree's metadata and document content. + +The previous search was implemented using mysql's full text indexes, but it was found to be rather limiting from the perspective +of returning useful results. We decided to adapt a known search library - Lucene - to remedy the situation. + +The complexity of integrating Lucene with the KnowledgeTree is that the data is now seperated between a database and an external source. + +KnowledgeTree needs to provide a mechanism where the two an be queried easily. The idea was to provide a mechanism to create an +expression which could be used. The expression can be evaluated and the subexpressions can be identified that should run on lucene and those +that should run on the metadata in the database, with the results finally being merged. + +New Database Requirements +------------------------- + +In order to further improve the user experience, the indexing of documents is to be scheduled as a background task. When documents +are added/checked-into KnowledgeTre, a reference to the document is added to a 'pending' index queue. The background task will process +items in the 'pending' index queue. + +The index queue is maintained by the 'index_files' table. It has a 'what' field that identifies what should be indexed. Possible values +are: 'C' = Content, 'D' = Discussion, 'A' = Content and Discussion + +The 'search_ranking' table is used to associate weightings with different fields. The weights are used when subexpressions match on various fields +and when results from the database and Lucene must be merged. + +The 'search_saved' table stores the expressions. The 'type' field describes what the saved search would be used for. The features will be used +in future versions. The types defined include; 'S' = Saved Search, 'C' = Conditional Permission, 'W' = Workflow Guard, 'B' = Subscription + +The 'search_saved_events' table tracks events so that the subscribed search functionality can run in the background. + +Folder Structure +---------------- + +The core search functionality is located in the ktroot/search2 folder. This is further comprised of an 'indexing' folder and a 'search' folder. +The 'indexing' folder contains the core functionality regarding indexing using Lucene - using the Java Lucene server or the PHP Lucene Server. +The 'search' folder contains the core search functionality that deals with evaluating a search expression, breaking it up into parts for Lucene +and the database, ranking and merging results. + +search2/indexing/bin - various scripts that can be run from the command line. +search2/indexing/extractors - text extractors used to extract text from various files. +search2/indexing/extractorHooks - hooking mechanisms around extraction process. +search2/indexing/indexers - the location of the actual indexers that could be used. Only one may be used in an installation. +search2/indexing/lib - libraries that may be required that are specific to Lucene. +search2/indexing/test - some basic test scripts. + + +search2/search - the primary location of search functionality. +search2/search/bin - various scripts that can be run from the command line. +search2/search/fields - the of fields that can be used in expressions. +search2/search/test - some basic test scripts. + +bin/luceneserver - the location of the Java Lucene Server. + +Additional Search Requirements +------------------------------ + +The search2 expression engine is built using a 'compiler' tool called phplemon, which is part of the PEAR PHP_ParserGenerator project. +See http://pear.php.net/package/PHP_ParserGenerator for more details. + +Lucene is an Apache project - http://lucene.apache.org. The 'main' project is Java based, but it has also been ported to PHP and incorporated +into the ZendFramework. See http://framework.zend.com for more details. + +search2/indexing/PHPLuceneIndexer.inc.php contains the code to interface to the PHP ZendFramework. + +search2/indexing/JavaXMLRPCLuceneIndexer.inc.php contains the code to interface with the Java Lucene Server. The Java Lucene Server +must be running for this to work. diff --git a/search2/docs/extractors.txt b/search2/docs/extractors.txt new file mode 100644 index 0000000..6c79d49 --- /dev/null +++ b/search2/docs/extractors.txt @@ -0,0 +1,87 @@ +SEARCH2 - HOWTO WRITE AN EXTRACTOR +================================== + +All extractors are located in the search2/indexing/extractors folder. + +Naming Convention +----------------- + +The extractor must be a class descendant from DocumentExtractor and must be suffixed with the text 'Extractor'. The filename for the class +should have the same name as the class, but with the extension '.inc.php'. + +Example +------- + +The simplest extractor is the following: + +class SomeExtractor extends DocumentExtractor +{ + public function getDisplayName() + { + return _kt('Some Extractor'); + } + + public function getSupportedMimeTypes() + { + return array('text/plain','text/csv'); + } + + public function extractTextContent() + { + $content = file_get_contents($this->sourcefile); + if (false === $content) + { + return false; + } + + $result = file_put_contents($this->targetfile, $this->filter($content)); + + return false !== $result; + } + + public function diagnose() + { + return null; + } +} + +The filename is 'SomeExtractor.inc.php'. + +Note that the DocumentExtractor class has some attributes that can be referenced: +1) sourcefile - the source filename from which the text must be extracted +2) targetfile - the target filename where the text that is extracted should be saved. + +The class requires 4 methods: +1) getDisplayName() - provides the system with a friendly name for the extractor which will be displayed to users. +2) getSupportedMimeTypes() - tells the system what mime types the extractor supports. +3) extractTextContent() - the function that does the work. It must read from sourcefile and write to targetfile. +4) diagnose() - it must return null if there are no problems. Otherwise, it should return a string with an error/informational message. + +Writing an extractor based on a command line application +-------------------------------------------------------- + +To illustrate how this can be done, the PDFExtractor is displayed: + +class PDFExtractor extends ApplicationExtractor +{ + public function __construct() + { + parent::__construct('extractors','pdftotext','pdftotext','PDF Text Extractor','-nopgbrk -enc UTF-8 {source} {target}'); + } + + public function getSupportedMimeTypes() + { + return array('application/pdf'); + } +} + +Note that the constructor takes the parameters: + +function __construct($section, $appname, $command, $displayname, $params) + +The application path is resolved from $section/$appname in the config.ini. If it is not found in the config.ini, the $command is +used by default. If you rely on $command, it should be accessible via the PATH environment variable. + +$displayname is the friendly name that will be displayed in the dashboard. + +Note that $params should contain {source} and {target} placeholders. These will be replaced by the system. diff --git a/search2/docs/userguide.txt b/search2/docs/userguide.txt new file mode 100644 index 0000000..a935526 --- /dev/null +++ b/search2/docs/userguide.txt @@ -0,0 +1,50 @@ +SEARCH2 User Guide +================== + +TODO: put this on the wiki. + +The new search engine provides for more complicated search expressions than were possible in the past. + +Expression Language +------------------- + +The core of the search engine is the 'expression language'. + +Expressions may be built up using the following grammar: +expr ::= expr { AND | OR } expr +expr ::= NOT expr +expr ::= (expr) +expr ::= expr { < | <= | = | > | >= | CONTAINS |STARTS WITH | ENDS WITH } value +expr ::= field BETWEEN value AND value +expr ::= field DOES [ NOT ] CONTAIN value +expr ::= field IS [ NOT ] LIKE value +value ::= "search text here" + +A field may be one of the following: +CheckedOut , CheckedOutBy , CheckedoutDelta , Created , CreatedBy , CreatedDelta , DiscussionText , DocumentId , +DocumentText , DocumentType , Filename , Filesize , Folder , GeneralText , IsCheckedOut , IsImmutable , +Metadata , MimeType , Modified , ModifiedBy , ModifiedDelta , Tag , Title , Workflow , +WorkflowID , WorkflowState , WorkflowStateID + +A 'field' may also refer to metadata using the following syntax: +["fieldset name"]["field name"] + +Note that 'values' must be contained within "double quotes". + +User Interface Features +----------------------- + +A) Quick Search widget + +This appears on the main navigation bar. Text entered into this widget will be searched according to two options: +1) metadata only +2) filename, title, metadata and document content + +B) Text Extractor Diagnostics Plugin + +This is available via the dashboard to the administrator. +The results may also be obtained by running the search2/indexing/bin/diagnose.php script. + +C) Search Portlet + +When browsing through the repository, the search portlet will be available to the right. It will provide a few extra options regarding search.