Peter M. Groen / knowledgetree

Commit 4168a4b066964a57c0b41612308f82a972e3a5ac

Authored by conradverm 2007-09-27 10:02:34 +0000

KTS-673

"The search algorithm needs some work"
Added. Basic documentation on the search

Committed By: Conrad Vermeulen
Reviewed By: Kevin Fourie

git-svn-id: https://kt-dms.svn.sourceforge.net/svnroot/kt-dms/trunk@7232 c91229c3-7414-0410-bfa2-8a42b809f60b

Inline Side-by-side

Showing 4 changed files with 289 additions and 0 deletions

search2/docs/adminguide.txt 0 → 100644

View file @4168a4b

	1	+SEARCH2 Administrator Guide
	2	+===========================
	3	+
	4	+TODO: put this on the wiki.
	5	+
	6	+Configuration
	7	+-------------
	8	+
	9	+[search]
	10	+; The number of results per page
	11	+; defaults to 25
	12	+resultsPerPage = default
	13	+
	14	+; The date format used when making queries using widgets
	15	+; defaults to Y-m-d .... NOTE Future development
	16	+dateFormat = default
	17	+
	18	+[indexer]
	19	+; The core indexing class
	20	+coreClass=PHPLuceneIndexer
	21	+
	22	+; The number of documents to be indexed in a cron session
	23	+; defaults to 20
	24	+batchDocuments = default
	25	+
	26	+; The location of the lucene indexes
	27	+luceneDirectory=${varDirectory}/indexes
	28	+
	29	+; The url for the Java Lucene Server. This should match up the the Lucene Server configuration.
	30	+; Defaults to http://localhost:8875
	31	+javaLuceneURL = default
	32	+
	33	+Setting up the Lucene Directory
	34	+-------------------------------
	35	+
	36	+If using the Java Lucene Server, simply start the server. Ensure that it is configured correctly. Some more information is available
	37	+in ktroot/bin/luceneserver/README.TXT
	38	+
	39	+Edit the config.ini and ensure that the 'javaLuceneURL' field is correct.
	40	+
	41	+If using the PHP Lucene Server, you need to run the search2/indexing/bin/recreateIndex.php.
	42	+
	43	+Migration
	44	+---------
	45	+
	46	+Migrating to the new server requires that the content of the full text tables are extracted and inserted into the Lucene indexes.
	47	+This is done using the search2/indexing/bin/migrate.php script. (this feature can be heavy - care should be taken when implementing)
	48	+
	49	+Search Results Ranking
	50	+----------------------
	51	+
	52	+Review the 'search_ranking' table to find the weightings associated with matching subexpressions. These may be modified to improve the
	53	+relevance of search results according to your needs.
	54	+
	55	+Status
	56	+------
	57	+
	58	+TODO:
	59	+The lucene indexers should provide some statistics on the lucene index. It should provide some general information on the index, but a diagnostics
	60	+function should be available to ensure that the correct version of the documents are indexed and possibly reschedule indexing if there is a mismatch for
	61	+some reason. (this feature could be heavy on the system - care should be taken when implementing)
	62	+
	63	+Background Tasks
	64	+----------------
	65	+search2/indexing/bin/cronIndexer.php - task to batch index files.
	66	+search2/indexing/bin/optimise.php - task to optimise the lucene index.
	67	+
	68	+The indexing script should be run frequently - say every 5 minutes. The config.ini allows for the number of documents to be indexed to be configured. This
	69	+defaults to 20. If the frequency is shortened, you may want to decrease the number of documents that will be indexed so that there is no serious load that can
	70	+impact on the performance of the system.
	71	+
	72	+The lucene index requires optimisation to ensure that performance is optimal. This could be run once a day around midnight, or weekly depending on frequency
	73	+of updates to the index.
	74	+
	75	+HOWTO - how to run a php script from the command line
	76	+-----------------------------------------------------
	77	+
	78	+php -Cq script.php
...	...

search2/docs/architecture.txt 0 → 100644

View file @4168a4b

	1	+SEARCH2 ARCHITECTURE
	2	+====================
	3	+
	4	+TODO: put this on the wiki.
	5	+
	6	+Introduction
	7	+------------
	8	+
	9	+Locating documents easily should be one of the most important features of the DMS. Implementing the new search must be flexible
	10	+to accomodate KnowledgeTree's metadata and document content.
	11	+
	12	+The previous search was implemented using mysql's full text indexes, but it was found to be rather limiting from the perspective
	13	+of returning useful results. We decided to adapt a known search library - Lucene - to remedy the situation.
	14	+
	15	+The complexity of integrating Lucene with the KnowledgeTree is that the data is now seperated between a database and an external source.
	16	+
	17	+KnowledgeTree needs to provide a mechanism where the two an be queried easily. The idea was to provide a mechanism to create an
	18	+expression which could be used. The expression can be evaluated and the subexpressions can be identified that should run on lucene and those
	19	+that should run on the metadata in the database, with the results finally being merged.
	20	+
	21	+New Database Requirements
	22	+-------------------------
	23	+
	24	+In order to further improve the user experience, the indexing of documents is to be scheduled as a background task. When documents
	25	+are added/checked-into KnowledgeTre, a reference to the document is added to a 'pending' index queue. The background task will process
	26	+items in the 'pending' index queue.
	27	+
	28	+The index queue is maintained by the 'index_files' table. It has a 'what' field that identifies what should be indexed. Possible values
	29	+are: 'C' = Content, 'D' = Discussion, 'A' = Content and Discussion
	30	+
	31	+The 'search_ranking' table is used to associate weightings with different fields. The weights are used when subexpressions match on various fields
	32	+and when results from the database and Lucene must be merged.
	33	+
	34	+The 'search_saved' table stores the expressions. The 'type' field describes what the saved search would be used for. The features will be used
	35	+in future versions. The types defined include; 'S' = Saved Search, 'C' = Conditional Permission, 'W' = Workflow Guard, 'B' = Subscription
	36	+
	37	+The 'search_saved_events' table tracks events so that the subscribed search functionality can run in the background.
	38	+
	39	+Folder Structure
	40	+----------------
	41	+
	42	+The core search functionality is located in the ktroot/search2 folder. This is further comprised of an 'indexing' folder and a 'search' folder.
	43	+The 'indexing' folder contains the core functionality regarding indexing using Lucene - using the Java Lucene server or the PHP Lucene Server.
	44	+The 'search' folder contains the core search functionality that deals with evaluating a search expression, breaking it up into parts for Lucene
	45	+and the database, ranking and merging results.
	46	+
	47	+search2/indexing/bin - various scripts that can be run from the command line.
	48	+search2/indexing/extractors - text extractors used to extract text from various files.
	49	+search2/indexing/extractorHooks - hooking mechanisms around extraction process.
	50	+search2/indexing/indexers - the location of the actual indexers that could be used. Only one may be used in an installation.
	51	+search2/indexing/lib - libraries that may be required that are specific to Lucene.
	52	+search2/indexing/test - some basic test scripts.
	53	+
	54	+
	55	+search2/search - the primary location of search functionality.
	56	+search2/search/bin - various scripts that can be run from the command line.
	57	+search2/search/fields - the of fields that can be used in expressions.
	58	+search2/search/test - some basic test scripts.
	59	+
	60	+bin/luceneserver - the location of the Java Lucene Server.
	61	+
	62	+Additional Search Requirements
	63	+------------------------------
	64	+
	65	+The search2 expression engine is built using a 'compiler' tool called phplemon, which is part of the PEAR PHP_ParserGenerator project.
	66	+See http://pear.php.net/package/PHP_ParserGenerator for more details.
	67	+
	68	+Lucene is an Apache project - http://lucene.apache.org. The 'main' project is Java based, but it has also been ported to PHP and incorporated
	69	+into the ZendFramework. See http://framework.zend.com for more details.
	70	+
	71	+search2/indexing/PHPLuceneIndexer.inc.php contains the code to interface to the PHP ZendFramework.
	72	+
	73	+search2/indexing/JavaXMLRPCLuceneIndexer.inc.php contains the code to interface with the Java Lucene Server. The Java Lucene Server
	74	+must be running for this to work.
...	...

search2/docs/extractors.txt 0 → 100644

View file @4168a4b

	1	+SEARCH2 - HOWTO WRITE AN EXTRACTOR
	2	+==================================
	3	+
	4	+All extractors are located in the search2/indexing/extractors folder.
	5	+
	6	+Naming Convention
	7	+-----------------
	8	+
	9	+The extractor must be a class descendant from DocumentExtractor and must be suffixed with the text 'Extractor'. The filename for the class
	10	+should have the same name as the class, but with the extension '.inc.php'.
	11	+
	12	+Example
	13	+-------
	14	+
	15	+The simplest extractor is the following:
	16	+
	17	+class SomeExtractor extends DocumentExtractor
	18	+{
	19	+ public function getDisplayName()
	20	+ {
	21	+ return _kt('Some Extractor');
	22	+ }
	23	+
	24	+ public function getSupportedMimeTypes()
	25	+ {
	26	+ return array('text/plain','text/csv');
	27	+ }
	28	+
	29	+ public function extractTextContent()
	30	+ {
	31	+ $content = file_get_contents($this->sourcefile);
	32	+ if (false === $content)
	33	+ {
	34	+ return false;
	35	+ }
	36	+
	37	+ $result = file_put_contents($this->targetfile, $this->filter($content));
	38	+
	39	+ return false !== $result;
	40	+ }
	41	+
	42	+ public function diagnose()
	43	+ {
	44	+ return null;
	45	+ }
	46	+}
	47	+
	48	+The filename is 'SomeExtractor.inc.php'.
	49	+
	50	+Note that the DocumentExtractor class has some attributes that can be referenced:
	51	+1) sourcefile - the source filename from which the text must be extracted
	52	+2) targetfile - the target filename where the text that is extracted should be saved.
	53	+
	54	+The class requires 4 methods:
	55	+1) getDisplayName() - provides the system with a friendly name for the extractor which will be displayed to users.
	56	+2) getSupportedMimeTypes() - tells the system what mime types the extractor supports.
	57	+3) extractTextContent() - the function that does the work. It must read from sourcefile and write to targetfile.
	58	+4) diagnose() - it must return null if there are no problems. Otherwise, it should return a string with an error/informational message.
	59	+
	60	+Writing an extractor based on a command line application
	61	+--------------------------------------------------------
	62	+
	63	+To illustrate how this can be done, the PDFExtractor is displayed:
	64	+
	65	+class PDFExtractor extends ApplicationExtractor
	66	+{
	67	+ public function __construct()
	68	+ {
	69	+ parent::__construct('extractors','pdftotext','pdftotext','PDF Text Extractor','-nopgbrk -enc UTF-8 {source} {target}');
	70	+ }
	71	+
	72	+ public function getSupportedMimeTypes()
	73	+ {
	74	+ return array('application/pdf');
	75	+ }
	76	+}
	77	+
	78	+Note that the constructor takes the parameters:
	79	+
	80	+function __construct($section, $appname, $command, $displayname, $params)
	81	+
	82	+The application path is resolved from $section/$appname in the config.ini. If it is not found in the config.ini, the $command is
	83	+used by default. If you rely on $command, it should be accessible via the PATH environment variable.
	84	+
	85	+$displayname is the friendly name that will be displayed in the dashboard.
	86	+
	87	+Note that $params should contain {source} and {target} placeholders. These will be replaced by the system.
...	...

search2/docs/userguide.txt 0 → 100644

View file @4168a4b

	1	+SEARCH2 User Guide
	2	+==================
	3	+
	4	+TODO: put this on the wiki.
	5	+
	6	+The new search engine provides for more complicated search expressions than were possible in the past.
	7	+
	8	+Expression Language
	9	+-------------------
	10	+
	11	+The core of the search engine is the 'expression language'.
	12	+
	13	+Expressions may be built up using the following grammar:
	14	+expr ::= expr { AND \| OR } expr
	15	+expr ::= NOT expr
	16	+expr ::= (expr)
	17	+expr ::= expr { < \| <= \| = \| > \| >= \| CONTAINS \|STARTS WITH \| ENDS WITH } value
	18	+expr ::= field BETWEEN value AND value
	19	+expr ::= field DOES [ NOT ] CONTAIN value
	20	+expr ::= field IS [ NOT ] LIKE value
	21	+value ::= "search text here"
	22	+
	23	+A field may be one of the following:
	24	+CheckedOut , CheckedOutBy , CheckedoutDelta , Created , CreatedBy , CreatedDelta , DiscussionText , DocumentId ,
	25	+DocumentText , DocumentType , Filename , Filesize , Folder , GeneralText , IsCheckedOut , IsImmutable ,
	26	+Metadata , MimeType , Modified , ModifiedBy , ModifiedDelta , Tag , Title , Workflow ,
	27	+WorkflowID , WorkflowState , WorkflowStateID
	28	+
	29	+A 'field' may also refer to metadata using the following syntax:
	30	+["fieldset name"]["field name"]
	31	+
	32	+Note that 'values' must be contained within "double quotes".
	33	+
	34	+User Interface Features
	35	+-----------------------
	36	+
	37	+A) Quick Search widget
	38	+
	39	+This appears on the main navigation bar. Text entered into this widget will be searched according to two options:
	40	+1) metadata only
	41	+2) filename, title, metadata and document content
	42	+
	43	+B) Text Extractor Diagnostics Plugin
	44	+
	45	+This is available via the dashboard to the administrator.
	46	+The results may also be obtained by running the search2/indexing/bin/diagnose.php script.
	47	+
	48	+C) Search Portlet
	49	+
	50	+When browsing through the repository, the search portlet will be available to the right. It will provide a few extra options regarding search.
...	...