Peter M. Groen / knowledgetree

Commit 4168a4b066964a57c0b41612308f82a972e3a5ac

Authored by conradverm 2007-09-27 10:02:34 +0000

KTS-673

"The search algorithm needs some work"
Added. Basic documentation on the search

Committed By: Conrad Vermeulen
Reviewed By: Kevin Fourie

git-svn-id: https://kt-dms.svn.sourceforge.net/svnroot/kt-dms/trunk@7232 c91229c3-7414-0410-bfa2-8a42b809f60b

Inline Side-by-side

Showing 4 changed files with 289 additions and 0 deletions

search2/docs/adminguide.txt 0 → 100644

View file @4168a4b

		1	+SEARCH2 Administrator Guide
		2	+===========================
		3	+
		4	+TODO: put this on the wiki.
		5	+
		6	+Configuration
		7	+-------------
		8	+
		9	+[search]
		10	+; The number of results per page
		11	+; defaults to 25
		12	+resultsPerPage = default
		13	+
		14	+; The date format used when making queries using widgets
		15	+; defaults to Y-m-d .... NOTE Future development
		16	+dateFormat = default
		17	+
		18	+[indexer]
		19	+; The core indexing class
		20	+coreClass=PHPLuceneIndexer
		21	+
		22	+; The number of documents to be indexed in a cron session
		23	+; defaults to 20
		24	+batchDocuments = default
		25	+
		26	+; The location of the lucene indexes
		27	+luceneDirectory=${varDirectory}/indexes
		28	+
		29	+; The url for the Java Lucene Server. This should match up the the Lucene Server configuration.
		30	+; Defaults to http://localhost:8875
		31	+javaLuceneURL = default
		32	+
		33	+Setting up the Lucene Directory
		34	+-------------------------------
		35	+
		36	+If using the Java Lucene Server, simply start the server. Ensure that it is configured correctly. Some more information is available
		37	+in ktroot/bin/luceneserver/README.TXT
		38	+
		39	+Edit the config.ini and ensure that the 'javaLuceneURL' field is correct.
		40	+
		41	+If using the PHP Lucene Server, you need to run the search2/indexing/bin/recreateIndex.php.
		42	+
		43	+Migration
		44	+---------
		45	+
		46	+Migrating to the new server requires that the content of the full text tables are extracted and inserted into the Lucene indexes.
		47	+This is done using the search2/indexing/bin/migrate.php script. (this feature can be heavy - care should be taken when implementing)
		48	+
		49	+Search Results Ranking
		50	+----------------------
		51	+
		52	+Review the 'search_ranking' table to find the weightings associated with matching subexpressions. These may be modified to improve the
		53	+relevance of search results according to your needs.
		54	+
		55	+Status
		56	+------
		57	+
		58	+TODO:
		59	+The lucene indexers should provide some statistics on the lucene index. It should provide some general information on the index, but a diagnostics
		60	+function should be available to ensure that the correct version of the documents are indexed and possibly reschedule indexing if there is a mismatch for
		61	+some reason. (this feature could be heavy on the system - care should be taken when implementing)
		62	+
		63	+Background Tasks
		64	+----------------
		65	+search2/indexing/bin/cronIndexer.php - task to batch index files.
		66	+search2/indexing/bin/optimise.php - task to optimise the lucene index.
		67	+
		68	+The indexing script should be run frequently - say every 5 minutes. The config.ini allows for the number of documents to be indexed to be configured. This
		69	+defaults to 20. If the frequency is shortened, you may want to decrease the number of documents that will be indexed so that there is no serious load that can
		70	+impact on the performance of the system.
		71	+
		72	+The lucene index requires optimisation to ensure that performance is optimal. This could be run once a day around midnight, or weekly depending on frequency
		73	+of updates to the index.
		74	+
		75	+HOWTO - how to run a php script from the command line
		76	+-----------------------------------------------------
		77	+
		78	+php -Cq script.php

search2/docs/architecture.txt 0 → 100644

View file @4168a4b

		1	+SEARCH2 ARCHITECTURE
		2	+====================
		3	+
		4	+TODO: put this on the wiki.
		5	+
		6	+Introduction
		7	+------------
		8	+
		9	+Locating documents easily should be one of the most important features of the DMS. Implementing the new search must be flexible
		10	+to accomodate KnowledgeTree's metadata and document content.
		11	+
		12	+The previous search was implemented using mysql's full text indexes, but it was found to be rather limiting from the perspective
		13	+of returning useful results. We decided to adapt a known search library - Lucene - to remedy the situation.
		14	+
		15	+The complexity of integrating Lucene with the KnowledgeTree is that the data is now seperated between a database and an external source.
		16	+
		17	+KnowledgeTree needs to provide a mechanism where the two an be queried easily. The idea was to provide a mechanism to create an
		18	+expression which could be used. The expression can be evaluated and the subexpressions can be identified that should run on lucene and those
		19	+that should run on the metadata in the database, with the results finally being merged.
		20	+
		21	+New Database Requirements
		22	+-------------------------
		23	+
		24	+In order to further improve the user experience, the indexing of documents is to be scheduled as a background task. When documents
		25	+are added/checked-into KnowledgeTre, a reference to the document is added to a 'pending' index queue. The background task will process
		26	+items in the 'pending' index queue.
		27	+
		28	+The index queue is maintained by the 'index_files' table. It has a 'what' field that identifies what should be indexed. Possible values
		29	+are: 'C' = Content, 'D' = Discussion, 'A' = Content and Discussion
		30	+
		31	+The 'search_ranking' table is used to associate weightings with different fields. The weights are used when subexpressions match on various fields
		32	+and when results from the database and Lucene must be merged.
		33	+
		34	+The 'search_saved' table stores the expressions. The 'type' field describes what the saved search would be used for. The features will be used
		35	+in future versions. The types defined include; 'S' = Saved Search, 'C' = Conditional Permission, 'W' = Workflow Guard, 'B' = Subscription
		36	+
		37	+The 'search_saved_events' table tracks events so that the subscribed search functionality can run in the background.
		38	+
		39	+Folder Structure
		40	+----------------
		41	+
		42	+The core search functionality is located in the ktroot/search2 folder. This is further comprised of an 'indexing' folder and a 'search' folder.
		43	+The 'indexing' folder contains the core functionality regarding indexing using Lucene - using the Java Lucene server or the PHP Lucene Server.
		44	+The 'search' folder contains the core search functionality that deals with evaluating a search expression, breaking it up into parts for Lucene
		45	+and the database, ranking and merging results.
		46	+
		47	+search2/indexing/bin - various scripts that can be run from the command line.
		48	+search2/indexing/extractors - text extractors used to extract text from various files.
		49	+search2/indexing/extractorHooks - hooking mechanisms around extraction process.
		50	+search2/indexing/indexers - the location of the actual indexers that could be used. Only one may be used in an installation.
		51	+search2/indexing/lib - libraries that may be required that are specific to Lucene.
		52	+search2/indexing/test - some basic test scripts.
		53	+
		54	+
		55	+search2/search - the primary location of search functionality.
		56	+search2/search/bin - various scripts that can be run from the command line.
		57	+search2/search/fields - the of fields that can be used in expressions.
		58	+search2/search/test - some basic test scripts.
		59	+
		60	+bin/luceneserver - the location of the Java Lucene Server.
		61	+
		62	+Additional Search Requirements
		63	+------------------------------
		64	+
		65	+The search2 expression engine is built using a 'compiler' tool called phplemon, which is part of the PEAR PHP_ParserGenerator project.
		66	+See http://pear.php.net/package/PHP_ParserGenerator for more details.
		67	+
		68	+Lucene is an Apache project - http://lucene.apache.org. The 'main' project is Java based, but it has also been ported to PHP and incorporated
		69	+into the ZendFramework. See http://framework.zend.com for more details.
		70	+
		71	+search2/indexing/PHPLuceneIndexer.inc.php contains the code to interface to the PHP ZendFramework.
		72	+
		73	+search2/indexing/JavaXMLRPCLuceneIndexer.inc.php contains the code to interface with the Java Lucene Server. The Java Lucene Server
		74	+must be running for this to work.

search2/docs/extractors.txt 0 → 100644

View file @4168a4b

		1	+SEARCH2 - HOWTO WRITE AN EXTRACTOR
		2	+==================================
		3	+
		4	+All extractors are located in the search2/indexing/extractors folder.
		5	+
		6	+Naming Convention
		7	+-----------------
		8	+
		9	+The extractor must be a class descendant from DocumentExtractor and must be suffixed with the text 'Extractor'. The filename for the class
		10	+should have the same name as the class, but with the extension '.inc.php'.
		11	+
		12	+Example
		13	+-------
		14	+
		15	+The simplest extractor is the following:
		16	+
		17	+class SomeExtractor extends DocumentExtractor
		18	+{
		19	+ public function getDisplayName()
		20	+ {
		21	+ return _kt('Some Extractor');
		22	+ }
		23	+
		24	+ public function getSupportedMimeTypes()
		25	+ {
		26	+ return array('text/plain','text/csv');
		27	+ }
		28	+
		29	+ public function extractTextContent()
		30	+ {
		31	+ $content = file_get_contents($this->sourcefile);
		32	+ if (false === $content)
		33	+ {
		34	+ return false;
		35	+ }
		36	+
		37	+ $result = file_put_contents($this->targetfile, $this->filter($content));
		38	+
		39	+ return false !== $result;
		40	+ }
		41	+
		42	+ public function diagnose()
		43	+ {
		44	+ return null;
		45	+ }
		46	+}
		47	+
		48	+The filename is 'SomeExtractor.inc.php'.
		49	+
		50	+Note that the DocumentExtractor class has some attributes that can be referenced:
		51	+1) sourcefile - the source filename from which the text must be extracted
		52	+2) targetfile - the target filename where the text that is extracted should be saved.
		53	+
		54	+The class requires 4 methods:
		55	+1) getDisplayName() - provides the system with a friendly name for the extractor which will be displayed to users.
		56	+2) getSupportedMimeTypes() - tells the system what mime types the extractor supports.
		57	+3) extractTextContent() - the function that does the work. It must read from sourcefile and write to targetfile.
		58	+4) diagnose() - it must return null if there are no problems. Otherwise, it should return a string with an error/informational message.
		59	+
		60	+Writing an extractor based on a command line application
		61	+--------------------------------------------------------
		62	+
		63	+To illustrate how this can be done, the PDFExtractor is displayed:
		64	+
		65	+class PDFExtractor extends ApplicationExtractor
		66	+{
		67	+ public function __construct()
		68	+ {
		69	+ parent::__construct('extractors','pdftotext','pdftotext','PDF Text Extractor','-nopgbrk -enc UTF-8 {source} {target}');
		70	+ }
		71	+
		72	+ public function getSupportedMimeTypes()
		73	+ {
		74	+ return array('application/pdf');
		75	+ }
		76	+}
		77	+
		78	+Note that the constructor takes the parameters:
		79	+
		80	+function __construct($section, $appname, $command, $displayname, $params)
		81	+
		82	+The application path is resolved from $section/$appname in the config.ini. If it is not found in the config.ini, the $command is
		83	+used by default. If you rely on $command, it should be accessible via the PATH environment variable.
		84	+
		85	+$displayname is the friendly name that will be displayed in the dashboard.
		86	+
		87	+Note that $params should contain {source} and {target} placeholders. These will be replaced by the system.

search2/docs/userguide.txt 0 → 100644

View file @4168a4b

		1	+SEARCH2 User Guide
		2	+==================
		3	+
		4	+TODO: put this on the wiki.
		5	+
		6	+The new search engine provides for more complicated search expressions than were possible in the past.
		7	+
		8	+Expression Language
		9	+-------------------
		10	+
		11	+The core of the search engine is the 'expression language'.
		12	+
		13	+Expressions may be built up using the following grammar:
		14	+expr ::= expr { AND \| OR } expr
		15	+expr ::= NOT expr
		16	+expr ::= (expr)
		17	+expr ::= expr { < \| <= \| = \| > \| >= \| CONTAINS \|STARTS WITH \| ENDS WITH } value
		18	+expr ::= field BETWEEN value AND value
		19	+expr ::= field DOES [ NOT ] CONTAIN value
		20	+expr ::= field IS [ NOT ] LIKE value
		21	+value ::= "search text here"
		22	+
		23	+A field may be one of the following:
		24	+CheckedOut , CheckedOutBy , CheckedoutDelta , Created , CreatedBy , CreatedDelta , DiscussionText , DocumentId ,
		25	+DocumentText , DocumentType , Filename , Filesize , Folder , GeneralText , IsCheckedOut , IsImmutable ,
		26	+Metadata , MimeType , Modified , ModifiedBy , ModifiedDelta , Tag , Title , Workflow ,
		27	+WorkflowID , WorkflowState , WorkflowStateID
		28	+
		29	+A 'field' may also refer to metadata using the following syntax:
		30	+["fieldset name"]["field name"]
		31	+
		32	+Note that 'values' must be contained within "double quotes".
		33	+
		34	+User Interface Features
		35	+-----------------------
		36	+
		37	+A) Quick Search widget
		38	+
		39	+This appears on the main navigation bar. Text entered into this widget will be searched according to two options:
		40	+1) metadata only
		41	+2) filename, title, metadata and document content
		42	+
		43	+B) Text Extractor Diagnostics Plugin
		44	+
		45	+This is available via the dashboard to the administrator.
		46	+The results may also be obtained by running the search2/indexing/bin/diagnose.php script.
		47	+
		48	+C) Search Portlet
		49	+
		50	+When browsing through the repository, the search portlet will be available to the right. It will provide a few extra options regarding search.