|
The following table compares Rubicon with DBISAM's full text search capability. Please be sure to read the explanations that follow the table as they elaborate on each point. A test program and database are available at the bottom of this page for download. As the table and explanations illustrate, Rubicon emphasizes speed and customization in all aspects of indexing, updating, and searching.
IntegrationRubicon is a set of add on components that handle the full text search (FTS) functions of building, updating, and searching of word indexes. DBISAM offers a built in FTS feature that does not require any additional components. DBISAM FTS is accessed through SQL function like TEXTSEARCH and TEXTOCCURS. PortableRubicon FTS is portable across a wide range of SQL and table based databases including Advantage, ADO, Apollo, BDE, DBISAM, Direct Oracle Access, FlashFiler, InterBase Express, InterBase Objects, NexusDB, ODBC Express, and Topaz. FTS Index StorageRubicon FTS indexes are stored in separate tables. DBISAM FTS indexes are stored in the table's idx file. Rubicon stores its indexes in a separate table, so you have the choice of backing up this file. FTS Index LocationRubicon indexes are not tied to the database of the table being indexed. The Rubicon indexes may be stored anywhere. Index on Separate PCSince indexing is resource intensive, it is often useful to perform this task on a separate PC. Rubicon can read the target data from a remote system and create the FTS indexes on the local system. DBISAM can index a table on another system, but the FTS indexes have to be written back to that system. Indexing Open TableRubicon does not require exclusive access to the table when indexing so the table may be open for read access by other users. When adding FTS indexing to an existing DBISAM table, the table must be restructured which requires exclusive access. Access to Indexed WordsThe words indexed by Rubicon are visible in the "words" table created by Rubicon. This table includes the number of occurrences of the word in the table. The ability to view the words provides useful feedback when fine tuning the parsing and definition of words. Index All Field TypesRubicon can index all native fields types plus calculated and lookup fields. Custom Field HandlingWith Rubicon you can customize the handling of the text in the field. If the field contains RTF or HTML, the field can be converted to plain text before indexing. If there are certain parts of the text you do not wish to be indexed, you can prevent those parts from being indexed. Custom Word HandlingRubicon allows you to completely control which words are indexed. You can customize the parsing routine and inspect each word before it is indexed. For instance, the parsing routine may return '00000xyz' as a word, but by using the Rubicon TrbAccept component (or your own code) this word can be rejected for indexing. Partial IndexingPartial indexing is useful when you are trying to refine your word parsing and definition routines. Instead of indexing the entire table, you can index the first 1000 or so records and inspect the results. Segmented IndexingIndexing very large tables (number of records, number of unique words, or both) is very resource intensive and may be impossible unless the task is broken down into a series of smaller tasks. Rubicon addresses this issue by using segmented indexing. Segmentation allows Rubicon to divide the table being indexed into small pieces virtually (Rubicon never alters the target table) and each piece or segment is indexed individually. Index 50mb SpeedTime to index a 50mb table of 15,000 newsgroup messages with 326,000 unique words. Rubicon extensively uses in memory caching and index compression to minimize disk access. Index 50mb SizeSize of the indexes created. Rubicon stores its indexes in the same table type as the table being indexed, in this case a DBISAM table. The combined size of the DBISAM blb, dat, and idx files are 52mb. For DBISAM FTS, the indexes are stored in the idx file. After restructuring the DBISAM table, the idx file size grows by 29mb. The Rubicon index size could be significantly reduced by using a 64 byte BlobBlockSize and by taking advantage of Rubicon's custom word handling to reject many of the 326k words. Search LogicOperations supported by the FTS engine. The Rubicon NEAR operator has a default distance of eight words. The LIKE operator uses a modified soundex routine, but a custom routine may also be used. Wildcard SearchesRubicon supports single and multi character wildcard searches. Matching Words & CountsMatching words and counts are available via the Rubicon MatchingWords method. This information provides valuable feedback to the user, especially when using wildcard searches. RankingRubicon search results can be ranked by count, percent (record with highest count is 100%), and presence (word only counted once per record). Search performance can be improved when ranking is disabled. Limit Match ResultsIt is not unusual for users to submit vague searches that return thousands of matches. In these instances it is not generally useful to return a complete result set, but rather return the first 100 matching records and allow the user to refine the search. With Rubicon you can perform the search, determine the number of matching records, and then decide whether to create a dataset of matching records and, if so, how many records to include. With DBISAM you can limit the query dataset size by using the MaxRowCount property, but then you cannot determine how many records actually matched the search criteria. Match DataSet OptionalRubicon does not require that a match dataset be created, which can be time consuming (especially with a large number of matching records). Instead the first page of match results can be created and displayed to the user by navigating through the indexed table using the Rubicon FindFirst and FindNext methods. These methods navigate in natural order or in rank order. Search Speed Narrow/Normal/VagueTime to complete ranked searches and create a dataset containing the matching records. The matching dataset sizes for narrow, normal, and vague searches are 14, 238, 5641 records, respectively. The search words were "century", "workstation" , and "windows". When performing non proximity searches with Rubicon, the matching record count is available immediately (under 100ms on the slowest systems, less than 1ms on a 2.8ghz P4) regardless of the number of matching records. Thus the Rubicon search times are almost entirely comprised of ranking and the creation of the dataset containing the matching records. To manage the search time, your Rubicon application can execute the search, immediately determine the number of matching records, and then decide how to proceed on the more time consuming tasks of ranking and match dataset creation. Rubicon match results can be created in any supported dataset. During testing it was discovered that a TClientDataSet performed much better than an in memory TDBISAMTable. The Rubicon vague search was performed in 18 seconds with a TClientDataSet and 60 seconds using an in memory TDBISAM table. The test program available for download does not use a TClientDataSet. Test Program & DatabaseDownload the test program (executable and source code) and database files and unzip them in their own directory. Run the program, create the FTS indexes, and perform searches.
Test ResultsTests were performed under Windows XP, 2.8ghz P4 (hyperthreading enabled), 7200rpm ATA drive, 1gb memory.
|
| Copyright 2006 © Tamarack Associates | ||
| www.TamarackA.com | Last updated 10/23/03 | www.FullTextSearch.com |