SearchPDF for Millions of Documents
Overview
SearchPDF can be used to create a searchable index of millions of documents when the documents are correctly organized for optimal performance. This paper discusses the organization of documents and search-indexes when the document collection is comprised of millions of pages.
SearchPDF uses dtSearch Indexes to perform a search. dtSearch Indexes are generated in a batch process before making the collection available for searching.
dtSearch Index Size Limits
There are some practical limits to the size of a dtSearch Index. A single dtSearch index can hold up to 4-8 Gigabytes of text. Technically, the limit is that none of the individual *.ix files in a dtSearch index can exceed 2 Gigabytes in size. When more than 4-8 Gigabytes are to be indexed, we recommend that multiple indexes be built. One common approach is to have one or more "archive" indices, organized by date (e.g., one for 1998 documents, one for 1999 documents, etc.) and a "Current" index with the documents that are still being added.
Maximum number of files/Folder structure
File access under Windows 9X (using the Fat32 file system) can become quite slow when folders contain more than a few thousand documents. Therefore, for document collections containing hundreds of thousands of documents, we recommend that the documents be organized into sub-folders, preferably with less than 1,000 documents in each folder (this number will vary depending on your Windows environment). The maximum number of files and subfolders under a single folder using Fat32 is 64,000.
Fortunately, today you will most likely be using WindowsXP or Windows2000 with the NTFS file system, and NTFS does not have the limits on number of files in a folder that FAT32 has. The maximum number of files and subfolders in a single folder under NTFS is about 6 billion. Still, using NTFS, we recommend that you limit the number of document files in one folder to about 200,000 for best performance.
Divide and Conquer
One user has reported building a single dtSearch Index for more than 2 million documents. However, this dtSearch Index took a long time to generate (and we assume that the maximum size limit on a dtSearch Index was not reached, otherwise it would not have worked). For better performance, and to avoid reaching dtSearch Index size limits, it is better to create muliple, smaller dtSearch indexes, and then search these multiple dtSearch indexes concurrently. There is no limit on the number of dtSearch indexes that can be searched concurrently in SearchPDF.
An Example Strategy for a 5 Million Document Collection
OK. You have 5 million documents. This is not a problem if you organize the documents and search-indexes correctly, as discussed below.
Logical Division of Documents and Indexes
The first thing to do is to determine if these documents can be logically divided into collections. For example, if the collection is shipping receipts and invoices, you really have two collections or sub-collections: receipts and invoices. Now store each sub-collection under a separate folder tree.
Next, see if each sub-collection can be further logically divided. In our example, we might divide invoices by year. In this case, we can now, under the Invoices folder, have further "year" subfolders, and the invoice documents can now be sorted into the year subfolders.
Once you have complete the logical division of the document collection, it is time to check the number of files in each document subfolder. It the total documents in a subfolder exceeds 200,000, then it will be good to further divide these documents purely because of volume into further subfolders, perhaps named "vol1", "vol2", etc.
Now that we have our 5 millions documents organized in a folder tree, we can define the multiple search-indexes that will be generated for the collection. Define the dtSearch Indexes so that they correspond to the logical subdivision of documents. In our example, we will then have a "Invoices-2001-Part A" index, a "Invoices - 2001 - Part B" index, etc.
Division of Documents and Indexes based only on Volume
If your documents cannot be logically divided into groups, it is fine to divide the documents based only on volume as a criteria.
In another example, let's say that our 5 million documents are all "discovery documents". There is no logical subdivision of these documents.
Here is what we can do. We will create 25 subfolders named "001", "002", etc. We will place 200,000 documents into each of these subfolders. Now, we will generate a dtSearch Index for each of these subfolders, named "001", "002", etc.
Now on our SearchPDF search page, we will be searching 25 indexes concurrently, the indexes named "001" through "025".
Summary
SearchPDF can be used to perform search and retrieval operations on millions of documents, as long as the folder structure for the storage of these documents is organized correctly, and as long as multiple search-indexes are used, so as not to exceed the size-limits for a single dtSearch Index.
Public Example
There is one website that is public and accessible, and that searches millions of documents using dtSearch Indexes.
It's a collection of over 4 million internal tobacco industry documents, with citations and OCR. It's a fairly custom configuration (the searches run in parallel across several machines, the data is all in xml, etc.), but it does show the power of dtSearch with large collections.
Multiple Server Configuration for Performance
Search query processsing can be configured for multiple servers in SearchPDF, and this approach is recommended for document collections that exceed 5 million documents. In this design, the search query that is specified by the user is submitted to multiple servers, each of which supports search-indexes for a subset of the complete document collection. The search results are then aggregated on the primary server and returned to the user in aggregated form.
The purpose of using multiple servers is for performance reasons. It is a "divide and conquer" strategy for rapid, parallel processing of search queries.
|