Finding a node in a skip list takes log n operations. Since Lucene's term index is a skip list of unique terms, the query "do any documents contain term t?" will take log(number of unique terms). The query "find all documents containing t" will take log(# of unique terms) + number of matches. If your index is sharded (i.e. your index is not optimized) you will need to multiply these formulae by the number of segments, since each segment has to be searched independently.
The real optimizations of Lucene come from the fact that you never search for all documents which match the query, but rather the top k. Call T the number of unique terms in your index. Say you have s sub-queries. Finding the s posting streams (one per query) takes s log T (since finding a posting stream is just finding that term in the skip list). Now suppose there are p instances of each term in your query, i.e.
Finding the top k matches can be done in p log k , so the entire search runs in
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-PPsyWu3tbFbO_Kja63CmGHRwIm2CqTP6SU8ybLuU64KvxGpJ6-7D_ur9yhIc31xqXSneUSjhobal7Pw6WYuTj99LaE5rKvfYzSO1RJzbQE573eCb23EABQAk9FLht0ASGoE9GcJ2aUM/s320/gif.latex3.gif)
This answers another common question about Lucene queries: what is the performance of "special" query types like range and prefix? The answer is that these queries add an additional penalty proportional to T, since they have to enumerate through the term index to find all terms which are in their range. However, the time actually finding documents in that range is not affected (although obviously the user will view it all as "query time").
If you've investigated Lucene performance on the indexing side of things, you know the key variable is "merge factor", or the maximum number of partitions the index can have. More partitions means faster writes and slower reads.
Say you have d documents, T unique terms and your memory can hold b documents at once. Then indexing can be done in
(Note that in the case where p=1, this reduces to
But what about the other side of things? If p = d and it's a linked list, then indexing time is linear but so is search time (which is a huge degredation from the above stated p log k). In practice, p is always much less than d (the operating system will probably throw "too many open files" if p is even in the double digits unless you're using cfs). Ignoring practical considerations though, you can essentially assume you have to run the query for each individual partition, so the number of partitions (roughly) linearly increases the time.
- http://lucene.sourceforge.net/papers/riao97.ps
- Lester, N., A. Moffat, and J. Zobel. “Fast on-line index construction by geometric partitioning.” In Proceedings of the 14th ACM international conference on Information and knowledge management, 783. ACM, 2005.