Minnesota Search Sprint

by earnest on Mon, 05/12/2008 - 12:11

So much was learned at the search sprint. First, I'd like to take the time to thank Chad for hosting us. I'd also like to thank everyone that came and participated. Sometimes it was slow going, but there were definitely large and important conclusions found. It was a great experience; not only did it focus on Drupal's search implementation/frame-work, but it also covered a wider aspect of information/problems such as unit testing and what a framework should cover just to name a couple. First, lets go over the reason we were there in the first place.

Drupal Search

High-level Overview of Findings

Let us start with a camp-fire talk of what we (search sprinters) found and talked about during our sprint. The current search implementation in Drupal is good. However, we also know that "search" is not what Drupal is made for; it's one aspect of the picture. Relevance, search, matching, and all its sub-genres make up a whole field of study. PHD theses are built on this subject matter. That said, we also all agreed that search can probably be handled better by 3rd party implementations (using a dedicated search appliance, knowledge network, Lucene, etc.). The current framework allows for 3rd party search integration, but only as a one-sided search. For example, say I want to search a 3rd party knowledge base on Company A's intranet when I go to the search page. This is totally possible in the current framework, however that search is totally isolated. It has its own tabs, and is basically a single information silo.

Following that paradigm, one can think of each search implementation as its own information silo. In order to get the whole picture of what one is searching for, one has to go to each tab, and search for the same item.

As we mentioned before in another post, one can not override the default content (node) and user searches. These are tightly bound to the "node" and "user" modules, both intrinsic to Drupal. In order to override these searches, one has to create a new hook_search in a module.

Now, the immediate question that comes to mind is "why would I want to separate the node and user searches from the node and user modules?". Indexing. If you remember something I mentioned a bit ago in this post about 3rd party appliances/engines/implementations of search perhaps having a better or different implementation of search, then you will realize that the current framework does not allow us to leverage these implementations.

For example; what if your site's nodes were tons of code snippets, and there is a special search engine/indexer that indexed programming code in a nice, fast, efficient way. It was able to return complex relevance scores, etc. If we were able to use our own "codeindexer" to index nodes, and then have this search show up under a "Content" tab, or what not, this would be fantastic.

General Recommendations

We found a few recommendations during the sprint. For this post, I am going to talk about the indexing recommendation as that is what I hit the most. One should read the posts/blogs of the other participants for additional information and/or details on other recommendations, etc.

  • Doug Green
  • Chad Fennel
  • David Lesieur
  • Blake Lucchesi
  • Robert Douglass

For this post, I am going to talk about the aspect of the "larger pie" that I worked on. The search sprint came to a grand vision (obviously this vision will likely go through a transformation as it's refined); however, to reach that place we need to fix some pieces to complete the puzzle. Most of the patches we made are a small part of the larger picture.

Splitting the Search Index

Regarding my above example of creating and splitting up search indexing, we started on a path to do just that. We first started by splitting the search up to 2 modules: nodesearch and usersearch. The main problem with this implementation off the bat is that drupal uses the module name as the call back. Thus, once this is split, the search path now looks like: search/nodeserach or search/usersearch. So, as a result I need to deal with the "tabs" and "path" issue.

Merging Search Tabs

The item that I tasked myself with was the merging of the different search tabs. I knew I needed to do 2 things:

  1. Give the ability to merge different search implementations under 1 tab.
  2. Merge the search results together, as order matters.

This led to a few problems:

  1. Efficiency due to implementation ( this will be discussed later in the post )
  2. How/should one "re-normalize" the search results?

Next, I merged the tabs ( 252211 ). This provides the following:

  • Merged a search implementation's results under another search impelemtatoins tab, and that search's tab will not show ( though is will still be searched ).
  • "Merged tabs" results are re-organized on scores, so that there is some ordering.

As a result, there is now a back-end interface to have a search implementation's results show up under another tab, and that tab will now show unified results.

Say you wanted to have the user's search implementation show up under the 'Content' tab (node's search implementation). So, when I search for 'admin', all nodes relating to admin will be search, AND the search result showing users that were found from the user search relating to admin will also be shown. All under the same tab.

However, then comes the problem of normalizing the results. The current implementation is:

  1. Loop through the searches that should be in that tab
  2. Loop through merged array of the results again and order them by their scores
  3. Loop through results and theme each result
  4. Display

A simple rudimentary analysis of this work-flow is something like:

Let L = Number of rows actually processed ( number of times we loop )
Let s = Number of searches that should be under that tab
Let r = Number of results that will/should be displayed on a page
L = (S * R) * 2 * R

On the premise that:

  1. We must loop through each of the results for the number that MAY be displayed [ (S * R) ].
  2. We must then loop though all the results again to order them by scoring [ (S * R) * 2 ] (Note: Pehraps this could change with a better sorting algorithm than linear search.
  3. We must then loop through the results to theme each returned one [ (S * R) * 2 * R ].

As you can see, this is not very efficient, because lemme tell ya, the graph of that function is pretty nasty :). One nice thing to implement would be a way to get ALL search results in 1 query. However, that is for another discussion in which we talked about passing around a SearchQueryObject/Structure, and then create the "search implementation" from that (e.g. perhaps it would create SQL to search drupal's search index, or perhaps XML to search a different type of index).

My current path on the patch is to:

  • Clean up the interface
  • Add more tests
  • Work on a way to perhaps do the merged search in 1 query.

That said, any comments/thoughts would def. be appreciated.

Patches

Issue number Who Summary Status
256792 Doug refactor search form needs review
22627 David/Doug pager count needs review
256678 Doug/Ernest search type help almost RTBC
145242 Doug refactor node rank RTBC
252211 Earnest merged tabs needs work
70722 David/Djun search exposes private data in search query working on simpletest
*** David/Robert/all Search Parsed Query BLUE SKY
54622 David db_rewrite patch Open
257033 Blake/Chad test coverage for search simplify needs work
257007 Robert inputs for search simplify Open

Patch details and notes below:

  • #145252 refactor node rank. I introduced this a while ago, but we rerolled it this weekend and wrote a test case, giving it enough attention to RTBC. This patch has already spawned another patch #257216, that will expose link relevancy as an additional scoring factor.
  • #257279 search performance improvement, remove extra join. This was a hidden gem that Robert found today that has been around for a long time. This patch can and should be backported to 5.x and 6.x. This is a big win with little cost, and is also RTBC
  • #256678 - Display search help based on type. This is a pretty simple patch that displays different help text when searching for nodes than it displays for users. It could use a few positive comments to move it along.
  • #22627 - Show result count and ranges. This patch adds a position count and total number of results to the search results page, that can be themed. This pretty much works, but needs to be retested, and needs a few more positive comments about the concept.
  • #256792 - refactor advanced search form and keywords. I blogged about this patch before and all it's missing is the test case, which Chad has made progress on, but not uploaded yet.
  • #257196 - ignore javascript during indexing. This came to my attention via Arthur this morning. I've got a pretty simple solution to this, that I think is a good first step, but it definitely needs a few other HTML experts to review and comment on.
  • #177722 - devel batch patch for creating lots of nodes. I wrote this during the Barcelona DrupalCon for the original 6.x search patch, so that we could test search on big datasets. I think that this is pretty much working now, and I used it last night to create 100,000 nodes.
  • #257033 - test coverage for search simplify. This patch adds a needed test case.
  • #70722 - search results expose private information. This patch fixes a problem and only needs a test case that I think Djun is working on.
  • #257244 - improper normalization of comment and statistics node ranks. It appears that the reason the comment and statistics node ranks don't work quite as expected is that the normalization of their scores may be off. This patch definitely needs review. When you do so, please read Robert's article on search results.

Tags: Drupalminneasotasearch sprint

earnest's blog  
    Delicious  Digg  Reddit  Technorati  

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.

More information about formatting options