Tuesday 17 September 2013

Multiple term search algorithm (naive)

The other week I was working on a search algorithm for a website I have been developing, which utilises Java 7, Spring MVC 3, Jetty, Bootstrap, jquery et al. The algorithm has to search certain database tables and return a list of results in a semi-useful order.

What is semi-useful to the customer? For me, just being returned a list of results in alphabetical order isn't what I would deem as "semi-useful". There is also the separate issue of how to deal with the search terms: take the entire contents of the search box as an entire search term, split on spaces/punctuation (e.g. commas), try and form multi-word search terms that would make sense etc. etc.

For the search term issue, I actually cheated slightly. Due to the nature of the data for the website I was able to cheat, and state that search terms should be comma separated. I know this isn't necessarily the best solution, but provides slightly more flexibility than "everything in the search box is one search term".

Now the search term conundrum is out of the way, how should I go about creating a suitable search algorithm? Initially I took the approach of just producing something naive that iterates over the given search terms, and goes poking the database for results.


Now this was a little bit too naive as duplicate objects were present when a record was returned for multiple terms. Changing the LinkedList to a Set resolved this (overriding both hashCode and equals to ensure each object of ResultData is unique, and the hashcode is deterministic).


Now for the more interesting part: returning the results in a semi-useful order. Providing each record returned a weighting, based on how many search terms returned it as a result seemed pretty suitable (higher the weighting, the more likely the record is the right one for the customer).

Simply use Map<ResultData, Integer>, incrementing the integer value related to a ResultData key if the key already exists, or putting a new instance of ResultData into the map with a integer value of one.


There are a few things to note in the above implementation. First, results has been changed from a Collection to a Map (as Maps aren't part of Java collections). Secondly, the introduction of the nested for-loop that iterates over the databases results, checking if each ResultData record has been returned previously (results.get(record) will return null if the record has not been seen before), and adding the record to the map with an updated weighting. And finally, the search method now returns null. Well that doesn't sound right. A sorted map (using a HashMap is not suitable here as the ordering of it's key-pair values are non-deterministic) is sorted by key, whilst we want ordering by descending order of value.

The next step here would be create a new another map, which contains a bespoke comparator for ResultData that takes in the initial map of results. This map can be used to get the value for each key, and provide some order to the second map. This sounds quite horrible, and I really want to go have some tea, so I am vetoing this solution, and instead going hunting for some other data types that may be useful to me.

After some quick searching, I decided that using the Google Collections (now known as Guava) would be the way forward (and fun to try something new). Reviewing the API resulted in me discovering the TreeMultiMap, and rubbing my hands together rather happily.
Implementation of Multimap whose keys and values are ordered by their natural ordering or by supplied comparators. In all cases, this implementation uses Comparable.compareTo(T) or Comparator.compare(T, T) instead of Object.equals(java.lang.Object) to determine equivalence of instances.

Time to create a comparator it seems for ResultData, and start playing with TreeMultiMap.


The RecordDataComparator is simply ordering RecordData in alphabetical order by comparing title's (and if they're equal, unique id's).

The new search algorithm is more interesting. This is iterating over each record returned by the database, and checking if the record already exists in the value side of the map (note that it is now MultiMap<Integer, ResultData> and not Map<ResultData, Integer>). If the record has already been added to the map then we need to find the key by iterating over the key set, identifying it's key (i.e. it's weighting), and adding the record to key with a higher value. If the record does not already exist in the map then we add it with a weighting of 1. Every key-pair element in the map is ordered by the key (descending order), and then by the ResultDataComparator. By calling sortedResults.values(), you receive the search results in descending weighted order (and subsequently in alphabetical order).

How MultiMap works is that when you add a value to a key that is already present within the map, it chains that value onto the end of the values that already exist for that key. This allowed me to creating weighting "buckets", where records are added put into buckets based on how frequent they are returned by the database.

I haven't performance tested this solution yet, but I suspect it will suffer the same pitfalls as many queueing strategies, where the majority of the elements will sit in the keys 1 and 2, making the checks on the value sets of these keys rather expensive.

This is only an initial implementation of a searching algorithm that provides results in a semi-useful order, and can no doubt use some more work!!