Tuesday, 17 September 2013

Multiple term search algorithm (naive)

The other week I was working on a search algorithm for a website I have been developing, which utilises Java 7, Spring MVC 3, Jetty, Bootstrap, jquery et al. The algorithm has to search certain database tables and return a list of results in a semi-useful order.

What is semi-useful to the customer? For me, just being returned a list of results in alphabetical order isn't what I would deem as "semi-useful". There is also the separate issue of how to deal with the search terms: take the entire contents of the search box as an entire search term, split on spaces/punctuation (e.g. commas), try and form multi-word search terms that would make sense etc. etc.

For the search term issue, I actually cheated slightly. Due to the nature of the data for the website I was able to cheat, and state that search terms should be comma separated. I know this isn't necessarily the best solution, but provides slightly more flexibility than "everything in the search box is one search term".

Now the search term conundrum is out of the way, how should I go about creating a suitable search algorithm? Initially I took the approach of just producing something naive that iterates over the given search terms, and goes poking the database for results.


Now this was a little bit too naive as duplicate objects were present when a record was returned for multiple terms. Changing the LinkedList to a Set resolved this (overriding both hashCode and equals to ensure each object of ResultData is unique, and the hashcode is deterministic).


Now for the more interesting part: returning the results in a semi-useful order. Providing each record returned a weighting, based on how many search terms returned it as a result seemed pretty suitable (higher the weighting, the more likely the record is the right one for the customer).

Simply use Map<ResultData, Integer>, incrementing the integer value related to a ResultData key if the key already exists, or putting a new instance of ResultData into the map with a integer value of one.


There are a few things to note in the above implementation. First, results has been changed from a Collection to a Map (as Maps aren't part of Java collections). Secondly, the introduction of the nested for-loop that iterates over the databases results, checking if each ResultData record has been returned previously (results.get(record) will return null if the record has not been seen before), and adding the record to the map with an updated weighting. And finally, the search method now returns null. Well that doesn't sound right. A sorted map (using a HashMap is not suitable here as the ordering of it's key-pair values are non-deterministic) is sorted by key, whilst we want ordering by descending order of value.

The next step here would be create a new another map, which contains a bespoke comparator for ResultData that takes in the initial map of results. This map can be used to get the value for each key, and provide some order to the second map. This sounds quite horrible, and I really want to go have some tea, so I am vetoing this solution, and instead going hunting for some other data types that may be useful to me.

After some quick searching, I decided that using the Google Collections (now known as Guava) would be the way forward (and fun to try something new). Reviewing the API resulted in me discovering the TreeMultiMap, and rubbing my hands together rather happily.
Implementation of Multimap whose keys and values are ordered by their natural ordering or by supplied comparators. In all cases, this implementation uses Comparable.compareTo(T) or Comparator.compare(T, T) instead of Object.equals(java.lang.Object) to determine equivalence of instances.

Time to create a comparator it seems for ResultData, and start playing with TreeMultiMap.


The RecordDataComparator is simply ordering RecordData in alphabetical order by comparing title's (and if they're equal, unique id's).

The new search algorithm is more interesting. This is iterating over each record returned by the database, and checking if the record already exists in the value side of the map (note that it is now MultiMap<Integer, ResultData> and not Map<ResultData, Integer>). If the record has already been added to the map then we need to find the key by iterating over the key set, identifying it's key (i.e. it's weighting), and adding the record to key with a higher value. If the record does not already exist in the map then we add it with a weighting of 1. Every key-pair element in the map is ordered by the key (descending order), and then by the ResultDataComparator. By calling sortedResults.values(), you receive the search results in descending weighted order (and subsequently in alphabetical order).

How MultiMap works is that when you add a value to a key that is already present within the map, it chains that value onto the end of the values that already exist for that key. This allowed me to creating weighting "buckets", where records are added put into buckets based on how frequent they are returned by the database.

I haven't performance tested this solution yet, but I suspect it will suffer the same pitfalls as many queueing strategies, where the majority of the elements will sit in the keys 1 and 2, making the checks on the value sets of these keys rather expensive.

This is only an initial implementation of a searching algorithm that provides results in a semi-useful order, and can no doubt use some more work!!

Monday, 5 August 2013

System.Net.HttpWebRequest resource constraints and API design

At work we had a user report connectivity issues through our .NET API that had been wrapped in a third party's system, where they were using four different user accounts to trade on four different strategies. The user was receiving the following exception:

2013-07-23 19:37:14,015 [11] ERROR TradingSession [(null)] - The operation has timed out, Description: URI: https://testapi.13.com/cancel, Exception: System.Net.WebException: The operation has timed out
   at System.Net.HttpWebRequest.GetRequestStream(TransportContext& context)
   at System.Net.HttpWebRequest.GetRequestStream()

This exception message did not appear in our main code base, so where do you begin? We eliminated network latency, resource constraints (the user's server had 4GB RAM instead of the recommended 8GB by the third party), the vendor system, and our main code base; leaving only the API.

After I wrote an aggressive (and very naive) trading bot we managed to replicate the issue consistently. Whilst traversing the .NET API I found a suspicious piece of code:


When a HttpWebRequest (see stack trace) is created, it is provided with a ServicePoint object by the ServicePoiuntManager, which instantiates the ServicePoint with whatever values it currently has.

ServicePointManager.DefaultConnectionLimit = 5 meant no more than five simultaneous web connections could be made, and and all further requests would be queued once that limit is met. The user was making so many web requests that they these requests were being queued up and timing out before they could be executed.

So we have found our problem, now how do we resolve it? Just hard coding a higher number for DefaultConnectionLimit is not the correct solution as you're just mitigating the problem, so making this value configurable is the ultimate goal.

The obvious place to do this is in the constructor as to guarantee that no web requests have been made before the value is set, and also because the HttpInvoker (which HttpWebRequest belongs to) to instantiated in the API constructor. However, when I found our API constructor I saw the following:


Making an API-breaking change is not an option (just adding the property onto one of constructors) is not an option, and there is no way that I am creating a further two instructors to just deal with this optional property. What if we want to make another property configurable next week, do we create another four constructors to deal with all the properties? I think not.

The most obvious solution to me was to use a builder pattern for our API instantiation. Using a builder pattern doesn't feel "right" for an API, especially for a mature one. So we opted for mystery option three (which had been established elsewhere in the internals of our API), and created the following:



By creating a new Options object we were able to avoid adding two new constructors to accommodate the different property options available, didn't break the API, and produced a solution that is expandable for the future.

Hopefully you have found this interesting for both potential contention issues within HttpWebRequest, and some basic API design.

For more information on ServicePointManager and HttpWebRequest have a look at MSDN blogs.