it’s been a while since my last post – and as interests fade with time, others jump up faster than a beach ball at a nickelback concert.

so i’ve been looking into solr the last couple of days.  solr is relatively new in the arena and probably outshined a bit in popularity by other search engines such as lucene and nutch.  “but why solr?”, you may find yourself asking.

Well solr has a couple of tricks up the sleave – which is likely due to the fact that its a fresher version of the old, dare i call it legacy, search engines.

some of the features of solr includes:

  • highly scalabe java search server (as i will try to show you through this article)
  • works with the lucene search library (tried and tested)
  • you can update your engine with xml using a kind of lightweight restful service.
  • can parse html, openoffice, microsoft office suite documents, pdf’s etc using lucene parsers.
  • custom tokenizer, filter and analyzer steps for control over indexing and query processing
  • extremely rich indexing of fields and metadata, including numbers
  • can combine fields for fulltext-type searching
  • tuned for high performance even when updating
  • spell checkers included, etc, only 2 name a couple of features.

click here to view a more complete list.

solr currently have api’s for ruby, php, python, json, forrest/cocoon.

the obvious elephant that is missing from the list above is the fact that solr is not compiled with a crawler of some sort.  luckily some friendly open source guys already released a guide for integrating nutch with solr for an all round experience.

you might also want to look into localsolr – which enables geographical and spatial searches with solr

now… where to begin:

first things first – since solr is java based, it has to be wrapped in a java container of some sorts.  i choose the friendly open source tomcat as a basis.  i installed tomcat 5.5 using this helpful guide as a reference.  read a bit more than what is required and make sure you understand the concept of using web applications and the configuration of tomcat.

after tomcat is installed, its time for solr.  and as luck would have it :-) – a solr installation in tomcat is probably one of the easiets most straight forward ways of doing it – and i just followed this handy guide, provided by solr, as a reference.

now after solr is up and running, you should be able to access your admin screen via the following url http://127.0.0.1:8180/solr

that would be assuming your tomcat port is 8180 and solr is installed under the directory solr in your web applications.

now after installing solr – and accessing the admin screen you should – you should be able view your configuration and perform a basic search.  to update the schema and import data into solr is a whole other post on its own, and as soon as i figure it out, you will be the first to know – but for know we are going to focus on replication and scalability.

look at the following diagram (done by me using gliffy)

Solr Replication and Loab Balancing

as you can see above – there is 3 main levels

  1. load balancers (wrapped in a monitoring service – keepalived which is optional)
  2. solr slave instances
  3. solr master

you can see that it isn’t the most original idea in the book, but it works.  especially if you should run it within a cloud where your instances could almost be infinite.

searching with solr

lets start from the bottom up – at the base – you have the solr master – which would be the solr instance where all your data importing from xml happens.

above that is the slave instances, which is will replicate their indexes from the master.  the slaves will in turn handle all search request from the load balancers which sits at the top.  this reduces load on the master, which will be used as the primary indexer.

also the load balancers will be wrapped in a monitoring system, which will alert you in case something goes wrong, and also reduced the risk of a single point of entry failure.  this monitoring system is really optional, since the load balancer i will be using has built in failover checks.  but nevertheless it is good practice to look into a monitoring system like keepalived for your architecture.

now for the specifics – to update the index of your master is almost a trivial exercise – since solr has a kind of a restful approach of doing this -

#the following would be in a .sh file
FILES=$*
URL=http://localhost:8180/solr/update


for f in $FILES; do
echo Posting file $f to $URL
curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'

#send the commit command to make sure all the changes are flushed and visible
curl $URL --data-binary '<commit/>' -H 'Content-type:text/xml; charset=utf-8'

so – if you save the above script as for e.g. post.sh, make it executable (chmod a+x post.sh), then you could run it with a xml dataset (./post.sh dataset.xml), and it will update the index using its semi restful interface.  all you have to do is change your url in the above script to the solr instance you wish to update.

then comes the replication of data to the slaves – this is done by using scripts in the bin directory of your solr installation.  here is the descriptions of these scripts from the solr websit.  we would most likely be interested in the snapshooter – which takes a snapshot of the current master index – from there you would use the snappuller – which would pull the latest snapshot from the master index and update the index.

after you’ve mastered these scripts, you can store run them as cron jobs every so often and update the slave indexes.  i’ve also read on there site, that in the new release they would be compiling replication as a feature into solr, where you can actually just tweak a couple of configuration settings, and it should replicate without external tools, so just keep a lookout for the new solr version.

load balancing with haproxy

so as a load balancing solution i chose haproxy, which offers high availability and balancing for tcp and http based applications.  haproxy does not deliver content such as apache, neither does it do caching in a way squid does it, but it’s small, simple and works very well.

to install haproxy:

mkdir /opt/haproxy
cd /opt/haproxy
wget http://haproxy.1wt.eu/download/1.3/src/haproxy-1.3.15.7.tar.gz
gunzip haproxy-1.3.15.7.tar.gz
tar -xf haproxy-1.3.15.7.tar
cd haproxy-1.3.15.7
make
cp ./haproxy /usr/bin/haproxy


then create a file called haproxy.cfg that contains the following data:


global
log 127.0.0.1   local0               #logs all haproxy info to local0 log
log 127.0.0.1   local1 notice    #logs all notifications to local1 log


daemon                                     #specifies haproxy to run as a deamon instance
maxconn         4096                 # total max connections (dependent on ulimit)


defaults   #setup some default values
log            global
mode       http
option      httplog
option      dontlognull


clitimeout        60000       # maximum inactivity time on the client side
srvtimeout        30000       # maximum inactivity time on the server side
timeout connect   4000        # maximum time to wait for a connection attempt to a server to succeed


option            httpclose     # disable keepalive (HAProxy does not yet support the HTTP keep-alive mode)
option            abortonclose  # enable early dropping of aborted requests from pending queue
option            httpchk       # enable HTTP protocol to check on servers health
option            forwardfor    # enable insert of X-Forwarded-For headers


balance roundrobin            # each server is used in turns, according to assigned weight

stats enable                  # enable web-stats at /haproxy?stats
stats auth        admin:pass  # force HTTP Auth to view stats
stats refresh     5s        # refresh rate of stats page


listen myloadbalancer 127.0.0.1:8888 #where the loadbalancer should listen for requests
server slave1 192.168.1.6:8180 #a slave

server slave2 192.168.1.6:8180 #another slave

and that is it.  now all you do is run:

/usr/bin/haproxy -f haproxy.cfg

and bob’s your uncle – just go to http://127.0.0.1:8888/haproxy?stats to verify that it is running (the username is: admin and the password: pass – as specified in the config file above)

just note – if you get binding errors, make sure the port you specify in the config file is open, and not in use by something like apache, tomcat, yaws etc.

now if you have all your slave instances running a version of solr – you should be easily able to connect to them in a round robin fasion – for e.g. connecting to http://127.0.0.1:8888/solr – you can confirm this in the syslog or with the above stats link.

monitor that architecture, with keepalived

last, but not the least would be the enabling of the load balancing monitor – keepalived was suggested to me as a viable option, however, with haproxy  you can have do failover checks and run multiple loadbalancers under a shared ip so this isn’t really a necessary part, but as explained earlier, it is good practice to run a monitor like keepalived on your system.

conclusion

because it is late, and i still have to proofread this sucker, i haven’t really got around to keepalived, although – do yourself a favor and read up on it – i know i certainly will have a cup of java and research it a bit more.

so – there you go – you have a fully load balanced system – running a search engine, which should scale horizontally with your hardware buying budget. :-)