Dewald Botha
open source web thoughts
open source web thoughts
it’s been a while since my last post – and as interests fade with time, others jump up faster than a beach ball at a nickelback concert.
so i’ve been looking into solr the last couple of days. solr is relatively new in the arena and probably outshined a bit in popularity by other search engines such as lucene and nutch. “but why solr?”, you may find yourself asking.
Well solr has a couple of tricks up the sleave – which is likely due to the fact that its a fresher version of the old, dare i call it legacy, search engines.
some of the features of solr includes:
click here to view a more complete list.
solr currently have api’s for ruby, php, python, json, forrest/cocoon.
the obvious elephant that is missing from the list above is the fact that solr is not compiled with a crawler of some sort. luckily some friendly open source guys already released a guide for integrating nutch with solr for an all round experience.
you might also want to look into localsolr – which enables geographical and spatial searches with solr
first things first – since solr is java based, it has to be wrapped in a java container of some sorts. i choose the friendly open source tomcat as a basis. i installed tomcat 5.5 using this helpful guide as a reference. read a bit more than what is required and make sure you understand the concept of using web applications and the configuration of tomcat.
after tomcat is installed, its time for solr. and as luck would have it
– a solr installation in tomcat is probably one of the easiets most straight forward ways of doing it – and i just followed this handy guide, provided by solr, as a reference.
now after solr is up and running, you should be able to access your admin screen via the following url http://127.0.0.1:8180/solr
that would be assuming your tomcat port is 8180 and solr is installed under the directory solr in your web applications.
now after installing solr – and accessing the admin screen you should – you should be able view your configuration and perform a basic search. to update the schema and import data into solr is a whole other post on its own, and as soon as i figure it out, you will be the first to know – but for know we are going to focus on replication and scalability.
look at the following diagram (done by me using gliffy)

as you can see above – there is 3 main levels
you can see that it isn’t the most original idea in the book, but it works. especially if you should run it within a cloud where your instances could almost be infinite.
lets start from the bottom up – at the base – you have the solr master – which would be the solr instance where all your data importing from xml happens.
above that is the slave instances, which is will replicate their indexes from the master. the slaves will in turn handle all search request from the load balancers which sits at the top. this reduces load on the master, which will be used as the primary indexer.
also the load balancers will be wrapped in a monitoring system, which will alert you in case something goes wrong, and also reduced the risk of a single point of entry failure. this monitoring system is really optional, since the load balancer i will be using has built in failover checks. but nevertheless it is good practice to look into a monitoring system like keepalived for your architecture.
now for the specifics – to update the index of your master is almost a trivial exercise – since solr has a kind of a restful approach of doing this -
#the following would be in a .sh file
FILES=$*
URL=http://localhost:8180/solr/update
for f in $FILES; do
echo Posting file $f to $URL
curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'
#send the commit command to make sure all the changes are flushed and visible
curl $URL --data-binary '<commit/>' -H 'Content-type:text/xml; charset=utf-8'
so – if you save the above script as for e.g. post.sh, make it executable (chmod a+x post.sh), then you could run it with a xml dataset (./post.sh dataset.xml), and it will update the index using its semi restful interface. all you have to do is change your url in the above script to the solr instance you wish to update.
then comes the replication of data to the slaves – this is done by using scripts in the bin directory of your solr installation. here is the descriptions of these scripts from the solr websit. we would most likely be interested in the snapshooter – which takes a snapshot of the current master index – from there you would use the snappuller – which would pull the latest snapshot from the master index and update the index.
after you’ve mastered these scripts, you can store run them as cron jobs every so often and update the slave indexes. i’ve also read on there site, that in the new release they would be compiling replication as a feature into solr, where you can actually just tweak a couple of configuration settings, and it should replicate without external tools, so just keep a lookout for the new solr version.
so as a load balancing solution i chose haproxy, which offers high availability and balancing for tcp and http based applications. haproxy does not deliver content such as apache, neither does it do caching in a way squid does it, but it’s small, simple and works very well.
to install haproxy:
mkdir /opt/haproxy
cd /opt/haproxy
wget http://haproxy.1wt.eu/download/1.3/src/haproxy-1.3.15.7.tar.gz
gunzip haproxy-1.3.15.7.tar.gz
tar -xf haproxy-1.3.15.7.tar
cd haproxy-1.3.15.7
make
cp ./haproxy /usr/bin/haproxy
then create a file called haproxy.cfg that contains the following data:
global
log 127.0.0.1 local0 #logs all haproxy info to local0 log
log 127.0.0.1 local1 notice #logs all notifications to local1 log
daemon #specifies haproxy to run as a deamon instance
maxconn 4096 # total max connections (dependent on ulimit)
defaults #setup some default values
log global
mode http
option httplog
option dontlognull
clitimeout 60000 # maximum inactivity time on the client side
srvtimeout 30000 # maximum inactivity time on the server side
timeout connect 4000 # maximum time to wait for a connection attempt to a server to succeed
option httpclose # disable keepalive (HAProxy does not yet support the HTTP keep-alive mode)
option abortonclose # enable early dropping of aborted requests from pending queue
option httpchk # enable HTTP protocol to check on servers health
option forwardfor # enable insert of X-Forwarded-For headers
balance roundrobin # each server is used in turns, according to assigned weight
stats enable # enable web-stats at /haproxy?stats
stats auth admin:pass # force HTTP Auth to view stats
stats refresh 5s # refresh rate of stats page
listen myloadbalancer 127.0.0.1:8888 #where the loadbalancer should listen for requests
server slave1 192.168.1.6:8180 #a slave
server slave2 192.168.1.6:8180 #another slave
–
and that is it. now all you do is run:
/usr/bin/haproxy -f haproxy.cfg
and bob’s your uncle – just go to http://127.0.0.1:8888/haproxy?stats to verify that it is running (the username is: admin and the password: pass – as specified in the config file above)
just note – if you get binding errors, make sure the port you specify in the config file is open, and not in use by something like apache, tomcat, yaws etc.
now if you have all your slave instances running a version of solr – you should be easily able to connect to them in a round robin fasion – for e.g. connecting to http://127.0.0.1:8888/solr – you can confirm this in the syslog or with the above stats link.
last, but not the least would be the enabling of the load balancing monitor – keepalived was suggested to me as a viable option, however, with haproxy you can have do failover checks and run multiple loadbalancers under a shared ip so this isn’t really a necessary part, but as explained earlier, it is good practice to run a monitor like keepalived on your system.
because it is late, and i still have to proofread this sucker, i haven’t really got around to keepalived, although – do yourself a favor and read up on it – i know i certainly will have a cup of java and research it a bit more.
so – there you go – you have a fully load balanced system – running a search engine, which should scale horizontally with your hardware buying budget.
| Print article | This entry was posted by dewaldbotha on January 14, 2009 at 8:29 am, and is filed under architecture, java, solr, tomcat. Follow any responses to this post through RSS 2.0. You can leave a response or trackback from your own site. |