<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Open Source Web Thoughts &#187; tomcat</title>
	<atom:link href="http://blog.dewaldbotha.co.za/tag/tomcat/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.dewaldbotha.co.za</link>
	<description></description>
	<lastBuildDate>Fri, 25 Nov 2011 09:17:26 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>the great search balancing act</title>
		<link>http://blog.dewaldbotha.co.za/2009/01/14/the-great-search-balancing-act/</link>
		<comments>http://blog.dewaldbotha.co.za/2009/01/14/the-great-search-balancing-act/#comments</comments>
		<pubDate>Wed, 14 Jan 2009 06:29:24 +0000</pubDate>
		<dc:creator>dewaldbotha</dc:creator>
				<category><![CDATA[architecture]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[tomcat]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[haproxy]]></category>
		<category><![CDATA[keepalived]]></category>
		<category><![CDATA[load balancing]]></category>
		<category><![CDATA[replication]]></category>
		<category><![CDATA[search engine]]></category>

		<guid isPermaLink="false">http://blog.dewaldbotha.co.za/2009-01-14/the-great-search-balancing-act/</guid>
		<description><![CDATA[it&#8217;s been a while since my last post &#8211; and as interests fade with time, others jump up faster than a beach ball at a nickelback concert. so i&#8217;ve been looking into solr the last couple of days.  solr is relatively new in the arena and probably outshined a bit in popularity by other search <a href="http://blog.dewaldbotha.co.za/2009/01/14/the-great-search-balancing-act/" class="more-link">More &#62;</a>]]></description>
			<content:encoded><![CDATA[<p>it&#8217;s been a while since my last post &#8211; and as interests fade with time, others jump up faster than a beach ball at a nickelback concert.</p>
<p>so i&#8217;ve been looking into <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> the last couple of days.  <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> is relatively new in the arena and probably outshined a bit in popularity by other search engines such as <a href="http://lucene.apache.org/solr/" title="Lucene" target="_blank">lucene</a> and <a href="http://lucene.apache.org/nutch/" title="Nutch" target="_blank">nutch</a>.  &#8220;but why <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a>?&#8221;, you may find yourself asking.</p>
<p>Well <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> has a couple of tricks up the sleave &#8211; which is likely due to the fact that its a fresher version of the old, dare i call it legacy, search engines.</p>
<p><span id="more-9"></span><strong>some of the features of <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> includes:</strong></p>
<ul>
<li>highly scalabe <a href="http://www.java.com/en/" title="Java" target="_blank">java</a> search server (as i will try to show you through this article)</li>
<li>works with the <a href="http://lucene.apache.org/solr/" title="Lucene" target="_blank">lucene</a> search library (tried and tested)</li>
<li>you can update your engine with <a href="http://en.wikipedia.org/wiki/XML" title="XML" target="_blank">xml</a> using a kind of lightweight <a href="http://java.sun.com/developer/technicalArticles/WebServices/restful/" title="Resftul" target="_blank">restful</a> service.</li>
<li>can parse html, openoffice, microsoft office suite documents, pdf&#8217;s etc using <a href="http://lucene.apache.org/solr/" title="Lucene" target="_blank">lucene</a> parsers.</li>
<li>custom tokenizer, filter and analyzer  steps for control over indexing and query processing</li>
<li>extremely rich indexing of fields and metadata, including numbers</li>
<li>can combine fields for fulltext-type searching</li>
<li>tuned for high performance even when updating</li>
<li>spell checkers included, etc, only 2 name a couple of features.</li>
</ul>
<p><a href="http://lucene.apache.org/solr/features.html" title="Solr Features" target="_blank">click here to view a more complete list.</a></p>
<p><a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> currently have api&#8217;s for <a href="http://www.ruby-lang.org/en/" title="Ruby" target="_blank">ruby</a>, <a href="http://www.php.net/" title="PHP" target="_blank">php</a>, <a href="http://www.python.org/" title="Python" target="_blank">python</a>, <a href="http://www.json.org/" title="JSON" target="_blank">json</a>, <a href="http://forrest.apache.org/" title="Forrest" target="_blank">forrest</a>/<a href="http://cocoon.apache.org/" title="Cocoon" target="_blank">cocoon</a>.</p>
<p>the obvious elephant that is missing from the list above is the fact that solr is not compiled with a crawler of some sort.  luckily some friendly open source guys already <a href="http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html" title="Integrate Nutch with Solr as a crawler" target="_blank">released a guide for integrating nutch</a> with <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> for an all round experience.</p>
<p>you might also want to look into <a href="http://www.gissearch.com/localsolr" title="localsolr" target="_blank">localsolr</a> &#8211; which enables geographical and spatial searches with <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a></p>
<h3>now&#8230; where to begin:</h3>
<p>first things first &#8211; since <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> is <a href="http://www.java.com/en/" title="Java" target="_blank">java</a> based, it has to be wrapped in a <a href="http://www.java.com/en/" title="Java" target="_blank">java</a> container of some sorts.  i choose the friendly open source <a href="http://tomcat.apache.org/" title="Apache Tomcat" target="_blank">tomcat</a> as a basis.  i installed <a href="http://tomcat.apache.org/" title="Apache Tomcat" target="_blank">tomcat</a> 5.5 using <a href="http://linux-sxs.org/internet_serving/c140.html" title="Tomcat Installation" target="_blank">this helpful guide as a reference</a>.  read a bit more than what is required and make sure you understand the concept of using web applications and the configuration of <a href="http://tomcat.apache.org/" title="Apache Tomcat" target="_blank">tomcat</a>.</p>
<p>after <a href="http://tomcat.apache.org/" title="Apache Tomcat" target="_blank">tomcat</a> is installed, its time for <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a>.  and as luck would have it <img src='http://blog.dewaldbotha.co.za/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  &#8211; a <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> installation in <a href="http://tomcat.apache.org/" title="Apache Tomcat" target="_blank">tomcat</a> is probably one of the easiets most straight forward ways of doing it &#8211; and i <a href="http://wiki.apache.org/solr/SolrTomcat" title="Install Solr in Tomcat" target="_blank">just followed this handy guide</a>, provided by <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a>, as a reference.</p>
<p>now after <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> is up and running, you should be able to access your admin screen via the following url <a href="http://localhost:8180/solr/" title="Solr" target="_blank">http://127.0.0.1:8180/solr</a></p>
<p>that would be assuming your <a href="http://tomcat.apache.org/" title="Apache Tomcat" target="_blank">tomcat</a> port is 8180 and <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> is installed under the directory <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> in your web applications.</p>
<p>now after installing <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> &#8211; and accessing the admin screen you should &#8211; you should be able view your configuration and perform a basic search.  to update the schema and import data into <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> is a whole other post on its own, and as soon as i figure it out, you will be the first to know &#8211; but for know we are going to focus on replication and scalability.</p>
<p>look at the following diagram (done by me using <a href="http://www.gliffy.com/" title="Gliffy" target="_blank">gliffy</a>)</p>
<p style="text-align: center"><img src="http://farm4.static.flickr.com/3397/3194302830_7100b86480_o.jpg" alt="Solr Replication and Loab Balancing" /></p>
<p>as you can see above &#8211; there is 3 main levels</p>
<ol>
<li>load balancers (wrapped in a monitoring service &#8211; <a href="http://www.keepalived.org/" title="Keepalived" target="_blank">keepalived</a> which is optional)</li>
<li><a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> slave instances</li>
<li><a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> master</li>
</ol>
<p>you can see that it isn&#8217;t the most original idea in the book, but it works.  especially if you should run it within a cloud where your instances could almost be infinite.</p>
<h3>searching with <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a></h3>
<p>lets start from the bottom up &#8211; at the base &#8211; you have the <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> master &#8211; which would be the <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> instance where all your data importing from <a href="http://en.wikipedia.org/wiki/XML" title="XML" target="_blank">xml</a> happens.</p>
<p>above that is the slave instances, which is will replicate their indexes from the master.  the slaves will in turn handle all search request from the load balancers which sits at the top.  this reduces load on the master, which will be used as the primary indexer.</p>
<p>also the load balancers will be wrapped in a monitoring system, which will alert you in case something goes wrong, and also reduced the risk of a single point of entry failure.  this monitoring system is really optional, since the load balancer i will be using has built in failover checks.  but nevertheless it is good practice to look into a monitoring system like <a href="http://www.keepalived.org/" title="Keepalived" target="_blank">keepalived</a> for your architecture.<a href="http://www.howtoforge.com/high-availability-load-balancer-haproxy-heartbeat-debian-etch" title="HaProxy Multiple load balancers on shared ip" target="_blank"></a></p>
<p>now for the specifics &#8211; to update the index of your master is almost a trivial exercise &#8211; since <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> has a kind of a <a href="http://java.sun.com/developer/technicalArticles/WebServices/restful/" title="Resftul" target="_blank">restful</a> approach of doing this -</p>
<p><code><strong>#the following would be in a .sh file<br />
FILES=$*<br />
URL=http://localhost:8180/solr/update</strong></code><br />
<code><br />
<strong>for f in $FILES; do<br />
echo Posting file $f to $URL<br />
curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'<br />
</strong></code><code><br />
<strong>#send the commit command to make sure all the changes are flushed and visible<br />
curl $URL --data-binary '&lt;commit/&gt;' -H 'Content-type:text/xml; charset=utf-8'</strong></code></p>
<p>so &#8211; if you save the above script as for e.g. post.sh, make it executable (chmod a+x post.sh), then you could run it with a xml dataset (./post.sh dataset.xml), and it will update the index using its semi restful interface.  all you have to do is change your url in the above script to the <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> instance you wish to update.</p>
<p>then comes the replication of data to the slaves &#8211; this is done by using scripts in the bin directory of your <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> installation.  <a href="http://wiki.apache.org/solr/SolrCollectionDistributionScripts" title="Solr Update Scripts" target="_blank">here is the descriptions of these scripts</a> from the <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> websit.  we would most likely be interested in the snapshooter &#8211; which takes a snapshot of the current master index &#8211; from there you would use the snappuller &#8211; which would pull the latest snapshot from the master index and update the index.</p>
<p>after you&#8217;ve mastered these scripts, you can store run them as cron jobs every so often and update the slave indexes.  i&#8217;ve also read on there site, that in the new release they would be compiling replication as a feature into <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a>, where you can actually just tweak a couple of configuration settings, and it should replicate without external tools, so just keep a lookout for the new <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> version.</p>
<h3>load balancing with <a href="http://haproxy.1wt.eu/" title="Haproxy" target="_blank">haproxy</a></h3>
<p>so as a load balancing solution i chose <a href="http://haproxy.1wt.eu/" title="Haproxy" target="_blank">haproxy</a>, which offers high availability and balancing for tcp and http based applications.  <a href="http://haproxy.1wt.eu/" title="Haproxy" target="_blank">haproxy</a> does not deliver content such as <a href="http://www.apache.org/" title="Apache" target="_blank">apache</a>, neither does it do caching in a way <a href="http://www.squid-cache.org/" title="squid" target="_blank">squid</a> does it, but it&#8217;s small, simple and works very well.</p>
<p>to install <a href="http://haproxy.1wt.eu/" title="Haproxy" target="_blank">haproxy</a>:</p>
<p><code><strong>mkdir /opt/haproxy<br />
cd /opt/haproxy<br />
wget http://haproxy.1wt.eu/download/1.3/src/haproxy-1.3.15.7.tar.gz<br />
gunzip haproxy-1.3.15.7.tar.gz<br />
tar -xf haproxy-1.3.15.7.tar<br />
cd haproxy-1.3.15.7<br />
make<br />
cp ./haproxy /usr/bin/haproxy</strong><br />
</code><br />
then create a file called haproxy.cfg that contains the following data:</p>
<p><code><br />
<strong>global<br />
log 127.0.0.1   local0               #logs all haproxy info to local0 log<br />
log 127.0.0.1   local1 notice    #logs all notifications to local1 log</strong></code><br />
<code><br />
<strong>daemon                                     #specifies haproxy to run as a deamon instance<br />
maxconn         4096                 # total max connections (dependent on ulimit)</strong></code><br />
<code><br />
<strong>defaults   #setup some default values<br />
log            global<br />
mode       http<br />
option      httplog<br />
option      dontlognull</strong><br />
</code><code><br />
<strong>clitimeout        60000       # maximum inactivity time on the client side<br />
srvtimeout        30000       # maximum inactivity time on the server side<br />
timeout connect   4000        # maximum time to wait for a connection attempt to a server to succeed</strong><br />
</code><code><br />
<strong>option            httpclose     # disable keepalive (HAProxy does not yet support the HTTP keep-alive mode)<br />
option            abortonclose  # enable early dropping of aborted requests from pending queue<br />
option            httpchk       # enable HTTP protocol to check on servers health<br />
option            forwardfor    # enable insert of X-Forwarded-For headers</strong><br />
</code><code><br />
<strong>balance roundrobin            # each server is used in turns, according to assigned weight</strong><br />
</code><code><br />
<strong>stats enable                  # enable web-stats at /haproxy?stats<br />
stats auth        admin:pass  # force HTTP Auth to view stats<br />
stats refresh     5s        # refresh rate of stats page</strong><br />
</code><code><br />
<strong>listen myloadbalancer 127.0.0.1:8888 #where the loadbalancer should listen for requests<br />
server slave1 192.168.1.6:8180 #a slave</strong></code><br />
<code><strong>server slave2 192.168.1.6:8180 #another slave</strong><br />
</code><br />
&#8211;</p>
<p>and that is it.  now all you do is run:</p>
<p><code><strong>/usr/bin/haproxy -f haproxy.cfg</strong></code></p>
<p>and bob&#8217;s your uncle &#8211; just go to <a href="http://127.0.0.1:8888/haproxy?stats" title="Haproxy Admin" target="_blank">http://127.0.0.1:8888/haproxy?stats</a> to verify that it is running (the username is: admin and the password: pass &#8211; as specified in the config file above)</p>
<p>just note &#8211; if you get binding errors, make sure the port you specify in the config file is open, and not in use by something like <a href="http://www.apache.org/" title="apache" target="_blank">apache</a>, <a href="http://tomcat.apache.org/" title="Apache Tomcat" target="_blank">tomcat</a>, <a href="http://yaws.hyber.org/" title="yaws" target="_blank">yaws</a> etc.</p>
<p>now if you have all your slave instances running a version of <a href="http://lucene.apache.org/solr/" title="Solr" target="_blank">solr</a> &#8211; you should be easily able to connect to them in a round robin fasion &#8211; for e.g. connecting to <a href="http://127.0.0.1:8888/solr" title="Solr" target="_blank">http://127.0.0.1:8888/solr</a> &#8211; you can confirm this in the syslog or with the above stats link.</p>
<h3>monitor that architecture, with <a href="http://www.keepalived.org/" title="Keepalived" target="_blank">keepalived</a></h3>
<p>last, but not the least would be the enabling of the load balancing monitor &#8211; <a href="http://www.keepalived.org/" title="Keepalived" target="_blank">keepalived</a> was suggested to me as a viable option, however, with <a href="http://haproxy.1wt.eu/" title="Haproxy" target="_blank">haproxy</a>  you can have do failover checks and <a href="http://www.howtoforge.com/high-availability-load-balancer-haproxy-heartbeat-debian-etch" title="HaProxy Multiple load balancers on shared ip" target="_blank">run multiple loadbalancers under a shared ip</a> so this isn&#8217;t really a necessary part, but as explained earlier, it is good practice to run a monitor like <a href="http://www.keepalived.org/" title="Keepalived" target="_blank">keepalived</a> on your system.</p>
<h3>conclusion</h3>
<p>because it is late, and i still have to proofread this sucker, i haven&#8217;t really got around to <a href="http://www.keepalived.org/" title="Keepalived" target="_blank">keepalived</a>, although &#8211; do yourself a favor and read up on it &#8211; i know i certainly will have a cup of java and research it a bit more.</p>
<p>so &#8211; there you go &#8211; you have a fully load balanced system &#8211; running a search engine, which should scale horizontally with your hardware buying budget. <img src='http://blog.dewaldbotha.co.za/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.dewaldbotha.co.za/2009/01/14/the-great-search-balancing-act/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

