<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
	xmlns:media="http://search.yahoo.com/mrss/"
>

<channel>
	<title>Solution Hacker</title>
	<atom:link href="http://www.solutionhacker.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.solutionhacker.com</link>
	<description>This blog provides solutions for enterpreneurs!</description>
	<pubDate>Sat, 11 Oct 2008 08:54:55 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
	<language>en</language>
		<!-- podcast_generator="podPress/8.8" -->
		<copyright>&#xA9; </copyright>
		<managingEditor>rayhon1014@hotmail.com ()</managingEditor>
		<webMaster>rayhon1014@hotmail.com()</webMaster>
		<category></category>
		<itunes:keywords></itunes:keywords>
		<itunes:subtitle></itunes:subtitle>
		<itunes:summary>This blog is dedicated to show you how to use web technologies to make money!</itunes:summary>
		<itunes:author></itunes:author>
		<itunes:category text="Society &amp; Culture"/>
		<itunes:owner>
			<itunes:name></itunes:name>
			<itunes:email>rayhon1014@hotmail.com</itunes:email>
		</itunes:owner>
		<itunes:block>No</itunes:block>
		<itunes:explicit>no</itunes:explicit>
		<itunes:image href="http://www.solutionhacker.com/wp-content/plugins/podpress/images/powered_by_podpress_large.jpg" />
		<image>
			<url>http://www.solutionhacker.com/wp-content/plugins/podpress/images/powered_by_podpress.jpg</url>
			<title>Solution Hacker</title>
			<link>http://www.solutionhacker.com</link>
			<width>144</width>
			<height>144</height>
		</image>
		<item>
		<title>Java System Architecture Resources - Links</title>
		<link>http://www.solutionhacker.com/2008/09/22/java-system-architecture-resources-links/</link>
		<comments>http://www.solutionhacker.com/2008/09/22/java-system-architecture-resources-links/#comments</comments>
		<pubDate>Mon, 22 Sep 2008 22:12:57 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[11. Architect Corner]]></category>

		<category><![CDATA[algorithm]]></category>

		<category><![CDATA[dangling reference]]></category>

		<category><![CDATA[garbage collector]]></category>

		<category><![CDATA[memory leak]]></category>

		<category><![CDATA[memory management]]></category>

		<guid isPermaLink="false">http://www.solutionhacker.com/?p=176</guid>
		<description><![CDATA[Memory Management
One strength of the Java™ 2 Platform, Standard Edition (J2SE™) is that it performs automatic memory
management, thereby shielding the developer from the complexity of explicit memory management. However, it doesn&#8217;t mean that there will not be any memory leak. So, I decide to give a summary on some key areas in this topics. They [...]<script type="text/javascript">SHARETHIS.addEntry({ title: "Java System Architecture Resources - Links", url: "http://www.solutionhacker.com/2008/09/22/java-system-architecture-resources-links/" });</script>]]></description>
			<content:encoded><![CDATA[<h2>Memory Management</h2>
<p>One strength of the Java™ 2 Platform, Standard Edition (J2SE™) is that it performs automatic memory<br />
management, thereby shielding the developer from the complexity of explicit memory management. However, it doesn&#8217;t mean that there will not be any memory leak. So, I decide to give a summary on some key areas in this topics. They are:</p>
<ol>
<li>How garbage collector works?</li>
<li>How memory leak still shows up?</li>
<li>What is weak reference?</li>
<li>What stores in heap and stack?</li>
</ol>
<p>If you want to understand this topic in detail, you can read the this <a href="https://java.sun.com/j2se/reference/whitepapers/memorymanagement_whitepaper.pdf">article </a>from Sun. The key points are summarized below:</p>
<ol>
<li>Without automatic memory management - GC, we may face 2 common issues: <strong>dangling reference</strong> (deallocate a object while others are still referencing it) and <strong>space leak </strong>(some objects are not referenced but not been deallocated either). GC is great but it doesn&#8217;t solve all the memory allocation problem. For example, you can have object list keep growing until it uses up all the free memory.</li>
<li><strong>Garbage collection </strong>takes time and resource to do it and sometimes it is not acceptable for the real-time mission critical system.</li>
<li>The task of fulfilling an allocation request, which involves <strong>finding a block of unused memory of a certain size in<br />
    the heap</strong>, is a difficult one. The main problem for most dynamic memory allocation algorithms is to <strong>avoid<br />
    fragmentation</strong>, while keeping both allocation and deallocation efficient. One approach to eliminating fragmentation is called <strong>compaction</strong>.</li>
<li>It is also desirable that a garbage collector operate <strong>efficiently</strong>, without introducing long pauses during which the<br />
    application is not running.</li>
</ol>
<h2>&#160;</h2>
<p><a href="http://sharethis.com/item?&wp=2.5.1&amp;publisher=dfe09b3a-d298-4d32-ae1a-5f327745cf58&amp;title=Java+System+Architecture+Resources+-+Links&amp;url=http%3A%2F%2Fwww.solutionhacker.com%2F2008%2F09%2F22%2Fjava-system-architecture-resources-links%2F">ShareThis</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.solutionhacker.com/2008/09/22/java-system-architecture-resources-links/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Amazon Web Service Solutions</title>
		<link>http://www.solutionhacker.com/2008/09/07/amazon-web-service-solutions/</link>
		<comments>http://www.solutionhacker.com/2008/09/07/amazon-web-service-solutions/#comments</comments>
		<pubDate>Mon, 08 Sep 2008 05:22:18 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[05. Scale your site]]></category>

		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.solutionhacker.com/?p=174</guid>
		<description><![CDATA[ 				When we talk about SOA, I would think of Amazon. It is the company that takes SOA to the next level, proving to the world that it is a viable solution for us. Great! I decide to put sometime to learn from Amazon via reviewing the web services it provides, reading the related interviews [...]<script type="text/javascript">SHARETHIS.addEntry({ title: "Amazon Web Service Solutions", url: "http://www.solutionhacker.com/2008/09/07/amazon-web-service-solutions/" });</script>]]></description>
			<content:encoded><![CDATA[<p><img align="left" style="margin-right: 10px; width: 176px; height: 100px;" src="http://g-ecx.images-amazon.com/images/G/01/00/10/00/14/19/27/100014192753._V46777512_.gif" alt="" /> 				When we talk about <strong>SOA</strong>, I would think of Amazon. It is the company that takes SOA to the next level, proving to the world that it is a viable solution for us. Great! I decide to put sometime to learn from Amazon via reviewing the web services it provides, reading the related interviews and blogs, studying how to build an application on top of its infrastructure, develop an application to consume data provided from its <strong>Web Services</strong>. Anyway, I believe the best way to learn SOA is to get a taste of the services provided from a company that relies greatly on this to scale its business. Before I delve deeper, I need to clarify one thing. Many people use the term SOA and Web Service interchangeably. Be honest, I was among one of them. However, in definition, they are not the same. <strong><span id="articleBody">SOA is about design</span></strong><span id="articleBody">; <strong>Web services are a specific technology set that supports distributed computing. </strong>Web services make it easier to create a service-based system, but only if your developers are using SOA design principles, where functions are packaged into modular, shareable, distributable services that can be used and reused by multiple consumers. In Amazon, each service is independent and encapsulates 3 things: data, business logic and public service interface. Each service owns its data and is never been directly accessed by other services. According to its CTO, this is the core architecture that scales Amazon.<br />
</span>&#160;</p>
<div style="page-break-after: always;"><span style="display: none;">&#160;</span></div>
<p><span id="more-174"></span></p>
<p><!--more--></p>
<h2>Video Presentation</h2>
<p><a href="http://www.infoq.com/presentations/GrepTheWeb-Jinesh-Varia">Jinesh Varia </a>- an evangelist from Amazon. In his presentation, he will show you how to build a regular-expression based search engine called &#8220;GrepTheWeb&#8221; on top of the Amazon infrastructure - SQS, SimpleDB, EC2 and S3. The most interesting thing he mentioned in this presentation is the on-demand architecture powered by Hadoop and Amazon infrastructure. <em>&#8220;At time t0, you have no infrastructure. At time t1, when regular expression comes in, the system reaches the execution phase and the whole infrastructure is ready for it. At time t2, the request is fulfilled, the whole infrastructure is gone&#8230;&#8221; </em>This gives me a taste of cloud computing and how powerful it can be. </p>
<ul>
<li><a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1632&amp;categoryID=100">Building GrepTheWeb in the Cloud, Part 1: Cloud Architectures</a></li>
<li><a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1633&amp;categoryID=152">Building GrepTheWeb in the Cloud, Part 2: Best Practices</a></li>
</ul>
<h2>&#160;</h2>
<h2>Web Resource</h2>
<p><a href="http://highscalability.com/amazon-architecture">High Scalability</a> posts an article about Amazon architecture. The author follows up with different resources and consolidates key information he found. </p>
<p>&#160;</p>
<p><img alt="" src="file:///C:/WINDOWS/TEMP/moz-screenshot-5.jpg" /></p>
<p><a href="http://sharethis.com/item?&wp=2.5.1&amp;publisher=dfe09b3a-d298-4d32-ae1a-5f327745cf58&amp;title=Amazon+Web+Service+Solutions&amp;url=http%3A%2F%2Fwww.solutionhacker.com%2F2008%2F09%2F07%2Famazon-web-service-solutions%2F">ShareThis</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.solutionhacker.com/2008/09/07/amazon-web-service-solutions/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Database Performance - Indexing</title>
		<link>http://www.solutionhacker.com/2008/08/24/database-performance-indexing/</link>
		<comments>http://www.solutionhacker.com/2008/08/24/database-performance-indexing/#comments</comments>
		<pubDate>Sun, 24 Aug 2008 08:14:21 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[15. Database Performance]]></category>

		<category><![CDATA[B-Tree]]></category>

		<category><![CDATA[balanced Tree]]></category>

		<category><![CDATA[database performance]]></category>

		<guid isPermaLink="false">http://www.solutionhacker.com/?p=171</guid>
		<description><![CDATA[There are 2 main focuses I will take to analyze a database. First, I will find out how it manages the data. Second, I will look at how it scales in term of data volume and traffics. Today, I will talk about the most common indexing scheme that most of the databases use today. It [...]<script type="text/javascript">SHARETHIS.addEntry({ title: "Database Performance - Indexing", url: "http://www.solutionhacker.com/2008/08/24/database-performance-indexing/" });</script>]]></description>
			<content:encoded><![CDATA[<p>There are 2 main focuses I will take to analyze a database. First, I will find out how it manages the data. Second, I will look at how it scales in term of data volume and traffics. Today, I will talk about the most common indexing scheme that most of the databases use today. It is B-Tree Indexing.</p>
<h2>B-Tree Indexing</h2>
<div class="mmobj"><img src="http://publib.boulder.ibm.com/infocenter/idshelp/v10/topic/com.ibm.adref.doc/adref014.gif" alt="begin figure description - The paragraph that precedes this figure describes the content of the figure. - end figure description" /></div>
<div class="mmobj">&nbsp;</div>
<div class="mmobj">Many people think <strong>B-Tree </strong>means binary tree. It is not right. If I really use binary tree to structure the index, 1 million index values will have a very deep tree to traverse and each node retrieval is equivalent to a read operation. Then, it may take so many reads to get down to the leaf node. How can it be performed? Instead, B-Tree means balanced tree. A B-tree is said to be balanced because it will never become lopsided as new nodes are added and removed. Apart from that, each node can have many sub-nodes. So, for millions of records, it can be handled by 2-3 levels of <strong>balanced tree.</strong> So, it is very good in performance. It offers O(log n) performance for a single-record lookups.</div>
<div class="mmobj">&nbsp;</div>
<div class="mmobj">The fundamental unit of an index is the<strong> <em>index item </em></strong><em>and </em><span style="font-style: italic;">a</span> node is an index page that stores a group of index items.<strong><em> </em></strong>An index item contains a key value that represents the value of the indexed column for a particular row. An index item also contains <strong>rowid </strong>information that the database server uses to locate the row in a data page. A node is an index page that stores a group of index items.</div>
<div class="mmobj">&nbsp;</div>
<div class="mmobj">It is interesting to note that the leaf nodes of the index are actually&nbsp; a <strong>doubly linked list.</strong> Once we find out where to start in the leaf node (find the first value), doing an <strong>ordered </strong>scan of value (index range scan) is very easy. We don&#8217;t have to navigate the structure any more; we just go forward through the leaf nodes. That makes solving <strong>range-based queries </strong>such as &quot;BETWEEN 20 and 30&quot; much easier.</div>
<div class="mmobj">&nbsp;</div>
<div class="mmobj">
<h2>When should you use B-Tree Index?</h2>
<p>To understand when you should use B-Tree index, you should know there are 2 ways to use an index. <strong>First</strong>, you can use index <u>as a mean to access rows in a table via rowid.</u> If you use index for that, you want to access a very small percentage of the rows in the table. Otherwise, you need to get into &quot;index then row&quot; cycle many times (implies many IOs) and it will be worse than pulling bunch of rows in batch to reduce the number of IOs (the costly part of database operation). According to the experiment done, full scan is faster if&nbsp; we access too high % of&nbsp; rows via index. <strong>Second</strong>, you can use index <u>as a mean to answer the query if the index contains enough information </u>to answer the entire query. In this case, we don&#8217;t need to go to the table at all. The index will be used as a thinner version of the table. So, if you want to access a large % of rows via index, you should consider to get the query answer via the information in the index.</p>
<h2>MySQL Indexing</h2>
<p>There are several rules to remember for MySQL indexing</p>
<ol>
<li>MySQL will only ever use one index per table per query (except for UNION b/c it is considered as separated queries).</li>
<li>To get around that, you can create multicolumn indexes.</li>
<li>When there are more than 1 indexes to choose from, MySQL makes an educated guess based on the statistics gathered.</li>
<li><strong>MyISAM </strong>has indexes kept in a completely separate file from table rows. And table rows are stored in the random order that are retrieved by the rowid in the index items.</li>
<li><strong>InnoDB </strong>uses <strong>clustered indexes</strong> that has primary key and the record itself clustered and the records are all stored in primary-key order. When your data is almost always searched on via its PK, clustered indexes can make lookups incredibly fast because single lookup can pull out record in question.</li>
<li>Primary key cannot contain NULL whereas unique index can.</li>
</ol>
</div>
<p>&nbsp;</p>
<p><a href="http://sharethis.com/item?&wp=2.5.1&amp;publisher=dfe09b3a-d298-4d32-ae1a-5f327745cf58&amp;title=Database+Performance+-+Indexing&amp;url=http%3A%2F%2Fwww.solutionhacker.com%2F2008%2F08%2F24%2Fdatabase-performance-indexing%2F">ShareThis</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.solutionhacker.com/2008/08/24/database-performance-indexing/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Secret of Warren Buffett</title>
		<link>http://www.solutionhacker.com/2008/08/09/secret-of-warren-buffett/</link>
		<comments>http://www.solutionhacker.com/2008/08/09/secret-of-warren-buffett/#comments</comments>
		<pubDate>Sat, 09 Aug 2008 08:26:02 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.solutionhacker.com/?p=169</guid>
		<description><![CDATA[Recently I have come across a great book named &#34;Even Buffett Isn&#8217;t Perfect&#34; that talks about Warren Buffett. What makes this book differentiates from the others is that it is not simply a love letter of Warren but an objective analysis of what contributes to Warren&#8217;s success. Below are the some of the key points [...]<script type="text/javascript">SHARETHIS.addEntry({ title: "Secret of Warren Buffett", url: "http://www.solutionhacker.com/2008/08/09/secret-of-warren-buffett/" });</script>]]></description>
			<content:encoded><![CDATA[<p>Recently I have come across a great book named &quot;Even Buffett Isn&#8217;t Perfect&quot; that talks about Warren Buffett. What makes this book differentiates from the others is that it is not simply a love letter of Warren but an objective analysis of what contributes to Warren&#8217;s success. Below are the some of the key points I got from this book:</p>
<ol>
<li><strong>Buffett loves insurance business</strong>. By sterring Berkshire into the insurance business, Buffett was able to get his hands on tremendous amount of float.</li>
<li>Buffett puts lots of focuses on the <strong>good-quality management </strong>that he sees it will create long-term value.</li>
<li>Buffett relies heavily on a <strong>discounted cash flow </strong>(DCF) analysis to find stocks and companies that can be purchased for less than <strong>intrinsic value</strong>. In other words, Buffett doesn&#8217;t buy cheap stock, he buys stock cheap.</li>
<li>The key here is how to accurately calculate the intrinsic value. </li>
<li><strong>Diversification </strong>lowers your risk but it also locks you in the market rate of return. In Buffett&#8217;s mind, he believes if you know what&nbsp; you are doing and really understand how to evaluate business, you just needs to hold dozen stocks to be sufficient diversification. </li>
<li>From studies, <strong>value stocks</strong> beat <strong>growth stocks </strong>(ie. high price multiple like P/E) over the long term, but that growth stocks are a better bet for investors with shorter horizons. It also shows that small cap stocks do better than large-cap stocks over the long term. </li>
<li><strong>Buffett is a buyer of companies than a buyer of stocks.</strong></li>
<li>Buffett actively petitioned against efforts to eliminate personal taxes on dividends, even though dividends are paid to shareholders only after corporations have paid their own taxes. In other words, the situation results in double taxation. However, lower tax rates could benefit everyone by stimulating economic growth means better job opportunities. Author believes taxing the investment class at higher rates is likely to do more harm than good.</li>
</ol>
<p>&nbsp;</p>
<p><a href="http://sharethis.com/item?&wp=2.5.1&amp;publisher=dfe09b3a-d298-4d32-ae1a-5f327745cf58&amp;title=Secret+of+Warren+Buffett&amp;url=http%3A%2F%2Fwww.solutionhacker.com%2F2008%2F08%2F09%2Fsecret-of-warren-buffett%2F">ShareThis</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.solutionhacker.com/2008/08/09/secret-of-warren-buffett/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Evolution of XML parsing technologies</title>
		<link>http://www.solutionhacker.com/2008/07/16/evolution-of-xml-parsing-technologies/</link>
		<comments>http://www.solutionhacker.com/2008/07/16/evolution-of-xml-parsing-technologies/#comments</comments>
		<pubDate>Wed, 16 Jul 2008 18:35:33 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[11. Architect Corner]]></category>

		<category><![CDATA[axiom]]></category>

		<category><![CDATA[Axis2]]></category>

		<category><![CDATA[data binding]]></category>

		<category><![CDATA[dom]]></category>

		<category><![CDATA[rpc]]></category>

		<category><![CDATA[sax]]></category>

		<category><![CDATA[soap]]></category>

		<category><![CDATA[StAX]]></category>

		<category><![CDATA[web service]]></category>

		<category><![CDATA[XFire]]></category>

		<category><![CDATA[xml]]></category>

		<category><![CDATA[xml parser]]></category>

		<guid isPermaLink="false">http://www.solutionhacker.com/?p=167</guid>
		<description><![CDATA[Introduction
There were 2 main XML parsing technologies few years ago. They were SAX and DOM.

SAX is event-driven and the events are fired and forget along the xml parsing. Advantages: It doesn&#8217;t need to cache the whole xml document in memory and you don&#8217;t need to wait til the whole xml been parsed before the first [...]<script type="text/javascript">SHARETHIS.addEntry({ title: "Evolution of XML parsing technologies", url: "http://www.solutionhacker.com/2008/07/16/evolution-of-xml-parsing-technologies/" });</script>]]></description>
			<content:encoded><![CDATA[<h2>Introduction</h2>
<p>There were 2 main XML parsing technologies few years ago. They were <strong>SAX </strong>and <strong>DOM</strong>.</p>
<ol>
<li><strong>SAX </strong>is <strong>event-driven</strong> and the events are fired and forget along the xml parsing. <strong>Advantages</strong>: It doesn&#8217;t need to cache the whole xml document in memory and you don&#8217;t need to wait til the whole xml been parsed before the first event emitted. <strong>Disadvantages</strong>: It uses <strong>Push API</strong> that holds the control during parsing. So clients cannot control the parsing and it doesn&#8217;t fit for xml manipulation.</li>
<li><strong>DOM </strong>is used to convert the xml into object <strong>tree </strong>in memory before manipulation. <strong>Advantages</strong>: Easier to manipulate the xml. <strong>Disadvantages</strong>: Eat up a lot of memory that is not good for documents larger than few MBs in size or in memory constrained environment such as J2ME.</li>
</ol>
<p><strong>Pull API</strong> is a more comfortable alternative for streaming processing of XML. A  pull API is based around the more familiar <strong>iterator design pattern</strong> rather than <strong>observer design pattern</strong>. In a pull API, the client program  asks the parser for the next piece of information rather than the parser telling  the client program when the next datum is available. In a pull API the client  program drives the parser. In a push API the parser drives the client. That leads to the invention of <strong>StAX</strong>.</p>
<p>In this article, I will introduce an new object model from Axis2 named <strong>AXIOM </strong>that uses StAX underneath for xml parsing. With this, xml parsing will cost less memory with better control.<span id="more-167"></span><!--more--></p>
<p><!--more--></p>
<p>
<!-- $absatzheadline_6 HTML{ --></p>
<h2>Evolution of Axis</h2>
<p>One of the first generation SOAP engines,<strong> Apache SOAP</strong>, uses a <strong>DOM-based</strong> object  model internally to represent the XML document, where the XML handling  techniques force the entire XML object model to be built at once. The second  generation <strong>Apache Axis </strong>shifted to <strong>SAX </strong>to avoid keeping the complete information  in the memory. SAX, however, has a major constraint - it is built around a  &quot;<strong>push</strong>&quot; technique, and once the parsing of the XML document starts it cannot be  stopped. To jump over this hurdle, Apache Axis has to record SAX events. So,  effectively, the XML message has to be kept in the memory in the form of SAX  events, thus making Apache Axis yet another memory intensive programming  model.</p>
<p><strong>Axis2 </strong>avoids keeping the complete SOAP message in the memory by  introducing a new Object Model for representing the SOAP message <b>AXIOM</b>.  <a href="http://jaxmag.com/itr/online_artikel/psecom,id,726,nodeid,147.html">AXIOM</a><!-- $absatzheadline_4 HTML{ --> takes a dramatically new approach. Although AXIOM has an  &quot;external&quot; resemblance to DOM, the difference lies in that it <strong>generates objects  only when required</strong>. This <strong>&quot;on-demand building&quot; </strong>feature gives AXIOM the edge  needed to overcome the memory barrier that early SOAP engines failed to pass.</p>
<p>An interesting feature of AXIOM is that it is based on <strong>Pull parsing</strong>. It is  capable of generating pull events from the Object Model that is built. Further,  if the Object Model happens to be half built, AXIOM is capable of shifting to  the underlying pull parser to generate pull events directly from the <strong>stream</strong><!-- $absatzheadline_5 HTML{ -->. The heart of AXIOM is the XML Pull parser since it is the only parsing model  that supports the pausing of the parsing process. AXIOM uses the <strong>Streaming API for XML (StAX),</strong> making it easy to manipulate and  utilizing only a fraction of the memory used by a conventional object model.  Combined with the speed of the <strong>streaming pull parser</strong>, AXIOM pushes Axis2 leaps  ahead of its predecessors in terms of efficiency and speed.</p>
<p><strong>Apart from new parser, Axis2 also has other new add-ons. They are:</strong></p>
<ol>
<li>Pluggable Data Binding - you can pick and choose JAXB, Castor and XMLBean for xml - java conversion.</li>
<li>Improved Support for Message-style interaction (RPC vs Message-based)</li>
<li>Improved handlers</li>
</ol>
<p>The goal of this article is to focus on parsing technology, so I will not discuss in detail the new features on Axis2. If you want to find out more, read <a href="http://www.jaxmag.com/itr/online_artikel/psecom,cur,,_psframe,,linkobject,print_,nocontainer,1_,id,747,nodeid,147.html">this</a>.</p>
<p>&nbsp;</p>
<h2>Reference</h2>
<p><a href="http://www.xml.com/lpt/a/1287">An Introduction to StAX</a></p>
<p><a href="http://jaxmag.com/itr/online_artikel/psecom,id,726,nodeid,147.html">Fast and lightweight object model for XML</a></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p><a href="http://sharethis.com/item?&wp=2.5.1&amp;publisher=dfe09b3a-d298-4d32-ae1a-5f327745cf58&amp;title=Evolution+of+XML+parsing+technologies&amp;url=http%3A%2F%2Fwww.solutionhacker.com%2F2008%2F07%2F16%2Fevolution-of-xml-parsing-technologies%2F">ShareThis</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.solutionhacker.com/2008/07/16/evolution-of-xml-parsing-technologies/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Salesforce.com opens up Google Data API</title>
		<link>http://www.solutionhacker.com/2008/07/11/salesforcecom-opens-up-google-data-api/</link>
		<comments>http://www.solutionhacker.com/2008/07/11/salesforcecom-opens-up-google-data-api/#comments</comments>
		<pubDate>Fri, 11 Jul 2008 15:51:23 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[16. Salesforce]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[Google Data API]]></category>

		<category><![CDATA[salesforce]]></category>

		<guid isPermaLink="false">http://www.solutionhacker.com/?p=166</guid>
		<description><![CDATA[Salesforce + Google
It is good news to hear that Salesforce.com has made Google Data API available on its platform. To further understand the full potential of the new platform, I have googled around to see whether anyone has talked about it, here is the first article I found that covers some use cases on this [...]<script type="text/javascript">SHARETHIS.addEntry({ title: "Salesforce.com opens up Google Data API", url: "http://www.solutionhacker.com/2008/07/11/salesforcecom-opens-up-google-data-api/" });</script>]]></description>
			<content:encoded><![CDATA[<h2>Salesforce + Google</h2>
<div id="logo">It is good news to hear that <strong>Salesforce</strong>.com has made <strong>Google Data API </strong>available on its platform. To further understand the full potential of the new platform, I have googled around to see whether anyone has talked about it, here is the first <a href="http://code.google.com/apis/gdata/articles/salesforce.html">article</a> I found that covers some use cases on this topic.&nbsp;</div>
<p>After getting a taste of its power, lets<a href="http://wiki.apexdevnet.com/index.php/Google_Data_APIs_Toolkit_Setup"> set it up</a> and try it ourselves.</p>
<p>&nbsp;</p>
<p><a href="http://sharethis.com/item?&wp=2.5.1&amp;publisher=dfe09b3a-d298-4d32-ae1a-5f327745cf58&amp;title=Salesforce.com+opens+up+Google+Data+API&amp;url=http%3A%2F%2Fwww.solutionhacker.com%2F2008%2F07%2F11%2Fsalesforcecom-opens-up-google-data-api%2F">ShareThis</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.solutionhacker.com/2008/07/11/salesforcecom-opens-up-google-data-api/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Powerful Full Text Search - Part 3 Solr</title>
		<link>http://www.solutionhacker.com/2008/07/06/powerful-full-text-search-part-3-solr/</link>
		<comments>http://www.solutionhacker.com/2008/07/06/powerful-full-text-search-part-3-solr/#comments</comments>
		<pubDate>Mon, 07 Jul 2008 06:45:24 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[02. Build your site]]></category>

		<category><![CDATA[05. Scale your site]]></category>

		<category><![CDATA[indexing]]></category>

		<category><![CDATA[lucene]]></category>

		<category><![CDATA[solr]]></category>

		<category><![CDATA[web service]]></category>

		<guid isPermaLink="false">http://www.solutionhacker.com/?p=165</guid>
		<description><![CDATA[Introduction of Solr
Solr is a standalone enterprise search server with a web-services like API. You put documents in it (called &#34;indexing&#34;) via XML over HTTP (RESTful). You query it via HTTP GET and receive XML results.

Advanced Full-Text Search Capabilities
 Optimized for High Volume Web Traffic
Standards Based Open Interfaces - XML and HTTP
Comprehensive HTML Administration Interfaces
Scalability [...]<script type="text/javascript">SHARETHIS.addEntry({ title: "Powerful Full Text Search - Part 3 Solr", url: "http://www.solutionhacker.com/2008/07/06/powerful-full-text-search-part-3-solr/" });</script>]]></description>
			<content:encoded><![CDATA[<h2>Introduction of Solr</h2>
<p><strong>Solr </strong>is a standalone enterprise search server with a web-services like API. You put documents in it (called &quot;indexing&quot;) via XML over HTTP (RESTful). You query it via HTTP GET and receive XML results.</p>
<ul>
<li>Advanced Full-Text Search Capabilities</li>
<li><strong> Optimized</strong> for High Volume Web Traffic</li>
<li>Standards Based Open Interfaces - XML and HTTP</li>
<li>Comprehensive HTML <strong>Administration Interfaces</strong></li>
<li>Scalability - Efficient <strong>Replication </strong>to other Solr Search Servers</li>
<li>Flexible and Adaptable with XML configuration</li>
<li>Extensible Plugin Architecture</li>
</ul>
<p><span id="more-165"></span></p>
<h2><!--more-->Set up Solr</h2>
<p>&nbsp;To set up Solr, you should follow this <a href="http://lucene.apache.org/solr/tutorial.html">guideline.</a> After the set up Solr, you practically have a indexing service up.</p>
<p>The HTTP/XML interface of the indexer has two main access points: the <em>update URL</em>, which maintains the index,  and the <em>select URL</em>, which is used for queries. In the default configuration, they are found at:</p>
<ul>
<li><code>http://[hostname:port]/solr/update</code></li>
<li><code>http://</code><code>[hostname:port]</code><code>/solr/select</code></li>
</ul>
<p>To add a document to the index, we POST an XML representation of the fields to index to the update URL. In addition, you can delete, update (ie. re-post on unique). All change operations need to commit to flush to file system. On the other hand,&nbsp; once we have indexed some data, an HTTP GET on the <em>select URL</em> does the querying.&nbsp;<em><br />
</em></p>
<h2>Powerful features Behind Solr</h2>
<p>If you follow the guideline above, you already get yourself familiar with <strong>indexing</strong>, <strong>searching </strong>and facet browsing. Now lets get down to how to make Solr a <strong>scalable </strong>solution with great <strong>performance</strong>.</p>
<p><u><strong>Caching<br />
</strong></u></p>
<p>TBA</p>
<p><u><strong>Distribution and Replication</strong></u></p>
<p>For applications that receive large volumes of queries, a single Solr server may not be enough to meet performance requirements.  Therefore, Solr provides mechanisms for <strong>replicating the Lucene index across multiple servers</strong> that are part of a load-balanced suite of query servers.  The replication process is handled through a combination of event listeners enabled through the solrconfig.xml file and several shell scripts (located in solr/bin of the example application).</p>
<p>In a replicating architecture, one Solr server acts as the <strong>master </strong>server, providing copies of the index (called <code>snapshots</code>) to one or more <strong>slave </strong>servers that handle query requests.  Indexing commands are sent to the master server and queries are sent to the slave servers. The master server can create snapshots manually or by configuring the <code>&lt;updateHandler&gt;</code> section of solrconfig.xml to trigger snapshot creation when <code>commit</code> and/or <code>optimize</code> events are received.  In either the manual or the event-driven process, the <code>snapshooter</code> script is invoked on the master server, creating a directory on the server named <code>snapshot.yyyymmddHHMMSS</code> where <code>yyyymmddHHMMSS</code> is the actual time the snapshot was created. The slave servers then use <strong><i>rsync</i> </strong>to copy only those files in the Lucene index that have been changed.</p>
<pre class="xml" name="code">
&lt;listener event=&quot;postCommit&quot; class=&quot;solr.RunExecutableListener&quot;&gt;
    &lt;str name=&quot;exe&quot;&gt;snapshooter&lt;/str&gt;
    &lt;str name=&quot;dir&quot;&gt;solr/bin&lt;/str&gt;
    &lt;bool name=&quot;wait&quot;&gt;true&lt;/bool&gt;
    &lt;arr name=&quot;args&quot;&gt; &lt;str&gt;arg1&lt;/str&gt; &lt;str&gt;arg2&lt;/str&gt; &lt;/arr&gt;
    &lt;arr name=&quot;env&quot;&gt; &lt;str&gt;MYVAR=val1&lt;/str&gt; &lt;/arr&gt;
&lt;/listener&gt;</pre>
<h2>Reference</h2>
<p>Below are some cool references I found:</p>
<ol>
<li><span style="color: rgb(153, 153, 153);"><a href="http://www.ibm.com/developerworks/java/library/j-solr1/">Search smarter with Apache Solr, Part 1:</a> </span>Essential features and the Solr schema</li>
<li><span style="color: rgb(153, 153, 153);"><a href="http://www.ibm.com/developerworks/java/library/j-solr2/">Search smarter with Apache Solr, Part 2:</a> </span>Solr for the enterprise</li>
<li><a href="http://www.cnlp.org/presentations/slides/AdvancedLuceneEU.pdf">Advanced Lucene</a></li>
</ol>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p><a href="http://sharethis.com/item?&wp=2.5.1&amp;publisher=dfe09b3a-d298-4d32-ae1a-5f327745cf58&amp;title=Powerful+Full+Text+Search+-+Part+3+Solr&amp;url=http%3A%2F%2Fwww.solutionhacker.com%2F2008%2F07%2F06%2Fpowerful-full-text-search-part-3-solr%2F">ShareThis</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.solutionhacker.com/2008/07/06/powerful-full-text-search-part-3-solr/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Powerful Full Text Search - Part 2 Nutch</title>
		<link>http://www.solutionhacker.com/2008/07/06/powerful-full-text-search-part-2-nutch/</link>
		<comments>http://www.solutionhacker.com/2008/07/06/powerful-full-text-search-part-2-nutch/#comments</comments>
		<pubDate>Sun, 06 Jul 2008 11:22:34 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[02. Build your site]]></category>

		<category><![CDATA[11. Architect Corner]]></category>

		<category><![CDATA[crawler]]></category>

		<category><![CDATA[hadoop]]></category>

		<category><![CDATA[HDFS]]></category>

		<category><![CDATA[indexing service]]></category>

		<category><![CDATA[lucene]]></category>

		<category><![CDATA[nutch]]></category>

		<guid isPermaLink="false">http://www.solutionhacker.com/?p=164</guid>
		<description><![CDATA[Introduction of Nutch &#38; Hadoop
After Lucene, the author created another powerful tool. Its name is Nutch. Nutch is a powerful crawler built on top of the Lucene. With Nutch, you can launch a multi-threaded crawler to obtain information from the Net. At this point of writing, Nutch is in its 0.9 version. Nutch comes with [...]<script type="text/javascript">SHARETHIS.addEntry({ title: "Powerful Full Text Search - Part 2 Nutch", url: "http://www.solutionhacker.com/2008/07/06/powerful-full-text-search-part-2-nutch/" });</script>]]></description>
			<content:encoded><![CDATA[<h2>Introduction of Nutch &amp; Hadoop</h2>
<p>After Lucene, the author created another powerful tool. Its name is <strong>Nutch</strong>. Nutch is a powerful crawler built on top of the Lucene. With Nutch, you can launch a multi-threaded crawler to obtain information from the Net. At this point of writing, Nutch is in its 0.9 version. Nutch comes with a list of cool features, including <strong>whole Web crawling</strong>, <strong>local file crawling </strong>for the<br />
intranet, <strong>indexing</strong> all the while.</p>
<p><strong>Hadoop </strong>was designed to handle the petabytes of data that Nutch could potentially store and process. In fact, Hadoop has its own file system: the <strong>Hadoop Distributed File System</strong> (HDFS), which can run on any old run-of-the-mill, low-cost hardware.<br />
Hadoop works by storing part of the file system&#8217;s data across all the servers in the cluster. As new queries come in, HDFS follows the &quot;moving computation is cheaper than moving data&quot; rule &mdash; meaning that moving the processing of the query to as<br />
close as possible to the data will be faster than placing the query at random within the cluster and moving data long distances across the network.</p>
<p>I have searched around to see if anyone can give me some tips on this tool. Surprisingly, I don&#8217;t see much. But don&#8217;t worry, I have found some that can at least get you start playing with it.</p>
<p><span id="more-164"></span></p>
<p><!--{12153432153430}--></p>
<h2>Set up Nutch</h2>
<p>Here is the <a href="http://peterpuwang.googlepages.com/NutchGuideForDummies.htm">guideline written by Peter Wang</a> that I followed to bring my Nutch up. Follow it and bring your Nutch before go further. By the way, if you want to run Nutch with Solr, this is a good <a href="http://wiki.apache.org/nutch/RunningNutchAndSolr">tutorial</a>.</p>
<h2>Nutch Architectural Review</h2>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p><a href="http://sharethis.com/item?&wp=2.5.1&amp;publisher=dfe09b3a-d298-4d32-ae1a-5f327745cf58&amp;title=Powerful+Full+Text+Search+-+Part+2+Nutch&amp;url=http%3A%2F%2Fwww.solutionhacker.com%2F2008%2F07%2F06%2Fpowerful-full-text-search-part-2-nutch%2F">ShareThis</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.solutionhacker.com/2008/07/06/powerful-full-text-search-part-2-nutch/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Powerful Full Text Search Engine - Part 1 Lucene Introduction</title>
		<link>http://www.solutionhacker.com/2008/07/04/powerful-full-text-search-engine-lucene-part-1/</link>
		<comments>http://www.solutionhacker.com/2008/07/04/powerful-full-text-search-engine-lucene-part-1/#comments</comments>
		<pubDate>Fri, 04 Jul 2008 08:38:53 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[02. Build your site]]></category>

		<category><![CDATA[10. Unleash your system]]></category>

		<category><![CDATA[digester]]></category>

		<category><![CDATA[full-text search]]></category>

		<category><![CDATA[grep]]></category>

		<category><![CDATA[lucene]]></category>

		<category><![CDATA[REST]]></category>

		<category><![CDATA[solr]]></category>

		<category><![CDATA[web service]]></category>

		<guid isPermaLink="false">http://www.solutionhacker.com/?p=160</guid>
		<description><![CDATA[Introduction of Lucene
I have heard of Lucene and its powerful full text search capability many times. Today, I decide to take a look at it. Before I dive into the user guide, I went to Google Tech Talk to find a video related to Lucene first. Here is what I found:&#160;

After I finished this video, [...]<script type="text/javascript">SHARETHIS.addEntry({ title: "Powerful Full Text Search Engine - Part 1 Lucene Introduction", url: "http://www.solutionhacker.com/2008/07/04/powerful-full-text-search-engine-lucene-part-1/" });</script>]]></description>
			<content:encoded><![CDATA[<h2>Introduction of Lucene</h2>
<p>I have heard of Lucene and its powerful full text search capability many times. Today, I decide to take a look at it. Before I dive into the user guide, I went to Google Tech Talk to find a video related to Lucene first. Here is what I found:&nbsp;</p>
<p><embed type="application/x-shockwave-flash" src="http://video.google.com/googleplayer.swf?docid=3675183678687996842&amp;hl=en&amp;fs=true" allowfullscreen="true" style="width: 400px; height: 326px;" id="VideoPlayback"></embed></p>
<p>After I finished this video, I found Lucene a really great tool for me. So, I decided to have a deeper look at it. After a quick search,&nbsp; I found a great <a href="http://immike.net/blog/2007/07/03/full-text-search-with-apache-lucene/">blog </a>that showed me how to use Lucene with Digg. With Solr on top of Lucene, you can make Lucene available as RESTful Web Service. It is so awesome, isn&#8217;t it? In this article, I will list you all the information I found during my little research on Lucene and I hope you will feel it useful.</p>
<p><span id="more-160"></span></p>
<h3><!--more-->Architecture Overview</h3>
<p>Before we dig into the code or set up guidelines, I would like to have a high level picture of Lucene first. I borrow a diagram from this article that helps me to grasp the key components in search.</p>
<p><img src="http://www.solutionhacker.com/wp-content/uploads/lucene1.gif" alt="" /></p>
<p>This high level picture shows you that your <strong>search keywords</strong> you entered (normally using a form) will become a HTTP <strong>search request </strong>and later been translated into a form that search engine understands by <strong>Query parser. </strong>Search engine will perform the search operation against the indexed files that was previously prepared by <strong>Indexer</strong>. After that, the result will be <strong>ranked based on predefined ranking algorithm </strong>and returned to the user. The <strong>source </strong>of the data can be from Web Service, database or documents in your file system. In this diagram, it shows you that you can launch <strong>spider </strong>or <strong>crawler </strong>like Google to obtain the data from web pages on the Internet and feed it to Indexer as your source.</p>
<h2>Get one step deeper</h2>
<p style="text-align: left;"><strong>Now you know the high level flow of how search works. Lets get one step into the detail.</strong></p>
<ol>
<li>What search interface to use?</li>
<li>How search interface communicates with your search engine?</li>
<li>What kind of search the search engine provides?</li>
<li>How search engine indexes the documents?</li>
<li>How result be ranked and what kind of ranking algorithms we normally use?</li>
</ol>
<p><strong>Below is the answers of the questions above:</strong></p>
<ol>
<li>Up to you. I would use Flex as I want to provide a rich search interface to my users.</li>
<li>Flex can talk HTTP, Web Service or RemoteObject AMF. If you put web service layer on Lucene (ie. Solr), you can use REST call (ie. HTTP) to obtain the result.</li>
<li>Lucene supports several kinds of advanced searches like: 
<ul>
<li><u>Boolean operators</u> - users can compose query using AND, OR, NOT</li>
<li><u>Field Search</u> - what fields the search operates on? like title, author or content?</li>
<li><u>Wildcard Search </u>- supports * and ?.</li>
<li><u>Fuzzy Search </u>- Lucene provides a fuzzy search that&#8217;s based on an edit distance algorithm. You can use the tilde character (~) at the end of a single search word to do a fuzzy search. For example, the query &quot;think~&quot; searches for the terms <strong>similar </strong>in spelling to the term &quot;think.&quot; The key here is the word &quot;similar&quot;. Do we consider horse and donkey are related? Or you have hose and horse be related somehow in spelling?</li>
<li><u>Range Search </u>- age, date and etc</li>
</ul>
</li>
<li>Large topic. I will go back to it later.</li>
<li>Up to you. If you want to look at the popular ranking algorithm in the world, check out <strong>Google Page Rank. </strong>It is one of the algorithms that many of us interested to know. Before I want to have my <a href="http://www.justproposed.com">wedding website - Justproposed.com </a>be shown on at the top of the result when users type &quot;wedding website&quot; as search keywords, I have looked into <strong>SEO</strong>. It is a fun area to explore. Generally speaking, if the query keywords shown in the title, it weights more, If the keyword frequency is higher, it ranks higher..blah blah. However, I know Google has weighted a lot on the links. It is not just purely based on the document that you have. How to obtain the additional information during the crawling is beyond the scope of this article.</li>
</ol>
<h2>Get your hand dirty</h2>
<p>Look into this great <a href="http://www-128.ibm.com/developerworks/java/library/wa-lucene2/index.html?ca=drs-">article</a>.</p>
<p><em>The thing this article doesn&#8217;t mention is that you need to create you <strong>dataDir </strong>and <strong>indexDir </strong>folders under C and drag a list of html files into the dataDir before you start the web server. If you drag new htmls into it, you need to clean up your indexDir and restart your web server in order to rebuild the indexes.</em></p>
<p>I have got the application up and running. It is nice trial. My next step is to enhance this example. I will do the following:</p>
<ol>
<li>Use Flex as search interface</li>
<li>Use Solr to expose the Lucene search engine as Web Service.</li>
<li>Have Flex calls my search engine via REST.</li>
<li>Display the result on Flex.</li>
</ol>
<p>After I have my new enhancements working, I would do the following:</p>
<ol>
<li>Look into how Lucene do the indexing</li>
<li>Look into Nutch,. So I can have it crawled some sites and put the htmls in dataDir for me automatically.</li>
</ol>
<h2>How Lucene Indexes the documents?</h2>
<p>Yes. I haven&#8217;t forgot to answer the question 4. Here is the <a href="http://www.ibm.com/developerworks/library/wa-lucene/">article </a>that answers your question. To summarize, here are several key points I extracted from this article.</p>
<ol>
<li><u>Content Extraction</u> - Lucene only takes <strong>text </strong>for index. So, it provides different types of parsers to extract content from different types of document like word, html, doc, pdf and etc. If you have other type of document that you cannot find a parser, you take the responsibility to extract the content out for Lucene. This <a href="http://www.ibm.com/developerworks/web/library/j-lucene/">article </a>shows you how to use <strong>Digester </strong>to extract content out from XML and feed Lucene. If you have a large pool of XML for content extraction, you need to pay attention on the parsing time. There is someone who has done <a href="http://grep.codeconsult.ch/2005/02/16/lucene-rocks/">this </a>and obtain some performance number as reference. However, the article was a bit outdated.</li>
<li><u>Content Preprocessing</u><strong> - Analyzer </strong>is used to extract the <strong>token </strong>from your text content to be indexed.  Before text is indexed, it is passed through an <code>Analyzer</code>. <code>Analyzer</code>s are in charge of extracting indexable tokens out of text to be indexed, and eliminating the rest.  Lucene comes with a few different <code>Analyzer</code> implementations.  Some of them deal with skipping stop words (frequently-used words that don&#8217;t help distinguish one document from the other, such as &quot;a,&quot; &quot;an,&quot; &quot;the,&quot; &quot;in,&quot; &quot;on,&quot; etc.), some deal with converting all tokens to lowercase letters, so that searches are not case-sensitive, and so on.</li>
<li><u>Indexing</u><strong> - IndexWriter </strong>is the key component in the indexing process. This class will use Analyzer that you passed in as parameter to create a new index or open an existing index and add documents to it. You need to set up <strong>fields </strong>and <strong>documents </strong>and feed them to the IndexWriter to do the job. Like the code below, you fetches a list of .txt files and its metadata like path from a directory and feed them for IndexWriter. IndexWriter will index them one after one.</li>
<li><u>Configuration </u>- You can <strong>configure </strong>IndexWriter to achieve better performance via increasing the buffer size because the bottleneck normally happen during the IO of the index files.</li>
<li>Lucene uses <strong>inverted index </strong>concept. An inverted index is an inside-out arrangement of documents in which terms take center stage. Each term points to a list of documents that contain it. On the contrary, in a <i>forwarding index</i>, documents take the center stage, and each document refers to a list of terms it contains. You can use an inverted index to easily find which documents contain certain terms. Lucene uses an inverted index as its index structure.</li>
</ol>
<pre name="code" class="java">

for(int i = 0; i &lt; textFiles.length; i++){
      if(textFiles[i].isFile() &gt;&gt; textFiles[i].getName().endsWith(&quot;.txt&quot;)){
        Reader textReader = new FileReader(textFiles[i]);
        Document document = new Document();
        document.add(Field.Text(&quot;content&quot;,textReader));
        document.add(Field.Keyword(&quot;path&quot;,textFiles[i].getPath()));
        indexWriter.addDocument(document);
      }
}
</pre>
<p>Lucene offers four different types of fields from which a developer can choose: Keyword,UnIndexed,UnStored,and Text.</p>
<blockquote>
<p><strong>Keyword </strong>fields are not tokenized, but are indexed and stored in the index verbatim.  This field is suitable for fields whose original value should be preserved in its entirety, such as URLs, dates, personal names, Social Security numbers, telephone numbers, etc.</p>
<p><strong>UnIndexed </strong>fields are neither tokenized nor indexed, but their value is stored in the index word for word.  This field is suitable for fields that you need to display with search results, but whose values you will never search directly.  Because this type of field is not indexed, searches against it are slow. Since the original value of a field of this type is stored in the index, this type is not suitable for storing fields with very large values, if index size is an issue.</p>
<p><strong>UnStored </strong>fields are the opposite of UnIndexed fields. Fields of this type are tokenized and indexed, but are not stored in the index. This field is suitable for indexing large amounts of text that does not need to be retrieved in its original form, such as the bodies of Web pages, or any other type of text document.</p>
<p><strong>Text </strong>fields are tokenized, indexed, and stored in the index. This implies that fields of this type can be searched, but be cautious about the size of the field stored as Text field.</p>
</blockquote>
<h2>Conclusion</h2>
<p>To use Lucene, there are 3 main concepts you need to grasp. There are:</p>
<ol>
<li>Indexer - create search engine indexes</li>
<li>Analyzer - Split text into tokens that make sense for the search engine. The structure is like document -&gt; a sequence of fields and each field is name/value pair -&gt; tokens. <span style="font-weight: bold;">Field </span>values may be <span style="font-style: italic;">stored</span>, <span style="font-style: italic;">indexed </span>or <span style="font-style: italic;">analyzed/ tokenize, </span>(and, now, <span style="font-style: italic;">vectored</span>). The <a href="http://lucene.sourceforge.net/talks/pisa/">lecture note</a> from Doug Cutting will give you more detail. <strong><span style="font-weight: bold;"><br />
    </span></strong></li>
<li>Searcher</li>
</ol>
<p>You may think of using <strong>grep </strong>to achieve or <strong>database </strong>to achieve what Lucene does. <strong>Grep </strong>is powerful Linux tool, however, if you want it to search on files with several MB in size, you will see that the tool is inefficient. The reason is grep doesn&#8217;t prepare the indexes of your files ahead of the time you do the search. Database can do indexing but not so sophisticated as Lucene in your varchar field. Oracle may provide one but I am not familiar with it. One key thing to remember: Lucene is open source, free and does the job extremely well. Why bother to dig into Oracle costly solution?</p>
<p>Lucene has given us a rich search engine capability on our web application. It has many features that I haven&#8217;t got a chance to discuss them all in this article. I will continue to write more articles on this topic as my research moves forward. Have a nice day! <img onclick="grin(':lol:');" alt=":lol:" src="../../../../../wp-includes/images/smilies/icon_lol.gif" /></p>
<h2>Reference</h2>
<p>The <a href="http://blog.lucene.com/">blog </a>of the Lucene and Solr creator - Doug Cutting</p>
<p>&nbsp;</p>
<p><a href="http://sharethis.com/item?&wp=2.5.1&amp;publisher=dfe09b3a-d298-4d32-ae1a-5f327745cf58&amp;title=Powerful+Full+Text+Search+Engine+-+Part+1+Lucene+Introduction&amp;url=http%3A%2F%2Fwww.solutionhacker.com%2F2008%2F07%2F04%2Fpowerful-full-text-search-engine-lucene-part-1%2F">ShareThis</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.solutionhacker.com/2008/07/04/powerful-full-text-search-engine-lucene-part-1/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Grid Computing - Part 1 Introduction</title>
		<link>http://www.solutionhacker.com/2008/07/03/grid-computing-part-1-introduction/</link>
		<comments>http://www.solutionhacker.com/2008/07/03/grid-computing-part-1-introduction/#comments</comments>
		<pubDate>Thu, 03 Jul 2008 11:30:39 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[05. Scale your site]]></category>

		<category><![CDATA[11. Architect Corner]]></category>

		<guid isPermaLink="false">http://www.solutionhacker.com/?p=158</guid>
		<description><![CDATA[Introduction from Cameron Purdy
&#160;

<script type="text/javascript">SHARETHIS.addEntry({ title: "Grid Computing - Part 1 Introduction", url: "http://www.solutionhacker.com/2008/07/03/grid-computing-part-1-introduction/" });</script>]]></description>
			<content:encoded><![CDATA[<h2>Introduction from Cameron Purdy</h2>
<p>&nbsp;</p>
<p><object height="344" width="425"><param name="movie" value="http://www.youtube.com/v/4Sq45B8wAXc&amp;hl=en&amp;fs=1" /><param name="allowFullScreen" value="true" /><embed height="344" width="425" allowfullscreen="true" type="application/x-shockwave-flash" src="http://www.youtube.com/v/4Sq45B8wAXc&amp;hl=en&amp;fs=1"></embed></object></p>
<p><a href="http://sharethis.com/item?&wp=2.5.1&amp;publisher=dfe09b3a-d298-4d32-ae1a-5f327745cf58&amp;title=Grid+Computing+-+Part+1+Introduction&amp;url=http%3A%2F%2Fwww.solutionhacker.com%2F2008%2F07%2F03%2Fgrid-computing-part-1-introduction%2F">ShareThis</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.solutionhacker.com/2008/07/03/grid-computing-part-1-introduction/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
