<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
xmlns:rawvoice="http://www.rawvoice.com/rawvoiceRssModule/"
>

<channel>
	<title>Solution Hacker &#187; grep</title>
	<atom:link href="http://www.solutionhacker.com/tag/grep/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.solutionhacker.com</link>
	<description>This blog provides solutions for enterpreneurs!</description>
	<lastBuildDate>Sun, 05 Feb 2012 00:45:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=205</generator>
<!-- podcast_generator="Blubrry PowerPress/2.0.4" -->
	<itunes:summary>This blog provides solutions for enterpreneurs!</itunes:summary>
	<itunes:author>Solution Hacker</itunes:author>
	<itunes:explicit>no</itunes:explicit>
	<itunes:image href="http://www.solutionhacker.com/wp-content/plugins/powerpress/itunes_default.jpg" />
	<itunes:subtitle>This blog provides solutions for enterpreneurs!</itunes:subtitle>
	<image>
		<title>Solution Hacker &#187; grep</title>
		<url>http://www.solutionhacker.com/wp-content/plugins/powerpress/rss_default.jpg</url>
		<link>http://www.solutionhacker.com</link>
	</image>
		<item>
		<title>Powerful Linux Text Processing Commands</title>
		<link>http://www.solutionhacker.com/uncategorized/powerful-linux-text-processing-commands/</link>
		<comments>http://www.solutionhacker.com/uncategorized/powerful-linux-text-processing-commands/#comments</comments>
		<pubDate>Fri, 17 Apr 2009 17:21:57 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[6. Uncategorized]]></category>
		<category><![CDATA[System]]></category>
		<category><![CDATA[cat]]></category>
		<category><![CDATA[find]]></category>
		<category><![CDATA[grep]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[sort]]></category>
		<category><![CDATA[uniq]]></category>

		<guid isPermaLink="false">http://www.solutionhacker.com/?p=226</guid>
		<description><![CDATA[<h2>Common Text Processing Commands</h2>
<p><img height="120" width="120" alt="" src="http://www.solutionhacker.com/wp-content/uploads/image/linux-logo.jpg" class="alignleft" />In our daily life, we deal with lots of data. The data normally is stored in text format for the ease of human to read. With the large amount of data we have, we need ways to deal with it. There are several things we frequently do on the data: <strong>Search</strong>, <strong>Filter</strong>, <strong>Sort</strong> and <strong>Analysis</strong>. In Linux, there are some powerful commands that I can use: <strong>cat</strong>, <strong>grep</strong>, <strong>find, sort, unique </strong>and etc. I found those commands quite powerful. So, I decide to put these down as my reference. This tutorial I will go over the basic text processing commands and how we use them together to achieve the tasks we often encounter in our workplace.&#160;</p>
<!--more-->
<h3>cat</h3>
<p>The power of "cat" is not just output a file to screen but to concatenates a list of file content and stream through the pipe to another program as input.</p>
<blockquote>
<p><tt>cat * &#124; sort</tt></p>
</blockquote>
<h3>find</h3>
<p>The power of find is to list out the matched filenames based on <strong>metadata </strong>of the files like type, size, create date...</p>
<h3>grep</h3>
<p>"grep" helps you to list out the file(s) with the content that match the pattern(s) in regular expression. You can use it as content search across the files in your file system.</p>
<blockquote>
<p><tt>grep -H -R --color -n -P abc *</tt></p>
</blockquote>
<p>option:</p>
<ol>
    <li>--color (highlight matching part in content with color)</li>
    <li>-n (show line number)</li>
    <li>-P PATTERN (perl regular expression pattern)</li>
    <li>-R (recursively)</li>
    <li>-l (only list out the filenames that match the pattern)</li>
    <li>-H show filename that matched.</li>
</ol>
<h3>cut</h3>
<p>"cut" extracts sections from each line of input. (<a href="http://en.wikipedia.org/wiki/Cut_%28Unix%29">example of usage</a>). Below the command will extract the 5th field and the rest from each line of file A using delimiter colon.</p>
<blockquote>
<p><tt>cut -d ":" -f 5- fileA <br />
</tt></p>
</blockquote>
<p>option:</p>
<ol>
    <li>-c (character)</li>
    <li>-b (byte)</li>
    <li>-f 5- (field if the line can be broken down by delimiter)</li>
    <li>-d &#124; (delimiter is pipe character)</li>
</ol>
<h3>sort&#160;</h3>
<p><span id="intelliTXT" name="intelliTxt">  The <strong>sort</strong> command sorts a file according to fields--the individual pieces of data on each line. By default, <strong>sort</strong> assumes that the fields are just words separated by <strong>blanks</strong>, but you can specify an alternative field delimiter if you want (such as commas or colons). Output from <strong>sort</strong> is printed to the screen, unless you redirect it to a file.</span></p>
<blockquote>
<p><span style="font-family: Courier;"><b>donor.data</b><br />
Bay Ching 500000 China<br />
Jack Arta 250000 Indonesia<br />
Cruella Lumper 725000 Malaysia</span></p>
<p>Let's take this sample donors file and sort it according to the donation amount. The following shows the command to sort the file on the <strong>second field (last name) </strong>and the output from the command:</p>
<p><span style="font-family: Courier;"><strong>sort +1 -2 donors.data</strong><br />
Jack Arta 250000 Indonesia<br />
Bay Ching 500000 China<br />
Cruella Lumper 725000 Malaysia</span></p>
</blockquote>
<p>If the file is delimited by comma, you can use <em><strong>-t , </strong></em>to tell the sort the delimiter. You can use<em><strong> -u</strong></em> to output the uniqueness as well.</p>
<p><span name="intelliTxt" id="intelliTXT"><blockquote>
<p><span style="font-family: Courier;"><strong>sort -t: +1 -2 company.data</strong><br />
Nasium, Jim:031762:Marketing <br />
Jucacion, Ed:396082:Sales<br />
Itorre, Jan:406378:Sales<br />
Ancholie, Mel:636496:Research</span></p>
<p>To sort the file on the third field (department name) and suppress the duplicates, use this command:</p>
<p><span style="font-family: Courier;"><strong>sort -t: -u +2 company.data</strong><br />
Nasium, Jim:031762:Marketing<br />
Ancholie, Mel:636496:Research<br />
Itorre, Jan:406378:Sales</span></p>
</blockquote>
<p>Note that the line for Ed Jucacion did not print, because he's in Sales, and we asked the command (with the <strong>-u</strong> flag) to suppress lines that were the same in the <strong>sort</strong> field.</p>
</span></p>
<p>option:<span name="intelliTxt" id="intelliTXT">
<ol>
    <li><strong>-f</strong>	Make all lines uppercase before sorting (so "Bill" and "bill" are treated the same).</li>
    <li><strong>-r</strong>	Sort in reverse order (so "Z" starts the list instead of "A").</li>
    <li><strong>-n</strong>	Sort a column in numerical order</li>
    <li><strong>-t<em>x</em></strong>	Use <em>x</em> as the field delimiter (replace <em>x</em> with a comma or other character).</li>
    <li><strong>-u</strong> Suppress all but one line in each set of lines with equal sort fields (so if you sort on a field containing last names, only one "Smith" will appear even if there are several).</li>
    <li>Specify the sort keys like this: <strong>+<em>m</em></strong>	Start at the first character of the <em>m</em>+1th field. <strong>-<em>n</em></strong>	End at the last character of the <em>n</em>th field (if -<em>N</em> omitted, assume the end of the line)</li>
</ol>
</span></p>
<h3>uniq</h3>
<p>uniq - line level uniqueness. It prints the unique lines in a sorted file, retaining only one of a run of matching lines. Optionally, it can show only lines that appear exactly once, or lines that appear more than once. <strong>uniq requires sorted input since it compares only consecutive lines.</strong></p>
<p>option:</p>
<ol>
    <li>-u (print the unqiue lines only - lines only appear once)</li>
    <li>-d (print the duplicate lines only - lines appear more than once)</li>
    <li>-c (prefix each line with occurrence)</li>
</ol>
<p><code>bash$ cat testfile<br />
This line occurs only once.<br />
This line occurs twice.<br />
This line occurs twice.<br />
This line occurs three times.<br />
This line occurs three times.<br />
This line occurs three times.</code></p>
<p><code><br />
bash$ uniq -c testfile<br />
1 This line occurs only once.<br />
2 This line occurs twice.<br />
3 This line occurs three times.</code></p>
<p><code><br />
bash$ sort testfile &#124; uniq -c &#124; sort -nr <br />
3 This line occurs three times.<br />
2 This line occurs twice.<br />
1 This line occurs only once.  </code></p>
<h3>wc</h3>
<p>wc - word count. Apart from word count, it also does the following</p>
<ol>
    <li><tt class="USERINPUT"><b>wc -w</b></tt> gives only the word count.</li>
    <li><tt class="USERINPUT"><b>wc -l</b></tt> gives only the line count.</li>
    <li><tt class="USERINPUT"><b>wc -c</b></tt> gives only the byte count.</li>
    <li><tt class="USERINPUT"><b>wc -m</b></tt> gives only the character count.</li>
    <li><tt class="USERINPUT"><b>wc -L</b></tt> gives only the length of the longest line.</li>
</ol>
<h3>tr</h3>
<p>"tr" translate or delete characters. It is used for data cleaning job. Can we do pattern replacement?</p>
<blockquote>
<p>tr '[:lower:]' '[:upper:]'</p>
</blockquote>
<p>The above command will convert all the lowest case to upper case.</p>
<blockquote>
<p>tr '.' '/'</p>
</blockquote>
<p>The above will convert all the . character to /. And for translation, you cannot have -d option on. You may be asking when would we do that. Here is the common use case - convert window files to unix formatted file:</p>
<blockquote>
<p>tr -d '\r' &#60; input_dos_file.txt &#62; output_unix_file.txt</p>
</blockquote>
<p>option:</p>
<ol>
    <li>-s (squeeze the repeated characters into one character. eg. tr -s '\n' )</li>
    <li>-d (delete characters eg. tr -d '\000')</li>
</ol>
<h3>sed</h3>
<p>"tr" can do character replacement. But if you want to do pattern replacement, you need to use sed. usage: sed -e s/pattern/replacement/flags</p>
<blockquote>
<p>sed -e s/one/another</p>
<p>sed -e s/[aeiou]/_/g</p>
</blockquote>
<p>&#160;Note the use of the "g" flag so that you apply the pattern/replacement to every match instead of just the first one.</p>
<h3>awk</h3>
<p>&#160;&#160;</p>
<h2>Put them all together</h2>
<p><strong><code>cat * &#124;grep lucene-core&#124;cut -f2 -d' '&#124;uniq&#124;tr '.' '/'&#124; awk '{printf "%s.class\n", $1}'</code></strong></p>]]></description>
			<content:encoded><![CDATA[<h2>Common Text Processing Commands</h2>
<p><img height="120" width="120" alt="" src="http://www.solutionhacker.com/wp-content/uploads/image/linux-logo.jpg" class="alignleft" />In our daily life, we deal with lots of data. The data normally is stored in text format for the ease of human to read. With the large amount of data we have, we need ways to deal with it. There are several things we frequently do on the data: <strong>Search</strong>, <strong>Filter</strong>, <strong>Sort</strong> and <strong>Analysis</strong>. In Linux, there are some powerful commands that I can use: <strong>cat</strong>, <strong>grep</strong>, <strong>find, sort, unique </strong>and etc. I found those commands quite powerful. So, I decide to put these down as my reference. This tutorial I will go over the basic text processing commands and how we use them together to achieve the tasks we often encounter in our workplace.&#160;</p>
<p><span id="more-226"></span></p>
<h3>cat</h3>
<p>The power of &#8220;cat&#8221; is not just output a file to screen but to concatenates a list of file content and stream through the pipe to another program as input.</p>
<blockquote>
<p><tt>cat * | sort</tt></p>
</blockquote>
<h3>find</h3>
<p>The power of find is to list out the matched filenames based on <strong>metadata </strong>of the files like type, size, create date&#8230;</p>
<h3>grep</h3>
<p>&#8220;grep&#8221; helps you to list out the file(s) with the content that match the pattern(s) in regular expression. You can use it as content search across the files in your file system.</p>
<blockquote>
<p><tt>grep -H -R --color -n -P abc *</tt></p>
</blockquote>
<p>option:</p>
<ol>
<li>&#8211;color (highlight matching part in content with color)</li>
<li>-n (show line number)</li>
<li>-P PATTERN (perl regular expression pattern)</li>
<li>-R (recursively)</li>
<li>-l (only list out the filenames that match the pattern)</li>
<li>-H show filename that matched.</li>
</ol>
<h3>cut</h3>
<p>&#8220;cut&#8221; extracts sections from each line of input. (<a href="http://en.wikipedia.org/wiki/Cut_%28Unix%29">example of usage</a>). Below the command will extract the 5th field and the rest from each line of file A using delimiter colon.</p>
<blockquote>
<p><tt>cut -d ":" -f 5- fileA <br />
</tt></p>
</blockquote>
<p>option:</p>
<ol>
<li>-c (character)</li>
<li>-b (byte)</li>
<li>-f 5- (field if the line can be broken down by delimiter)</li>
<li>-d | (delimiter is pipe character)</li>
</ol>
<h3>sort&#160;</h3>
<p><span id="intelliTXT" name="intelliTxt">  The <strong>sort</strong> command sorts a file according to fields&#8211;the individual pieces of data on each line. By default, <strong>sort</strong> assumes that the fields are just words separated by <strong>blanks</strong>, but you can specify an alternative field delimiter if you want (such as commas or colons). Output from <strong>sort</strong> is printed to the screen, unless you redirect it to a file.</span></p>
<blockquote>
<p><span style="font-family: Courier;"><b>donor.data</b><br />
Bay Ching 500000 China<br />
Jack Arta 250000 Indonesia<br />
Cruella Lumper 725000 Malaysia</span></p>
<p>Let&#8217;s take this sample donors file and sort it according to the donation amount. The following shows the command to sort the file on the <strong>second field (last name) </strong>and the output from the command:</p>
<p><span style="font-family: Courier;"><strong>sort +1 -2 donors.data</strong><br />
Jack Arta 250000 Indonesia<br />
Bay Ching 500000 China<br />
Cruella Lumper 725000 Malaysia</span></p>
</blockquote>
<p>If the file is delimited by comma, you can use <em><strong>-t , </strong></em>to tell the sort the delimiter. You can use<em><strong> -u</strong></em> to output the uniqueness as well.</p>
<p><span name="intelliTxt" id="intelliTXT"><br />
<blockquote>
<p><span style="font-family: Courier;"><strong>sort -t: +1 -2 company.data</strong><br />
Nasium, Jim:031762:Marketing <br />
Jucacion, Ed:396082:Sales<br />
Itorre, Jan:406378:Sales<br />
Ancholie, Mel:636496:Research</span></p>
<p>To sort the file on the third field (department name) and suppress the duplicates, use this command:</p>
<p><span style="font-family: Courier;"><strong>sort -t: -u +2 company.data</strong><br />
Nasium, Jim:031762:Marketing<br />
Ancholie, Mel:636496:Research<br />
Itorre, Jan:406378:Sales</span></p>
</blockquote>
<p>Note that the line for Ed Jucacion did not print, because he&#8217;s in Sales, and we asked the command (with the <strong>-u</strong> flag) to suppress lines that were the same in the <strong>sort</strong> field.</p>
<p></span></p>
<p>option:<span name="intelliTxt" id="intelliTXT"></p>
<ol>
<li><strong>-f</strong>	Make all lines uppercase before sorting (so &#8220;Bill&#8221; and &#8220;bill&#8221; are treated the same).</li>
<li><strong>-r</strong>	Sort in reverse order (so &#8220;Z&#8221; starts the list instead of &#8220;A&#8221;).</li>
<li><strong>-n</strong>	Sort a column in numerical order</li>
<li><strong>-t<em>x</em></strong>	Use <em>x</em> as the field delimiter (replace <em>x</em> with a comma or other character).</li>
<li><strong>-u</strong> Suppress all but one line in each set of lines with equal sort fields (so if you sort on a field containing last names, only one &#8220;Smith&#8221; will appear even if there are several).</li>
<li>Specify the sort keys like this: <strong>+<em>m</em></strong>	Start at the first character of the <em>m</em>+1th field. <strong>-<em>n</em></strong>	End at the last character of the <em>n</em>th field (if -<em>N</em> omitted, assume the end of the line)</li>
</ol>
<p></span></p>
<h3>uniq</h3>
<p>uniq &#8211; line level uniqueness. It prints the unique lines in a sorted file, retaining only one of a run of matching lines. Optionally, it can show only lines that appear exactly once, or lines that appear more than once. <strong>uniq requires sorted input since it compares only consecutive lines.</strong></p>
<p>option:</p>
<ol>
<li>-u (print the unqiue lines only &#8211; lines only appear once)</li>
<li>-d (print the duplicate lines only &#8211; lines appear more than once)</li>
<li>-c (prefix each line with occurrence)</li>
</ol>
<p>[code]]czoyMjY6XCJiYXNoJCBjYXQgdGVzdGZpbGU8YnIgLz4NClRoaXMgbGluZSBvY2N1cnMgb25seSBvbmNlLjxiciAvPg0KVGhpcyBsaW57WyYqJl19ZSBvY2N1cnMgdHdpY2UuPGJyIC8+DQpUaGlzIGxpbmUgb2NjdXJzIHR3aWNlLjxiciAvPg0KVGhpcyBsaW5lIG9jY3VycyB0aHJlZXtbJiomXX0gdGltZXMuPGJyIC8+DQpUaGlzIGxpbmUgb2NjdXJzIHRocmVlIHRpbWVzLjxiciAvPg0KVGhpcyBsaW5lIG9jY3VycyB0aHJlZSB0e1smKiZdfWltZXMuXCI7e1smKiZdfQ==[[/code]</p>
<p>[code]]czoxMzk6XCI8YnIgLz4NCmJhc2gkIHVuaXEgLWMgdGVzdGZpbGU8YnIgLz4NCjEgVGhpcyBsaW5lIG9jY3VycyBvbmx5IG9uY2UuPGJ7WyYqJl19ciAvPg0KMiBUaGlzIGxpbmUgb2NjdXJzIHR3aWNlLjxiciAvPg0KMyBUaGlzIGxpbmUgb2NjdXJzIHRocmVlIHRpbWVzLlwiO3tbJiomXX0=[[/code]</p>
<p>[code]]czoxNjA6XCI8YnIgLz4NCmJhc2gkIHNvcnQgdGVzdGZpbGUgfCB1bmlxIC1jIHwgc29ydCAtbnIgPGJyIC8+DQozIFRoaXMgbGluZSB7WyYqJl19b2NjdXJzIHRocmVlIHRpbWVzLjxiciAvPg0KMiBUaGlzIGxpbmUgb2NjdXJzIHR3aWNlLjxiciAvPg0KMSBUaGlzIGxpbmUgb2NjdXtbJiomXX1ycyBvbmx5IG9uY2UuICBcIjt7WyYqJl19[[/code]</p>
<h3>wc</h3>
<p>wc - word count. Apart from word count, it also does the following</p>
<ol>
<li><tt class="USERINPUT"><b>wc -w</b></tt> gives only the word count.</li>
<li><tt class="USERINPUT"><b>wc -l</b></tt> gives only the line count.</li>
<li><tt class="USERINPUT"><b>wc -c</b></tt> gives only the byte count.</li>
<li><tt class="USERINPUT"><b>wc -m</b></tt> gives only the character count.</li>
<li><tt class="USERINPUT"><b>wc -L</b></tt> gives only the length of the longest line.</li>
</ol>
<h3>tr</h3>
<p>"tr" translate or delete characters. It is used for data cleaning job. Can we do pattern replacement?</p>
<blockquote>
<p>tr '[:lower:]' '[:upper:]'</p>
</blockquote>
<p>The above command will convert all the lowest case to upper case.</p>
<blockquote>
<p>tr '.' '/'</p>
</blockquote>
<p>The above will convert all the . character to /. And for translation, you cannot have -d option on. You may be asking when would we do that. Here is the common use case - convert window files to unix formatted file:</p>
<blockquote>
<p>tr -d '\r' &lt; input_dos_file.txt &gt; output_unix_file.txt</p>
</blockquote>
<p>option:</p>
<ol>
<li>-s (squeeze the repeated characters into one character. eg. tr -s '\n' )</li>
<li>-d (delete characters eg. tr -d '\000')</li>
</ol>
<h3>sed</h3>
<p>"tr" can do character replacement. But if you want to do pattern replacement, you need to use sed. usage: sed -e s/pattern/replacement/flags</p>
<blockquote>
<p>sed -e s/one/another</p>
<p>sed -e s/[aeiou]/_/g</p>
</blockquote>
<p>&#160;Note the use of the "g" flag so that you apply the pattern/replacement to every match instead of just the first one.</p>
<h3>awk</h3>
<p>&#160;&#160;</p>
<h2>Put them all together</h2>
<p><strong>[code]]czo4NjpcImNhdCAqIHxncmVwIGx1Y2VuZS1jb3JlfGN1dCAtZjIgLWRcJyBcJ3x1bmlxfHRyIFwnLlwnIFwnL1wnfCBhd2sgXCd7cHJpbnRmIFwiJXtbJiomXX1zLmNsYXNzXFxuXCIsICQxfVwnXCI7e1smKiZdfQ==[[/code]</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.solutionhacker.com/uncategorized/powerful-linux-text-processing-commands/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Powerful Full Text Search Engine &#8211; Part 1 Lucene Introduction</title>
		<link>http://www.solutionhacker.com/implement-your-idea/build-your-website/powerful-full-text-search-engine-lucene-part-1/</link>
		<comments>http://www.solutionhacker.com/implement-your-idea/build-your-website/powerful-full-text-search-engine-lucene-part-1/#comments</comments>
		<pubDate>Fri, 04 Jul 2008 08:38:53 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Site Building]]></category>
		<category><![CDATA[System]]></category>
		<category><![CDATA[digester]]></category>
		<category><![CDATA[full-text search]]></category>
		<category><![CDATA[grep]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[REST]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[web service]]></category>

		<guid isPermaLink="false">http://www.solutionhacker.com/?p=160</guid>
		<description><![CDATA[<h2>Introduction of Lucene</h2>
<p>I have heard of Lucene and its powerful full text search capability many times. Today, I decide to take a look at it. Before I dive into the user guide, I went to Google Tech Talk to find a video related to Lucene first. Here is what I found:&#160;</p>
<embed type="application/x-shockwave-flash" src="http://video.google.com/googleplayer.swf?docid=3675183678687996842&#38;hl=en&#38;fs=true" allowfullscreen="true" style="width: 400px; height: 326px;" id="VideoPlayback"></embed>
<p>After I finished this video, I found Lucene a really great tool for me. So, I decided to have a deeper look at it. After a quick search,&#160; I found a great <a href="http://immike.net/blog/2007/07/03/full-text-search-with-apache-lucene/">blog </a>that showed me how to use Lucene with Digg. With Solr on top of Lucene, you can make Lucene available as RESTful Web Service. It is so awesome, isn't it? In this article, I will list you all the information I found during my little research on Lucene and I hope you will feel it useful.</p>
<p><!--more--></p>
<h3><!--more-->Architecture Overview</h3>
<p>Before we dig into the code or set up guidelines, I would like to have a high level picture of Lucene first. I borrow a diagram from this article that helps me to grasp the key components in search.</p>
<p><img src="http://www.solutionhacker.com/wp-content/uploads/lucene1.gif" alt="" /></p>
<p>This high level picture shows you that your <strong>search keywords</strong> you entered (normally using a form) will become a HTTP <strong>search request </strong>and later been translated into a form that search engine understands by <strong>Query parser. </strong>Search engine will perform the search operation against the indexed files that was previously prepared by <strong>Indexer</strong>. After that, the result will be <strong>ranked based on predefined ranking algorithm </strong>and returned to the user. The <strong>source </strong>of the data can be from Web Service, database or documents in your file system. In this diagram, it shows you that you can launch <strong>spider </strong>or <strong>crawler </strong>like Google to obtain the data from web pages on the Internet and feed it to Indexer as your source.</p>
<h2>Get one step deeper</h2>
<p style="text-align: left;"><strong>Now you know the high level flow of how search works. Lets get one step into the detail.</strong></p>
<ol>
    <li>What search interface to use?</li>
    <li>How search interface communicates with your search engine?</li>
    <li>What kind of search the search engine provides?</li>
    <li>How search engine indexes the documents?</li>
    <li>How result be ranked and what kind of ranking algorithms we normally use?</li>
</ol>
<p><strong>Below is the answers of the questions above:</strong></p>
<ol>
    <li>Up to you. I would use Flex as I want to provide a rich search interface to my users.</li>
    <li>Flex can talk HTTP, Web Service or RemoteObject AMF. If you put web service layer on Lucene (ie. Solr), you can use REST call (ie. HTTP) to obtain the result.</li>
    <li>Lucene supports several kinds of advanced searches like: <br />
    <ul>
        <li><u>Boolean operators</u> - users can compose query using AND, OR, NOT</li>
        <li><u>Field Search</u> - what fields the search operates on? like title, author or content?</li>
        <li><u>Wildcard Search </u>- supports * and ?.</li>
        <li><u>Fuzzy Search </u>- Lucene provides a fuzzy search that's based on an edit distance algorithm. You can use the tilde character (~) at the end of a single search word to do a fuzzy search. For example, the query &#34;think~&#34; searches for the terms <strong>similar </strong>in spelling to the term &#34;think.&#34; The key here is the word &#34;similar&#34;. Do we consider horse and donkey are related? Or you have hose and horse be related somehow in spelling?</li>
        <li><u>Range Search </u>- age, date and etc</li>
    </ul>
    </li>
    <li>Large topic. I will go back to it later.</li>
    <li>Up to you. If you want to look at the popular ranking algorithm in the world, check out <strong>Google Page Rank. </strong>It is one of the algorithms that many of us interested to know. Before I want to have my <a href="http://www.justproposed.com">wedding website - Justproposed.com </a>be shown on at the top of the result when users type &#34;wedding website&#34; as search keywords, I have looked into <strong>SEO</strong>. It is a fun area to explore. Generally speaking, if the query keywords shown in the title, it weights more, If the keyword frequency is higher, it ranks higher..blah blah. However, I know Google has weighted a lot on the links. It is not just purely based on the document that you have. How to obtain the additional information during the crawling is beyond the scope of this article.</li>
</ol>
<h2>Get your hand dirty</h2>
<p>Look into this great <a href="http://www-128.ibm.com/developerworks/java/library/wa-lucene2/index.html?ca=drs-">article</a>.</p>
<p><em>The thing this article doesn't mention is that you need to create you <strong>dataDir </strong>and <strong>indexDir </strong>folders under C and drag a list of html files into the dataDir before you start the web server. If you drag new htmls into it, you need to clean up your indexDir and restart your web server in order to rebuild the indexes.</em></p>
<p>I have got the application up and running. It is nice trial. My next step is to enhance this example. I will do the following:</p>
<ol>
    <li>Use Flex as search interface</li>
    <li>Use Solr to expose the Lucene search engine as Web Service.</li>
    <li>Have Flex calls my search engine via REST.</li>
    <li>Display the result on Flex.</li>
</ol>
<p>After I have my new enhancements working, I would do the following:</p>
<ol>
    <li>Look into how Lucene do the indexing</li>
    <li>Look into Nutch,. So I can have it crawled some sites and put the htmls in dataDir for me automatically.</li>
</ol>
<h2>How Lucene Indexes the documents?</h2>
<p>Yes. I haven't forgot to answer the question 4. Here is the <a href="http://www.ibm.com/developerworks/library/wa-lucene/">article </a>that answers your question. To summarize, here are several key points I extracted from this article.</p>
<ol>
    <li><u>Content Extraction</u> - Lucene only takes <strong>text </strong>for index. So, it provides different types of parsers to extract content from different types of document like word, html, doc, pdf and etc. If you have other type of document that you cannot find a parser, you take the responsibility to extract the content out for Lucene. This <a href="http://www.ibm.com/developerworks/web/library/j-lucene/">article </a>shows you how to use <strong>Digester </strong>to extract content out from XML and feed Lucene. If you have a large pool of XML for content extraction, you need to pay attention on the parsing time. There is someone who has done <a href="http://grep.codeconsult.ch/2005/02/16/lucene-rocks/">this </a>and obtain some performance number as reference. However, the article was a bit outdated.</li>
    <li><u>Content Preprocessing</u><strong> - Analyzer </strong>is used to extract the <strong>token </strong>from your text content to be indexed.  Before text is indexed, it is passed through an <code>Analyzer</code>. <code>Analyzer</code>s are in charge of extracting indexable tokens out of text to be indexed, and eliminating the rest.  Lucene comes with a few different <code>Analyzer</code> implementations.  Some of them deal with skipping stop words (frequently-used words that don't help distinguish one document from the other, such as &#34;a,&#34; &#34;an,&#34; &#34;the,&#34; &#34;in,&#34; &#34;on,&#34; etc.), some deal with converting all tokens to lowercase letters, so that searches are not case-sensitive, and so on.</li>
    <li><u>Indexing</u><strong> - IndexWriter </strong>is the key component in the indexing process. This class will use Analyzer that you passed in as parameter to create a new index or open an existing index and add documents to it. You need to set up <strong>fields </strong>and <strong>documents </strong>and feed them to the IndexWriter to do the job. Like the code below, you fetches a list of .txt files and its metadata like path from a directory and feed them for IndexWriter. IndexWriter will index them one after one.</li>
    <li><u>Configuration </u>- You can <strong>configure </strong>IndexWriter to achieve better performance via increasing the buffer size because the bottleneck normally happen during the IO of the index files.</li>
    <li>Lucene uses <strong>inverted index </strong>concept. An inverted index is an inside-out arrangement of documents in which terms take center stage. Each term points to a list of documents that contain it. On the contrary, in a <i>forwarding index</i>, documents take the center stage, and each document refers to a list of terms it contains. You can use an inverted index to easily find which documents contain certain terms. Lucene uses an inverted index as its index structure.</li>
</ol>
<pre name="code" class="java">
 
for(int i = 0; i &#60; textFiles.length; i++){
      if(textFiles[i].isFile() &#62;&#62; textFiles[i].getName().endsWith(&#34;.txt&#34;)){
        Reader textReader = new FileReader(textFiles[i]);
        Document document = new Document();
        document.add(Field.Text(&#34;content&#34;,textReader));
        document.add(Field.Keyword(&#34;path&#34;,textFiles[i].getPath()));
        indexWriter.addDocument(document);
      }
}
</pre>
<p>Lucene offers four different types of fields from which a developer can choose: Keyword,UnIndexed,UnStored,and Text.</p>
<blockquote>
<p><strong>Keyword </strong>fields are not tokenized, but are indexed and stored in the index verbatim.  This field is suitable for fields whose original value should be preserved in its entirety, such as URLs, dates, personal names, Social Security numbers, telephone numbers, etc.</p>
<p><strong>UnIndexed </strong>fields are neither tokenized nor indexed, but their value is stored in the index word for word.  This field is suitable for fields that you need to display with search results, but whose values you will never search directly.  Because this type of field is not indexed, searches against it are slow. Since the original value of a field of this type is stored in the index, this type is not suitable for storing fields with very large values, if index size is an issue.</p>
<p><strong>UnStored </strong>fields are the opposite of UnIndexed fields. Fields of this type are tokenized and indexed, but are not stored in the index. This field is suitable for indexing large amounts of text that does not need to be retrieved in its original form, such as the bodies of Web pages, or any other type of text document.</p>
<p><strong>Text </strong>fields are tokenized, indexed, and stored in the index. This implies that fields of this type can be searched, but be cautious about the size of the field stored as Text field.</p>
</blockquote>
<h2>Conclusion</h2>
<p>To use Lucene, there are 3 main concepts you need to grasp. There are:</p>
<ol>
    <li>Indexer - create search engine indexes</li>
    <li>Analyzer - Split text into tokens that make sense for the search engine. The structure is like document -&#62; a sequence of fields and each field is name/value pair -&#62; tokens. <span style="font-weight: bold;">Field </span>values may be <span style="font-style: italic;">stored</span>, <span style="font-style: italic;">indexed </span>or <span style="font-style: italic;">analyzed/ tokenize, </span>(and, now, <span style="font-style: italic;">vectored</span>). The <a href="http://lucene.sourceforge.net/talks/pisa/">lecture note</a> from Doug Cutting will give you more detail. <strong><span style="font-weight: bold;"><br />
    </span></strong></li>
    <li>Searcher</li>
</ol>
<p>You may think of using <strong>grep </strong>to achieve or <strong>database </strong>to achieve what Lucene does. <strong>Grep </strong>is powerful Linux tool, however, if you want it to search on files with several MB in size, you will see that the tool is inefficient. The reason is grep doesn't prepare the indexes of your files ahead of the time you do the search. Database can do indexing but not so sophisticated as Lucene in your varchar field. Oracle may provide one but I am not familiar with it. One key thing to remember: Lucene is open source, free and does the job extremely well. Why bother to dig into Oracle costly solution?</p>
<p>Lucene has given us a rich search engine capability on our web application. It has many features that I haven't got a chance to discuss them all in this article. I will continue to write more articles on this topic as my research moves forward. Have a nice day! <img onclick="grin(':lol:');" alt=":lol:" src="../../../../../wp-includes/images/smilies/icon_lol.gif" /></p>
<h2>Reference</h2>
<p>The <a href="http://blog.lucene.com/">blog </a>of the Lucene and Solr creator - Doug Cutting</p>
<p>&#160;</p>]]></description>
			<content:encoded><![CDATA[<h2>Introduction of Lucene</h2>
<p>I have heard of Lucene and its powerful full text search capability many times. Today, I decide to take a look at it. Before I dive into the user guide, I went to Google Tech Talk to find a video related to Lucene first. Here is what I found:&nbsp;</p>
<p><embed type="application/x-shockwave-flash" src="http://video.google.com/googleplayer.swf?docid=3675183678687996842&amp;hl=en&amp;fs=true" allowfullscreen="true" style="width: 400px; height: 326px;" id="VideoPlayback"></embed></p>
<p>After I finished this video, I found Lucene a really great tool for me. So, I decided to have a deeper look at it. After a quick search,&nbsp; I found a great <a href="http://immike.net/blog/2007/07/03/full-text-search-with-apache-lucene/">blog </a>that showed me how to use Lucene with Digg. With Solr on top of Lucene, you can make Lucene available as RESTful Web Service. It is so awesome, isn&#8217;t it? In this article, I will list you all the information I found during my little research on Lucene and I hope you will feel it useful.</p>
<p><span id="more-160"></span></p>
<h3><!--more-->Architecture Overview</h3>
<p>Before we dig into the code or set up guidelines, I would like to have a high level picture of Lucene first. I borrow a diagram from this article that helps me to grasp the key components in search.</p>
<p><img src="http://www.solutionhacker.com/wp-content/uploads/lucene1.gif" alt="" /></p>
<p>This high level picture shows you that your <strong>search keywords</strong> you entered (normally using a form) will become a HTTP <strong>search request </strong>and later been translated into a form that search engine understands by <strong>Query parser. </strong>Search engine will perform the search operation against the indexed files that was previously prepared by <strong>Indexer</strong>. After that, the result will be <strong>ranked based on predefined ranking algorithm </strong>and returned to the user. The <strong>source </strong>of the data can be from Web Service, database or documents in your file system. In this diagram, it shows you that you can launch <strong>spider </strong>or <strong>crawler </strong>like Google to obtain the data from web pages on the Internet and feed it to Indexer as your source.</p>
<h2>Get one step deeper</h2>
<p style="text-align: left;"><strong>Now you know the high level flow of how search works. Lets get one step into the detail.</strong></p>
<ol>
<li>What search interface to use?</li>
<li>How search interface communicates with your search engine?</li>
<li>What kind of search the search engine provides?</li>
<li>How search engine indexes the documents?</li>
<li>How result be ranked and what kind of ranking algorithms we normally use?</li>
</ol>
<p><strong>Below is the answers of the questions above:</strong></p>
<ol>
<li>Up to you. I would use Flex as I want to provide a rich search interface to my users.</li>
<li>Flex can talk HTTP, Web Service or RemoteObject AMF. If you put web service layer on Lucene (ie. Solr), you can use REST call (ie. HTTP) to obtain the result.</li>
<li>Lucene supports several kinds of advanced searches like: 
<ul>
<li><u>Boolean operators</u> &#8211; users can compose query using AND, OR, NOT</li>
<li><u>Field Search</u> &#8211; what fields the search operates on? like title, author or content?</li>
<li><u>Wildcard Search </u>- supports * and ?.</li>
<li><u>Fuzzy Search </u>- Lucene provides a fuzzy search that&#8217;s based on an edit distance algorithm. You can use the tilde character (~) at the end of a single search word to do a fuzzy search. For example, the query &quot;think~&quot; searches for the terms <strong>similar </strong>in spelling to the term &quot;think.&quot; The key here is the word &quot;similar&quot;. Do we consider horse and donkey are related? Or you have hose and horse be related somehow in spelling?</li>
<li><u>Range Search </u>- age, date and etc</li>
</ul>
</li>
<li>Large topic. I will go back to it later.</li>
<li>Up to you. If you want to look at the popular ranking algorithm in the world, check out <strong>Google Page Rank. </strong>It is one of the algorithms that many of us interested to know. Before I want to have my <a href="http://www.justproposed.com">wedding website &#8211; Justproposed.com </a>be shown on at the top of the result when users type &quot;wedding website&quot; as search keywords, I have looked into <strong>SEO</strong>. It is a fun area to explore. Generally speaking, if the query keywords shown in the title, it weights more, If the keyword frequency is higher, it ranks higher..blah blah. However, I know Google has weighted a lot on the links. It is not just purely based on the document that you have. How to obtain the additional information during the crawling is beyond the scope of this article.</li>
</ol>
<h2>Get your hand dirty</h2>
<p>Look into this great <a href="http://www-128.ibm.com/developerworks/java/library/wa-lucene2/index.html?ca=drs-">article</a>.</p>
<p><em>The thing this article doesn&#8217;t mention is that you need to create you <strong>dataDir </strong>and <strong>indexDir </strong>folders under C and drag a list of html files into the dataDir before you start the web server. If you drag new htmls into it, you need to clean up your indexDir and restart your web server in order to rebuild the indexes.</em></p>
<p>I have got the application up and running. It is nice trial. My next step is to enhance this example. I will do the following:</p>
<ol>
<li>Use Flex as search interface</li>
<li>Use Solr to expose the Lucene search engine as Web Service.</li>
<li>Have Flex calls my search engine via REST.</li>
<li>Display the result on Flex.</li>
</ol>
<p>After I have my new enhancements working, I would do the following:</p>
<ol>
<li>Look into how Lucene do the indexing</li>
<li>Look into Nutch,. So I can have it crawled some sites and put the htmls in dataDir for me automatically.</li>
</ol>
<h2>How Lucene Indexes the documents?</h2>
<p>Yes. I haven&#8217;t forgot to answer the question 4. Here is the <a href="http://www.ibm.com/developerworks/library/wa-lucene/">article </a>that answers your question. To summarize, here are several key points I extracted from this article.</p>
<ol>
<li><u>Content Extraction</u> &#8211; Lucene only takes <strong>text </strong>for index. So, it provides different types of parsers to extract content from different types of document like word, html, doc, pdf and etc. If you have other type of document that you cannot find a parser, you take the responsibility to extract the content out for Lucene. This <a href="http://www.ibm.com/developerworks/web/library/j-lucene/">article </a>shows you how to use <strong>Digester </strong>to extract content out from XML and feed Lucene. If you have a large pool of XML for content extraction, you need to pay attention on the parsing time. There is someone who has done <a href="http://grep.codeconsult.ch/2005/02/16/lucene-rocks/">this </a>and obtain some performance number as reference. However, the article was a bit outdated.</li>
<li><u>Content Preprocessing</u><strong> &#8211; Analyzer </strong>is used to extract the <strong>token </strong>from your text content to be indexed.  Before text is indexed, it is passed through an [code]]czo4OlwiQW5hbHl6ZXJcIjt7WyYqJl19[[/code]. [code]]czo4OlwiQW5hbHl6ZXJcIjt7WyYqJl19[[/code]s are in charge of extracting indexable tokens out of text to be indexed, and eliminating the rest.  Lucene comes with a few different [code]]czo4OlwiQW5hbHl6ZXJcIjt7WyYqJl19[[/code] implementations.  Some of them deal with skipping stop words (frequently-used words that don't help distinguish one document from the other, such as &quot;a,&quot; &quot;an,&quot; &quot;the,&quot; &quot;in,&quot; &quot;on,&quot; etc.), some deal with converting all tokens to lowercase letters, so that searches are not case-sensitive, and so on.</li>
<li><u>Indexing</u><strong> - IndexWriter </strong>is the key component in the indexing process. This class will use Analyzer that you passed in as parameter to create a new index or open an existing index and add documents to it. You need to set up <strong>fields </strong>and <strong>documents </strong>and feed them to the IndexWriter to do the job. Like the code below, you fetches a list of .txt files and its metadata like path from a directory and feed them for IndexWriter. IndexWriter will index them one after one.</li>
<li><u>Configuration </u>- You can <strong>configure </strong>IndexWriter to achieve better performance via increasing the buffer size because the bottleneck normally happen during the IO of the index files.</li>
<li>Lucene uses <strong>inverted index </strong>concept. An inverted index is an inside-out arrangement of documents in which terms take center stage. Each term points to a list of documents that contain it. On the contrary, in a <i>forwarding index</i>, documents take the center stage, and each document refers to a list of terms it contains. You can use an inverted index to easily find which documents contain certain terms. Lucene uses an inverted index as its index structure.</li>
</ol>
<p><pre><pre name="code" class="java">
 
for(int i = 0; i &amp;lt; textFiles.length; i++){
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if(textFiles[i].isFile() &amp;gt;&amp;gt; textFiles[i].getName().endsWith(&amp;quot;.txt&amp;quot;)){
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Reader textReader = new FileReader(textFiles[i]);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Document document = new Document();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;document.add(Field.Text(&amp;quot;content&amp;quot;,textReader));
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;document.add(Field.Keyword(&amp;quot;path&amp;quot;,textFiles[i].getPath()));
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;indexWriter.addDocument(document);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
}
</pre></pre></p>
<p>Lucene offers four different types of fields from which a developer can choose: Keyword,UnIndexed,UnStored,and Text.</p>
<blockquote>
<p><strong>Keyword </strong>fields are not tokenized, but are indexed and stored in the index verbatim.  This field is suitable for fields whose original value should be preserved in its entirety, such as URLs, dates, personal names, Social Security numbers, telephone numbers, etc.</p>
<p><strong>UnIndexed </strong>fields are neither tokenized nor indexed, but their value is stored in the index word for word.  This field is suitable for fields that you need to display with search results, but whose values you will never search directly.  Because this type of field is not indexed, searches against it are slow. Since the original value of a field of this type is stored in the index, this type is not suitable for storing fields with very large values, if index size is an issue.</p>
<p><strong>UnStored </strong>fields are the opposite of UnIndexed fields. Fields of this type are tokenized and indexed, but are not stored in the index. This field is suitable for indexing large amounts of text that does not need to be retrieved in its original form, such as the bodies of Web pages, or any other type of text document.</p>
<p><strong>Text </strong>fields are tokenized, indexed, and stored in the index. This implies that fields of this type can be searched, but be cautious about the size of the field stored as Text field.</p>
</blockquote>
<h2>Conclusion</h2>
<p>To use Lucene, there are 3 main concepts you need to grasp. There are:</p>
<ol>
<li>Indexer - create search engine indexes</li>
<li>Analyzer - Split text into tokens that make sense for the search engine. The structure is like document -&gt; a sequence of fields and each field is name/value pair -&gt; tokens. <span style="font-weight: bold;">Field </span>values may be <span style="font-style: italic;">stored</span>, <span style="font-style: italic;">indexed </span>or <span style="font-style: italic;">analyzed/ tokenize, </span>(and, now, <span style="font-style: italic;">vectored</span>). The <a href="http://lucene.sourceforge.net/talks/pisa/">lecture note</a> from Doug Cutting will give you more detail. <strong><span style="font-weight: bold;"><br />
    </span></strong></li>
<li>Searcher</li>
</ol>
<p>You may think of using <strong>grep </strong>to achieve or <strong>database </strong>to achieve what Lucene does. <strong>Grep </strong>is powerful Linux tool, however, if you want it to search on files with several MB in size, you will see that the tool is inefficient. The reason is grep doesn't prepare the indexes of your files ahead of the time you do the search. Database can do indexing but not so sophisticated as Lucene in your varchar field. Oracle may provide one but I am not familiar with it. One key thing to remember: Lucene is open source, free and does the job extremely well. Why bother to dig into Oracle costly solution?</p>
<p>Lucene has given us a rich search engine capability on our web application. It has many features that I haven't got a chance to discuss them all in this article. I will continue to write more articles on this topic as my research moves forward. Have a nice day! <img onclick="grin(':lol:');" alt=":lol:" src="../../../../../wp-includes/images/smilies/icon_lol.gif" /></p>
<h2>Reference</h2>
<p>The <a href="http://blog.lucene.com/">blog </a>of the Lucene and Solr creator - Doug Cutting</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.solutionhacker.com/implement-your-idea/build-your-website/powerful-full-text-search-engine-lucene-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

