Archive | System RSS feed for this section

Learning Hive

Starting to learn Hive

As I mentioned in my last article,  I was getting excited about the potential of Hive. Today, I decide to start my journey to learn this. I found a great introductory video that gives you a nice warm-up of using Hive (A basic knowledge of how hadoop and mapreduce work would be helpful for you to digest the material inside).

Below are some highlights from this video

Hive is an SQL interface built on top of Hadoop. It supports Web access and JDBC. I am amazed how close the SQL syntax like the regular SQL for RDBMS. Below are some SQLs used in this tutorial.

//———- Set up your tables in HIVE —————–
SHOW TABLES;

CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE;

DESCRIBE shakespeare;

//———- Load data into Hive table from Hadoop HDFS ——————-
LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare;

//———- Query against the data using hive sql interface ————–
select * from shakespeare limit 10;
select * from sakespeare where freq > 100 sort by freq asc limit 10;
select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;

//show me the plan
explain select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;

//———- Create a merge table and populate it using dataset joining by 2 different tables
insert overwrite table merged select s.word, s.freq, k.freq from shakespeare s join kjv k on (s.word = k.word);

//———- Query the merge table ———————
select word, shake_f, kjv_f, (shake_f+kjv_f) as ss from merged sort by ss limit 20;

To prepare the data for Hive to load in, the demo uses another mapreduce job to achieve. Remember to delete the log before doing Hive table load.

hadoop jar $HADOOP_HOME/hadoop-*-examples.jar grep input shakespeare_freq ‘\w+’

//remove the mapreduce job log
hadoop fs -rmr shakespeare_freq/_logs

Often time, large scale data processing system always IO bound. So for mapreduce job, your mapper is always waiting for data to load from disk. Hadoop mitigates the problem via during parallel load from lots of hard drives. However, a single hard drive is still max out at 75MB/s read as physical limit and nothing we can do about this. In order to achieve good speed, the key is to eliminate # of hadoop pass

Since Hive is on top of Hadoop’s HDFS, it will have the same restrictions as it. So, you cannot do UPDATE, DELETE and INSERT records as regular RDMS. However, you can do bulk load to add more new files (data) to the table and you can do delete a file from Hive.

Hive needs to store metadata of the tables out from the HDFS. You can use regular rdms to achieve the job. But when you start Hive locally, it will seek for the local metastore. So, in distributed environment, you may need to centralize the metastore in a remote location. There is wiki on the Hive site that documents how to set it up.

See Hive in Action

Cloudera Hadoop Training: Hive Tutorial Screencast from Cloudera on Vimeo.

Other projects similar to Hadoop

  • Parallel databases: Gama, Bubba, Volcano
  • Google: Sawzall
  • Yahoo: Pig
  • IBM Research: JAQL
  • Microsoft: DryadLINQ, SCOPE
  • Greenplum: YAML MapReduce
  • Aster Data: In-database MapReduce
  • Business.com: CloudBase
Leave a comment Continue Reading →

Hive on Amazon EC2 cloud

adserving-ec2-hive-system-arch

 

I ever worked for a display ad network company that collects over 400 million of impression/ click logs per day. With this amount of data, my ex-company bought a supercomputer and cross their fingers that it can handle the grow in both volume and analytic demand of the data. It is obviously not a scalable solution. However, what is the best solution?

Although I haven’t worked for this company anymore, it is still an interesting problem to solve. I have a great friend who proposed a shared nothing solution for this company. The solution is to partition the data across a set of Postgresql databases and put Greenplum on top of them to parallelize the query —there is no disk-level sharing or contention to be concerned with (i.e. it is a ‘shared-nothing’ architecture). I like this approach. The only thing is that Greenplum is not free and it may be difficult for a startup to face this upfront cost. Apart from that, this setting requires all the databases are running on the same network that hindered us to move this in the elastic cloud like Amazon EC2.

Later on, I joined a great company in the same industry that seeks for a solution in the cloud to host its data warehouse. So, I got a  chance to revisit this problem. During the research, I came across an interesting technology – column-based database (eg. infobright and lucid db). The idea of column-based data store is that traditional database stores and fetches data in row from data files into the memory. It is inefficient if your query only requires few columns for computation. So, column-based data stores your data in column with effective compression algorithm due to all values in it has the same data type. This solution is great but it doesn’t do MPP (ie. massive parallel processing) and it is also not ready for cloud yet.

Here comes another solution. That is Hive on top of Hadoop on top of Amazon cloud. It is an interesting idea. Check out this video to learn about this.


If you are not sure what Hadoop is and want to get some warm up in massive computing. I suggest you go through the following 5 excellent Google lectures.


Leave a comment Continue Reading →

Powerful Linux Text Processing Commands

Common Text Processing Commands

In our daily life, we deal with lots of data. The data normally is stored in text format for the ease of human to read. With the large amount of data we have, we need ways to deal with it. There are several things we frequently do on the data: Search, Filter, Sort and Analysis. In Linux, there are some powerful commands that I can use: cat, grep, find, sort, unique and etc. I found those commands quite powerful. So, I decide to put these down as my reference. This tutorial I will go over the basic text processing commands and how we use them together to achieve the tasks we often encounter in our workplace. 

cat

The power of “cat” is not just output a file to screen but to concatenates a list of file content and stream through the pipe to another program as input.

cat * | sort

find

The power of find is to list out the matched filenames based on metadata of the files like type, size, create date…

grep

“grep” helps you to list out the file(s) with the content that match the pattern(s) in regular expression. You can use it as content search across the files in your file system.

grep -H -R --color -n -P abc *

option:

  1. –color (highlight matching part in content with color)
  2. -n (show line number)
  3. -P PATTERN (perl regular expression pattern)
  4. -R (recursively)
  5. -l (only list out the filenames that match the pattern)
  6. -H show filename that matched.

cut

“cut” extracts sections from each line of input. (example of usage). Below the command will extract the 5th field and the rest from each line of file A using delimiter colon.

cut -d ":" -f 5- fileA

option:

  1. -c (character)
  2. -b (byte)
  3. -f 5- (field if the line can be broken down by delimiter)
  4. -d | (delimiter is pipe character)

sort 

The sort command sorts a file according to fields–the individual pieces of data on each line. By default, sort assumes that the fields are just words separated by blanks, but you can specify an alternative field delimiter if you want (such as commas or colons). Output from sort is printed to the screen, unless you redirect it to a file.

donor.data
Bay Ching 500000 China
Jack Arta 250000 Indonesia
Cruella Lumper 725000 Malaysia

Let’s take this sample donors file and sort it according to the donation amount. The following shows the command to sort the file on the second field (last name) and the output from the command:

sort +1 -2 donors.data
Jack Arta 250000 Indonesia
Bay Ching 500000 China
Cruella Lumper 725000 Malaysia

If the file is delimited by comma, you can use -t , to tell the sort the delimiter. You can use -u to output the uniqueness as well.


sort -t: +1 -2 company.data
Nasium, Jim:031762:Marketing
Jucacion, Ed:396082:Sales
Itorre, Jan:406378:Sales
Ancholie, Mel:636496:Research

To sort the file on the third field (department name) and suppress the duplicates, use this command:

sort -t: -u +2 company.data
Nasium, Jim:031762:Marketing
Ancholie, Mel:636496:Research
Itorre, Jan:406378:Sales

Note that the line for Ed Jucacion did not print, because he’s in Sales, and we asked the command (with the -u flag) to suppress lines that were the same in the sort field.

option:

  1. -f Make all lines uppercase before sorting (so “Bill” and “bill” are treated the same).
  2. -r Sort in reverse order (so “Z” starts the list instead of “A”).
  3. -n Sort a column in numerical order
  4. -tx Use x as the field delimiter (replace x with a comma or other character).
  5. -u Suppress all but one line in each set of lines with equal sort fields (so if you sort on a field containing last names, only one “Smith” will appear even if there are several).
  6. Specify the sort keys like this: +m Start at the first character of the m+1th field. -n End at the last character of the nth field (if -N omitted, assume the end of the line)

uniq

uniq – line level uniqueness. It prints the unique lines in a sorted file, retaining only one of a run of matching lines. Optionally, it can show only lines that appear exactly once, or lines that appear more than once. uniq requires sorted input since it compares only consecutive lines.

option:

  1. -u (print the unqiue lines only – lines only appear once)
  2. -d (print the duplicate lines only – lines appear more than once)
  3. -c (prefix each line with occurrence)

[code]]czoyMjY6XCJiYXNoJCBjYXQgdGVzdGZpbGU8YnIgLz4NClRoaXMgbGluZSBvY2N1cnMgb25seSBvbmNlLjxiciAvPg0KVGhpcyBsaW57WyYqJl19ZSBvY2N1cnMgdHdpY2UuPGJyIC8+DQpUaGlzIGxpbmUgb2NjdXJzIHR3aWNlLjxiciAvPg0KVGhpcyBsaW5lIG9jY3VycyB0aHJlZXtbJiomXX0gdGltZXMuPGJyIC8+DQpUaGlzIGxpbmUgb2NjdXJzIHRocmVlIHRpbWVzLjxiciAvPg0KVGhpcyBsaW5lIG9jY3VycyB0aHJlZSB0e1smKiZdfWltZXMuXCI7e1smKiZdfQ==[[/code]

[code]]czoxMzk6XCI8YnIgLz4NCmJhc2gkIHVuaXEgLWMgdGVzdGZpbGU8YnIgLz4NCjEgVGhpcyBsaW5lIG9jY3VycyBvbmx5IG9uY2UuPGJ7WyYqJl19ciAvPg0KMiBUaGlzIGxpbmUgb2NjdXJzIHR3aWNlLjxiciAvPg0KMyBUaGlzIGxpbmUgb2NjdXJzIHRocmVlIHRpbWVzLlwiO3tbJiomXX0=[[/code]

[code]]czoxNjA6XCI8YnIgLz4NCmJhc2gkIHNvcnQgdGVzdGZpbGUgfCB1bmlxIC1jIHwgc29ydCAtbnIgPGJyIC8+DQozIFRoaXMgbGluZSB7WyYqJl19b2NjdXJzIHRocmVlIHRpbWVzLjxiciAvPg0KMiBUaGlzIGxpbmUgb2NjdXJzIHR3aWNlLjxiciAvPg0KMSBUaGlzIGxpbmUgb2NjdXtbJiomXX1ycyBvbmx5IG9uY2UuICBcIjt7WyYqJl19[[/code]

wc

wc – word count. Apart from word count, it also does the following

  1. wc -w gives only the word count.
  2. wc -l gives only the line count.
  3. wc -c gives only the byte count.
  4. wc -m gives only the character count.
  5. wc -L gives only the length of the longest line.

tr

“tr” translate or delete characters. It is used for data cleaning job. Can we do pattern replacement?

tr ‘[:lower:]‘ ‘[:upper:]‘

The above command will convert all the lowest case to upper case.

tr ‘.’ ‘/’

The above will convert all the . character to /. And for translation, you cannot have -d option on. You may be asking when would we do that. Here is the common use case – convert window files to unix formatted file:

tr -d ‘\r’ < input_dos_file.txt > output_unix_file.txt

option:

  1. -s (squeeze the repeated characters into one character. eg. tr -s ‘\n’ )
  2. -d (delete characters eg. tr -d ‘\000′)

sed

“tr” can do character replacement. But if you want to do pattern replacement, you need to use sed. usage: sed -e s/pattern/replacement/flags

sed -e s/one/another

sed -e s/[aeiou]/_/g

 Note the use of the “g” flag so that you apply the pattern/replacement to every match instead of just the first one.

awk

  

Put them all together

[code]]czo4NjpcImNhdCAqIHxncmVwIGx1Y2VuZS1jb3JlfGN1dCAtZjIgLWRcJyBcJ3x1bmlxfHRyIFwnLlwnIFwnL1wnfCBhd2sgXCd7cHJpbnRmIFwiJXtbJiomXX1zLmNsYXNzXFxuXCIsICQxfVwnXCI7e1smKiZdfQ==[[/code]

Leave a comment Continue Reading →

Linux System Overview – File System

Linux File System Basic

Ext3 (successor of Ext2) is the standard file system for Linux: It is robust, fast and suitable for all fields of use. The main difference between them is that Ext3 has a journal that records the pending operations for fast recovery purpose in the event of system crash. This record guarantees a consistent file system at all times and reduces the time needed for checking a mounted file system from several hours to a few seconds b/c instead of checking the entire disk, the system can check just those areas noted in the journal as having pending operations.

Like all decent Unix file systems, Ext3 uses three general data structures: directories, inodes and data blocks. Directories only contain file names and the inode numbers assigned to them. Each file has one i-node that contains a list of disk block’s starting sector addresses as a file content is normally not stored in contiguous disk blocks in disk drive due to constant add and delete and the size is dynamic (ie. external fragmentation). If the file content are scattered, it takes longer to retrieve its content as it takes more header spins physically.

http://www.heise-online.co.uk/images/110398/0/1

Under the hood, each disk block can span multiple disk sectors and each sector has the size of 512 bytes. Disk sector is the smallest addressable unit on hard disks. Ext3 uses block sizes of 1024, 2048 or 4096 bytes. In theory, Ext3 supports block sizes up to 64 KB, but in x86 and x64 architectures, 4 KB is the maximum: This block size corresponds to that of the kernel’s memory pages in RAM, which makes paging easier for the operating system. Ext3 uses 32-bit values (4 bytes as integer in Java) to assign block numbers, which means that it can only address about four billion blocks – 4 TB at a block size of 1024 bytes, 16 TB at 4096 bytes. So, larger block size allows you to create large file system. On the other hand, large blocks can waste a lot of disk space because files always use a whole block even if they only contain a few bytes: On average, every file wastes half a block - the larger the blocks and the smaller the files, the more noticeable the effect is. This effect called internal fragmentation.

Optimization with sacrifice

For an efficient file system, it needs to quickly find the data belonging to a file name. For example, for filename “abc.txt”, OS needs to traverse a list of directory entries before the inode of the file is located (depends on the depth of the folder hierarchy) and then traverse all the data block pointers to retrieve the content. To optimize the speed, Ext3 writes the inodes into static tables on the disk during formatting. One consequence of this is that the number of inodes can’t be altered after the file system has been set up. As every file needs to be assigned to one specific inode there can’t be more files than inodes. It is not scalable for handling large number of files. By default, mke2fs creates one inode for every 4 KB in file systems up to 512 MB, otherwise one inode for every 8 KB. Although you can tune this number to increase the number of inodes, it is only changeable at setup time (not dynamic). By the way, each inode itself consumes 128  bytes in size. 

Handle large size file

How is it possible to fit the millions of data block numbers required for gigabyte-sized files into a static data structure of 128 bytes? It isn’t – one Ext3 inode stores exactly 15 block numbers. The first twelve point directly to data blocks, block 13 to a data block containing block numbers (indirectly addressed blocks), block 14 to a block pointing to blocks with block numbers (double indirect), and block 15 points to triple indirect blocks. Therefore, at a block size of 4 KB (that is 1024 block numbers with 4 bytes per indirect block) one inode can handle 12 + 1024 + 10242 + 10243, around a billion block numbers. The resulting maximum file size of just over 4 TB.

Power of B-Tree Indexing

Now you know how inode uses hierarchical pointers to handle file with large size. However, if a directory has tons of files, how directory entries make it efficient to locate the inode. Ext2 originally stored the file names within a directory as a linked list. While this is a elegantly simple data structure it has the disadvantage that operations take longer and longer with a growing number of entries. Ext3 can manage directories in B-Tree+ structure if [code]]czo5OlwiZGlyX2luZGV4XCI7e1smKiZdfQ==[[/code] is set (not default). This drastically speeds up directory operations. Performance loss is only experienced when the directories are filled with hundreds of thousands of files. This is usually caused by a caching effect as Linux kernel doesn’t use unlimited memory for caching directory structures even you add more memory.

$ sudo tune2fs -O dir_index /dev/hda1

Run the above command as root. Do note that the indexing will take up much more space, but then hard disk space is not too expensive nowadays. If you don’t want to tweak the OS default setting but you still want to store large number of files. You can restructure the directory so that it does not contain that many files. Without doing this, in a default (untuned) Ext3 partition, each subsequent write degrades horribly past the 2000 file limit. So, keeping the items in a directory to within 2000 files should be fine. If you want to go this route, there are approaches to restructure your folder:

  1. Date based – YEAR > MONTH > DATE > HOUR if your files is uniformly distributed across the time.
  2. Hash based – break down the hash into several parts as folder name (check this)
  3. Id based – reverse the id and break it down use 2 digits each to make sure it is uniformly distributed

NOTE: I don’t want to use random number here as I want to locate the file via its metadata later.

Alternative solution for large number of files in a directory

ReiserFS can handle up to 2^31 files per dir (that’s 2 billion), with a max of 2^32 (4 billion) files on the filesys total. It can handle up to 64000 subdirs in a directory. Ext3 has a limit of 32000 subdirs per dir. The max number of files per dir is theoretically unlimited (actually around 130 trillion), but performance becomes terrible with above 10-15 thousand files. The max number of total files on the filesys is limited by the number of inodes you have. With a 1 gig file system and a 4k block/ inode ratio (the default), you have around 260000 inodes, and that’s also the max number of files you can have.

Reference

Here are some good references

  1. The Unix and Internet Fundamental How to – Eric Raymond
  2. Tuning Linux file system – Ext3 by Oliver Diedrich
  3. Handling large number of files in a directory – Roopinder Singh
  4. Super fast Ext4 filesystem arrives in Ubuntu 9.04If the benchmark is correct, it outperforms all the file system nowadays dramatically.
  5. Introduction to Linux file systems and files
  6. Extreme performance monitoring and tuning in Linux
  7. Simple Help with simple answer - simplehelp.net
Leave a comment Continue Reading →

Plenty of Fish – Cash cow!

A site called “PlentyOfFish.com” is currently getting 30 million hits a day. The number doesn’t blow me off. However, what surprise me is that this site is basically operated by single man “Markus Frind”. How does he achieved that? If you want to hear how he does that, you can go to his interview from this link. Otherwise, you can read the summary I got from his interview.

The stuff I learnt from Markus

You may think that Markus must spend a lot of $$ to maintain his site. A picture of server farm may be popped up in your head. Hahaha… all he needs is just 1 web server and 3 database servers. This is the cost that you and me can afford. No bother to write your business plan and wait for VC $$ nowadays. :grin:

Here are some quick tips for Markus

  1. You need a lot of RAM. RAM is cheap, go ahead to power up your box with tons of RAMs please!
  2. Markus uses Akamai CDN to offload the bandwidth of fetching images across different locales.
  3. Separate R/W database operation.
  4. Markus uses one database as master for write and 2 databases as slave to handle the searches (read). According to him, radius-based searches demand lots of resources. “If you have one system to do just one thing, it will do it much efficiently.”
  5. Markus put RAM to both web and db servers. “If you can load your whole db in the RAM, do it!”
  6. Optimize the db access is the key to handle lots of requests.
  7. Denormalization is necessary if you want to reduce the number of joins that can potentially slow down your queries.
  8. PlentyOfFish.com is purely based on “Word of Mouth” marketing. Do things right, your users will spread it out for you. Cheapest marketing strategy ever!
  9. PlentyOfFish.com is FREE site. Because it is free, it doesn’t have high requirements like uptime. It can be down without much issues.
  10. PlentyOfFish.com solely monetized from advertisement like Google Ads. Just this, Markus is making around 10 million annually. Amazing!
  11. PlentyOfFish.com is purely using Microsoft solution like IIS, ASP.NET and SQL Server. In fact, you can build it using other solution like Apache, Spring, MySQL

I love to see how people like Markus beat down the giant like Match.com. One man beats hundreds of people with simple system settings. Incredible! Folks, there is no excuse whining no $$ to start your business!:lol:

Although it sounds easy for Markus during the interview, there are areas the interviewer didn’t cover:

  1. PlentyOfFish.com webfront is not looking good. How could it attract the first set of users in the first place? FREE
  2. If you go to a FREE site without data, you may leave it right away. How PlentyOfFish.com attracts the first real user? Did PlentyOfFish.com crawl competitors’ data to power his site as bootstrap?
  3. PlentyOfFish.com purely makes $$ from Google AdSense. However, according to John Chow, Adsense is not a good place to make $$. Why is that?

What possibly may go wrong for his approach:

His database architecture is traditional master-slave approach. It can offload the read but not write operations. Obviously the master becomes the write bottleneck and a single point of failure. And as load increases the cost of replication increases as well. Replication costs in CPU, network bandwidth, and disk IO. The slaves fall behind and have stale data. The folks at YouTube had a big problem with replication overhead as they scaled. This problem can be tackled by shard/ federation. I will discuss this topic later.

 

Leave a comment Continue Reading →

Powerful Full Text Search Engine – Part 1 Lucene Introduction

Introduction of Lucene

I have heard of Lucene and its powerful full text search capability many times. Today, I decide to take a look at it. Before I dive into the user guide, I went to Google Tech Talk to find a video related to Lucene first. Here is what I found: 

After I finished this video, I found Lucene a really great tool for me. So, I decided to have a deeper look at it. After a quick search,  I found a great blog that showed me how to use Lucene with Digg. With Solr on top of Lucene, you can make Lucene available as RESTful Web Service. It is so awesome, isn’t it? In this article, I will list you all the information I found during my little research on Lucene and I hope you will feel it useful.

Architecture Overview

Before we dig into the code or set up guidelines, I would like to have a high level picture of Lucene first. I borrow a diagram from this article that helps me to grasp the key components in search.

This high level picture shows you that your search keywords you entered (normally using a form) will become a HTTP search request and later been translated into a form that search engine understands by Query parser. Search engine will perform the search operation against the indexed files that was previously prepared by Indexer. After that, the result will be ranked based on predefined ranking algorithm and returned to the user. The source of the data can be from Web Service, database or documents in your file system. In this diagram, it shows you that you can launch spider or crawler like Google to obtain the data from web pages on the Internet and feed it to Indexer as your source.

Get one step deeper

Now you know the high level flow of how search works. Lets get one step into the detail.

  1. What search interface to use?
  2. How search interface communicates with your search engine?
  3. What kind of search the search engine provides?
  4. How search engine indexes the documents?
  5. How result be ranked and what kind of ranking algorithms we normally use?

Below is the answers of the questions above:

  1. Up to you. I would use Flex as I want to provide a rich search interface to my users.
  2. Flex can talk HTTP, Web Service or RemoteObject AMF. If you put web service layer on Lucene (ie. Solr), you can use REST call (ie. HTTP) to obtain the result.
  3. Lucene supports several kinds of advanced searches like:
    • Boolean operators – users can compose query using AND, OR, NOT
    • Field Search – what fields the search operates on? like title, author or content?
    • Wildcard Search - supports * and ?.
    • Fuzzy Search - Lucene provides a fuzzy search that’s based on an edit distance algorithm. You can use the tilde character (~) at the end of a single search word to do a fuzzy search. For example, the query "think~" searches for the terms similar in spelling to the term "think." The key here is the word "similar". Do we consider horse and donkey are related? Or you have hose and horse be related somehow in spelling?
    • Range Search - age, date and etc
  4. Large topic. I will go back to it later.
  5. Up to you. If you want to look at the popular ranking algorithm in the world, check out Google Page Rank. It is one of the algorithms that many of us interested to know. Before I want to have my wedding website – Justproposed.com be shown on at the top of the result when users type "wedding website" as search keywords, I have looked into SEO. It is a fun area to explore. Generally speaking, if the query keywords shown in the title, it weights more, If the keyword frequency is higher, it ranks higher..blah blah. However, I know Google has weighted a lot on the links. It is not just purely based on the document that you have. How to obtain the additional information during the crawling is beyond the scope of this article.

Get your hand dirty

Look into this great article.

The thing this article doesn’t mention is that you need to create you dataDir and indexDir folders under C and drag a list of html files into the dataDir before you start the web server. If you drag new htmls into it, you need to clean up your indexDir and restart your web server in order to rebuild the indexes.

I have got the application up and running. It is nice trial. My next step is to enhance this example. I will do the following:

  1. Use Flex as search interface
  2. Use Solr to expose the Lucene search engine as Web Service.
  3. Have Flex calls my search engine via REST.
  4. Display the result on Flex.

After I have my new enhancements working, I would do the following:

  1. Look into how Lucene do the indexing
  2. Look into Nutch,. So I can have it crawled some sites and put the htmls in dataDir for me automatically.

How Lucene Indexes the documents?

Yes. I haven’t forgot to answer the question 4. Here is the article that answers your question. To summarize, here are several key points I extracted from this article.

  1. Content Extraction – Lucene only takes text for index. So, it provides different types of parsers to extract content from different types of document like word, html, doc, pdf and etc. If you have other type of document that you cannot find a parser, you take the responsibility to extract the content out for Lucene. This article shows you how to use Digester to extract content out from XML and feed Lucene. If you have a large pool of XML for content extraction, you need to pay attention on the parsing time. There is someone who has done this and obtain some performance number as reference. However, the article was a bit outdated.
  2. Content Preprocessing – Analyzer is used to extract the token from your text content to be indexed. Before text is indexed, it is passed through an [code]]czo4OlwiQW5hbHl6ZXJcIjt7WyYqJl19[[/code]. [code]]czo4OlwiQW5hbHl6ZXJcIjt7WyYqJl19[[/code]s are in charge of extracting indexable tokens out of text to be indexed, and eliminating the rest. Lucene comes with a few different [code]]czo4OlwiQW5hbHl6ZXJcIjt7WyYqJl19[[/code] implementations. Some of them deal with skipping stop words (frequently-used words that don’t help distinguish one document from the other, such as "a," "an," "the," "in," "on," etc.), some deal with converting all tokens to lowercase letters, so that searches are not case-sensitive, and so on.
  3. Indexing – IndexWriter is the key component in the indexing process. This class will use Analyzer that you passed in as parameter to create a new index or open an existing index and add documents to it. You need to set up fields and documents and feed them to the IndexWriter to do the job. Like the code below, you fetches a list of .txt files and its metadata like path from a directory and feed them for IndexWriter. IndexWriter will index them one after one.
  4. Configuration - You can configure IndexWriter to achieve better performance via increasing the buffer size because the bottleneck normally happen during the IO of the index files.
  5. Lucene uses inverted index concept. An inverted index is an inside-out arrangement of documents in which terms take center stage. Each term points to a list of documents that contain it. On the contrary, in a forwarding index, documents take the center stage, and each document refers to a list of terms it contains. You can use an inverted index to easily find which documents contain certain terms. Lucene uses an inverted index as its index structure.

 
for(int i = 0; i &lt; textFiles.length; i++){
      if(textFiles[i].isFile() &gt;&gt; textFiles[i].getName().endsWith(&quot;.txt&quot;)){
        Reader textReader = new FileReader(textFiles[i]);
        Document document = new Document();
        document.add(Field.Text(&quot;content&quot;,textReader));
        document.add(Field.Keyword(&quot;path&quot;,textFiles[i].getPath()));
        indexWriter.addDocument(document);
      }
}

Lucene offers four different types of fields from which a developer can choose: Keyword,UnIndexed,UnStored,and Text.

Keyword fields are not tokenized, but are indexed and stored in the index verbatim. This field is suitable for fields whose original value should be preserved in its entirety, such as URLs, dates, personal names, Social Security numbers, telephone numbers, etc.

UnIndexed fields are neither tokenized nor indexed, but their value is stored in the index word for word. This field is suitable for fields that you need to display with search results, but whose values you will never search directly. Because this type of field is not indexed, searches against it are slow. Since the original value of a field of this type is stored in the index, this type is not suitable for storing fields with very large values, if index size is an issue.

UnStored fields are the opposite of UnIndexed fields. Fields of this type are tokenized and indexed, but are not stored in the index. This field is suitable for indexing large amounts of text that does not need to be retrieved in its original form, such as the bodies of Web pages, or any other type of text document.

Text fields are tokenized, indexed, and stored in the index. This implies that fields of this type can be searched, but be cautious about the size of the field stored as Text field.

Conclusion

To use Lucene, there are 3 main concepts you need to grasp. There are:

  1. Indexer – create search engine indexes
  2. Analyzer – Split text into tokens that make sense for the search engine. The structure is like document -> a sequence of fields and each field is name/value pair -> tokens. Field values may be stored, indexed or analyzed/ tokenize, (and, now, vectored). The lecture note from Doug Cutting will give you more detail.
  3. Searcher

You may think of using grep to achieve or database to achieve what Lucene does. Grep is powerful Linux tool, however, if you want it to search on files with several MB in size, you will see that the tool is inefficient. The reason is grep doesn’t prepare the indexes of your files ahead of the time you do the search. Database can do indexing but not so sophisticated as Lucene in your varchar field. Oracle may provide one but I am not familiar with it. One key thing to remember: Lucene is open source, free and does the job extremely well. Why bother to dig into Oracle costly solution?

Lucene has given us a rich search engine capability on our web application. It has many features that I haven’t got a chance to discuss them all in this article. I will continue to write more articles on this topic as my research moves forward. Have a nice day! :lol:

Reference

The blog of the Lucene and Solr creator – Doug Cutting

 

Leave a comment Continue Reading →

Tomcat Performance Tuning

Most companies I have worked for use Tomcat as Servlet Container. It is de facto standard just like how Apache been used as Web Server. However, most of us just drag our war file to the webapp folder and use Tomcat with all the settings as default out of the box. It works fine in development environment but may not in production. This article will give you advice in several areas:

  1. Production Tomcat Architecture
  2. Tuning tomcat for performance
  3. Resolving problems which affect availability

 Production Tomcat Architecture

In production Tomcat relies on a number of resources which can impact its overall performance. Understanding the overall system architecture is key to tuning performance and troubleshooting problems.

  1. Hardware: CPU(s), memory, network IO and file IO
  2. OS: SMP (symmetric multiprocessing) and thread support
  3. JVM: version, tuning memory usage, and tuning GC
  4. Tomcat: version (example, Tomcat 6 supports NIO)
  5. Application: Application design can have the largest impact on overall performance
  6. Database: concurrent db connection is allowed (pooling and object caching)
  7. Web Server: Apache can sit in front of Tomcat and serves the static content. It also can do load balancing across multiple Tomcat instances.
  8. Network: Network delays.
  9. Remote Client: How fast is the communication protocol? Content can be compressed. 

Performance Tuning

How to measure and test performance

  • Request latency is key b/c it reflects the responsiveness of your site for visitors.
  • Test environment should match production as closely as possible.
  • The data volume is important to simulate in database side.
  • Test HTTP requests with different request parameters (test corner cases)
  • Use load test to simulate the traffics (ex. JMeter)
  • Final tests should be over longer periods like days because JVM performance changes over time and can actually improve if using HotSpot. Memory leaks, db temporary unavailable, etc can only be found when running longer tests.

JVM version, memory usage and GC

  • Sun Java 1.3 and later releases inlcude HotSpot profiling optimizer customized for long running server application.
  • Tomcat will freeze processing of all requests while the JVM is performing GC. On a poorly tuned JVM this can last 10′s of seconds. Most GC’s should take < 1 second and never exceed 10 seconds
  • Tune the -Xms (min) and -Xmx (max) java stack memory (set them to the same value can improve GC performance)
  • Make sure the java process always keeps the memory it uses resident in physical memory and not swapped out to virtual memory.
  • Use -Xincgc to enable incremental garbage collection
  • Try reducing -Xss thread stack memory usage

Tomcat version and configuration

  • Tomcat 6 supports NIO.
  • Set “reloadable” false – remove unnecessary detection overhead
  • Set “liveDeploy” to false – liveDeploy controls whether your webapps directory is periodically checked for new war files. This is done using background thread.
  • Set “debug” to 0
  • Set “swallowOutput” to true – This makes sure all output to stdout or stderr for a web application gets directed to the web application log rather than the console or catalina.out. This make it easier to troubleshoot problems.
  • Connector configuration – minProcessor, maxProcessor, acceptCount, enableLookups. Don’t set the acceptCount too high b/c this sets the number of pending requests awaiting processing. It is better deny few requests than overload Tomcat and cause problems for all requests. Set “enableLookups” to false b/c DNS lookups can add significant delays.

Database connection pool

  • We use connection pool provided by Spring instead
  • Using middleware to persist and cache objects from your database can significantly improve performance b/c of fewer db calls, less thrashing of the JVM for creation and subsequent GC of object craeted for resultset.

Application design and profiling

  • If the data used to generate a dynamic page rarely changes, modify it to a static page which you regenerate periodically.
  • Cache dynamic page
  • Use tool like JProble to profle your web applications during development phase
  • Look for possible thread synchronization bottlenecks
  • Date and Time thread synchronization bottleneck 

Troubleshooting

Collecting and analyzing log data

Common problems

  • Broken pipe – For HTTP Connector indicates that the remote client aborted the request. For web server JK Connector indicates that the web server process or thread was terminated. These are normal and rarely due to a problem with Tomcat. However, if you have long request, the connectionTimeout may close the connection before you send your response back.
  • Tomcat freezes or pauses with no request being processed – usually due to a long pause of JVM GC. A long pause can cause a cascading effect and high load once Tomcat starts handling requests again. Don’t set the “acceptCount” too high and use java -verbose:gc startup argument to collect GC data.
  • Out of Memory Exception – look into application code to fix the leak (profile tool can help). Increase available memory on the system via -Xmx. Restart tomcat!
  • Database connection failure – connection used up when traffic is spike.
  • Random connection close exception - when you close your connection twice. First close(), the connection returns to the pool. It may be picked up by another thread. Now, second close() may close a connection that is being used by other thread. Don’t close connection twice, use JDBC Template from Spring to avoid this problem. 

Reference

  1. JavaWorld GC Article
  2. Sun HotSpot Performance Document
  3. Tomcat Performance Slides

  

Leave a comment Continue Reading →

Power of awk

If you have a file of records, and you want to find out which record(s) meets the criteria like field1=xyz, field2=abc… How would you approach it? Simple! Load the file to database, write a sql with where clause and have the database taken care of it for you. Is it the simplest way? May not! Awk can achieve the same job without using db.

How Awk works?

awk program [files]
awk -f pfile [files]

Awk runs through a text file by reading and processing one record at time. Its commands are written with the intention that they act repetitively on each record as it is read in to awk. A record that has been read by awk is broken into separate fields ($1, $2, …etc), and actions can be performed on the separate fields as well as on the whole record.

When you type awk as a command, you must also provide two additional pieces of information or arguments. The first is the program or script to be executed, and the second is some method of identifying the file on which to perform the actions. Awk can be used as a pipe, and the file does not need to be explicitly named on the command line like: ls -l | awk ‘{print}’

By default, awk breaks down the field from the record via space and assigns each field value to variable $1, $2… So, you can do something like: ls -l | awk ‘{print $1 $3}’ to filter out some of the information before outputting it. So far so good! If you really run the command above, you may notice all the fileds output are concatenated into a single string. To separate them out, you could put ” ” to the statement to separate out $1 and $3. Or you tell awk what it should use as delimiter.

Executing more than one set of commands

So far, we are telling to do one thing per record. What if we want it to do more than one set of commands on a record? Use (;) to separate the commands: ls -l | awk ‘{ttl+=$5; print $9 }’. As you can see, awk takes variable (ie. ttl) as well. If you want to add pre-processing and post-processing commands, you can do this:

ls -l|awk ‘
BEGIN{print “Custom Directory Listing”}
{ttl+=$5;
print $9 ” ” $5 ” “$3}
END{print “Total ” ttl ” bytes”}’

There are some built-in variable that you may find useful:

  • FILENAME
  • FS – input field separator (you can set it FS = : via awk -F: ‘{…}’)
  • OFS – output field separator
  • NF – number of fields in the current record
  • NR – number of current record (line #)
  • RS – record separator
  • $0 – entire input record
  • $n – nth field in the current record. And field are separated by FS.

If tests and conditions

You can test the field via the following operators ( ==, >, <, >=, <=, !=) and conditions can combined via (&&, ||, !) . Loop that you can use:

  • while (condition) command
  • do command while (condition)
  • for (set; test; increment) command (break & continue work as expected)

Pattern matching

Now you know the basic of awk. To discuss the power of awk, it is hard not mentioning its pattern matching feature. For example, if I want to search on a list of employee records that has ‘CA’, I can write my awk command like followings: awk ‘/AL/ {print $3,$2}’ emp_names

Pattern can be:

  • /AL|IN/ (AL or IN)
  • $2 ~ /A|B|C/ (with letter A, B or C in the 2nd field)
  • $1 !~ /pattern/
  • $1 != prev {print; prev = $1} (print all input lines in which the first field is different from the previous first field)
  • /[0-9]+/
  • /a-zA-Z]+/
  • /^abc/ (match any string with abc at the beginning)
  • /p$/ (match any string with p at end of the string)
  • . (single character)
  • * (any character)
  • + (at least one)
  • ? (1 or 0)
  • {n}, {n,m}, {n,} (specify the occurrence range)
Leave a comment Continue Reading →

WebDav vs FTP

Today, I have come across a technical issue that a process is taking too long to download a file from one of our file server. The reason is due to the number of the files of a folder is increased over time and finally reach to ~ 12000. If you use ftp, you need to establish tcp connection per file download.

  1. Webdav is a protocol. It enhances http, and works thru http, so you only need to open port 80 (or 443 if https)
  2. It allows you to work with remote files as if they were on your machine.
  3. Example, ftp, for you to edit a txt file on an ftp server, you need to download the file, edit, and upload the revised file. Webdav allows you to work with the file, fromthe server, without downloading.
  4. It also allows for “locking” of files, which means, multiple users can work with a file at the same time, but only 1 at a time can make changes.
  5. Wedav is also secure because it works along side ntfs permissions, and can be used over https, so everything is encrypted. With FTP you’d have to setup a vnp connection to the ftp server, and enable IPsec policies over that vpn.
  6. Webdav allows you to pipe multiple transfers, as opposed to ftp opening a new connection for every transfer.
  7. It is more efficient, and nativley support by windows (webfolders IS webdav) and ultimatly more secure than FTP, and the fact that when working with xp, every app sees the file as if it were local, makes it a much more efficient webmastering/file transfering solution.
  8. It means less ports to open on your firewall, and a smaller attack zone for hackers, cause it works from port 80 with http
Leave a comment Continue Reading →

Basic hardware knowledge

What should we look at for a machine?

  1. CPU (how many core, how many physical cpu(s), how fast, 64 bits?, cache size)
  2. Memory RAM
  3. IO speed 

Dual-core CPU vs multiprocessor

A dual-core CPU is a CPU with two separate cores on the same die, each with its own cache. It’s the equivalent of getting two microprocessors in one. Multi-processor has 2 or more CPUs physically on the motherboard.  An attractive value of dual core processor is that it does not require new motherboard. For now, I have heard of quad-core CPU as well (4x). For example, Dell PowerEdge 6950 has 4 quad-core AMD Opteron 8300 processors.In a single-core or traditional processor the CPU is fed strings of instructions it must order, execute, then selectively store in its cache for quick retrieval. When data outside the cache is required, it is retrieved through the system bus from random access memory (RAM) or from storage devices. Accessing these slows down performance to the maximum speed the bus, RAM or storage device will allow, which is far slower than the speed of the CPU. The situation is compounded when multi-tasking. In this case the processor must switch back and forth between two or more sets of data streams and programs. CPU resources are depleted and performance suffers.In a dual core processor each core handles incoming data strings simultaneously to improve efficiency. Now when one is executing the other can be accessing the system bus or executing its own code. Adding to this favorable scenario, both AMD and Intel’s dual-core flagships are 64-bit.To utilize a dual core processor, the operating system must be able to recognize multi-threading and the software must have simultaneous multi-threading technology (SMT) written into its code. SMT enables parallel multi-threading wherein the cores are served multi-threaded instructions in parallel. Without SMT the software will only recognize one core.

Memory is important

If you cache stuff, you need memory. Cache is one of the key weapons to boost your performance. It can reduce the number of database calls dramatically and eliminate unnecessary query processing time. So, you want to buy your system more RAM and it is cheap too!

Disk Storage

When we talk about IO, we are looking at the power of our storage device. Here we are going to understand some of the common terms

  1. SCSI – It’s a fast bus that can connect lots of devices to a computer at the same time, including hard drives, scanners, CD-ROM/RW drives, printers and tape drives. Other technologies, like serial-ATA (SATA), have largely replaced it in new systems, but SCSI is still in use.  
  2. RAID – SCSI is often used to control a redundant array of independent discs (RAID). Other technologies, like serial-ATA (SATA), can also be used for this purpose. A RAID is a series of hard drives treated as one big drive. Disk arrays stripe data across multiple disks and access them in parallel to achieve high throughput for the system. But large disk array is highly vulnerable to disk failure. The solution to the problem of lower reliability in disk arrays is to improve the availability of the system via redundancy (fault tolerant!). However, redundancy has its disadvantage of lowering the write performance because of maintaining the consistency across 2 replica. Data striping (0) - improve performance; Redundancy via mirroring (1) improves availability. RAID 01 – mirror of strips, RAID 10 – strip of mirror. For 10 machines, you adopt RAID 10. You will have 5 set of mirrors and each set contain its own strips. Clearly, RAID 1+0 is more robust than RAID 0+1.
  3. SAS – The newest type of SCSI, called Serial Attached SCSI (SAS), uses SCSI commands but transmits data serially. SAS uses a point-to-point serial connection to move data at 3.0 gigabits per second.
  4. SATA vs SAS. In term of storage (GB/$), SATA is a LOT better than SAS. But in term of performance, SATA is 10K rpm is slower than SAS (15K rpm). In addition, SAS is more reliable.
  5. iSCSI – iSCSI is one of two main approaches to storage data transmission over IP networks; the other method, Fibre Channel over IP.
  6. SAN – Storage Area Network (SAN) is a high-speed subnetwork of shared storage devices. A storage device is a machine that contains nothing but a disk or disks (disk array) for storing data. A SAN’s architecture works in a way that makes all storage devices available to all servers on a LAN or WAN. As more storage devices are added to a SAN, they too will be accessible from any server in the larger network. In this case, the server merely acts as a pathway between the end user and the stored data.
Leave a comment Continue Reading →