Posted by admin on November 18, 2009
Starting to learn Hive As I mentioned in my last article, I was getting excited about the potential of Hive. Today, I decide to start my journey to learn this. I found a great introductory video that gives you a nice warm-up of using Hive (A basic knowledge of how hadoop and mapreduce work would [...]
Posted by admin on November 11, 2009
I ever worked for a display ad network company that collects over 400 million of impression/ click logs per day. With this amount of data, my ex-company bought a supercomputer and cross their fingers that it can handle the grow in both volume and analytic demand of the data. It is obviously not a [...]
Posted by admin on April 17, 2009
Common Text Processing Commands In our daily life, we deal with lots of data. The data normally is stored in text format for the ease of human to read. With the large amount of data we have, we need ways to deal with it. There are several things we frequently do on the data: Search, [...]
Posted by admin on October 15, 2008
A site called “PlentyOfFish.com” is currently getting 30 million hits a day. The number doesn’t blow me off. However, what surprise me is that this site is basically operated by single man “Markus Frind”. How does he achieved that? If you want to hear how he does that, you can go to his interview from [...]
Posted by admin on July 4, 2008
Introduction of Lucene I have heard of Lucene and its powerful full text search capability many times. Today, I decide to take a look at it. Before I dive into the user guide, I went to Google Tech Talk to find a video related to Lucene first. Here is what I found: After I finished [...]
Posted by admin on May 29, 2008
Most companies I have worked for use Tomcat as Servlet Container. It is de facto standard just like how Apache been used as Web Server. However, most of us just drag our war file to the webapp folder and use Tomcat with all the settings as default out of the box. It works fine in [...]
Posted by admin on April 4, 2008
If you have a file of records, and you want to find out which record(s) meets the criteria like field1=xyz, field2=abc… How would you approach it? Simple! Load the file to database, write a sql with where clause and have the database taken care of it for you. Is it the simplest way? May not! [...]
Posted by admin on March 20, 2008
Today, I have come across a technical issue that a process is taking too long to download a file from one of our file server. The reason is due to the number of the files of a folder is increased over time and finally reach to ~ 12000. If you use ftp, you need to [...]
Posted by admin on November 14, 2007
What should we look at for a machine? CPU (how many core, how many physical cpu(s), how fast, 64 bits?, cache size) Memory RAM IO speed Dual-core CPU vs multiprocessor A dual-core CPU is a CPU with two separate cores on the same die, each with its own cache. It’s the equivalent of getting two microprocessors in [...]
Posted by admin on October 11, 2007
Configure Tomcat Change port to 80. Edit install_dir/conf/server.xml and change the port attribute of the Connector element from 8080 to 80. Turn on servlet reloading. Edit install_dir/conf/context.xml and change <Context> to <Context reloadable=”true”>. Change the default AJP/1.3 connector port of Tomcat. Edit install_dir/conf/server.xml and change the value of the port attribute in the AJP/1.3 Connector element. [...]