Tag Archives: hadoop

Learning Hive

Starting to learn Hive

As I mentioned in my last article,  I was getting excited about the potential of Hive. Today, I decide to start my journey to learn this. I found a great introductory video that gives you a nice warm-up of using Hive (A basic knowledge of how hadoop and mapreduce work would be helpful for you to digest the material inside).

Below are some highlights from this video

Hive is an SQL interface built on top of Hadoop. It supports Web access and JDBC. I am amazed how close the SQL syntax like the regular SQL for RDBMS. Below are some SQLs used in this tutorial.

//———- Set up your tables in HIVE —————–
SHOW TABLES;

CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE;

DESCRIBE shakespeare;

//———- Load data into Hive table from Hadoop HDFS ——————-
LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare;

//———- Query against the data using hive sql interface ————–
select * from shakespeare limit 10;
select * from sakespeare where freq > 100 sort by freq asc limit 10;
select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;

//show me the plan
explain select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;

//———- Create a merge table and populate it using dataset joining by 2 different tables
insert overwrite table merged select s.word, s.freq, k.freq from shakespeare s join kjv k on (s.word = k.word);

//———- Query the merge table ———————
select word, shake_f, kjv_f, (shake_f+kjv_f) as ss from merged sort by ss limit 20;

To prepare the data for Hive to load in, the demo uses another mapreduce job to achieve. Remember to delete the log before doing Hive table load.

hadoop jar $HADOOP_HOME/hadoop-*-examples.jar grep input shakespeare_freq ‘\w+’

//remove the mapreduce job log
hadoop fs -rmr shakespeare_freq/_logs

Often time, large scale data processing system always IO bound. So for mapreduce job, your mapper is always waiting for data to load from disk. Hadoop mitigates the problem via during parallel load from lots of hard drives. However, a single hard drive is still max out at 75MB/s read as physical limit and nothing we can do about this. In order to achieve good speed, the key is to eliminate # of hadoop pass

Since Hive is on top of Hadoop’s HDFS, it will have the same restrictions as it. So, you cannot do UPDATE, DELETE and INSERT records as regular RDMS. However, you can do bulk load to add more new files (data) to the table and you can do delete a file from Hive.

Hive needs to store metadata of the tables out from the HDFS. You can use regular rdms to achieve the job. But when you start Hive locally, it will seek for the local metastore. So, in distributed environment, you may need to centralize the metastore in a remote location. There is wiki on the Hive site that documents how to set it up.

See Hive in Action

Cloudera Hadoop Training: Hive Tutorial Screencast from Cloudera on Vimeo.

Other projects similar to Hadoop

  • Parallel databases: Gama, Bubba, Volcano
  • Google: Sawzall
  • Yahoo: Pig
  • IBM Research: JAQL
  • Microsoft: DryadLINQ, SCOPE
  • Greenplum: YAML MapReduce
  • Aster Data: In-database MapReduce
  • Business.com: CloudBase
Leave a comment Continue Reading →

Hive on Amazon EC2 cloud

adserving-ec2-hive-system-arch

 

I ever worked for a display ad network company that collects over 400 million of impression/ click logs per day. With this amount of data, my ex-company bought a supercomputer and cross their fingers that it can handle the grow in both volume and analytic demand of the data. It is obviously not a scalable solution. However, what is the best solution?

Although I haven’t worked for this company anymore, it is still an interesting problem to solve. I have a great friend who proposed a shared nothing solution for this company. The solution is to partition the data across a set of Postgresql databases and put Greenplum on top of them to parallelize the query —there is no disk-level sharing or contention to be concerned with (i.e. it is a ‘shared-nothing’ architecture). I like this approach. The only thing is that Greenplum is not free and it may be difficult for a startup to face this upfront cost. Apart from that, this setting requires all the databases are running on the same network that hindered us to move this in the elastic cloud like Amazon EC2.

Later on, I joined a great company in the same industry that seeks for a solution in the cloud to host its data warehouse. So, I got a  chance to revisit this problem. During the research, I came across an interesting technology – column-based database (eg. infobright and lucid db). The idea of column-based data store is that traditional database stores and fetches data in row from data files into the memory. It is inefficient if your query only requires few columns for computation. So, column-based data stores your data in column with effective compression algorithm due to all values in it has the same data type. This solution is great but it doesn’t do MPP (ie. massive parallel processing) and it is also not ready for cloud yet.

Here comes another solution. That is Hive on top of Hadoop on top of Amazon cloud. It is an interesting idea. Check out this video to learn about this.


If you are not sure what Hadoop is and want to get some warm up in massive computing. I suggest you go through the following 5 excellent Google lectures.


Leave a comment Continue Reading →

Powerful Full Text Search – Part 2 Nutch

Introduction of Nutch & Hadoop

After Lucene, the author created another powerful tool. Its name is Nutch. Nutch is a powerful crawler built on top of the Lucene. With Nutch, you can launch a multi-threaded crawler to obtain information from the Net. At this point of writing, Nutch is in its 0.9 version. Nutch comes with a list of cool features, including whole Web crawling, local file crawling for the
intranet, indexing all the while.

Hadoop was designed to handle the petabytes of data that Nutch could potentially store and process. In fact, Hadoop has its own file system: the Hadoop Distributed File System (HDFS), which can run on any old run-of-the-mill, low-cost hardware.
Hadoop works by storing part of the file system’s data across all the servers in the cluster. As new queries come in, HDFS follows the "moving computation is cheaper than moving data" rule — meaning that moving the processing of the query to as
close as possible to the data will be faster than placing the query at random within the cluster and moving data long distances across the network.

I have searched around to see if anyone can give me some tips on this tool. Surprisingly, I don’t see much. But don’t worry, I have found some that can at least get you start playing with it.

Set up Nutch

Here is the guideline written by Peter Wang that I followed to bring my Nutch up. Follow it and bring your Nutch before go further. By the way, if you want to run Nutch with Solr, this is a good tutorial.

Nutch Architectural Review

 

 

 

Leave a comment Continue Reading →