Starting to learn Hive
As I mentioned in my last article, I was getting excited about the potential of Hive. Today, I decide to start my journey to learn this. I found a great introductory video that gives you a nice warm-up of using Hive (A basic knowledge of how hadoop and mapreduce work would be helpful for you to digest the material inside).
Below are some highlights from this video
Hive is an SQL interface built on top of Hadoop. It supports Web access and JDBC. I am amazed how close the SQL syntax like the regular SQL for RDBMS. Below are some SQLs used in this tutorial.
//———- Set up your tables in HIVE —————–
SHOW TABLES;CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE;
DESCRIBE shakespeare;
//———- Load data into Hive table from Hadoop HDFS ——————-
LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare;//———- Query against the data using hive sql interface ————–
select * from shakespeare limit 10;
select * from sakespeare where freq > 100 sort by freq asc limit 10;
select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;//show me the plan
explain select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;//———- Create a merge table and populate it using dataset joining by 2 different tables
insert overwrite table merged select s.word, s.freq, k.freq from shakespeare s join kjv k on (s.word = k.word);//———- Query the merge table ———————
select word, shake_f, kjv_f, (shake_f+kjv_f) as ss from merged sort by ss limit 20;
To prepare the data for Hive to load in, the demo uses another mapreduce job to achieve. Remember to delete the log before doing Hive table load.
hadoop jar $HADOOP_HOME/hadoop-*-examples.jar grep input shakespeare_freq ‘\w+’
//remove the mapreduce job log
hadoop fs -rmr shakespeare_freq/_logs
Often time, large scale data processing system always IO bound. So for mapreduce job, your mapper is always waiting for data to load from disk. Hadoop mitigates the problem via during parallel load from lots of hard drives. However, a single hard drive is still max out at 75MB/s read as physical limit and nothing we can do about this. In order to achieve good speed, the key is to eliminate # of hadoop pass
Since Hive is on top of Hadoop’s HDFS, it will have the same restrictions as it. So, you cannot do UPDATE, DELETE and INSERT records as regular RDMS. However, you can do bulk load to add more new files (data) to the table and you can do delete a file from Hive.
Hive needs to store metadata of the tables out from the HDFS. You can use regular rdms to achieve the job. But when you start Hive locally, it will seek for the local metastore. So, in distributed environment, you may need to centralize the metastore in a remote location. There is wiki on the Hive site that documents how to set it up.
See Hive in Action
Cloudera Hadoop Training: Hive Tutorial Screencast from Cloudera on Vimeo.
Other projects similar to Hadoop
- Parallel databases: Gama, Bubba, Volcano
- Google: Sawzall
- Yahoo: Pig
- IBM Research: JAQL
- Microsoft: DryadLINQ, SCOPE
- Greenplum: YAML MapReduce
- Aster Data: In-database MapReduce
- Business.com: CloudBase