Tag Archives: data warehouse

Learning Hive

Starting to learn Hive

As I mentioned in my last article,  I was getting excited about the potential of Hive. Today, I decide to start my journey to learn this. I found a great introductory video that gives you a nice warm-up of using Hive (A basic knowledge of how hadoop and mapreduce work would be helpful for you to digest the material inside).

Below are some highlights from this video

Hive is an SQL interface built on top of Hadoop. It supports Web access and JDBC. I am amazed how close the SQL syntax like the regular SQL for RDBMS. Below are some SQLs used in this tutorial.

//———- Set up your tables in HIVE —————–
SHOW TABLES;

CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE;

DESCRIBE shakespeare;

//———- Load data into Hive table from Hadoop HDFS ——————-
LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare;

//———- Query against the data using hive sql interface ————–
select * from shakespeare limit 10;
select * from sakespeare where freq > 100 sort by freq asc limit 10;
select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;

//show me the plan
explain select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;

//———- Create a merge table and populate it using dataset joining by 2 different tables
insert overwrite table merged select s.word, s.freq, k.freq from shakespeare s join kjv k on (s.word = k.word);

//———- Query the merge table ———————
select word, shake_f, kjv_f, (shake_f+kjv_f) as ss from merged sort by ss limit 20;

To prepare the data for Hive to load in, the demo uses another mapreduce job to achieve. Remember to delete the log before doing Hive table load.

hadoop jar $HADOOP_HOME/hadoop-*-examples.jar grep input shakespeare_freq ‘\w+’

//remove the mapreduce job log
hadoop fs -rmr shakespeare_freq/_logs

Often time, large scale data processing system always IO bound. So for mapreduce job, your mapper is always waiting for data to load from disk. Hadoop mitigates the problem via during parallel load from lots of hard drives. However, a single hard drive is still max out at 75MB/s read as physical limit and nothing we can do about this. In order to achieve good speed, the key is to eliminate # of hadoop pass

Since Hive is on top of Hadoop’s HDFS, it will have the same restrictions as it. So, you cannot do UPDATE, DELETE and INSERT records as regular RDMS. However, you can do bulk load to add more new files (data) to the table and you can do delete a file from Hive.

Hive needs to store metadata of the tables out from the HDFS. You can use regular rdms to achieve the job. But when you start Hive locally, it will seek for the local metastore. So, in distributed environment, you may need to centralize the metastore in a remote location. There is wiki on the Hive site that documents how to set it up.

See Hive in Action

Cloudera Hadoop Training: Hive Tutorial Screencast from Cloudera on Vimeo.

Other projects similar to Hadoop

  • Parallel databases: Gama, Bubba, Volcano
  • Google: Sawzall
  • Yahoo: Pig
  • IBM Research: JAQL
  • Microsoft: DryadLINQ, SCOPE
  • Greenplum: YAML MapReduce
  • Aster Data: In-database MapReduce
  • Business.com: CloudBase
Leave a comment Continue Reading →

How to build data warehouse

Operational databases are most commonly designed using normalized modeling, often using third-normal form or entity-relationship modeling. Normalized database schemas are tuned to support fast updates and inserts by minimizing the number of rows that must be changed when recording new data.Example: Order-Management Schema for operational database

relatonalmodel.JPG

Data warehouses differ from operational databases in the way they are designed; they are optimized for efficient querying and not for updating. Data warehouses provide a read-only version of the data in the operational databases, which is optimized for querying. The kind of modeling most commonly used in warehouse design is called dimensional modeling, and the schemas produced are known as star schemas. In dimensional modeling, a database is organized around a small number of fact tables. Each row in a fact table is a single measurable event: a single sale, a single hit to a web page, etc. Example: Order-Management Dimension Schema

dimensionmodeling.JPG

The key benefits of data warehouse are simplication and consolidation of data. It normally gathers data from different operational databases into single dimensional model for reporting and analysis purpose. On the other hand, dimensional modeling offers a chance to reduce the level of complexity in your database. By reducing complex chains of tables into dimension tables, the schema becomes smaller and performance tends to improve. The approaches we take to reduce the complexity are (1) We try to model one aspect of the system for each DM schema. (2) We can denormalize the schema to reduce number of joins. ETL Process Once you have a data schema for your warehouse, you'll need to fill it with data. This process is known as extract, transform, and load, or ETL for short. The first step, extraction, is simply the process of selecting all the data of interest from the operational database. Then the data must be transformed into the format needed by the warehouse. This could be as simple as renaming some of the fields or as complex as cleaning dirty data and computing new fields. Finally the data must be loaded into the data warehouse. There are some areas you need to pay attention when you perform the ETL:

  1. During extraction, you will put a lot of strains to the operational database. To deal with this problem we can replicate a low-cost copy of the operational database on the warehouse machine before doing extraction. The SQL output of the extraction process can be a CSV file.
  2. Transformation can be computing summary data, converting postal code into geo-code (ie. lat and long) that powers"within X miles" queries. You can use Perl to do this job. The output of transformation may be another CSV file.
  3. Finally, you load the data into CSV into dimensional model. To speed up the load, in MySQL, we first disable indexes with ALTER TABLE foo DISABLE KEYS, and after the load, we re-enable them with ALTER TABLE foo ENABLE KEYS. Each table needs to be cleared before loading via TRUNCATE command.
  4. You may be wondering what happens to clients using the warehouse while an ETL process is running. In our case, nothing at all! This magic is achieved by actually having two warehouse databases, one in use and the other free for loading. All the data goes into the loading database, and when it's full we swap it into place with RENAME.This produces an atomic switch of all tables in the loading database with the tables in the live database. It will wait for any running queries in the warehouse to finish before performing the swap, which is exactly what we want.

Quick Tips

  1. CSV format isn't a standard. Use XML can solve character issue but it might not perform as well due to formatting overhead.
  2. Transform is not always needed. If not, use "SELECT … INTO TABLE" to provide a straight database-to-database extract-and-load.
  3. Incremental load is highly desirable. Use trigger can achieve that.
  4. Operational database uses MySQL's InnoDB backend, providing referential integrity and transactions. However, we chose MySQL's MyISAM backend for our warehouse for better performance as it is read-only and transactional feature is not needed.
  5. MySQL does not support for bitmap indexes. Bitmap indexes are ideal for the kind of low-cardinality data that is commonly used in data warehouses. PostgreSQL supports bitmap indexes as of version v8.1, as do a number of commercial database systems.
Leave a comment Continue Reading →

Hibernate vs iBatis

Hibernate is great. However, I don’t see it fits all the data access requirements. At its core, it is an ORM tool that helps you to map your object model to relational model. If you have full control of your relational model and perform lots of CRUD operations, it is certainly a great tool for you. Its transparent persistence, 2 level caching, dirty checking, lazy/ eager data retrieval and sql generation indeed can save us lots of development time. However, one tool doesn’t fit all !! Why not?

In my current company, I have created a reporting tool that interfaces with dimensional model in data warehouse. In this setting, you will deal with star schema with denormalized dimensions.  Often time, I need to tune the query performance via looking into explain plan. Without full control of SQL, my job will be hard to achieve. Apart from that, reporting tool often issues read-only set based queries to the data warehouse. The resultset returned doesn’t fit into my OO model at all. Again, Hibernate just doesn’t fit in. People in my company argue that I should use named query in Hibernate for the sake of sticking with the standard. I am like ok, whatever… I have known a tool called iBatis that I can achieve my job cleanly. Why the hell I would have motivation to try named query that basically a way to by-pass the ORM model to query database. What benefit I will get from this? The cache? We are using ETL to update our fact and dimension in the data warehouse, not by my reporting app. Unless the ETL throws me an event when the update cycle is finished so I can flush my cache, I simply don’t think it gives me lots of help.

Anyway, it is just my own little perspective. You don’t have to agree with me. The point here is not that I don’t like Hibernate but I don’t like to be pushed to use it only because it is “standard” to someone. If Hibernate could help me to construct my sql based on user input, stream my result directly to my presentation layer, populate my model automatically based on mapping I provided, detect data warehouse changes and take care my cache, then I am more happy to adopt it in my reporting app. Otherwise, I would not be eager to dump my iBatis DAO layer unless I get no choice under the political game.

Reference

  1. http://www.nofluffjuststuff.com/media.jsp?id=19
  2. http://www.javalobby.org/articles/hibernate-query-101/
  3. How to use named parameters and named query in Hibernate?
  4. Don’t repeat your DAO 
Leave a comment Continue Reading →