Archive | August, 2009

Java 5 Features – Enum and Annotation

Intent

I want to summarize some new and interesting Java 5 features in this article and how they change the way I code.

Enum

I use int constants to make my life easier b/c it can avoid typo. However, it has several drawbacks:

  1. Java doesn’t provide namespace for int enum groups. I can either prefix my constant like ABC_ or using inner interfaces to organize it.
  2. It is compile-time constants. So you need to recompile once changed.
  3. No easy way to translate int enum constants into printable string during debugging.
  4. You cannot iterate over all the int enum easily.
  5. You need a way to validate the enum is an valid int

Use new enum type in Java 5:

public enum Apple {FUJI, PIPPIN, GRANNY_SMITH}

Enum is full-fledged final class that export one instance for each enumeration constant via a public static final field.

  1. Namespace is provided via the enum type name.
  2. You can reorder and add the enumeration constant without recompiling its client.
  3. You can translate enum into printable strings via toString() method.
  4. Enum type provides you values() method to iterate your enumeration constants (based on declaration order).
  5. Type-checking can be used for the validation check
  6. You can associate data with enum constant
  7. Enum is immutable, serializable and comparable.

EnumSet

If elements of an enumerated types are used primarily in sets, it is traditional to use the int enum pattern, assigning a different power of 2 to each constant like READ = 1 << 2, WRITE = 1 <<1, EXECUTE = 1 << 0 to represent permissions per each entity in Unix. This representation lets you use the bitwise OR operation to combine several constants into a set, known as a bit field. The bit field representation also lets you perform set operations such as union and intersection efficiently using bitwise arithmetic. But bit fields have all the disadvantages of int enum mentioned above.

Now, java.util package provides the EnumSet to efficient represent sets of value drawn from single enum type. This class implements Set interface and internally use bit vector to represent set of values. For example, if you enum types has 64 values, the entire EnumSet can be represented as a single long, so its performance is comparable to the bit field.

The EnumSet class provides three benefits a normal set does not:

  1. Various creation methods that simplify the construction of a set based on an Enumeration
  2. Guaranteed ordering of the elements in the set based on their order in the enumeration constants are declared
  3. Performance and memory benefits not nearly possible with a regular set implementation

Annotation

An annotation is a new language feature introduced in J2SE 5.0. Simply put, annotations allow developers to mark classes, methods, and members with secondary information that is not part of the operating code.You can see annotation is a way to extend Java language.

Before annotation from Java 5, you may use naming patterns to indicate that some program elements like method demanded special treatment by a tool or a framework. Like JUnit required its users to name the test methods with the pattern like testXXX(). It works but with some big disadvantages:

  1. Typo problem
  2. It doesn’t provide a way to associate parameter values with program elements.

Annotation can solve this problem. To use it, you can:

  1. Create you own marker annotation (@interface is the keyword) or parametized annotation. You can annotate the annotation (ie. meta-annotation). Example: @Retention and @Target. And marker annotation has no parameter associated with it.
  2. Annotate the program elements
  3. Write processor to handle your annotated code. Generally, annotations never change the semantics of the annotated code, but enable it for special treatment by tools. Now, the metadata of Method carries additional info for your job. You can use Method’s isAnnotationPresent() to check if a method is annotated by certain annotation type. If you annotation carried parameter, you can use Method’s getAnnotation() to get the Annotaton object and use value() to obtain the parameter.

Reference

Below are some related articles I feel useful:

  1. http://www.javalobby.org/java/forums/t16967.html
  2. Annotation in Tiger – Part 1 Meta-Annotation
  3. Annotation in Tiger – Part 2 Custom Annotation

 

 

Leave a comment Continue Reading →

How to build data warehouse

Operational databases are most commonly designed using normalized modeling, often using third-normal form or entity-relationship modeling. Normalized database schemas are tuned to support fast updates and inserts by minimizing the number of rows that must be changed when recording new data.Example: Order-Management Schema for operational database

relatonalmodel.JPG

Data warehouses differ from operational databases in the way they are designed; they are optimized for efficient querying and not for updating. Data warehouses provide a read-only version of the data in the operational databases, which is optimized for querying. The kind of modeling most commonly used in warehouse design is called dimensional modeling, and the schemas produced are known as star schemas. In dimensional modeling, a database is organized around a small number of fact tables. Each row in a fact table is a single measurable event: a single sale, a single hit to a web page, etc. Example: Order-Management Dimension Schema

dimensionmodeling.JPG

The key benefits of data warehouse are simplication and consolidation of data. It normally gathers data from different operational databases into single dimensional model for reporting and analysis purpose. On the other hand, dimensional modeling offers a chance to reduce the level of complexity in your database. By reducing complex chains of tables into dimension tables, the schema becomes smaller and performance tends to improve. The approaches we take to reduce the complexity are (1) We try to model one aspect of the system for each DM schema. (2) We can denormalize the schema to reduce number of joins. ETL Process Once you have a data schema for your warehouse, you'll need to fill it with data. This process is known as extract, transform, and load, or ETL for short. The first step, extraction, is simply the process of selecting all the data of interest from the operational database. Then the data must be transformed into the format needed by the warehouse. This could be as simple as renaming some of the fields or as complex as cleaning dirty data and computing new fields. Finally the data must be loaded into the data warehouse. There are some areas you need to pay attention when you perform the ETL:

  1. During extraction, you will put a lot of strains to the operational database. To deal with this problem we can replicate a low-cost copy of the operational database on the warehouse machine before doing extraction. The SQL output of the extraction process can be a CSV file.
  2. Transformation can be computing summary data, converting postal code into geo-code (ie. lat and long) that powers"within X miles" queries. You can use Perl to do this job. The output of transformation may be another CSV file.
  3. Finally, you load the data into CSV into dimensional model. To speed up the load, in MySQL, we first disable indexes with ALTER TABLE foo DISABLE KEYS, and after the load, we re-enable them with ALTER TABLE foo ENABLE KEYS. Each table needs to be cleared before loading via TRUNCATE command.
  4. You may be wondering what happens to clients using the warehouse while an ETL process is running. In our case, nothing at all! This magic is achieved by actually having two warehouse databases, one in use and the other free for loading. All the data goes into the loading database, and when it's full we swap it into place with RENAME.This produces an atomic switch of all tables in the loading database with the tables in the live database. It will wait for any running queries in the warehouse to finish before performing the swap, which is exactly what we want.

Quick Tips

  1. CSV format isn't a standard. Use XML can solve character issue but it might not perform as well due to formatting overhead.
  2. Transform is not always needed. If not, use "SELECT … INTO TABLE" to provide a straight database-to-database extract-and-load.
  3. Incremental load is highly desirable. Use trigger can achieve that.
  4. Operational database uses MySQL's InnoDB backend, providing referential integrity and transactions. However, we chose MySQL's MyISAM backend for our warehouse for better performance as it is read-only and transactional feature is not needed.
  5. MySQL does not support for bitmap indexes. Bitmap indexes are ideal for the kind of low-cardinality data that is commonly used in data warehouses. PostgreSQL supports bitmap indexes as of version v8.1, as do a number of commercial database systems.
Leave a comment Continue Reading →