»
S
I
D
E
B
A
R
«
Book Review for “Pentaho Reporting 3.5 for Java Developers”
Nov 18th, 2009 by admin

Recently I was invited to review a book named “Pentaho Reporting 3.5 for Java Developer”. It was great because I was planning to use Pentaho for my project and hardly found the documents on Pentaho website sufficient. As a programmer, to be honest, I don’t like user guide. I am looking for developer guide. I need to know the API, the service interfaces, the features that can be ported and used to other projects, the design, the code samples, the extension points…etc. Although Pentaho is an open-source project but it approaches its customers as a blackbox solution. Deploy the war and follow the user guide to use it. What if I want to have reporting services as part of my project. How can I use pentaho as a jar and what services it provide? What if I want to plug-in caching service in the db layer and replace the straight JDBC calls via leveraging Hibernate to do my row-based filtering. In Pentaho, all query jobs are expressed as xaction and it does provide a way to output the resultset as xml. However, it is the only thing I found it provides you. What if I want to replace the reporting UI completely and how can I access the reporting scheduling service, security service, rendering service…etc. Don’t get me wrong. Pentaho is among the best open-source solution I can find on the Net. I just want to find out if I can use it like regular Apache libraries.

With high expectation, I started reading this book. It is well-organized and written in a cookbook format. To my dismay, its focus is to help users to get familiar with the Reporting Designer UI for report definition creation. It also shows you the code that runs the report definition you created and generates the report in different formats on the Net. Not bad! There are few chapters I want to highlight:

  • Chapter 5: Working with Data Sources: It shows the reporting engine data API that you can use to customize the way reporting engine accesses the data sources.
  • Chapter 12: Using BI Server

Installation and Setup

  1. Download BI Server 3.5 stable version from http://sourceforge.net/projects/pentaho. Note: Pentaho Reporting 3.5 is included in this release.
  2. Unzip the distribution and put it to any folder you want. Inside the zip folder, there are 2 main folders: bi-server-ce and administration-console.
  3. Before you start the server, you need to enable a publish password so that you can send reports from the Report Designer application to the server. Go to bi-server-ce/pentaho-solutions/system folder and edit publisher_config.xml such as <publisher-password>password</publisher-password>
  4. Start pentaho via running ./start-pentaho.sh (Mac) under bi-server-ce.
  5. Access the BI server via web browser: http://localhost:8080/pentaho (internally, Pentaho BI Server uses Tomcat as web server).
  6. The installation is easy! Isn’t it?
  7. To complete starting up the BI Server, go into the administration-console folder and run ./start-pac.sh. This starts the administration console application hosted within Jetty web server. It is accessible by visiting http://localhost:8099 in your browser. The default username and password is admin and password. Administration console is used for managing database connections and users for Pentaho BI Server.
  • After  you have BI Server and Admin Server up, you can publish a report from your Report Designer UI to BI Server.
  • Once the report is published, you can leverage many features provided by BI Server. For example:
    • Report Scheduling – You can schedule your report. When a report is ready, you will see it appear on your Workspace (View | Workspace).
    • Report Security – You can configure the permission on folder and individual published reports against different users and/or roles using Share tab.
    • Report Bursting and Emailing – You can wrap your report in an action sequence (.xaction) and enable advanced features like report bursting and emailing. Action Sequence is like a simplified workflow that orchestrates various Pentaho component to get a job done. For example,  to send the output of a report via email, you need to define “to”, “from” and content of message for the email component. The message comes from report component with format and prpt file feed in  (Note: use Pentaho Design Studio, an eclipse plugin, to create your xaction). After you finish your xaction, you can publish it to the BI Server as well. To do this, copy your xaction to the pentaho-solutions folder then do “Tool | Refresh | Repository Cache” on UI. Before executing the report on the server, you need to configure BI Server’s email setting in bi-server-ce/pentaho-solutions/system/smtp-email/email_config.xml and do “Tools | Refresh | System Settings” on UI. Now it is ready for you to execute it on the UI. This is just one example of how we use action sequence. In fact, this opens you up lots of possibilities there.
  • Ad Hoc Reporting - One more interesting feature of BI Server is that it allows you to create reports on the fly based on predefined Pentaho Metadata model, as well as SQL Queries and CSV files. You can use Pentaho Metadata Editor to describe the data source and publish the model to the BI Server. Once published, users are able to define their own reports without requiring knowledge of SQL and table relationship.

Core API describes in this book

The code segment below was taken from this book p.342.

//Boot the reporting engine. For servlet, this will be called in the init() method.
ClassEngineBoot.getInstance().start();

//Load the report prpt file. For servelt, the path could be obtained from ServletContext instead
ResourceManager manager = new ResourceManager();
manager.registerDefaults();
Resource res = manager.createDirectly(new URL(”file:data/metadata_table.prpt”), MasterReport.class);
MasterReport report = (MasterReport) res.getResource();

//Predefined output like PDF
response.setContentType(”application/pdf”);
PdfReportUtil.createPDF(report, response.getOutputStream());
ExcelReportUtil.createXLS(report, response.getOutputStream());
RTFReportUtil.createRTF(report, response.getOutputStream();

OR

//Generate output from a custom output processor
List<PojoOutputProcessor.PojoObject> objs = PojoUtil.createPojoReport(report);

//Write the PojoObject list to System.out
for (PojoOutputProcessor.PojoObject obj: objs){
System.out.println(”"+obj.x+”, “+obj.y+”: “+obj.text);
}

Learning Hive
Nov 18th, 2009 by admin

Starting to learn Hive

As I mentioned in my last article,  I was getting excited about the potential of Hive. Today, I decide to start my journey to learn this. I found a great introductory video that gives you a nice warm-up of using Hive (A basic knowledge of how hadoop and mapreduce work would be helpful for you to digest the material inside).

Below are some highlights from this video

Hive is an SQL interface built on top of Hadoop. It supports Web access and JDBC. I am amazed how close the SQL syntax like the regular SQL for RDBMS. Below are some SQLs used in this tutorial.

//———- Set up your tables in HIVE —————–
SHOW TABLES;

CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE;

DESCRIBE shakespeare;

//———- Load data into Hive table from Hadoop HDFS ——————-
LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare;

//———- Query against the data using hive sql interface ————–
select * from shakespeare limit 10;
select * from sakespeare where freq > 100 sort by freq asc limit 10;
select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;

//show me the plan
explain select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;

//———- Create a merge table and populate it using dataset joining by 2 different tables
insert overwrite table merged select s.word, s.freq, k.freq from shakespeare s join kjv k on (s.word = k.word);

//———- Query the merge table ———————
select word, shake_f, kjv_f, (shake_f+kjv_f) as ss from merged sort by ss limit 20;

To prepare the data for Hive to load in, the demo uses another mapreduce job to achieve. Remember to delete the log before doing Hive table load.

hadoop jar $HADOOP_HOME/hadoop-*-examples.jar grep input shakespeare_freq ‘\w+’

//remove the mapreduce job log
hadoop fs -rmr shakespeare_freq/_logs

Often time, large scale data processing system always IO bound. So for mapreduce job, your mapper is always waiting for data to load from disk. Hadoop mitigates the problem via during parallel load from lots of hard drives. However, a single hard drive is still max out at 75MB/s read as physical limit and nothing we can do about this. In order to achieve good speed, the key is to eliminate # of hadoop pass

Since Hive is on top of Hadoop’s HDFS, it will have the same restrictions as it. So, you cannot do UPDATE, DELETE and INSERT records as regular RDMS. However, you can do bulk load to add more new files (data) to the table and you can do delete a file from Hive.

Hive needs to store metadata of the tables out from the HDFS. You can use regular rdms to achieve the job. But when you start Hive locally, it will seek for the local metastore. So, in distributed environment, you may need to centralize the metastore in a remote location. There is wiki on the Hive site that documents how to set it up.

See Hive in Action

Cloudera Hadoop Training: Hive Tutorial Screencast from Cloudera on Vimeo.

Other projects similar to Hadoop

  • Parallel databases: Gama, Bubba, Volcano
  • Google: Sawzall
  • Yahoo: Pig
  • IBM Research: JAQL
  • Microsoft: DryadLINQ, SCOPE
  • Greenplum: YAML MapReduce
  • Aster Data: In-database MapReduce
  • Business.com: CloudBase
Hive on Amazon EC2 cloud
Nov 11th, 2009 by admin

adserving-ec2-hive-system-arch

 

I ever worked for a display ad network company that collects over 400 million of impression/ click logs per day. With this amount of data, my ex-company bought a supercomputer and cross their fingers that it can handle the grow in both volume and analytic demand of the data. It is obviously not a scalable solution. However, what is the best solution?

Although I haven’t worked for this company anymore, it is still an interesting problem to solve. I have a great friend who proposed a shared nothing solution for this company. The solution is to partition the data across a set of Postgresql databases and put Greenplum on top of them to parallelize the query —there is no disk-level sharing or contention to be concerned with (i.e. it is a ’shared-nothing’ architecture). I like this approach. The only thing is that Greenplum is not free and it may be difficult for a startup to face this upfront cost. Apart from that, this setting requires all the databases are running on the same network that hindered us to move this in the elastic cloud like Amazon EC2.

Later on, I joined a great company in the same industry that seeks for a solution in the cloud to host its data warehouse. So, I got a  chance to revisit this problem. During the research, I came across an interesting technology – column-based database (eg. infobright and lucid db). The idea of column-based data store is that traditional database stores and fetches data in row from data files into the memory. It is inefficient if your query only requires few columns for computation. So, column-based data stores your data in column with effective compression algorithm due to all values in it has the same data type. This solution is great but it doesn’t do MPP (ie. massive parallel processing) and it is also not ready for cloud yet.

Here comes another solution. That is Hive on top of Hadoop on top of Amazon cloud. It is an interesting idea. Check out this video to learn about this.


If you are not sure what Hadoop is and want to get some warm up in massive computing. I suggest you go through the following 5 excellent Google lectures.


Java 5 Features – Enum and Annotation
Aug 18th, 2009 by admin

Intent

I want to summarize some new and interesting Java 5 features in this article and how they change the way I code.

Enum

I use int constants to make my life easier b/c it can avoid typo. However, it has several drawbacks:

  1. Java doesn’t provide namespace for int enum groups. I can either prefix my constant like ABC_ or using inner interfaces to organize it.
  2. It is compile-time constants. So you need to recompile once changed.
  3. No easy way to translate int enum constants into printable string during debugging.
  4. You cannot iterate over all the int enum easily.
  5. You need a way to validate the enum is an valid int

Use new enum type in Java 5:

public enum Apple {FUJI, PIPPIN, GRANNY_SMITH}

Enum is full-fledged final class that export one instance for each enumeration constant via a public static final field.

  1. Namespace is provided via the enum type name.
  2. You can reorder and add the enumeration constant without recompiling its client.
  3. You can translate enum into printable strings via toString() method.
  4. Enum type provides you values() method to iterate your enumeration constants (based on declaration order).
  5. Type-checking can be used for the validation check
  6. You can associate data with enum constant
  7. Enum is immutable, serializable and comparable.

EnumSet

If elements of an enumerated types are used primarily in sets, it is traditional to use the int enum pattern, assigning a different power of 2 to each constant like READ = 1 << 2, WRITE = 1 <<1, EXECUTE = 1 << 0 to represent permissions per each entity in Unix. This representation lets you use the bitwise OR operation to combine several constants into a set, known as a bit field. The bit field representation also lets you perform set operations such as union and intersection efficiently using bitwise arithmetic. But bit fields have all the disadvantages of int enum mentioned above.

Now, java.util package provides the EnumSet to efficient represent sets of value drawn from single enum type. This class implements Set interface and internally use bit vector to represent set of values. For example, if you enum types has 64 values, the entire EnumSet can be represented as a single long, so its performance is comparable to the bit field.

The EnumSet class provides three benefits a normal set does not:

  1. Various creation methods that simplify the construction of a set based on an Enumeration
  2. Guaranteed ordering of the elements in the set based on their order in the enumeration constants are declared
  3. Performance and memory benefits not nearly possible with a regular set implementation

Annotation

An annotation is a new language feature introduced in J2SE 5.0. Simply put, annotations allow developers to mark classes, methods, and members with secondary information that is not part of the operating code.You can see annotation is a way to extend Java language.

Before annotation from Java 5, you may use naming patterns to indicate that some program elements like method demanded special treatment by a tool or a framework. Like JUnit required its users to name the test methods with the pattern like testXXX(). It works but with some big disadvantages:

  1. Typo problem
  2. It doesn’t provide a way to associate parameter values with program elements.

Annotation can solve this problem. To use it, you can:

  1. Create you own marker annotation (@interface is the keyword) or parametized annotation. You can annotate the annotation (ie. meta-annotation). Example: @Retention and @Target. And marker annotation has no parameter associated with it.
  2. Annotate the program elements
  3. Write processor to handle your annotated code. Generally, annotations never change the semantics of the annotated code, but enable it for special treatment by tools. Now, the metadata of Method carries additional info for your job. You can use Method’s isAnnotationPresent() to check if a method is annotated by certain annotation type. If you annotation carried parameter, you can use Method’s getAnnotation() to get the Annotaton object and use value() to obtain the parameter.

Reference

Below are some related articles I feel useful:

  1. http://www.javalobby.org/java/forums/t16967.html
  2. Annotation in Tiger – Part 1 Meta-Annotation
  3. Annotation in Tiger – Part 2 Custom Annotation

 

 

Create a Virtual Company
Jul 13th, 2009 by admin

Nowadays there are many tools available on the Net ranging from IM to cloud computing that certainly lowers the barrier for entrepreneurs like us. Today, I am going to list out the tools that helps me to run my company:

Set up virtual office

  1. Skype – Save you from long distance bill
    • FREE. If you pay a rate, it will let you connect to phone line.
    • When I am tired of typing, it will switch to this.
    • If you like face to face conversion, I would use iChat on Mac. All you need is an AOL account.
  2. Yugma – Web conferencing
    • FREE up to 20 attendees.
    • It has Skype integration.
  3. Free conferencing
    • FREE
    • In case you don’t have internet access or your laptop is not next to you. This is a good tool because it gives you a dedicated line to dail in. However, the number is not TOLL-FREE.
    • Why 712 area code? Check this out
  4. Google tools – all FREE
    • Google doc (shareable)
    • Google calendar (it can sync with my iCal on Mac now. If you use Outlook, you need to install a plugin to do the calendar sync). Follow this guide to set it up.
    • Google email (have gmail to host your mail server – yourname@yourcompany.com)
  5. MediaWiki – good wiki tool for information sharing
  6. Posterous – create your company blog via email :)
  7. iPhone
    • Not FREE
    • I use it for sync email, calendar and access Web.
  8. VNC – Remote desktop tool
    • FREE
    • For Mac, download OSXVine Server from here.
    • If you are using DSL that assigns you IP address dynamically, it is quite a headache to keep track of it. You can obtain a domain name from DynDNS to abstract you from the IP address.
    • By default, VNC server will be listened on port 5900. If you want to do remote desktop outside your subnet, make sure your DSL router open a port for that and forward the request to your machine.
    • Here is a web-based VNC viewer. With that, you can do remote desktop anywhere. Just key in your dynDNS domain and you are done.
    • If you want to make this access security, you can password protected your box via configure the VNS Server.
    • Here is a great article for that.
    • There are people who uses LogMeIn service that provides more remote secure featuers. However, it is NOT FREE. To me, VNC Server is good enough solution.

Build your virtual dev team

  1. Tracs – combine wiki, ticket system, project planning in one
    • FREE
    • A bit complicated to set it up in web hosting company
    • It integrates with Subversion as well
    • Bugzilla is pretty good for issue tracking as well
  2. Dreamhost for web hosting
    • Below $10 per month
    • I currently use it for hosting my own blog and subversion repository.
    • No java support yet.
    • For VPS solution, this one is cheap and my buddy said it is great.
  3. VirtualBox – have several operating system runs on your laptop. Very appealing!
    • FREE
    • I set up Ubundu on my Mac. Full screen, share folders, share mouse. I love it.
    • With this, I can ensure all developers are working on the same environment. Furthermore, I can have dev, qa and production using the same environment.
  4. Omnigraffle – design graphical tool on Mac
    • NOT FREE but cheap
    • Free stencils available on here.
  5.  Amazon AWS
    • Way low cost comparing to hire your own team to make sure your system 24×7
    • Cloud computing allows you scale on demand.
    • Processing power via EC2
    • Storage via S3
    • CDN via CloudFront
    • Messaging via Amazon SQS

 

Adobe Air with SQLite database
Jul 1st, 2009 by admin

Recently, I am trying to build an interactive reporting tool that needs to deal with lots of data. The data is not dynamic because it is basically data from historical performance log files. However, the volume of the data is large (over few millions of rows) and I still want my clients to interact with large amount of data in ease. With this, I am looking into Adobe AIR as I heard that it comes with in-memory database “SQLite“. I believe it should have better performance than web-based application because data is local and SQLite is lightweight and fast. Apart from that, SQLite supports parameterized query, strongly-typed result, asynchronous/ synchonous processing, indexing, view, trigger, transaction and most of SQL92. On top of that, it is small footprint, cross-platform and open source. The tradeoff for SQLite is its weak support in concurrency because it is using table exclusive lock. However, it is totally fine for desktop application because it normally only serves one user. For more info of SQLite, check out my notes below.

Update

SQLite 3 is released that addressed some of its issues in version 2.

  • BLOB support
  • Fulltext searching
  • Connection shared between threads
  • Improve concurrency

However, it still doesn’t support writeable view, nested transaction and foreign key.

Presentation

Here is a nice presentation from Paul Roberson, look at it first.

 Note from the video:

  1. Warm up with general SQL Tips (Join favors subquery, Avoid IN, Avoid LIKE, Specify columns name in select and insert, Avoid unnecessary join)
  2. AIR SQL connection can connect up to 10 databases at a time, you can use qualifier for your tables.
  3. Don’t reuse the same SQL Statement for different prepared statements.
  4. Use transaction to do batch insert/ update/ delete operations (48x faster!)
  5. Index columns in WHERE clause, use together index together
  6. Create table structure before you add data because internally SQLite…?
  7. Handle large resultset in parts for perceived performance gain (detailed)

There are several things I want to find out:

  1. Can SQLite handles large dataset?
    • Yes. According to spec, it can handle terabyte of data.
  2. Does SQLite support pagination?
    • Yes. Look at SQL Statement object.
  3. How SQLite synchronize with the updated data from the remote database?

Some notes about SQLite

Below are some SQLite tips and practices I obtained from different sources:

  • A big advantage of sqlite above a flat file is the possibility to index your data.
  • Using parameterize query protects against sql injection, and makes the ‘ problems go away. It is also much faster because sqlite can reuse the execution plan of statements when you use parameters.
  • Make sure to import the records in a transaction so that it doesn’t spend a lot of time creating indexes until everything is imported.
  • The SQLite documentation states that SQLite databases can be terabytes in size, and that the primary limitation of SQLite is concurrency (many users at the same time).
  • “The SQLite database is pretty damn fast. I was getting near instantaneous searching with databases that were ~100,000 records. Somewhere around 800,000 – 1,000,000 records you start losing performance, waiting a few seconds for a search” – by Daniel
  • Each database is contained within a single file.

Reference

Below are some good links I have found:

 

SOA Approach for Business
Apr 27th, 2009 by admin

What is SOA?

What is ESB?

ESB = Enterprise Service Bus. The definition is flexible, but in general it’s a conduit for messages of multiple, different formats, between application endpoints, over more than one protocol.

Mule vs ServiceMix

  1. Compared to Mule, the major difference for ServiceMix is its architectural design, which is fundamentally based on the Java Business Integration (JBI) standard. Mule provides a JBI binding so that Mule components can interact with JBI containers, including the ServiceMix JBI container. However, the internal Mule APIs are not based on the JBI standard.
  2. JBI uses a notion of Message Exchanges and a Normalized Message to communicate between components, where as Mule use a “POJO / Endpoint” architecture.
  3. Mule doesn’t require your service to implement any Mule interface. The components are wired up thru the mule-config.xml file.дивани

The video above can give you a good taste of Mule. But the example runs as standalone app. If you want to use Mule in your web application. I found this distributed raman amplifiervideo a good starting point.

What is EDA?

Asynchronous messaging lets two or more applications send data to each other without having to wait for receipt confirmation. The infrastructure guarantees message delivery, even if the receiving application isn’t currently running or the network connection is interrupted. It sounds simple enough, but asynchronous messaging demands a new way of thinking about system architecture.

Caveat of asynchronous messaging

  1. The convenience of sending guaranteed messages literally around the world to applications running on different platforms or technologies is a huge benefit. What’s the catch? The messaging system can guarantee that the message will be delivered, but it does not guarantee when it will be delivered.
  2. Worse yet, if an application sends two messages, the messaging system doesn’t guarantee that they’ll arrive in the same order.
  3. Because destinations are unidirectional, the application receives responses on a different destination than the one on which it publishes requests. This means that the responses can arrive in a different order than the requests were sent. As a result, the application must explicitly correlate incoming response messages to the requests. In the asynchronous world, correlation is such a common need that the JMS API includes the methods getJMSCorrelationID and setJMSCorrelationID in the message interface.
  4. How do we have sender and receiver in a transaction?

Reference

  1. Enterprise Architecture Patterns – by Martin Fowler
  2. An Asynchronous World
  3. Errant Architecture
  4.  

 

 

Top 10 Flex Programming Tips
Apr 18th, 2009 by admin

There are some interesting tips I found during the time I work on Flex Programming. I will cover Embedding, Binding, Event Handling, Function Pointer, Mixin and more. I hope these tips will make your life easier when you work on Flex.


Read the rest of this entry »

Powerful Extension of Flex DataGrid – Part 1
Apr 17th, 2009 by admin

Features wanted!

To make Flex datagrid completed, I would like to have the following featues. AutoCompleted Search – Locate the data I want quickly if there are too many rows in my grid. Internationalization – Handle currency, number and date format. Data Export – Output the data in csv format, so users can import to Excel. Pagination - If I give the total number of records, the subset of the data rows and the number of rows per page, the grid should be able to do pagination and fire the events when user clicks on other pages. This article I will show you how to make these happen. Read the rest of this entry »

Powerful Linux Text Processing Commands
Apr 17th, 2009 by admin

Common Text Processing Commands

In our daily life, we deal with lots of data. The data normally is stored in text format for the ease of human to read. With the large amount of data we have, we need ways to deal with it. There are several things we frequently do on the data: Search, Filter, Sort and Analysis. In Linux, there are some powerful commands that I can use: cat, grep, find, sort, unique and etc. I found those commands quite powerful. So, I decide to put these down as my reference. This tutorial I will go over the basic text processing commands and how we use them together to achieve the tasks we often encounter in our workplace. 

Read the rest of this entry »

»  Substance: WordPress   »  Style: Ahren Ahimsa