My buddy Steve sent me an article today that contained some important data shared by 4 startups companies. I like this article a lot as it reflects what I have experienced since I started my online wedding company – JustProposed.com. Being an entrepreneur, I found out it was way more difficult than I thought. You need to believe your idea a lot and be persistent to push it out. I have spent so many sleepless nights and face quite a bit of discouraging facts… However, when I found out people actually using it with good comments and feedback, my energy came back.
Startups – It is full of ups and downs
Memory leak is possible in Java
I previously thought that if I pick Java for programming, I don’t need to worry about allocating and free up memory as in C++. Certainly, Java memory management does help us to minimize the chance of memory leak. However, it does not eliminate it. It is confusing right? The key is the definition of memory leak. If memory leak means an object allocated in the heap which has no direct reference, then Java does a good job to tackle this concern via object reachability (not reference counting). However, memory leak can be generalized to include the case that an object in memory that is not used but somehow being referenced and causing garbage collector not able to clean it. When will this happen in Java? A long lifecycle object holds a reference to a short lifecycle object. Below is the list of possible fixes:
- Explicitly remove the reference to the short life cycle object
- Shorten the life cycle of the anchor
- Weaken the reference
You can use the methods below to tell whether your system has memory leak:
- Examine heap memory status after several GCs. If you see the memory usage is trending up even after GCs, then you may have memory leak.
- OutOfMemoryError – easy indicator
- Performance decline – It may be the overhead of page swapping. This will take place as memory is in demand.
Reference
AOP – injecting behavior to your object
Introduction of AOP
Aspect oriented programming (AOP) is the great solution for the cross-cutting concerns like logging, security, auditing, exception handling, transaction and etc. Without AOP, those cross-cutting concern’s codes are spreading all over your system. Now, with AOP, you can dynamically inject behavior to your class (weaving) or object (proxy-based). The injection process is illustrated below:
- Pointcut – language construct that specifies join point
- Join point – define specific points in program’s execution
- Advice – define the pieces of an aspect implementation to be executed at pointcut
- Aspect - combine these primitive
Spring proxy-based AOP vs AspectJ weaving (bytecode enhancement)
- Spring proxy-based AOP
- Good: easy to use, no need post-compiler. Now it supports AspectJ pointcut expression. It is instance based.
- Weak: carry overhead of reflection, no interception for internal call and limit of only method join points.
- Usage: Service and Data Access Layer (not in domain layer)
- Â AspectJ weaving
- Good: achieve all the weakness in proxy-based AOP
- Weak: need to do post compilation for your code for bytecode enhancement. It is type-based.
- Usage: All layers including domain layer
- Support for weaving (linking aspects with) binary class files either offline or at runtime as classes are loaded into the virtual machine.
Use Annotation vs XML for aspect wiring
- Annotation approach -Â Spring can actually understand @AspectJ aspects, so it is possible to share entire aspect definitions between Spring and AspectJ. Enabling this capability is easy, just include the <aop:aspectj-autoproxy> element in your configuration. If Aspectj-autoproxying is enabled in this way, then any bean defined in your application context with a declared type that is an @AspectJ aspect will be interpreted as an aspect by Spring AOP and beans in the context will be advised accordingly. However, you will have your POJOÂ coupled with AspectJ annotation.
- XML Approach: Your POJO is agnositic about aspects. You will need to do the wiring in the XML. Here is the XML that does the same job as above: (It is centralized management with agnostic POJO but it can potentially complicate the already overloaded XML)
 <bean id=”helloService”
  class=”org.aspectprogrammer.hello.spring.HelloService”/>
<aop:aspectj-autoproxy/>
<bean id=”helloFromAspectJ”
  class=”org.aspectprogrammer.hello.aspectj.HelloFromAspectJ”/>
Lets look at the POJO classes:
public class HelloService {
 public void main() {
   System.out.println(“Hello World!”);
 }
}@Aspect
public class HelloFromAspectJ {
 @Pointcut(“execution(* main(..))”)
 public void mainMethod() {}@AfterReturning(“mainMethod()”)
 public void sayHello() {
 System.out.println(“Hello from AspectJ!”);
 }
}//Test the AOP behavior injection
public class SpringBoot {
 public static void main(String[] args) {
 ApplicationContext context = new ClassPathXmlApplicationContext(
 “org/aspectprogrammer/hello/spring/application-context.xml”);
 HelloService service = (HelloService) context.getBean(“helloService”);
 service.main();
 }
}
  <bean id=”helloService”
     class=”org.aspectprogrammer.hello.spring.HelloService”/>
  <aop:config>
    <aop:aspect ref=”helloFromSpringAOP”>
            <aop:pointcut id=”mainMethod”
                         expression=”execution(* main(..))”/>
            <aop:after-returning pointcut-ref=”mainMethod”
                         method=”sayHello”/>
    </aop:aspect>
  </aop:config>
  <bean id=”helloFromSpringAOP”
        class=”org.aspectprogrammer.hello.spring.HelloAspect”/>
Nice reference:
Linux Commands
Grep
Pattern matching against a set of files
- grep [options] “pattern” [filename or wildcard]
- options:
- -n : show line number
- -v: negative result
- -l: list out the matched filenames
- -i: case-insensitive
- -x: exact match
- -r: recursively over the directory.
- grep does not support + and ? and | operators. Use egrep instead.
- Regular expression for pattern (reference visits here)
- /^Mary/ : beginning of the line
- /Mary$/ : end of the line
- /[a-z]a.e/ : single character in the set of a-z and following with any single character, then character e: eg. have
- /[^a-z]a/: any character not in the set of a-z: eg. Mary
- /abc|abb/: alternation of patterns – like OR
- /A+B*C?D/: + means >=1, * means >=0, ? means 0 or 1
- /a{5} b{,6} c{4,8}/ : 5 of a, 0-6 of b, 4-8 of c
- Use with other utilities:
- tail -n8 a_file | grep “boo” (now grep can use to search a portion of a file)
- find | grep “hello” (search on the dynamic result from find command)
- find . -exec grep “boo” {} \;Â (find any file recursively from the current directory with the pattern matched with “boo”. {} is to feed the grep with all the filenames obtained from the find and \; is to end the command)
- grep “\([a-z]\)\1″ a_file
Find
The find program is a search utility, mostly found on Unix-like platforms. It searches through one or more directory tree(s) of a filesystem, locating files based on some user-specified criteria. Further, find allows the user to specify an action to be taken on each matched file. Thus, it is an extremely powerful program for applying actions to many files. It also supports regexp matching.
- find /var/ftp/mp3 -name “*.mp3″ -type f -exec chmod 744 {} \;
The above command has several things to learn:
- find from the directory “/var/ftp/mp3″ and its subdirectories with filename matching pattern of “*.mp3″.
- just look at the name of regular file for pattern matching excluding directories, special files, pipes, symbolic links.
- “-exec chmod 744 {} \;” – all the matched filenames will be feeded into an action for execution. This action can be any command like grep or chmod. The example above will change their permission to 744.
find ~jsmith -exec grep “LOG” ‘{}’ /dev/null \; -print
/home/jsmith/scripts/errpt.sh:cp $LOG $FIXEDLOGNAME
/home/jsmith/scripts/errpt.sh:cat $LOG
/home/jsmith/scripts/title:USER=$LOGNAME
The commands below achieve the same thing:
find /tmp -exec grep -H “search string” ‘{}’ \; -print
grep -R /tmp “search string”
A little shell programming -Â bash
Now we know how to how the command “find” and “grep” works. It is time we unleash their power in shell script.
#!/bin/bash
if [ $# = 1 ]
then
{
   ps -ax | fgrep -i $1;
}
else
{
   ps -ax;
}
fi;
The program above named “px” that checks to see whether only one parameter is passed in via “$#=1″. If yes, ps command generates a detailed listing of all processes running on the system and then the output is piped into fgrep with first parameter as the pattern to strip out irrelevant information.
 search ()
{
   if [ $# = 1 ]
   then
   {
       for i in `find . -path ‘./dev’ -prune -o -print 2> /dev/null`
       do
       {
           fgrep -i $1 $i > /dev/null 2>&1;
           if [ $? = 0 ]
           then
           {
               echo $i
           } fi;
       } done;
   }
   else
   {
       echo “Wrong number of arguments!”
   } fi;
}
This is a shell function instead of a script. Functions are basically shell scripts loaded into the shell’s memory (usually via your .profile file when you log in), and are therefore available as commands without having to put the executable file on your $PATH. More importantly, they execute in your current shell, without starting a child or subshell process.
- find . -path ‘./dev’ -prune -o -print 2> /dev/nullÂ
- This command is used to generate a list of all files in the current directory and any subdirectories, with the exception of any directory called dev. We generally want to avoid looking in any dev directory because it is traditionally where device files and other special files are kept, so we “prune” that subtree from the search. Performing a search on special files can produce some interesting results, but it’s definitely not recommended. We print out any other filenames we come across, and we redirect any errors to /dev/null because we don’t really want them and they would only confuse matters.
- We execute the find statement by placing it in backquotes, which have a special meaning in bash. In effect the expression in the backquotes evaluates to the output of the command when it’s executed, so if the find command printed out the name of three files as: “file1 file2 file3″ the for loop would effectively be: for i in “file1 file2 file3″
- Inside the loop we call: fgrep -i $1 $i > /dev/null 2>&1;
- we call fgrep for a case-insensitive search (the “i” option) using the pattern passed on the command line ($1) on the currently active filename ($i), and we ignore everything it prints out.
- The normal action for fgrep is to print out the line of text that matches the search pattern, but we’re not interested in that here. All we want is the name of the file that contains the search pattern.
A bit of Linux I/O redirection (reference)
- 1 = standard output
- 2 = error output
- > is overwrite to
- >> is append to
- 2>&1 : “redirect standard error (2) to the same place as standard outout (1)
- cmd 2>&1 1>outfile.txt (2 points to the address of 1 that is stdout, then 1 points to outfile.txt). So, error message goes to stdout and standard output goes to outfile.txt
- cmd 1>outfile.txt 2>&1 (1 points to outfile.txt, then 2 points to the address that 1 is pointing that is outfile.txt as well. Now both standard output and error go to outfile.txt – combine!)
- /dev/null: discard area
Transaction and Concurrency
For resource (database and cache) needs to be managed for concurrent access to avoid damaging data integrity, it is often time achieved by locking mechanism (read and write lock). For a transaction to be isolated from other transactions, this mechanism is very important. There are 5 different isolation levels that can be achieved via read and write lock. We know that the higher the isolation level, the lower the degree of concurrency. Therefore, our goal here is to get the transaction isolated while minimizing the impact of concurrency. This article, I will go through how to use read/ write lock to achieve various isolation levels and how to improve concurrency while achieve highest degree of transaction isolation.
5 isolation levels
- NONE - There is no locking at this level. Your code needs to take care of it.
- READ-UNCOMMITTED - Exclusive lock for writes while reads don’t need to acquire a lock. It means when Tx1 get a write lock and change the data without committed, Tx2 can read the uncommitted data (dirty read).
- READ-COMMITTED - Use a read-write lock; reads succeed acquiring the lock when there are only reads, writes have to wait until there are no more readers holding the lock, and readers are blocked acquiring the lock until there are no more writers holding the lock. However, reads typically release the read-lock when done, so that a subsequent read to the same data of the same transaction has to re-acquire a read-lock; this leads to nonrepeatable reads, where 2 reads of the same data might return different values because write can happen in between 2 reads.
- REPEATED-READ - Data can be read while there is no write and vice versa. Write lock is not given until the Tx holding the read lock is committed. This level prevents “non-repeatable read” but it does not completely prevent the so-called “phantom read” where new data can be inserted into the tree from another transaction.
- SERIALIZABLE - Data access is synchronized with exclusive locks. Only 1 writer or reader can have the lock at any given time. Locks are released at the end of the transaction. Regarded as very poor for performance and thread/transaction concurrency.
Optimistic LockingÂ
The lock mechanism above is using pessimistic locking. To achieve better concurrency, you may want to use optimistic locking. When optimistic locking is turned on, isolation level is ignored because you are using different mechanism for data integrity protection. For optimistic locking, the data is versioned and all calls in the transaction work on the copy of the data rather than the actual data. When the transaction commits, its workspace is merged back and version is checked. If there is a version mismatch, the transaction throws a RollbackException when committing and the commit fails. Optimistic locking uses the same locks we speak of above, but the locks are only held for a very short duration – at the start of a transaction to build a workspace, and when the transaction commits and has to merge data back into the tree. So while optimistic locking may occasionally fail if version validations fail or may run slightly slower than pessimistic locking due to the inevitable overhead and extra processing of maintaining workspaces, versioned data and validating on commit, it does buy you a near-SERIALIZABLE degree of data integrity while maintaining a very high level of concurrency.
Hibernate – Caching
Hibernate uses three different caches for objects to reduce SQL calls:
- First-level cache – Session on a per-transactinal basis. That is to say, if an object is modified several times within the same transaction, Hibernate will generate only one SQL UPDATE at the end of tx.
- Second-level cache – Session Factory (cache available between transactions – cluster or JVM level).
- Query-level cache - Cache actual result rather than just persistent objects.
Caching Concurrency Strategy
There are four caching concurrency strategies are available:
- Read-only: This strategy is useful for data that is read frequently but never updated. This is by far the simplest and best-performing cache strategy.
- Read/write: Read/write caches may be appropriate if your data needs to be updated. They carry more overhead than read-only caches. In non-JTA environments, each transaction should be completed when Session.close() or Session.disconnect() is called.
- Nonstrict read/write: This strategy does not guarantee that two transactions won’t simultaneously modify the same data. Therefore, it may be most appropriate for data that is read often but only occasionally modified.
- Transactional: This is a fully transactional cache that may be used only in a JTA environment.
Caching Cluster SupportÂ
Apart from caching strategy, we can also evaluate the cache implementatin in term of cluster support. Only SwarmCache and JBoss TreeCache have cluster support. They achieve this via IP multicasting. To ensure cluster safety, SwarmCache uses clustered invalidation whereas JBoss TreeCache uses replication.
Example of usage
- Country: Airport (1:M) : Country has a set of Aiports.
- Employee: Language (M:M) : Employee can speak a set of Languages and a Language can be spoken by many different Employees.
- Country: Employee (1:M) -Â a Country has many Employee.
Relationship Modeling
In relational model, 1-M relationship is modeled by “M” side has FK to “1″ side and “1″ side doesn’t even know “M” side exists. Only the time you delete a row in ”1″ side and “delete-cascade” is turned on, it will take care the rows from “M” side referencing it. However, on the contrary, you will see the “1″ side referencing “M” via collection class like Set and “M” may or may not have reference back to “1″ side in object model (depends on whether you want bidirectional navigation).
On the other hand, M:M relationship is modeled using Join-Table in relational model. Both M sides will not have knowledge on the relationships because this knowledge is solely kept in the Join-Table. However, M:M in object model, the knowledge can be kept in one side of “M” or both.
The main cause of the difference in relational and object model is due to there is limitation in table structure to reference the non-fixed “M” side. Even you may try to use more than one column to simulate that. You don’t know how many columns to use. But in object world, you don’t have this limitation in object reference.
<hibernate-mapping package=”com.wakaleo.articles.caching.businessobjects”>
   <class name=”Employee” table=”EMPLOYEE” dynamic-update=”true”>
  <meta attribute=”implement-equals”>true</meta>  Â
  <cache usage=”read-write”/>
 Â
       <id name=”id” type=”long” unsaved-value=”null” >
           <column name=”emp_id” not-null=”true”/>
           <generator class=”increment”/>
       </id>   <property column=”emp_surname” name=”surname” type=”string”/>
   <property column=”emp_firstname” name=”firstname” type=”string”/>
  Â
   <many-to-one name=”country”
           column=”cn_id”
           class=”com.wakaleo.articles.caching.businessobjects.Country”Â
          not-null=”true” />
     Â
  <!– Lazy-loading is disactivated to demonstrate caching behavior –>  Â
    <set name=”languages” table=”EMPLOYEE_SPEAKS_LANGUAGE” lazy=”false”>
      <cache usage=”read-write”/>
       <key column=”emp_id”/>
        <many-to-many column=”lan_id” class=”Language”/>
    </set>                Â
   </class>
</hibernate-mapping>
- hibernate.cfg.xml – Hibernate configuration like cache-level, show sql and mapping files.
<hibernate-configuration>
…
  <property name=”hibernate.cache.use_query_cache”>true</property>
  <property name=”hibernate.cache.use_second_level_cache”>
        false
  </property>
  <property name=”hibernate.cache.provider_class”>
       org.hibernate.cache.EhCacheProvider
   </property>
</hibernate-configuration>
- DAO for searching
How the cache work
Hibernate doesn’t not cache the object reference but the primitive properties of an object. So, the associations are not cached by default because of it. But you can tell which association you want to cache in the mapping file.
Data warehouse 101
To build data warehouse, you will use the techniques of dimensional modeling. Here are the guidelines you can follow:
- Divide the world into measurements and context.
- Numeric measurements place in Fact table whereas context are broken down into Dimensions. A fact table in a pure star schema consists of multiple foreign keys, each paired with a primary key in a dimension, together with the facts containing the measurements.
- Build the FK-PK pairs as surrogate keys that are just sequentially assigned integers.
- Use a special record in Dimension to represent unknown or no because we want to avoid putting null as FK.
- Resist snowflake the dimensional tables and leave them in flat second normal form because the flat tables are much more efficient to query. Snowflaking a dimension into third normal form, while not incorrect, destroys the ability to use bitmap indexes and increases the user-perceived complexity of the design.
- Semi-additive fact – Most fact tables are huge, with millions or even billions of rows, you almost never fetch a single record into your answer set. Rather, you fetch a very large number of records, which you compress into digestible form by adding, counting, averaging, or taking the min or max. Bank balance and inventory levels represent intensities that are awkward to express in an additive format. You sum over balance for 1 month is not really meaningful. Normally, we still treat these semiadditive facts as if they were additive but just before presenting the results to the end user, divide the answer y the number of time periods to get the right result. This technique is called averging over time.
- Slowly changing dimension -
- Hierarchical Dimension – There are 2 types of hierarchies. One is “Parent-Child” relationship and the other one is “Array of Level”. Array of level like Country -> State -> City -> Store. Parent-Child like product categories that can be nested in different ways.
Flex – Cairngorm Microarchitecture
What is Cairngorm?
Adobe Consulting Group has defined an architectural framework for Flex application, named "Cairngorm". The framework has borrowed quite a bit of design patterns from GOF and J2EE. Its goal is to help us to layout the groundwork for complicated RIA. It is interesting to see how those patterns work together in a seamless way. The framework not only cleans up our code, but it also decouples our components. The result is very neat and elegant!
Lets go through some interesting points for this microarchitecture. I assume you already read this excellent 6 parts articles from Steven Webster.
Encapsulate your Flex component and make it no dependency to the Cairngorm framework to promote high reusability
- To do a better encapsulation – input (VO) and output (event) are non-Cairngorm object.
- Component internally use VO to bind to its controls and external link VO from ModelLocator to it. So, whenever the VO in ModelLocator be changed by Command, the components’ controls that link to it will be updated.
- User gesture is first captured by the component via event listener and then re-dispatched as custom event and bubbled up. Have the View that uses the component to capture the custom event and re-dispatch again as one type of CairngormEvent with VO loaded. In Cairngorm 2.2, you simple can call the "dispatch()" method from the CairngormEvent. Then, the FrontController can take care the rest. The CairngormEvent dispatched should be non-bubbled and cancellable.
View is the place for layout the controls and interact with Cairgnorm objects (normally it is non-reusable)
- View can set up control with VO from ModelLocator and capture custom event from control and re-dispatch it as a type of CairngormEvent.
- View contains state definition that is explicitly binding to the ModelLocator where it contains what state the View should be. The code below indicates that when the search_state is changed, the searchState() method will be invoked and alter the state of the View.
[code]]czoxMTc6XCImbHQ7bXg6QmluZGluZyBzb3VyY2U9JnF1b3Q7e01vZGVsTG9jYXRvci5nZXRJbnN0YW5jZSgpLnNlYXJjaF9zdGF0ZX17WyYqJl19JnF1b3Q7IGRlc3RpbmF0aW9uPSZxdW90O3NlYXJjaFN0YXRlJnF1b3Q7IC8mZ3Q7XCI7e1smKiZdfQ==[[/code]
Command should be the one that changes the model
- FrontController is a registry that associates Event with Command. This mapping can be M:1.
- Consider ModelLocator is a mediator for Views to interact one another via changing the model.
- Delegate may intercept the response via registered its result and fault handler. The goal is to hide the detail from remote call. Therefore, it is good practice to convert XML to VO (via Factory) before invoking Command’s result and fault handler.
[code]]czoxOTU6XCI8Zm9udCBzaXplPVwiMlwiPnZhciB0b2tlbjpBc3luY1Rva2VuID0gc2VydmljZS5zZW5kKHBhcmFtcyk7IHZhciByZXNwb3tbJiomXX1uZGVyOm14LnJwYy5SZXNwb25kZXIgPSBuZXcgbXgucnBjLlJlc3BvbmRlcihzZWFyY2hCb29rc19vblJlc3VsdCwgc2VhcmNoQm9ve1smKiZdfWtzX29uRmF1bHQpOyB0b2tlbi5hZGRSZXNwb25kZXIocmVzcG9uZGVyKTs8L2ZvbnQ+XCI7e1smKiZdfQ==[[/code]
How to unit test Cairgnorm-enabled project
- VO, Service, FrontController, Event and ModelLocator are simple classes that are not subject to test.
- Command can be tested with Mock Delegate
- Model can be tested if it contains logic
- Factory can be tested if it contains parsing logic.
- Delegate can be tested with Mock Service (but a bit tricky as how to write a mock service)
- Control can be tested via addListener for the custom event thrown.
References
- http://www.adobe.com/devnet/flex/articles/cairngorm_pt6_print.html (very good 6 parts articles)
- http://jharbs.com/blog/?p=96
- http://blog.thinkingdigital.org/?p=3 (Eclipse plugin for Cairngorm)
- http://www.ericfeminella.com/blog/cairngen/ (code generation for cairngorm)
- http://www.zeuslabs.us/ (Yahoo engineer for Cairngorm)
- Amazon example from Jesse Warden
- http://www.ariaware.com/products/arp/manual.php (ARP – another framework for RIA)
Update
Recently, I have heard that PureMVC provides a clean framework than cairngorm as cairgnorm uses a lot of singleton framework class. I haven’t got a chance to look into it. Be sure will keep you update if I find it interesting.
Pick the right database for data warehouse
For those who don’t want to go for licensing path. Open source is definitely a better solution. However, whether open source DBMS can be used to build your data warehouse? I am not a good person to answer this question. But I have seen more and more small and medium size companies launched their business intelligent platform powered by open source DBMS like PostgreSQL and MySQL. Before MySQL v5 released, I don’t recommend to use MySQL for data warehouse because it missed some of key features that others provide like trigger, stored procedure, partitioning. But now, I suggest to revisit it, especially I have heard MySQL become a golden partner with the great open source business intelligent platform “Pentaho”. Below is a rough comparison chart for DBMS I got from devx.com. Take a look at it first.
There are debates about whether we should choose PostgreSQL vs MySQL. Here is one case study that shows PostgreSQL is better in OLTP system. However, for select query, another study shows MySQL v5 is 2X faster the PostgreSQL v8. For data warehouse application, MySQL sounds like a better option as it is mostly read-only.
Here is the summary that I obtained from this article that compares PostgreSQL with MySQL:
- MySQL uses traditional row-level locking. PostgreSQL uses something called Multi Version Concurrency Control (MVCC) by default. MVCC is a little different from row-level locking in that transactions on the database are performed on a snapshot of the data and then serialized.
- MySQL supports the advanced feature of data partitioning within a database whereas PostgreSQL does not.
- PostgreSQL has many of the database features that Oracle, DB2, or MS-SQL has, including triggers, views, inheritance, sequences, stored procedures, cursors, and user-defined data types. MySQL’s development version, version 5.0, supports views, stored procedures, and cursors. MySQL’s future version, version 5.1, will support triggers.
- PostgreSQL supports user-defined data types, while MySQL does not.
- Both MySQL and PostgreSQL have support for single-master, multi-slave replication scenarios. PostgreSQL offers additional support for multi-master, multi-slave replication from a third-party vendor, as well as additional replication methods.
- MySQL uses a threaded model for server processes, wherein all of the users connect to a single database daemon for access. PostgreSQL uses a non-threaded model where every new connection to the database gets a new database process.
- MySQL does not support for bitmap indexes. Bitmap indexes are ideal for the kind of low-cardinality data that is commonly used in data warehouses. PostgreSQL supports bitmap indexes as of version v8.1, as do a number of commercial database systems.
Pentaho – Quick Start
This goal of this post is to walk you through an awesome business intelligent framework named “Pentaho”. I believe the philosophy of “Learn by Practice”. So, I will show you the steps to get pentaho up and run for a fictitious company. Along with this exercise, you should be able to understand how Pentaho works and what features it provides. Lets start.
Installation of Pentaho
- Download Pentaho Demo (PCI) here
- Read Pentaho Quick Start and the Creating Pentaho Solutions pdf documents. You can get those documents from the download center above as well.
- Unzip the download file will result in a pentaho-demo directory. This is the server root, and it is commonly referred to as the PCI root or PCI install directory or something similar. To start the server, windows users run start-pentaho.bat; *nix users run start-pentaho.sh.
- Open an internet browser, and navigate to:http://localhost:8080/. This may take a little while – the server needs to warm up.
- Now you should see the pentaho web front. Try this sample out to make sure the setup is correctly done.
- Getting Started::Hello World
- Reporting
- Chart Examples. Shows some of the included charting capabilities
- Analysis / OLAP Examples. Demonstrates slice and dice
- Dashboards. There is demo “Flash Dashboard” that actually uses XML-driven free Flash chart. Look pretty good! However, I suggest to use FLEX Charting although it is not free
- When you’re done with pentaho, locate the stop-pentaho script in the PCI installation directory. Execute the script to stop de server.
- For more info of how to set up PCI as server and how to configure email service. Take a look at Roland blog.
Create a new sample
- Start pentaho demo as stated above.
- Create a folder named “MySQL” under %PCI%/pentaho-solutions/samples/mysql.
- To make mysql folder to display at the entry page. We need to put index.xml in the mysql folder. You may notice that we are using variables for name and description. The value of the variables are defined using index.properties under the same folder. The reason to do that is to support internationalization because you can define index_cn.properties for Chinese wording. Note: Click “Update Solution Repository” under Admin tab to refresh the change.
- Download MySQL Sample database - sakila. Here is the schema view.
- Download mysql v5 database and its jdbc driver.
- Run sakila_data.sql and sakila_schema.sql against MySql database. Now you have your sample movie database ready.
-
Create a file named mysql-ds.xml in $DEMO_BASE/jboss/server/default/deploy/
-
 Edit the file $DEMO_BASE/jboss/server/default/deploy/pentaho.war/WEB-INF/web.xml. Add the following right below solution5 resource-ref entry
<resource-ref>
<description>sakila</description>
<res-ref-name>jdbc/sakila</res-ref-name>
<res-type>javax.sql.DataSource</res-type>
<res-auth>Container</res-auth>
</resource-ref> -
 Edit the file $DEMO_BASE/jboss/server/default/deploy/pentaho.war/WEB-INF/jboss-web.xml. Add the following right below the solution5 entry.
<resource-ref>
<res-ref-name>jdbc/sakila</res-ref-name>
<res-type>javax.sql.DataSource</res-type>
<jndi-name>java:/sakila</jndi-name>
</resource-ref> - Copy your mysql jdbc driver library in the following directory: $DEMO_BASE/jboss/server/default/lib
- Create your own myFirst.xaction file.
- Create your own myFirst.properties with title and description.
- Restart pentaho
- Open up Firefox (cough IE) and hit the following URL: http://localhost:8080/pentaho/ViewAction?&solution=samples&path=mysql&action=myFirst.xaction
- That’s it!
<index>
<name>%directory_name</name>
<description>%directory_description</description>
<icon>folder.png|dashboard.jpg</icon>
<visible>true</visible>
<display-type>list</display-type>
</index>
<?xml version=”1.0″ encoding=”UTF-8″?>
<datasources>
<local-tx-datasource>
<jndi-name>sakila</jndi-name>
<connection-url>jdbc:mysql://localhost/sakila</connection-url>
<driver-class>com.mysql.jdbc.Driver</driver-class>
<user-name>root</user-name>
<password>honr</password>
</local-tx-datasource>
</datasources>
<?xml version=”1.0″ encoding=”UTF-8″?>
<action-sequence>
<name>myFirst.xaction</name>
<title>%title</title>
<version>1</version>
<logging-level>debug</logging-level>
<documentation>
<author>Raymond Hon</author>
<description>%description</description>
<help/>
<result-type>rule</result-type>
<icon>SQL_Datasource.png</icon>
</documentation>
<inputs/>
<outputs>
<rule-result>
<type>list</type>
</rule-result>
</outputs>
<resources/>
<actions>
<action-definition>
<action-outputs>
<rule-result type=”list” />
</action-outputs>
<component-name>SQLLookupRule</component-name>
<action-type>rule</action-type>
<component-definition>
<jndi>sakila</jndi>
<query>
<![CDATA[select * from actor where actor_id = 1]]>
</query>
</component-definition>
</action-definition>
</actions>
</action-sequence>