Archive | Site Building RSS feed for this section

Session Management – Part 1

Session management is one of the key topics that all serious web developers and architects need to master with. This article will go through several key topics with you. They are:

  • Persistence vs non-persistence web connection – web performance!
  • Concerns of using cookie – security and size limitations
  • Server side session management challenges in scalable web application
  • Achieve linear scalability through stateless servers - start moving the session to the client

Today, I will start walking through all these topics at a high level. A series of articles will be written to further develop on each topic if necessary. Lets start!

Persistence vs non-persistence web connection

  1. Before HTTP 1.1, HTTP is a stateless protocol that doesn't maintain persistence connection. Each request made by a Web browser, for an image, an HTML page, or other Web object, is made via a new connection.
  2. HTTP 1.1 introduced persistence connection (ie. Keep-Alive) that Web browser can established a single connection, through which multiple requests could be made.
  3. But before HTTP 1.1, how can state maintain across stateless HTTP request?
    • Normally, we keep the session in the server side and provide the session id to the client that can be used to link subsequent requests to the same session.
    • Normally, client (often time web client) will store the session id in cookie.
    • However, if the cookie is disabled, the session id will normally embedded in the URL (ie URL Rewriting).

Concerns of using cookie

What do we need to pay attention when we store info in cookie?

  1. Size limitation and security concerns.
  2. How long cookie can last? Default = expired when browser exits. In Java, you can do cookie.setMaxAge(int) with long future date if you want to keep the info lasting long in the cookie. If you do setMaxAge(0), it will void the cookie.
  3. Normally, we don't keep all state info in cookie as the information could be sensitive and we are not able to protect it because it sits in the clients' filesystem. Apart from that, there has limitation in size as well. For these two concerns, we normally just store the session id in the cookie and keep the session in the server side. This approach can save us bandwidth as well.

Server side session management challenges

At the first glance, session in server side sounds like a great solution. However, when it comes to scale, it always raises the concerns. Imagine you need to replicate client session state across multiple servers to achieve high availability. Both the replication time and memory resource limit will cause your system not able to scale linearly. To solve or minimize this, we selectively pick what kind of info we store in the session, use sticky session to avoid one session replication across all the machine or even try to store the state to the client if possible like using rich client UI (ex. Flex and Silverlight). A post will be written about this topic later on.

Transient vs Persistent State

  1. Session in the server can be timed out (~30 minute inactive)
  2. Session in the server can be persisted in file across Tomcat restart.
  3. Persistent state should be stored in database.
  4. Object putting in session should be Serializable
  5. Avoid putting too much info in the session b/c we don't want to put too much baggage during session replication. One server crash b/c of memory depletion can further spread across to other servers via session replication. Not Good! Should we reconsider storing session in client? This article talks about it.
  6. Session replication is needed to support failover. Sticky session for simplicity but suffered data lost when the box is down. We can tell one or two servers as its backup to avoid the session lost. To go for sticky session approach, we need to identify the "sticky" part. What kind of thing we can use to link separate requests? Use IP address can potentially overload a box because some Internet service providers use a set of proxy servers to deal with many clients. This subject can be further developed. We will go back to it later!
Leave a comment Continue Reading →

Flex Hacking Series Part 1 – Event Model

Event Model

Event Flow

The idea is simple. Here is the regular event flow: Users interact with the UI, event is generated, broadcast via the event dispatcher (bubbling up the display hierarchy if enabled) and captured by any registered listeners, and a set of actions is taken in response. To understand it a bit more in detailed, you can check this article and play with its demo. In short, under the hood, it has 3 phases: capturing, targeting and bubbling. In the display list, from the top, it always starts from Stage as the root, then SystemManager, then your Application. The event is created by the flash player and travel down from Stage to the target component and bubble it up back to Stage (if enabled). In this round trip, it will trigger the event listeners'  actions. Most components communicate with others using events which conveys useful information and data but only visual components (objects in the display list) can participate in the event flow described above.

in practice, you normally just need to worry about registering listener to the target component for a particular event propagated from it. You seldom need to understand the event flow above in detailed. However, if you want to exercise more control on the event like stopping the event from propagating, you need to understand how it works first.

Cancel/ Stop the event

Within Flex, by default, events only broadcast themselves to their parent component. If you want the event to broadcast to its parent's parent (and all the way up your component chain in the display object hierarchy), then you tell that event to bubble. If you don't want any component in chain cancels your event via stopPropagaton() or stopImmediatePropagation(), you can make it non-cancelable during event construction. The difference between stopPropagation and stopImmediatePropagation is that stopImmediatePropagation will not only prevent the event from moving to the next node, but it will also prevent any other listeners on that node from capturing their events.

Some events have an associated default behavior. For example, the doubleclick event has an associated default behavior that highlights the word under the mouse pointer at the time of the event. Your event listener can cancel this behavior by calling the preventDefault() method. 

Define your own Custom Event

Now you understand the event flow. Next, you need to know how to create a custom event to carry additional info. This article helps you to achieve this goal.

Leave a comment Continue Reading →

Hibernate vs iBatis

Hibernate is great. However, I don’t see it fits all the data access requirements. At its core, it is an ORM tool that helps you to map your object model to relational model. If you have full control of your relational model and perform lots of CRUD operations, it is certainly a great tool for you. Its transparent persistence, 2 level caching, dirty checking, lazy/ eager data retrieval and sql generation indeed can save us lots of development time. However, one tool doesn’t fit all !! Why not?

In my current company, I have created a reporting tool that interfaces with dimensional model in data warehouse. In this setting, you will deal with star schema with denormalized dimensions.  Often time, I need to tune the query performance via looking into explain plan. Without full control of SQL, my job will be hard to achieve. Apart from that, reporting tool often issues read-only set based queries to the data warehouse. The resultset returned doesn’t fit into my OO model at all. Again, Hibernate just doesn’t fit in. People in my company argue that I should use named query in Hibernate for the sake of sticking with the standard. I am like ok, whatever… I have known a tool called iBatis that I can achieve my job cleanly. Why the hell I would have motivation to try named query that basically a way to by-pass the ORM model to query database. What benefit I will get from this? The cache? We are using ETL to update our fact and dimension in the data warehouse, not by my reporting app. Unless the ETL throws me an event when the update cycle is finished so I can flush my cache, I simply don’t think it gives me lots of help.

Anyway, it is just my own little perspective. You don’t have to agree with me. The point here is not that I don’t like Hibernate but I don’t like to be pushed to use it only because it is “standard” to someone. If Hibernate could help me to construct my sql based on user input, stream my result directly to my presentation layer, populate my model automatically based on mapping I provided, detect data warehouse changes and take care my cache, then I am more happy to adopt it in my reporting app. Otherwise, I would not be eager to dump my iBatis DAO layer unless I get no choice under the political game.

Reference

  1. http://www.nofluffjuststuff.com/media.jsp?id=19
  2. http://www.javalobby.org/articles/hibernate-query-101/
  3. How to use named parameters and named query in Hibernate?
  4. Don’t repeat your DAO 
Leave a comment Continue Reading →

Reporting solution!

Open source reporting

My company needs a reporting engine but it doesn’t want to go for the expensive commerical ones like MicroStrategy. In fact, I don’t know why we need to pay so much because there are tools out there for FREE. As usual, I googled the Net and found out two seemingly promising open source reporting solution.

  1. Pentaho Reporting
  2. Jasper Reporting

Both of them are bundled with a suite of tools related to OLAP, Data Mining, ETL.. etc. To me, I just want an non-invasive reporting engine that can easily integrate into our architecture. To my dismay, I found out Pentaho doesn’t go this route. It basically gives you a reporting server configured. You could build your reports and deploy them following the manual. However, I hardly see a reporting solution that could satisfy all the business requirements without customization. All I expected from Pentaho is a jar file with documents that shows me how to use its api to generate reports in different formats and how to integrate with our database. I have attempted to look into the code and extracted the stuff I want from Pentaho. However, I found out the engine is actually not powerful. To strip out the workflow part, it is basically a simple SQL executor that later on will render the result according to the UI info embedded in the report definition. What is wrong with that?

  1. We want to handle pagination and data streaming as our data volume is huge. In Pentaho, you need to take care these yourself. So, you write your own sql, paginate yourself, stream it yourself if the resultset is huge. Isn’t it what we are doing without it? Apart from that, each report in Pentaho needs a report definition. It supports dynamic sql via token replacement. It is primitive as I want it to support control flow because I may decide what tables to join based on the input filters.
  2. On the UI side, Pentaho helps you to render your result into graph, table…etc. Again, I don’t like this UI solution as well. I found that JFreeChart is not as powerful as the Flex solution. I am adopting Flex and it gives me much powerful visualization tool. All I want is to ship my Flex app the data from my query’s result.

How about Japser? Pretty much the same but the good thing of Jasper is that it gives you the jar and document of how to use it instead of a reporting server like Pentaho. So, I can use it as report renderer to generate PDF and Excel like other utility libraries I use. So, what is my final solution?

I finally decide to create my own report definition that my Flex UI can take and render out the reporting interface. So, I don’t need to create form for each report. Apart from that, in my report definition, I have iBatis SQL template embedded. So, I can leverage its dynamic sql feature that supports control flow logic and the auto result class population. Yes, I still need to handle pagination and streaming myself. But, at least, it already saves up my time. The result object populated will return to Flex via AMF. So, I don’t need to marshal and de-marshal it in xml. It saves the processing time and costs less bandwidth. At the end, my solution combines the best in the market:

  1. Powerful reporting widgets provided by Flex
  2. Fast streaming and RPC protocol – AMF
  3. Good dynamic sql generation and mapping tool from iBatis
  4. Good reporting rendering tool from Jasper that helps me to do PDF and Excel generation

My solution is more flexible. As I can plugin hibernate map if I don’t want to write my own sql at all. Apart from that, no UI work is needed to deploy a new report unless my generic reporting interface is not enough.

Later, if I really need the workflow engine provided by Pentaho, I can plug it in. Again, the document provided doesn’t give us clear instruction or APIs of how to do it.

Reference

Below are references I used to build my solution:

  1. Flexible reporting with JasperReport and iBatis
  2. How Kodo JPA handles large result set (its optimization guide is good reference even you may not use Kodo)
  3. Process Large Result Sets in Java Web Application
  4. Streaming architecture

 

Leave a comment Continue Reading →

How to pick a good web hosting company for Java webapp

I currently use Dreamhost for my own company “JustProposed.com” that is powered by typical LAMP solution. It is a great shared hosting solution but it doesn’t support website powered by Java. If you have Java website, I suggest you to try a web hosting company that provides you VPS (Virtual Private Server) solution. VPS occupies a middle ground between a dedicated server and regular shared hosting. You get the features of dedicated hosting in a shared environment. In other words, your virtual server runs on one of host servers. The host server runs a number of virtual servers.  Each virtual server shares the host server’s memory, CPU, Internet connection and other resources. No one VPS can monopolize resources.  Each VPS gets a guaranteed share of the server CPU, disk IO and network.

Before you pick a good web hosting company for your java webapp, you better get familiar with the system need of your webapp first. There are some interesting topics that you may encounter:

  1. Shared JVM vs Private JVM – The two biggest problems with shared Java hosting are memory leaks and security. If someone has a memory leak or code that is a high user of memory in a shared environment, all of the people sharing that JVM suffer the same memory problems.  Apart from that, it is not secure. Therefore, you don’t want to share the same JVM with other users although it is cheaper. One more disadvantage of Shared JVM is that you cannot restart the Tomcat as you wish.
  2. PermGen space (default = 64MB) is used for things that do not change (or change often). e.g. Java classes.  So often large, complex apps will need lots of PermGen space. Similarly if you are doing frequent war/ear/jar deployments to running servers like Tomcat or JBoss you may need to issue a server restart after a few deploys or increase your PermGen space. To increase the PermGen space use something like: -XX:MaxPermSize=128m
  3. Heap size setting – Java has a couple of settings that help control how much memory it uses:
    • -Xms sets the minimum memory heap size.
    • -Xmx sets the maximum memory heap size. When setting the -Xmx setting you should consider a few things…  -Xmx has to be enough for you to run your app.  If it is set too low then you may get Java OutOfMemory exceptions (even when there is sufficient spare memory on the server).
    • We typically specify the same amount of memory for both flags (-Xms and -Xmx) to force the server to use all the allocated memory from startup. This way, the JVM wouldn’t need to dynamically change the heap size at runtime, which is a leading cause of JVM instability.
    • If you don’t specify a memory size in the JVM startup flags, the JVM would limit the heap memory to 64MB (512MB on Linux), no matter how much physical memory you have on the server!
    • For 64-bit servers, make sure that you run a 64-bit JVM on top of a 64-bit operating system to take advantage of all RAM on the server. Otherwise, the JVM would only be able to utilize 2GB or less of memory space. 64-bit JVMs are typically only available for JDK 5.0.
    • Suggested memory size: PermGen + Max Heap = 256MB. Of course, the more you get the better! :smile:
  4. Garbage collection (GC)With a large heap memory, the garbage collection (GC) operation could become a major performance bottleneck. It could take more than ten seconds for the GC to sweep through a multiple gigabyte heap.
    • Single-threaded vs concurrent GC – In JDK 1.3 and earlier, GC is a single threaded operation, which stops all other tasks in the JVM. That not only causes long and unpredictable pauses in the application, but it also results in very poor performance on multi-CPU computers since all other CPUs must wait in idle while one CPU is running at 100% to free up the heap memory space. It is crucial that we select a JDK 1.4+ JVM that supports parallel and concurrent GC operations. Actually, the concurrent GC implementation in the JDK 1.4 series of JVMs is not very stable. So, we strongly recommend you upgrade to JDK 5.0.
    • Pick a good GC algorithms – Parallel GC free up memory faster but longer pause. Concurrent GC has shorter pause but doesn’t free up all memory at once.
  5. How to have your tomcat be available at port 80
  6. More on Java system tuning on multi-core server

Below are some of the great VPS options on the Net:

RimuHosting.com provides Xen-based VPS. Xen is the virtualization technology that Amazon use for EC2.

  • Guaranteed 99.9% uptime
  • They target at most, 8 customers per CPU core with 16% usage.
  • The host server is 2U Supermicro with 32GB of memory, 8 2.4Ghz Xeon CPU cores and 4TB of disk space.
  • You can get 256MB memory and 4GB disk space allocated for about $30.
  • Great how to wiki page


SliceHost.com also provides Xen-based VPS.

  • Guaranteed 99.9% uptime
  • The host server is in 16GB memory, 64 bits, quad-core CPU with 8+ Ghz. RAID-10 disk storage with Gigabit network.
  • You can get 256MB memory, 10GB disk space and 100GB bandwidth for about $20.
  • No contracts, no setup fees.
  • Great how to wikipage

WestHost.com that costs $10 per month.

  • Guaranteed 99.9% uptime
  • Lots of hard drive space but no guaranteed on memory allocation.

http://www.perfectblogger.com/2007/10/why-i-think-slicehost-is-the-best/

How to get SVN service for FREE?

If you are using website is in php, I believe Dreamhost is enough for you. Apart from using Dreamhost as web hosting, I also use it as my SVN repository. I have a project named “Justproposed.com” that I want to work in collaboration with my partners. I want to use SVN that I found very helpful at work. However, I don’t want to pay too much to do that. I wonder any of the web hosting company on the Net that provides this service. So, I can pay my regular web hosting fee and get this extra service for FREE (It is practically free if the web hosting company you pick gives you tons of disk space). I have searched through different postings and feedbacks and eventually found “Dreamhost” that gives me exactly what I want.

DreamHost

If you use “SOLUTIONHACKER” as code, you will get your first year FREE as well. Apart from the great service they provide, there are many supporters on the Net to help each other. Here is a good article that shows you how to set up your first SVN repository on Dreamhost. Follow this to get started your project in a cheap way!

http://www.jtbullitt.com/projects/tech/svn-for-website-development

Leave a comment Continue Reading →

Wiring up Flex, Mate, BlazeDS, Spring, Hibernate and MySQL with Maven 2 – Part 1

Introduction

This article is written on top of the great work that Sébastien Arbogast has done. He has written 3 articles that showed you how to wire up Flex, BlazeDS, Spring, Hibernate and MySQL with Maven as build process. I have included his articles below as your reference.

  1. The Flex, Spring, and BlazeDS full stack – Part 1: Creating a Flex module
  2. The Flex, Spring and BlazeDS full stack – Part 2: Writing the to-do list server
  3. The Flex, Spring and BlazeDS full stack – Part 3: Putting the application together

I have found Sebastien’s work as a good foundation for my own project. To contribute back to the community, I will write a series of articles to show you how can customize and extend the todolist sample.

What is in the Part 1 of the series…

  1. Enhancements on the Maven build process
    • Leverage RSL to factor our the framework swc, so the size of the application swf will be reduced. Apart from that, I also take advantage of Flash Player Cache that is available after version 9 update 3 to cache the framework libraries.
    • Clean up the Flex and BlazeDS dependencies in POM as the latest version of the sdk is available and the BlazeDS dependencies are officially available.
    • Include some common reports for maven site generation
    • Embed Jetty web server in the build process for quick deployment and testing
  2. Document how to get the sample up on Eclipse for development
  3. Use Mate as Flex framework
    • Restructure ToDoList sample to leverage Mate framework
    • Factor out Mate as RSL and integrate it with Maven build process via Flex-mojo plugin.

What are in the coming articles…

  1. In part 2 of this series, I will show you how to use flex-mojo to build a modular Flex application.
  2. In part 3 of this series, I will show you how to test your flex app via FlexUnit (Unit test) and FlexMonkey (Functional test)
  3. In part 4 or this series, I will work on server side. I am planning to add monitoring, caching and security to the server side.

Review “ToDoList” sample

Before I start my journey, let me highlight what Sebastien has done first:

  1. Sebastien’s sample demonstrates how to use Maven as a build process. There are 3 parts or subprojects in his sample. They are:
    • todolist-config (configuration files shared by other subprojects)
    • todolist-ria (Flex frontend)
    • todolist-web (Server side that supports the Frontend)
  2. All these subprojects are considered as modules of the main project (root POM). Finally, they are combined together into war artifact and ready to deploy to Tomcat or other J2EE webapp server.
  3. Flex frontend and backend communicate through a binary RPC protocol – AMF. AMF is considered to be the simplest and fastest remoting approach available in Flex. Recently, Adobe has released BlazeDS as an open source implementation of AMF spec. In this sample, BlazeDS is used. To use BlazeDS, there are few things you need to do:
    • Externalize your POJO service via BlazeDS. This sample shows you how to integrate BlazeDS with Spring
    • Make BlazeDS endpoints availabe to the Net via Servlet.
    • Have frontend and backend shared the same BlazeDS configuration files.
  4. In this sample, you can also find out how to use flex-mojo maven plugin to compile the Flex frontend code into swf. Apart from flex-mojo plugin, there are other two good plugins worth to mention:
    • maven-assembly-plugin - can be used to bundle all the files under a directory into a zip file. It is used by todolist-config to bundle all the configuration files (service-config.xml and remoting-config.xml) into a zip during the package phase.
    • maven-dependency-plugincan be used to unpack the zip file and move to the place you want. It is used by todolist-web to unpack the config zip during the generate-resources phase.

Enhancements on maven POM

I have modified the sample’s maven pom as follows:

  • Link to new repository “Sonatype Forge” in the root POM. So, I can use the new version of flex-mojo and simplify the todolist-ria adobe framework dependencies. Apart from that, I also take away the private repository from Sebastein because BlazeDS libraries are available in official maven repository (Note: The BlazeDS libraries available in official maven repo are in version 3.0 instead of 3.0.0.544. So, you need to modify the webapp pom correspondingly).

    <repositories>
        <repository>
            <id>flex-mojos-repository</id>
            <url>http://svn.sonatype.org/flexmojos/repository/</url>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </repository>
    </repositories>

    <pluginRepositories>
        <pluginRepository>
            <id>flex-mojos-repository</id>
            <url>http://svn.sonatype.org/flexmojos/repository/</url>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </pluginRepository>
    </pluginRepositories>

  • Because I link to Sonatype repository, I can have my todolist-ria depends on one flex-framework pom dependency instead of all the swc dependencies. Note that the pom dependency is a way to factor out all the adobe swc dependencies that makes your pom easier to maintain.

        <dependency>
            <groupId>com.adobe.flex.framework</groupId>
            <artifactId>flex-framework</artifactId>
            <version>3.1.0.2710</version>
            <type>pom</type>
        </dependency>

  • I include mysql driver as dependency in my webapp pom. I think it is cleaner to bundle it in war. I have also added jetty plugin in the POM so you have a web server embedded in the build process. With this, you can run this sample application right after you check it out from svn (assume you have maven 2 installed). To start jetty, you can issue the following maven command under your webapp project.

project_root> mvn clean install
project_root/jp-web> mvn jetty:run-war

  • I have included some reports that will be shown after site generation. You may not be able to do mvn site-deploy because it is linked to my web hosting site. However, you can modify it for your own sake.

Get the sample up on Eclipse

To develop on Eclipse, you can follow the steps below:

  1. Create Eclipse project file via running the command below at the project root. This will create 2 eclipse projects. One for todolist-ria and one for the webapp. You noticed that I use the -Declipse.downloadSource=true to include the source files of my dependencies in my eclipse project. Therefore, I can get to the source code if needed.

mvn -Declipse.downloadSource=true eclipse:eclipse

  1. Import the projects into Eclipse
  2. Add new variable M2_REPO and set it equals to [home]/.m2/repository
  3. If you have installed Flex Builder plugin to your Eclipse, you can Add Flex Project Nature to the todolist-ria project.
    • Select Application Server Type: J2EE
    • Put check on “Use remote object access service” with LiveCycle Data Service selected.
    • Set up the path. I have my tomcat installed under C:\tools with default 8080 as port. You should make the changes if you installed it differently.
    • Remove the generated main.mxml under the src folder.
    • Set index.mxml under src folder as default Flex application file to run.
    • Use Default Flex SDK in Flex Compiler Configuration instead of Server Flex SDK
    • Right click and select Recreate HTML Template if you see error.
    • After all these, you have configured your Flex application pointing to the webapp server and sharing the BlazeDS configuration files. You can verify in Flex Compiler Configuration’s Additional Compiler Parameters. See whether you see this: -services “C:\tools\tomcat-6.0.16\webapps\jp\WEB-INF\flex\services-config.xml” -locale en_US
    • Move the war to your tomcat’s webapp folder and start it under remote debugging setting. If you are using window, set DEBUG_OPTS=-Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=8787,suspend=n under your bin/catalina.bat.
    • Start your webapp via bin/startup.bat
    • Put breakpoint under TodoServiceImpl save method and start remote debugger on localhost:8787
    • Right click the index.mxml and Run As Flex Application.
    • Add a new entry and save it on the flex app. :razz: You should see your remote debugger halt at the breakpoint for you to debug.
    • Now you can change your flex code and test it out without leaving your Eclipse. However, if you modify the service in webapp, you need to run “mvn clean install” and deploy the war to the tomcat before your flex code can call your server-side code via AMF.

Use Mate as Framework

If you are not familiar with Mate, click the image below that moves you to a nice presentation.

 

What did I do to restructure the todolist sample to make it Mate app?

  1.  

Download

I have made my work available at: www.solutionhacker.com/wp-content/uploads/todolist-jp-modified.zip

Reference

Below are the references I used for the article:

  1. Flex mojo compiler user guide
  2. Flex mojo dependency scope rules
  3. Flex 3 feature introduction: Flex 3 RSL
  4. Improving Flex application performance using Flash Player Cache
  5. FNA archetype projects 

 

Leave a comment Continue Reading →

Data representation

Data can be represented in text format for human and binary format for computer. Here my focus will be on text representation.

For application, we commonly use XML because:

  1. Its self-documenting format describes structure and field names as well as specific values. And it is easily digested by both human and machine.
  2. It is platform-independent, thus relatively immune to changes in technology and facilitate in data exchange across heterogeneous systems.
  3. It supports Unicode.
  4. It can represent common computer science data structures: records, lists and trees.
  5. It allows validation using schema languages such as DTD and XSD. XSDs are far more powerful than DTDs in describing XML languages. They use a rich datatyping system, allow for more detailed constraints on an XML document’s logical structure, and must be processed in a more robust validation framework.

With all the above advantages, it quickly becomes the standard of data exchange especially in web service world. However, XML also carries its disadvantages like it is verbose and the hierarchical model for representation is limited in comparison to an object oriented graph.

Other options:

  1. XML vs JSON - JSON is now more attractive than XML for kinds of data interchange that powers Web-based mashups and Web gadgets widgets. Why? Look at the articles below:
    • Fixing AJAX: XMLHttpRequest considered HarmfulYou don’t see much AJAX examples that access third party web services like Amazon, Yahoo and Google. That is because all the newest web browsers impose a significant security restriction on the use of XMLHttpRequest. That restriction is that you aren’t allowed to make XMLHttpRequest to any server except the server where your web page came from. If you attempt to do so, XMLHttpRequest will either fail or pop up warnings, depending on the browser you are using… Solution: Application Proxy, Apache Proxy or Use Script Tag Hack (On-demand Javascript).
    • JSON vs XML: Browser Security ModelThis article comments the solutions proposed above. It indicated that Script Tag approach is better than proxy.
    • JSON and Yahoo!’s Javascript API – This article will give you example of how to use Script Tag to communicate with Yahoo Web service API and bypass the restriction of XMLHttpRequest. The way to bypass XHR restriction is not using XHR at all. The cross-site requests are made by adding script tags to a document’s HEAD with DOM methods (i.e. [code]]czoxNDpcIi5hcHBlbmRDaGlsZCgpXCI7e1smKiZdfQ==[[/code])
    • Is JSON better than XML (a good objective review)
    • In conclusion, JSON enables you to use Script Tag approach to bypass XHR security restriction b/c JSON itself is part of Javasript. That makes JSON popular.
  2. YAML as an alternative of data serialization
  3. Java Serialization will take object to binary representation (versioning headache). XStream is a simple library to serialize objects to XML and back again.

For machine, data is represented in binary format:

  1. The art of assembly language (a free book that you can read online)
Leave a comment Continue Reading →

Plenty of Fish – Cash cow!

A site called “PlentyOfFish.com” is currently getting 30 million hits a day. The number doesn’t blow me off. However, what surprise me is that this site is basically operated by single man “Markus Frind”. How does he achieved that? If you want to hear how he does that, you can go to his interview from this link. Otherwise, you can read the summary I got from his interview.

The stuff I learnt from Markus

You may think that Markus must spend a lot of $$ to maintain his site. A picture of server farm may be popped up in your head. Hahaha… all he needs is just 1 web server and 3 database servers. This is the cost that you and me can afford. No bother to write your business plan and wait for VC $$ nowadays. :grin:

Here are some quick tips for Markus

  1. You need a lot of RAM. RAM is cheap, go ahead to power up your box with tons of RAMs please!
  2. Markus uses Akamai CDN to offload the bandwidth of fetching images across different locales.
  3. Separate R/W database operation.
  4. Markus uses one database as master for write and 2 databases as slave to handle the searches (read). According to him, radius-based searches demand lots of resources. “If you have one system to do just one thing, it will do it much efficiently.”
  5. Markus put RAM to both web and db servers. “If you can load your whole db in the RAM, do it!”
  6. Optimize the db access is the key to handle lots of requests.
  7. Denormalization is necessary if you want to reduce the number of joins that can potentially slow down your queries.
  8. PlentyOfFish.com is purely based on “Word of Mouth” marketing. Do things right, your users will spread it out for you. Cheapest marketing strategy ever!
  9. PlentyOfFish.com is FREE site. Because it is free, it doesn’t have high requirements like uptime. It can be down without much issues.
  10. PlentyOfFish.com solely monetized from advertisement like Google Ads. Just this, Markus is making around 10 million annually. Amazing!
  11. PlentyOfFish.com is purely using Microsoft solution like IIS, ASP.NET and SQL Server. In fact, you can build it using other solution like Apache, Spring, MySQL

I love to see how people like Markus beat down the giant like Match.com. One man beats hundreds of people with simple system settings. Incredible! Folks, there is no excuse whining no $$ to start your business!:lol:

Although it sounds easy for Markus during the interview, there are areas the interviewer didn’t cover:

  1. PlentyOfFish.com webfront is not looking good. How could it attract the first set of users in the first place? FREE
  2. If you go to a FREE site without data, you may leave it right away. How PlentyOfFish.com attracts the first real user? Did PlentyOfFish.com crawl competitors’ data to power his site as bootstrap?
  3. PlentyOfFish.com purely makes $$ from Google AdSense. However, according to John Chow, Adsense is not a good place to make $$. Why is that?

What possibly may go wrong for his approach:

His database architecture is traditional master-slave approach. It can offload the read but not write operations. Obviously the master becomes the write bottleneck and a single point of failure. And as load increases the cost of replication increases as well. Replication costs in CPU, network bandwidth, and disk IO. The slaves fall behind and have stale data. The folks at YouTube had a big problem with replication overhead as they scaled. This problem can be tackled by shard/ federation. I will discuss this topic later.

 

Leave a comment Continue Reading →

Powerful Full Text Search – Part 3 Solr

Introduction of Solr

Solr is a standalone enterprise search server with a web-services like API. You put documents in it (called "indexing") via XML over HTTP (RESTful). You query it via HTTP GET and receive XML results.

  • Advanced Full-Text Search Capabilities
  • Optimized for High Volume Web Traffic
  • Standards Based Open Interfaces – XML and HTTP
  • Comprehensive HTML Administration Interfaces
  • Scalability – Efficient Replication to other Solr Search Servers
  • Flexible and Adaptable with XML configuration
  • Extensible Plugin Architecture

Set up Solr

 To set up Solr, you should follow this guideline. After the set up Solr, you practically have a indexing service up.

The HTTP/XML interface of the indexer has two main access points: the update URL, which maintains the index, and the select URL, which is used for queries. In the default configuration, they are found at:

  • [code]]czozNDpcImh0dHA6Ly9baG9zdG5hbWU6cG9ydF0vc29sci91cGRhdGVcIjt7WyYqJl19[[/code]
  • [code]]czo3OlwiaHR0cDovL1wiO3tbJiomXX0=[[/code][code]]czoxNTpcIltob3N0bmFtZTpwb3J0XVwiO3tbJiomXX0=[[/code][code]]czoxMjpcIi9zb2xyL3NlbGVjdFwiO3tbJiomXX0=[[/code]

To add a document to the index, we POST an XML representation of the fields to index to the update URL. In addition, you can delete, update (ie. re-post on unique). All change operations need to commit to flush to file system. On the other hand,  once we have indexed some data, an HTTP GET on the select URL does the querying. 

Powerful features Behind Solr

If you follow the guideline above, you already get yourself familiar with indexing, searching and facet browsing. Now lets get down to how to make Solr a scalable solution with great performance.

Caching

TBA

Distribution and Replication

For applications that receive large volumes of queries, a single Solr server may not be enough to meet performance requirements. Therefore, Solr provides mechanisms for replicating the Lucene index across multiple servers that are part of a load-balanced suite of query servers. The replication process is handled through a combination of event listeners enabled through the solrconfig.xml file and several shell scripts (located in solr/bin of the example application).

In a replicating architecture, one Solr server acts as the master server, providing copies of the index (called [code]]czo5Olwic25hcHNob3RzXCI7e1smKiZdfQ==[[/code]) to one or more slave servers that handle query requests. Indexing commands are sent to the master server and queries are sent to the slave servers. The master server can create snapshots manually or by configuring the [code]]czoyMTpcIiZsdDt1cGRhdGVIYW5kbGVyJmd0O1wiO3tbJiomXX0=[[/code] section of solrconfig.xml to trigger snapshot creation when [code]]czo2OlwiY29tbWl0XCI7e1smKiZdfQ==[[/code] and/or [code]]czo4Olwib3B0aW1pemVcIjt7WyYqJl19[[/code] events are received. In either the manual or the event-driven process, the [code]]czoxMTpcInNuYXBzaG9vdGVyXCI7e1smKiZdfQ==[[/code] script is invoked on the master server, creating a directory on the server named [code]]czoyMzpcInNuYXBzaG90Lnl5eXltbWRkSEhNTVNTXCI7e1smKiZdfQ==[[/code] where [code]]czoxNDpcInl5eXltbWRkSEhNTVNTXCI7e1smKiZdfQ==[[/code] is the actual time the snapshot was created. The slave servers then use rsync to copy only those files in the Lucene index that have been changed.

&lt;listener event=&quot;postCommit&quot; class=&quot;solr.RunExecutableListener&quot;&gt;
    &lt;str name=&quot;exe&quot;&gt;snapshooter&lt;/str&gt;
    &lt;str name=&quot;dir&quot;&gt;solr/bin&lt;/str&gt;
    &lt;bool name=&quot;wait&quot;&gt;true&lt;/bool&gt;
    &lt;arr name=&quot;args&quot;&gt; &lt;str&gt;arg1&lt;/str&gt; &lt;str&gt;arg2&lt;/str&gt; &lt;/arr&gt;
    &lt;arr name=&quot;env&quot;&gt; &lt;str&gt;MYVAR=val1&lt;/str&gt; &lt;/arr&gt;
&lt;/listener&gt;

Reference

Below are some cool references I found:

  1. Search smarter with Apache Solr, Part 1: Essential features and the Solr schema
  2. Search smarter with Apache Solr, Part 2: Solr for the enterprise
  3. Advanced Lucene

 

 

Leave a comment Continue Reading →

Powerful Full Text Search – Part 2 Nutch

Introduction of Nutch & Hadoop

After Lucene, the author created another powerful tool. Its name is Nutch. Nutch is a powerful crawler built on top of the Lucene. With Nutch, you can launch a multi-threaded crawler to obtain information from the Net. At this point of writing, Nutch is in its 0.9 version. Nutch comes with a list of cool features, including whole Web crawling, local file crawling for the
intranet, indexing all the while.

Hadoop was designed to handle the petabytes of data that Nutch could potentially store and process. In fact, Hadoop has its own file system: the Hadoop Distributed File System (HDFS), which can run on any old run-of-the-mill, low-cost hardware.
Hadoop works by storing part of the file system’s data across all the servers in the cluster. As new queries come in, HDFS follows the "moving computation is cheaper than moving data" rule — meaning that moving the processing of the query to as
close as possible to the data will be faster than placing the query at random within the cluster and moving data long distances across the network.

I have searched around to see if anyone can give me some tips on this tool. Surprisingly, I don’t see much. But don’t worry, I have found some that can at least get you start playing with it.

Set up Nutch

Here is the guideline written by Peter Wang that I followed to bring my Nutch up. Follow it and bring your Nutch before go further. By the way, if you want to run Nutch with Solr, this is a good tutorial.

Nutch Architectural Review

 

 

 

Leave a comment Continue Reading →