Archive | Scale and Optimize RSS feed for this section

Function Currying in Scala – Part 4

When I first read function currying, I had no idea what is the use of it. Why someone wants to break down its argument list into multiple () with one argument per each. Luckily, I finally came across an article that provides a great explanation of it. Given what you have learnt so far from my series, you should have good foundation to understand this article. I will try to walk through this example with you here and hope you will grasp it as well.

Review the example

1

You can see some syntax rules are applied here. Let me go through with you one after one.

  1. testRule1() method has return type Unit, so ‘=’ sign is not necessary. Since it is > 1 statement in the function body, we still need to have the { }.
  2. parserRule must be an instance variable from class Parser.
  3. The return type of parserRule is instance of one of the case class “Success” or “NoSuccess” and they extend ParseResult abstract class. The return result will be used in the pattern matching like regular function chaining.
  4. “apply” method name can be omitted and author wants to factor out the routine for Success as a separate function.

The author creates a method tryMatch and refactor out the assertion routine as below:

1

There are some interesting points I want to bring up:

  1. ParseResult has [T] next to it. I haven’t talked about that in my series. It is generic in Scala way.
  2. The 2nd parameter is a function type that allows you to pass the function to handle the success case.

How to use the new function above to make your test cleaner and easier

1

According to pattern matching rules, the value in the ParseResult will assign to r if case Success is matched. And the r will be the “result” of the function literal passed in. So far so good?

Function Currying

OK, I finally get to point to talk about function currying. The example in this blog saying that if there are multiple statement in success case routine, it will look ugly if you try to encapsulate all of these inside the function literal like below:

1

<
So, the author suggested to use function currying as it will transform a method taking multiple parameters into a chain of singular parameter. Here is what he can achieve after the function currying:

1

However, I don’t really feel the power of function currying simply just convert () to {}. If the author can format the previous code a bit, he can achieve the one like below. To me, it is about the same in term of clarity.

1

I believe function currying should possess the power better than this example. The reason I picked this example because I like the refactoring process from the author. I have googled more and notice this example may be a good one. However, to be honest, I am still not quite clear about the pivot syntax. So, I googled more and found out this statement from this article, “… Sometimes it makes sense to let people apply some arguments to your function now and others later”. Sounds like curried function allowing you to partially apply the parameter to the function.

1

Too bad that I still cannot figure out a real life example that we need to use curried function. If you know one, please comment this post.

Leave a comment Continue Reading →

Power of Scala – Part 3

Enough comparison between Scala and Java. The first two articles in the series just helps you to get familiar with Scala syntax but not telling you the actual power of Scala. In fact, I have no motivation to pick up Scala if it doesn’t give me the power big enough to scarify my time with my family. From this onward, I will put down the key features in Scala that differentiate it from Java.

Pattern Matching

Extend the power of switch/cases

  1. Support more than simple primitive type switch/cases
  2. Support case class
  3. Type match and assignment
Simple switch/cases

//primitive
def checkPrime(number:Int) = number match {
case 1 => true
case 2 => true

case _ => false
}

// you can condense cases in one line and return other data type
def toYesOrNo(choice: Int): String = choice match {
case 1 | 2 | 3 => “yes”
case 0 => “no”
case _ => “error”
}

//you can compare different data types
def f(x: Any): String = x match {
case i:Int => “integer: ” + i
case _:Double => “a double”
case s:String => “I want to say ” + s
}

//just check type. If match, assign x with obj and assign it with type Color. Then return it and assign to cast variable.
var obj = performOperation()
var cast:Color = obj match {
case x:Color => x
case _ => null
}

//with this addon, you can make it looks very similar to formula definition:
def fact(n: Int): Int = n match {
case 0 => 1
case n => n * fact(n – 1)
}


Example of case class

case class Number(value:Int)

def checkPrime(n:Number) = n match {
case Number(1) => true
case Number(2) => true
case Number(3) => true
case Number(5) => true
case Number(7) => true
case Number(_) => false
}

checkPrime(Number(12))

  1. The first statement is the key to the whole thing: defining a new case class “Number” with a single property.
  2. Case classes are a special type of class in Scala which can be matched directly by pattern matching.
  3. You can also create a new instance of a case class without the new operator
  4. For each case, we’re actually creating a new instance of Number, each with a different value.
  5. When Scala sees that the instances are the same type during comparison, it introspects the two instances and compares the property values.
  6. Case class supports inheritance.

How exception handling uses this?

Scala does not have checked exceptions and it does not force you to catch exceptions as well. Basically, it is up to you to catch it. To catch various exception, you don’t need multiple catch block in Scala. See the example below. You can see it just uses one catch block with switch/cases to do the job.

Box title

import java.sql._
import java.net._

Connection conn = DriverManager.getConnection(“jdbc:hsqldb:mem:testdb”, “sa”, “”)
try {
PreparedStatement stmt = conn.prepareStatement(“INSERT INTO urls (url) VALUES (?)”)
stmt.setObject(1, new URL(“http://www.codecommit.com”))

stmt.executeUpdate()
stmt.close()
} catch {
case e:SQLException => println(“Database error”)
case e:MalformedURLException => println(“Bad URL”)

case e => {
println(“Some other exception type:”)
e.printStackTrace()
}
} finally {
conn.close()
}

Leave a comment Continue Reading →

Class in Scala – Part 2

Common knowledge in this space

When I first dealt with class, the main concept I learnt was that it encapsulates both related properties and behaviors. Then I started learning the constructor, access scope, mutability, type constraints, inheritance and etc. I would like Scala addressing those OO concepts as well or it could provide a better abstraction of this concept.

Define a regular class

First, let me take the example used by a blog as a starting point.

Regular class definition

//default primary constructor
class Person {
  private var name = "Daniel Spiewak" // private field that could be changed
  val ssn = 1234567890    // public constant field
  var age = 0             // public field

  // public method
  def firstName() = splitName()(0)   // public method

  // import with method scope and round() from math can only be used in this method.
  // visible to subclass and enclosing class
  protected def guessAge() = {
    import Math._
    round(random * 20)
  }

  //private method that can be accessed by object of the same class
  private def splitName() = name.split(" ")  

  //access-restricted to both the enclosing class <em>and</em> the enclosing package
  private[mypackage] def myMethod = "test" 

  //access-restricted to this instance but no others.
  private[this] def myMethod2 = "test2"
}


Rules:

  1. The primary constructor for a class is coded in-line in the class definition. Parameters of the primary constructor turn into fields that are initialized with the construction parameters.
  2. The primary constructor can be made private by adding the access modifier private before the parameter list.
  3. Class parameters:
    • val to make them immutable instance value whereas
    • var to make them mutable.
  4. Class parameters can be preceded by an access modifier such as private or protected. By default, it is public on both class and method.
    • protected by default limits access to only subclasses, unlike Java which also allows access to other classes in the same package.
    • modifier[package] notation in Scala gives you more control of visibility.
    • Scala has far fewer method modifiers than Java does, primarily because it doesn’t need so many.  For example, Scala supports the final modifier, but it doesn’t support abstract, native or synchronized:
  5. auto-generate get/set methods but in different forms. To generate the setXxx and getXxx as Java Bean,  you need to use annotation: @BeanProperty.
    • If the field is private, the getter and setter are private. (NOTE: other object under the same class can access the private field. ie class-private)
    • If the field is a val, only a getter is generated.
    • If you don’t want any getter or setter, declare the field as private[this]. (Note: other object under the same class cannot access this. ie. object-private).
  6. A class can have a companion “object” (ie. singleton) with the same name but both of them must be at the same source file. They can access each other private features.
  7. Scala doesn’t force you to declare the return type for your methods.  Once again the type inference mechanism can come into play and the return type will be inferred.  The exception to this is if the method can return at different points in the execution flow (so if it has an explicit return statement).  In this case, Scala forces you to declare the return type to ensure unambiguous behavior. This “returnless” form becomes extremely important when dealing with anonymous methods.
  8. Access scope difference between Java and Scala

An example demonstrate all modifiers in Scala

In Scala

abstract class Person {
  private var age = 0

  def firstName():String
  final def lastName() = "Spiewak"

  def incrementAge() = {
    synchronized {
      age += 1
    }
  }

  @native
  def hardDriveName():String
}

In Java

public abstract class Person {
    private int age = 0;

    public abstract String firstName();

    public final String lastName() {
        return "Spiewak";
    }

    public synchronized void incrementAge() {
        age += 1;
    }

    public native String hardDriveAge();
}

Rules

  1. abstract is used in the class but not method level. No body method definition is considered to be abstract.
  2. final can be used in the method level but not at the field level
  3. native method needs to use annotation to specify
  4. synchronized method no longer there in Scala. But you can use code block level synchronized to achieve the same. However, we should consider to move to Scala actor-based concurrency model instead.

Singleton

Scala doesn’t support static member/methods. But you can use “object” to get the same capability. Basically, you can see object as singleton but there is no instance been created.

Companion object

    class MyString(val jString:String) {
      private var extraData = ""
      override def toString = jString+extraData
    }
    object MyString {
      def apply(base:String, extras:String) = {
        val s = new MyString(base)
        s.extraData = extras
        s
      }
      def apply(base:String) = new MyString(base)
    }
    println(MyString("hello"," world"))
    println(MyString("hello"))

Leave a comment Continue Reading →

Function in Scala – Part 1

Function Definition

def methodName(paramName: paramType, …): returnType = {//function body}

Rules:

  1. single statement doesn’t need {}
  2. semicolon after statement is optional
  3. return type is optional as scala interpreter will look at the last line as return type (no need return keyword)
  4. return type and the “=” can be omitted when the type is Unit (ie. void in Java)
  5. recursive function must specify the return type.
  6. A function with no parameters can be declared without parentheses, in which case it must be called with no parentheses.
  7. Vararg parameters are declared by appending an asterisk to the argument
  8. You can specify default arguments
Example

//this function will return true if character ‘l’ is passed in
def lls(p: Char): Boolean = { p == ‘l’ }

//return void but >1 statement, so braces are needed
def box(s : String) {
val border = “-” * s.length + “–\n”
println(border + “|” + s + “|\n” + border)
}

//varag parameters
def sum(args: Int*) = {
var result = 0
for (arg <- args) result += arg
result
}

//call it:
val s = sum(1, 4, 9, 16, 25)
//1 to 5 is a range not Int as expected for single argument. Use _* to tell it is arg sequence
val s = sum(1 to 5: _*)

//bounded array input -  T is subclass of Number, input is T varying array
def sum[T <: Number] (as: T*): Double = as.foldLeft(0d)(_+_.doubleValue)

//default arguments
def decorate(str: String, left: String = “[", right: String = "]“) = left + str + right

Function Call

Now you should have a good idea how to define a function in Scala. Lets look at some interesting areas related to anonymous function and compile syntax sugar.

Lets start from defining a function:

def lls(p: Char): Boolean = { p == ‘l’ }

In StringOps, there is a count function that takes the function type specified above:
def count (p: (Char) ⇒ Boolean): Int

StringOps is a scala way to extend the capability of Java String.
“Hello”.count(lls) // outputs 2

Functions with zero or one argument can be called without the dot and ( ). If it is single argument, you can use {} instead of ( ).
“Hello” count lls  // outputs 2

Syntax Sugar

This kind of syntax sugar is quite confusing for an newbie like me. I believe it is created for a reason and I notice that it is used quite extensively in the area of DSL. Hopefully, there is syntax style standard develops on top of this and make life easier. Before seeing this, I need to try out all possible combinations and see how far Scala can take.

Here I demonstrate the rules are correct and > 1 space can be added as long as it doesn’t confuse the compiler.

Function Literal

(x:Int, y:Int) => //function body

Use of function literal – is like an anonymous function in java.
“Hello” count { p:Char => p == ‘l’ } // outputs 2

Function literal is shorthand of object of class FunctionN where N is number of input parameter)
new Function1[Char, Boolean] {
def apply(p: Char): Boolean = { p == ‘l’}
}

You can bind Function Literal to variables:
val add = (a:Int, b:Int) => a + b
add(1, 2) // Result is 3

Function literals are useful for passing as arguments to higher-order functions. They’re also useful for defining one-liners or helper functions nested within other functions.
“Hello” count { p => p == ‘l’ } // outputs 2 (type inference knowing p is character)
“Hello” count { _ == ‘l’ } // still outputs 2 (if one parameter passed in and the name has no significant, you can use _)

Leave a comment Continue Reading →

Session Management – Part 1

Session management is one of the key topics that all serious web developers and architects need to master with. This article will go through several key topics with you. They are:

  • Persistence vs non-persistence web connection – web performance!
  • Concerns of using cookie – security and size limitations
  • Server side session management challenges in scalable web application
  • Achieve linear scalability through stateless servers - start moving the session to the client

Today, I will start walking through all these topics at a high level. A series of articles will be written to further develop on each topic if necessary. Lets start!

Persistence vs non-persistence web connection

  1. Before HTTP 1.1, HTTP is a stateless protocol that doesn't maintain persistence connection. Each request made by a Web browser, for an image, an HTML page, or other Web object, is made via a new connection.
  2. HTTP 1.1 introduced persistence connection (ie. Keep-Alive) that Web browser can established a single connection, through which multiple requests could be made.
  3. But before HTTP 1.1, how can state maintain across stateless HTTP request?
    • Normally, we keep the session in the server side and provide the session id to the client that can be used to link subsequent requests to the same session.
    • Normally, client (often time web client) will store the session id in cookie.
    • However, if the cookie is disabled, the session id will normally embedded in the URL (ie URL Rewriting).

Concerns of using cookie

What do we need to pay attention when we store info in cookie?

  1. Size limitation and security concerns.
  2. How long cookie can last? Default = expired when browser exits. In Java, you can do cookie.setMaxAge(int) with long future date if you want to keep the info lasting long in the cookie. If you do setMaxAge(0), it will void the cookie.
  3. Normally, we don't keep all state info in cookie as the information could be sensitive and we are not able to protect it because it sits in the clients' filesystem. Apart from that, there has limitation in size as well. For these two concerns, we normally just store the session id in the cookie and keep the session in the server side. This approach can save us bandwidth as well.

Server side session management challenges

At the first glance, session in server side sounds like a great solution. However, when it comes to scale, it always raises the concerns. Imagine you need to replicate client session state across multiple servers to achieve high availability. Both the replication time and memory resource limit will cause your system not able to scale linearly. To solve or minimize this, we selectively pick what kind of info we store in the session, use sticky session to avoid one session replication across all the machine or even try to store the state to the client if possible like using rich client UI (ex. Flex and Silverlight). A post will be written about this topic later on.

Transient vs Persistent State

  1. Session in the server can be timed out (~30 minute inactive)
  2. Session in the server can be persisted in file across Tomcat restart.
  3. Persistent state should be stored in database.
  4. Object putting in session should be Serializable
  5. Avoid putting too much info in the session b/c we don't want to put too much baggage during session replication. One server crash b/c of memory depletion can further spread across to other servers via session replication. Not Good! Should we reconsider storing session in client? This article talks about it.
  6. Session replication is needed to support failover. Sticky session for simplicity but suffered data lost when the box is down. We can tell one or two servers as its backup to avoid the session lost. To go for sticky session approach, we need to identify the "sticky" part. What kind of thing we can use to link separate requests? Use IP address can potentially overload a box because some Internet service providers use a set of proxy servers to deal with many clients. This subject can be further developed. We will go back to it later!
Leave a comment Continue Reading →

Speed up your website via caching

Introduction

Caching is a crucial performance tuning strategy, especially your system has high read to write ratio. You can perform caching strategy at different levels from client browser cache all the way to disk cache at server side. Lets take a brief look at where we can cache based on the invocation path for a request to be fulfilled:

  1. Client browser cache
  2. CDN network
    • A CDN is a network, like Akamai, where a web site such as JustProposed.com can offload high-bandwidth static files like photos and videos to another network, so that my web site doesn’t need to have such huge bandwidth to run. Since bandwidth is a major expense, especially as we grow or when we get slashdotted (in which case we run out of bandwidth), a CDN has looked interesting. However, Akamai is too expensive for us to use. So, we will go for the free network, Coral CDN.
    • Apart from the bandwidth, JustProposed.com has lots of non-USA users who sometimes find my site slow to use. So, CDN network gives us proximity advantages.
    • To use Coral CDN, you simply append nydu.net:8080 to the end of the hostname in the URL of your expensive resources. For example, http://www.justproposed.com/raydoris/myphoto.jpg to http://www.justproposed.com.nydu.net:8080/raydoris/myphoto.jpg
    • Coral looks great, the only problem I have with it is that it’s running on a high port, so that people behind proxy servers that don’t automatically support http over anything bug port 80 will have problems. To use Coral, follow this instruction.
  3. Reverse proxy server and content accelerator – Squid
    •  Why not use Apache as reverse proxy instead of putting Squid in front of Apache? Here are some of the benefits of this setup. The main reason is that Apache spawns out a new process per request that eats up lots of resources.
    •  

 

There are several things that you need to look at when you go for caching approach:

  1. What to cache? The data used by most web applications varies in its dynamicity, from completely static to always changing at every request. Everything that has some degree of stability can be cached. However, I always pick the ones that are most frequently access and/or expensive to compute and retrieve to cache because of the limited resource (ie. memory).

Application level caching (for J2EE)

JCS – Java Caching System

  1. Configuration
    • To understand the power of JCS, the best way is to look at its configuration file. To find out what is each configurable parameter does, take a look at this article.
  2. Integrate with Spring
    • To use JCS with Spring, take a look at this article. It talks about how to create a wrapper or Interceptor for your DAO and inject it to your service for caching purpose. To implement cache as an aspect with full control of what and how to cache, it doesn’t use the declarative Spring module caching approach. Regular dependency injection can do the trick!
  3. Distributed caching
    • JCS is a front-tier cache that can be configured to maintain consistency across multiple servers by using a centralized remote server (client-server) or by lateral distribution (peer-to-peer) of cache updates. 

Reference

  1. Speed up your LAMP stack with lighhttpd
  2. Squid and Apache on the same server – have squid listened on port 80 and apache listened on port 8080
  3. Squid configuration variable

 

Leave a comment Continue Reading →

Streaming data to your grid

Push data to client

Traditional web application is based on request and response model that information is delivered as a single payload and then immediately close the connection to the client. To keep the client in sync, we normally pull the server periodically. This approach may generate unacceptable load to the server. To solve this problem, we want to have a push mechanism from server to client. This is why Comet is defined. Comet is a generic term describing various approaches to send data asynchronously from a Web server to a client without the need for the client to explicitly request the data. It is an essential technique for any real-time event-driven web applications, where the majority of events occur on the server and data must be “pushed” frequently to the client. To achieve this, Comet servers must maintain a continuous connection to each client for the duration of the session.

 

OK. How to maintain a continuous connection to each client for the duration of the session?

If you try to adapt traditional server to the Comet methodology, it may not scale and often fails after a few thousand simultaneously open connections. A true Comet implementation requires a very different kind of server architecture to be efficient and scalable – Liberator (a solid Comet server that are used by the financial industries. However, it is written in C and not open source although it has FREE edition distributed).

To understand this statement a little bit more, we need to know how traditional web containers handle the request. They are under one request per thread model.

  1. The client , typically , a browser sends request for resource to a web server.
  2. The server has a listening thread that keeps track of incoming connections.
  3. When a request arrives , the server uses one process or thread to process the request.
  4. The resource is returned to the client and the connection is closed.

In this model, the number of requests that can be served in a second would depend on two things

  1. How many threads are there to handle the client requests
  2. How long it takes to serve one request.

If all threads of server are busy, then the incoming requests are put in a queue. The server would return to the requests in queue when server threads become free. The number of requests handled per second is always greater than the number of allowed simultaneous connections. All this is made possible because the time required to process a request is very short. In other words you can server more requests in a second than you have threads.

However, there are one breed of applications that need to hold onto the connections. Think of applications that require real time data coming to clients (stock tickers)  or think of applications where low-latency is required. In the above traditional web model, the browser has to re-connect to get the new data. (Polling). If the new updates “can”  happen with high frequency (e.g. a chat application) then the polling frequency also has to increase .  An alternative to high frequency polling is to use push based applications. For push based application, once the browser connects to server, the server will maintain the connection till the browser time-out (server response stream is not closed) and keeps flushing data down the connection as and when they become available. In servlet container, to hold the connection, your thread in the service method cannot exit the method. Otherwise, the response stream will be closed. So what you do is, you block the thread on some condition within the service method. So the thread will block for your condition. When push data becomes available , this thread writes to response stream and again enters a blocked state. So as long as you hold onto the connection, you can not return this thread to the thread pool. And as more and more “push” connections are established you would run out of threads! To remedy the problem, the possible solutions are:

  1. Increase # of server threads.

Flex Push

There is confusion that whether BlazeDS supports real time messaging. Yes it does :wink:. In fact, BlazeDS has a full spectrum of channel types ranging from simple polling, to near-real-time polling, to real-time streaming.

  1. Simple polling – ping the server from Flex client using the traditional request and response model
  2. Near-real-time polling (long polling) – Instead of acknowledging right away, the server could hold the polling request until there’s a message for the client. This ensure the messages are delivered to the client as soon as they become available. The caveat for using long-polling is the thread limitation in most application servers. At this moment, BlazeDS could not support more than a few hundred long-polling clients on most application servers. However, this problem could be resolved once servers like Tomcat start to support asynchronos, non-blocking connection threads. Update: Now Tomcat 6 supports NIO.
  3. Real-time streaming – BlazeDS supports real-time message streaming over AMF and HTTP. Unlike long polling, which closes and reopens the connection upon receiving a message, streaming keep the connection open at all times. Streaming suffers from the same thread blocking issue as long polling. A cap must be set so the server is not hang by idle threads.

The reason why people are confused is that Adobe doesn’t release its proprietary push solution RTMP to BlazeDS. So, RTMP isn’t available as a channel in the BlazeDS configuration files. BlazeDS lives in a Servlet container and hence constrained by one-thread-per-connection limit whereas LCDS has NIO-based channels that can scale up to 1000s of requests. On the other hand, BlazeDS has the advantage that it’ll work over port 80/443, whereas LCDS will use some port for persistent connections that would require a firewall configuration. Once the servlet that implements BlazeDS is revved to support Comet Events under Tomcat 6, and then Jetty Continuations, then the long polling technique will be fine.

UPDATE: We are waiting for a solution that supports Comet Events under Tomcat 6. Then BlazeDS can be coupled to the Tomcat NIO HTTP listener and be able to scale as well as any NIO based server software.

I have learnt from this article that you can create a channel set in client side. So Flex can fail-over to other channels until it gets connected or the list is exhausted.

Marc has put an effort to build a better data grid like a spreadsheet in Flex. (check this out)

Reference

Here are the references I used for this article

  1. Tuning Apache and Tomcat for Web 2.0 comet application
  2. Performance of Grids for Streaming DataThis shows you the performance numbers on various frontend technologies. Again, Flex shows us a good result.
  3. Are raining comets and threads? – Comet Daily
  4. Comet & Java: Threaded Vs Nonblocking IO
  5. JDK 1.6 uses epoll to implement NIO
  6. BlazeDS dev guide
  7. Achieve performance breakthrough using BlazeDSFarata System put an effort to write its NIO channel that runs on Jetty 7 and receive promising result.

 

Leave a comment Continue Reading →

Plenty of Fish – Cash cow!

A site called “PlentyOfFish.com” is currently getting 30 million hits a day. The number doesn’t blow me off. However, what surprise me is that this site is basically operated by single man “Markus Frind”. How does he achieved that? If you want to hear how he does that, you can go to his interview from this link. Otherwise, you can read the summary I got from his interview.

The stuff I learnt from Markus

You may think that Markus must spend a lot of $$ to maintain his site. A picture of server farm may be popped up in your head. Hahaha… all he needs is just 1 web server and 3 database servers. This is the cost that you and me can afford. No bother to write your business plan and wait for VC $$ nowadays. :grin:

Here are some quick tips for Markus

  1. You need a lot of RAM. RAM is cheap, go ahead to power up your box with tons of RAMs please!
  2. Markus uses Akamai CDN to offload the bandwidth of fetching images across different locales.
  3. Separate R/W database operation.
  4. Markus uses one database as master for write and 2 databases as slave to handle the searches (read). According to him, radius-based searches demand lots of resources. “If you have one system to do just one thing, it will do it much efficiently.”
  5. Markus put RAM to both web and db servers. “If you can load your whole db in the RAM, do it!”
  6. Optimize the db access is the key to handle lots of requests.
  7. Denormalization is necessary if you want to reduce the number of joins that can potentially slow down your queries.
  8. PlentyOfFish.com is purely based on “Word of Mouth” marketing. Do things right, your users will spread it out for you. Cheapest marketing strategy ever!
  9. PlentyOfFish.com is FREE site. Because it is free, it doesn’t have high requirements like uptime. It can be down without much issues.
  10. PlentyOfFish.com solely monetized from advertisement like Google Ads. Just this, Markus is making around 10 million annually. Amazing!
  11. PlentyOfFish.com is purely using Microsoft solution like IIS, ASP.NET and SQL Server. In fact, you can build it using other solution like Apache, Spring, MySQL

I love to see how people like Markus beat down the giant like Match.com. One man beats hundreds of people with simple system settings. Incredible! Folks, there is no excuse whining no $$ to start your business!:lol:

Although it sounds easy for Markus during the interview, there are areas the interviewer didn’t cover:

  1. PlentyOfFish.com webfront is not looking good. How could it attract the first set of users in the first place? FREE
  2. If you go to a FREE site without data, you may leave it right away. How PlentyOfFish.com attracts the first real user? Did PlentyOfFish.com crawl competitors’ data to power his site as bootstrap?
  3. PlentyOfFish.com purely makes $$ from Google AdSense. However, according to John Chow, Adsense is not a good place to make $$. Why is that?

What possibly may go wrong for his approach:

His database architecture is traditional master-slave approach. It can offload the read but not write operations. Obviously the master becomes the write bottleneck and a single point of failure. And as load increases the cost of replication increases as well. Replication costs in CPU, network bandwidth, and disk IO. The slaves fall behind and have stale data. The folks at YouTube had a big problem with replication overhead as they scaled. This problem can be tackled by shard/ federation. I will discuss this topic later.

 

Leave a comment Continue Reading →

Amazon Web Service Solutions

When we talk about SOA, I would think of Amazon. It is the company that takes SOA to the next level, proving to the world that it is a viable solution for us. Great! I decide to put sometime to learn from Amazon via reviewing the web services it provides, reading the related interviews and blogs, studying how to build an application on top of its infrastructure, develop an application to consume data provided from its Web Services. Anyway, I believe the best way to learn SOA is to get a taste of the services provided from a company that relies greatly on this to scale its business. Before I delve deeper, I need to clarify one thing. Many people use the term SOA and Web Service interchangeably. Be honest, I was among one of them. However, in definition, they are not the same. SOA is about design; Web services are a specific technology set that supports distributed computing. Web services make it easier to create a service-based system, but only if your developers are using SOA design principles, where functions are packaged into modular, shareable, distributable services that can be used and reused by multiple consumers. In Amazon, each service is independent and encapsulates 3 things: data, business logic and public service interface. Each service owns its data and is never been directly accessed by other services. According to its CTO, this is the core architecture that scales Amazon.

Video Presentation

Jinesh Varia - an evangelist from Amazon. In his presentation, he will show you how to build a regular-expression based search engine called “GrepTheWeb” on top of the Amazon infrastructure – SQS, SimpleDB, EC2 and S3. The most interesting thing he mentioned in this presentation is the on-demand architecture powered by Hadoop and Amazon infrastructure. “At time t0, you have no infrastructure. At time t1, when regular expression comes in, the system reaches the execution phase and the whole infrastructure is ready for it. At time t2, the request is fulfilled, the whole infrastructure is gone…” This gives me a taste of cloud computing and how powerful it can be.

Web Resource

High Scalability posts an article about Amazon architecture. The author follows up with different resources and consolidates key information he found.

Leave a comment Continue Reading →

Powerful Full Text Search – Part 3 Solr

Introduction of Solr

Solr is a standalone enterprise search server with a web-services like API. You put documents in it (called "indexing") via XML over HTTP (RESTful). You query it via HTTP GET and receive XML results.

  • Advanced Full-Text Search Capabilities
  • Optimized for High Volume Web Traffic
  • Standards Based Open Interfaces – XML and HTTP
  • Comprehensive HTML Administration Interfaces
  • Scalability – Efficient Replication to other Solr Search Servers
  • Flexible and Adaptable with XML configuration
  • Extensible Plugin Architecture

Set up Solr

 To set up Solr, you should follow this guideline. After the set up Solr, you practically have a indexing service up.

The HTTP/XML interface of the indexer has two main access points: the update URL, which maintains the index, and the select URL, which is used for queries. In the default configuration, they are found at:

  • [code]]czozNDpcImh0dHA6Ly9baG9zdG5hbWU6cG9ydF0vc29sci91cGRhdGVcIjt7WyYqJl19[[/code]
  • [code]]czo3OlwiaHR0cDovL1wiO3tbJiomXX0=[[/code][code]]czoxNTpcIltob3N0bmFtZTpwb3J0XVwiO3tbJiomXX0=[[/code][code]]czoxMjpcIi9zb2xyL3NlbGVjdFwiO3tbJiomXX0=[[/code]

To add a document to the index, we POST an XML representation of the fields to index to the update URL. In addition, you can delete, update (ie. re-post on unique). All change operations need to commit to flush to file system. On the other hand,  once we have indexed some data, an HTTP GET on the select URL does the querying. 

Powerful features Behind Solr

If you follow the guideline above, you already get yourself familiar with indexing, searching and facet browsing. Now lets get down to how to make Solr a scalable solution with great performance.

Caching

TBA

Distribution and Replication

For applications that receive large volumes of queries, a single Solr server may not be enough to meet performance requirements. Therefore, Solr provides mechanisms for replicating the Lucene index across multiple servers that are part of a load-balanced suite of query servers. The replication process is handled through a combination of event listeners enabled through the solrconfig.xml file and several shell scripts (located in solr/bin of the example application).

In a replicating architecture, one Solr server acts as the master server, providing copies of the index (called [code]]czo5Olwic25hcHNob3RzXCI7e1smKiZdfQ==[[/code]) to one or more slave servers that handle query requests. Indexing commands are sent to the master server and queries are sent to the slave servers. The master server can create snapshots manually or by configuring the [code]]czoyMTpcIiZsdDt1cGRhdGVIYW5kbGVyJmd0O1wiO3tbJiomXX0=[[/code] section of solrconfig.xml to trigger snapshot creation when [code]]czo2OlwiY29tbWl0XCI7e1smKiZdfQ==[[/code] and/or [code]]czo4Olwib3B0aW1pemVcIjt7WyYqJl19[[/code] events are received. In either the manual or the event-driven process, the [code]]czoxMTpcInNuYXBzaG9vdGVyXCI7e1smKiZdfQ==[[/code] script is invoked on the master server, creating a directory on the server named [code]]czoyMzpcInNuYXBzaG90Lnl5eXltbWRkSEhNTVNTXCI7e1smKiZdfQ==[[/code] where [code]]czoxNDpcInl5eXltbWRkSEhNTVNTXCI7e1smKiZdfQ==[[/code] is the actual time the snapshot was created. The slave servers then use rsync to copy only those files in the Lucene index that have been changed.

&lt;listener event=&quot;postCommit&quot; class=&quot;solr.RunExecutableListener&quot;&gt;
    &lt;str name=&quot;exe&quot;&gt;snapshooter&lt;/str&gt;
    &lt;str name=&quot;dir&quot;&gt;solr/bin&lt;/str&gt;
    &lt;bool name=&quot;wait&quot;&gt;true&lt;/bool&gt;
    &lt;arr name=&quot;args&quot;&gt; &lt;str&gt;arg1&lt;/str&gt; &lt;str&gt;arg2&lt;/str&gt; &lt;/arr&gt;
    &lt;arr name=&quot;env&quot;&gt; &lt;str&gt;MYVAR=val1&lt;/str&gt; &lt;/arr&gt;
&lt;/listener&gt;

Reference

Below are some cool references I found:

  1. Search smarter with Apache Solr, Part 1: Essential features and the Solr schema
  2. Search smarter with Apache Solr, Part 2: Solr for the enterprise
  3. Advanced Lucene

 

 

Leave a comment Continue Reading →