Archive | Data Intelligence RSS feed for this section

Flex Hacking Series Part 1 – Event Model

Event Model

Event Flow

The idea is simple. Here is the regular event flow: Users interact with the UI, event is generated, broadcast via the event dispatcher (bubbling up the display hierarchy if enabled) and captured by any registered listeners, and a set of actions is taken in response. To understand it a bit more in detailed, you can check this article and play with its demo. In short, under the hood, it has 3 phases: capturing, targeting and bubbling. In the display list, from the top, it always starts from Stage as the root, then SystemManager, then your Application. The event is created by the flash player and travel down from Stage to the target component and bubble it up back to Stage (if enabled). In this round trip, it will trigger the event listeners'  actions. Most components communicate with others using events which conveys useful information and data but only visual components (objects in the display list) can participate in the event flow described above.

in practice, you normally just need to worry about registering listener to the target component for a particular event propagated from it. You seldom need to understand the event flow above in detailed. However, if you want to exercise more control on the event like stopping the event from propagating, you need to understand how it works first.

Cancel/ Stop the event

Within Flex, by default, events only broadcast themselves to their parent component. If you want the event to broadcast to its parent's parent (and all the way up your component chain in the display object hierarchy), then you tell that event to bubble. If you don't want any component in chain cancels your event via stopPropagaton() or stopImmediatePropagation(), you can make it non-cancelable during event construction. The difference between stopPropagation and stopImmediatePropagation is that stopImmediatePropagation will not only prevent the event from moving to the next node, but it will also prevent any other listeners on that node from capturing their events.

Some events have an associated default behavior. For example, the doubleclick event has an associated default behavior that highlights the word under the mouse pointer at the time of the event. Your event listener can cancel this behavior by calling the preventDefault() method. 

Define your own Custom Event

Now you understand the event flow. Next, you need to know how to create a custom event to carry additional info. This article helps you to achieve this goal.

Leave a comment Continue Reading →

Postgresql – any easier way to create a sample database?

матраци

I wanted to have my new code test against the real data in production. According to the typical company policy, I should not have any of my non-production code run against production data. Period. It makes a lot of sense. So, I decided to build my sample database in test environment and have subset of the production data loaded into it. I just needed one table to play around and I don't want to backup the whole table because the size of the table in production is too big for me. Using the popular postgresql, I expected this kind of request could be fulfilled easily but it eventually took me more than I had thought. Why so complicated? Did I miss anything? Let me show you what I did and please let me know if you have an easier way to achieve this.

Leave a comment Continue Reading →

Learning Hive

Starting to learn Hive

As I mentioned in my last article,  I was getting excited about the potential of Hive. Today, I decide to start my journey to learn this. I found a great introductory video that gives you a nice warm-up of using Hive (A basic knowledge of how hadoop and mapreduce work would be helpful for you to digest the material inside).

Below are some highlights from this video

Hive is an SQL interface built on top of Hadoop. It supports Web access and JDBC. I am amazed how close the SQL syntax like the regular SQL for RDBMS. Below are some SQLs used in this tutorial.

//———- Set up your tables in HIVE —————–
SHOW TABLES;

CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE;

DESCRIBE shakespeare;

//———- Load data into Hive table from Hadoop HDFS ——————-
LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare;

//———- Query against the data using hive sql interface ————–
select * from shakespeare limit 10;
select * from sakespeare where freq > 100 sort by freq asc limit 10;
select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;

//show me the plan
explain select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;

//———- Create a merge table and populate it using dataset joining by 2 different tables
insert overwrite table merged select s.word, s.freq, k.freq from shakespeare s join kjv k on (s.word = k.word);

//———- Query the merge table ———————
select word, shake_f, kjv_f, (shake_f+kjv_f) as ss from merged sort by ss limit 20;

To prepare the data for Hive to load in, the demo uses another mapreduce job to achieve. Remember to delete the log before doing Hive table load.

hadoop jar $HADOOP_HOME/hadoop-*-examples.jar grep input shakespeare_freq ‘\w+’

//remove the mapreduce job log
hadoop fs -rmr shakespeare_freq/_logs

Often time, large scale data processing system always IO bound. So for mapreduce job, your mapper is always waiting for data to load from disk. Hadoop mitigates the problem via during parallel load from lots of hard drives. However, a single hard drive is still max out at 75MB/s read as physical limit and nothing we can do about this. In order to achieve good speed, the key is to eliminate # of hadoop pass

Since Hive is on top of Hadoop’s HDFS, it will have the same restrictions as it. So, you cannot do UPDATE, DELETE and INSERT records as regular RDMS. However, you can do bulk load to add more new files (data) to the table and you can do delete a file from Hive.

Hive needs to store metadata of the tables out from the HDFS. You can use regular rdms to achieve the job. But when you start Hive locally, it will seek for the local metastore. So, in distributed environment, you may need to centralize the metastore in a remote location. There is wiki on the Hive site that documents how to set it up.

See Hive in Action

Cloudera Hadoop Training: Hive Tutorial Screencast from Cloudera on Vimeo.

Other projects similar to Hadoop

  • Parallel databases: Gama, Bubba, Volcano
  • Google: Sawzall
  • Yahoo: Pig
  • IBM Research: JAQL
  • Microsoft: DryadLINQ, SCOPE
  • Greenplum: YAML MapReduce
  • Aster Data: In-database MapReduce
  • Business.com: CloudBase
Leave a comment Continue Reading →

Hive on Amazon EC2 cloud

adserving-ec2-hive-system-arch

 

I ever worked for a display ad network company that collects over 400 million of impression/ click logs per day. With this amount of data, my ex-company bought a supercomputer and cross their fingers that it can handle the grow in both volume and analytic demand of the data. It is obviously not a scalable solution. However, what is the best solution?

Although I haven’t worked for this company anymore, it is still an interesting problem to solve. I have a great friend who proposed a shared nothing solution for this company. The solution is to partition the data across a set of Postgresql databases and put Greenplum on top of them to parallelize the query —there is no disk-level sharing or contention to be concerned with (i.e. it is a ‘shared-nothing’ architecture). I like this approach. The only thing is that Greenplum is not free and it may be difficult for a startup to face this upfront cost. Apart from that, this setting requires all the databases are running on the same network that hindered us to move this in the elastic cloud like Amazon EC2.

Later on, I joined a great company in the same industry that seeks for a solution in the cloud to host its data warehouse. So, I got a  chance to revisit this problem. During the research, I came across an interesting technology – column-based database (eg. infobright and lucid db). The idea of column-based data store is that traditional database stores and fetches data in row from data files into the memory. It is inefficient if your query only requires few columns for computation. So, column-based data stores your data in column with effective compression algorithm due to all values in it has the same data type. This solution is great but it doesn’t do MPP (ie. massive parallel processing) and it is also not ready for cloud yet.

Here comes another solution. That is Hive on top of Hadoop on top of Amazon cloud. It is an interesting idea. Check out this video to learn about this.


If you are not sure what Hadoop is and want to get some warm up in massive computing. I suggest you go through the following 5 excellent Google lectures.


Leave a comment Continue Reading →

How to build data warehouse

Operational databases are most commonly designed using normalized modeling, often using third-normal form or entity-relationship modeling. Normalized database schemas are tuned to support fast updates and inserts by minimizing the number of rows that must be changed when recording new data.Example: Order-Management Schema for operational database

relatonalmodel.JPG

Data warehouses differ from operational databases in the way they are designed; they are optimized for efficient querying and not for updating. Data warehouses provide a read-only version of the data in the operational databases, which is optimized for querying. The kind of modeling most commonly used in warehouse design is called dimensional modeling, and the schemas produced are known as star schemas. In dimensional modeling, a database is organized around a small number of fact tables. Each row in a fact table is a single measurable event: a single sale, a single hit to a web page, etc. Example: Order-Management Dimension Schema

dimensionmodeling.JPG

The key benefits of data warehouse are simplication and consolidation of data. It normally gathers data from different operational databases into single dimensional model for reporting and analysis purpose. On the other hand, dimensional modeling offers a chance to reduce the level of complexity in your database. By reducing complex chains of tables into dimension tables, the schema becomes smaller and performance tends to improve. The approaches we take to reduce the complexity are (1) We try to model one aspect of the system for each DM schema. (2) We can denormalize the schema to reduce number of joins. ETL Process Once you have a data schema for your warehouse, you'll need to fill it with data. This process is known as extract, transform, and load, or ETL for short. The first step, extraction, is simply the process of selecting all the data of interest from the operational database. Then the data must be transformed into the format needed by the warehouse. This could be as simple as renaming some of the fields or as complex as cleaning dirty data and computing new fields. Finally the data must be loaded into the data warehouse. There are some areas you need to pay attention when you perform the ETL:

  1. During extraction, you will put a lot of strains to the operational database. To deal with this problem we can replicate a low-cost copy of the operational database on the warehouse machine before doing extraction. The SQL output of the extraction process can be a CSV file.
  2. Transformation can be computing summary data, converting postal code into geo-code (ie. lat and long) that powers"within X miles" queries. You can use Perl to do this job. The output of transformation may be another CSV file.
  3. Finally, you load the data into CSV into dimensional model. To speed up the load, in MySQL, we first disable indexes with ALTER TABLE foo DISABLE KEYS, and after the load, we re-enable them with ALTER TABLE foo ENABLE KEYS. Each table needs to be cleared before loading via TRUNCATE command.
  4. You may be wondering what happens to clients using the warehouse while an ETL process is running. In our case, nothing at all! This magic is achieved by actually having two warehouse databases, one in use and the other free for loading. All the data goes into the loading database, and when it's full we swap it into place with RENAME.This produces an atomic switch of all tables in the loading database with the tables in the live database. It will wait for any running queries in the warehouse to finish before performing the swap, which is exactly what we want.

Quick Tips

  1. CSV format isn't a standard. Use XML can solve character issue but it might not perform as well due to formatting overhead.
  2. Transform is not always needed. If not, use "SELECT … INTO TABLE" to provide a straight database-to-database extract-and-load.
  3. Incremental load is highly desirable. Use trigger can achieve that.
  4. Operational database uses MySQL's InnoDB backend, providing referential integrity and transactions. However, we chose MySQL's MyISAM backend for our warehouse for better performance as it is read-only and transactional feature is not needed.
  5. MySQL does not support for bitmap indexes. Bitmap indexes are ideal for the kind of low-cardinality data that is commonly used in data warehouses. PostgreSQL supports bitmap indexes as of version v8.1, as do a number of commercial database systems.
Leave a comment Continue Reading →

Flex Startup Sequence

Magic behind the scene

I always wonder how my Flex application displayed on the Flash Player in browser. Why decompile Flex SWF will give me 2 frames movie? What is SystemManager and how can I get a handle of it? Many of these kind of questions are at the lower level. The level that makes Flex application possible. Normally, we don’t need to dive into this to write a Flex application. However, it would make me feel more comfortable to understand this before I advocate Flex as the main part of our company’s UI strategy. 

To understand how your Flex application got loaded and display on Flash Player, you can read this great article. I am going to put down the sequences in steps below:

Flex is a 2-frame movie

The first frame of a Flex SWF contains the SystemManager, the Preloader, the DownloadProgressBar and some glue helper classes. Remember, the Preloader is what creates the DownloadProgressBar control which displays the progress of a Flex application downloading and being initialized. The second frame of a Flex SWF contains the rest of the Flex framework code, your application code and all of your application assets like embedded fonts, images, etc.


By creating a 2-frame movie, Flex applications can take advantage of the streaming support built into the Flash Player and a preloader can appear before all of the Flex framework code and your application code are downloaded.

The .swf format is a progressive download format which means that Flash player can access content on frames as they download without having to wait for the entire file to download.

Here are the steps:

  1. First, enough bytes for frame 1 are streamed down to the Flash Player.
  2. The Flash Player executes those bytes by creating a SystemManager instance.
  3. SystemManager instruct the Flash Player to stop at the end of frame 1.
  4. SystemManager then goes on to create the Preloader which creates the DownloadProgressBar control and pops that up on the client screen.
  5. The Preloader then starts tracking the rest of the bytes streaming in from the Flex SWF. 
  6. Once all the bytes for the Flex framework and application code are in, the System Manager goes on to frame 2 and instantiates the Application instance.
  7. Once the Application instance has been created, the SystemManager sets Application.systemManager to itself. This is how you, the application developer, can access the SystemManager at a later time.
  8. The Application dispatches the preinitialize event at the beginning of the initialization process.
  9. Application goes on to create its children. The method createChildren() is called on the application. At this point each of the application’s components is being constructed, and each component’s createChildren() will be also called. For detail, look at component lifecycle section.
  10. The Application dispatches the initialize event, which indicates that all application’s components have been initialized. However, at this state, all the components are not yet laid out.
  11. Eventually, once all the Application child controls and containers have been created, sized and positioned, the Application dispatches the creationComplete event.
  12. Once the creationComplete event has been dispatched, the Preloader removes the DownloadProgressBar control and the SystemManager adds the Application instance to the Flash Player display list. (The Flash Player display list is basically the tree of visible or potentially visible objects that make up your application. When you add and remove child components to your application, your basically just adding and removing them from the display list).
  13. Once the Application is added to the Flash Player display list, the Application dispatches its applicationComplete event
  14. The Application has been created and is up on the screen ready to be interacted with.
     

Component Architecture

 This is the best video I have found that discuss the component lifecycle in detail – Thanks for Deepa.

Below is summary I took from the video above, in case you don’t want to spend an hour to listen to this. However, I really think you should. To start out, Deepa introduced component and skinning architecture “Spark” that is part of Flex 4 Gumbo. Spark is built on top of Halo (ie. Flex 3 component architecture) and components using Halo or Spark can co-exist. Then,  Deepa put her focus to talk about how to develop her custom video component on top of Halo and Spark for comparison. Spart is great that it can factor out the layout code from component.

Halo component lifecycle can be separated into 3 phases:

Phase 1 – Initialization

  1. Construction
    • Choose the right base class and provide default constructor (ie. zero argument). By extending UIComponent, you inherit all of the lifecycle methods, events and properties. Also, since UIComponent extends from EventDispatcher, your component inherits the ability to listen and dispatch events.
    • Best practice: Call super() and add event listener. Minimal work should occur here.
  2. COnfiguration
    • setter and getter for properties.
    • Involve in invalidation and validation cycle. Detail later.
    • Best practice: Setter should not expect the internal children have been created and you want your setter to be fast. Use _xxx for storage variable and have dirty flag for your variable. Throw events out when the internal state of your component get changed.
  3. Attachment
    • Component are added to the flash display list by its parent via addChild() call. Without attachment, component lifecycle will be stalled. Nothing is going to happen to your component. So, this is an important step.
    • Display list is a tree of visible or potentially visible objects in your application. At the root of the display list is your main application. Thing like containment hierarchy and rendering order are all maintained by the display list.
  4. Initialization- 5 lifecycle actions occur here.
    • preinitialize event is dispatched. It signifies that you as component that has been added to the display list by your parent via addChild().
    • createChildren() – walk through all your children, create, configure and attach them to the display list. Best practice: Call super.createChildren(), construct if not exist and attach your children via addChild(). If your child component is dynamic and data driven, use commitProperties() b/c it gets called at every invalidation and validation cycle. Halo rules: Container –> nested structure of UIComponents –> MovieClip, Video, Shape and Sprite.
    • initialize event is dispatched. It signifies that you and all your children are created and attached.
    • First invalidation/ validation pass occurs,
    • creationComplete event is dispatched. Ready for prime time.

Phase 2 – Update

At this phase, your component is fully initialized and ready for usage. Now, it needs to know how to update. And update occurs when its internal state has been changed by like user interactions. To respond to the changes, component uses the invalidation and validation cycle. The key here is to flag the changed variable dirty during invalidation and later handles it during validation right before rendering. So, you can have many invalidation and one validation that gives better performance via avoiding repetitive work. To better understand this approach, Deepa talks about Flash Player Elastic Racetrack that has 2 parts: code execution and rendering. If either part taking too long, Flash player cannot get its job done faster than 1 frame per second. You will see your application with lag and halt, that is bad! 

  1. Invalidation/ Validation can be split into 3 phase.
    • InvalidateProperties –> commitProperties
    • InvalidateSize –> measure (measure may not be called. Don’t have your code depends on it)
    • InvalidateDisplayLIst –> updateDisplayList

Phase 3 – Destruction

  1. Detachment
  2. Garbage Collection

 

What is Mixin?

When you put the [Mixin] metadata just above your class definition and add a static init function to the class like so,

[Mixin]
public class Model {   
  public static function init (systemManager : ISystemManager)   {     
    trace ("I get called first")   
  } 
}

The static init function will be called as soon as your application loads (assuming this class is referenced somewhere in the app), much like static initialization blocks in Java. This is useful if you have some code that you want to run before any of the other code in the class.

Reference

Below are some interesting articles I found related to this article

  1. Create Custom Preloader, Ted has an article related to this too
  2. Speed up startup loading time
  3. Dynamic loading a new custom theme without fraction seconds delay
  4. Deepa presentation in MAX
  5. Introduction to mixins

 

Leave a comment Continue Reading →

Flex Annotated Charting

Recently, I want to extend the LineChart in Flex. I want to have line chart with event annotated like Google Finance.

 
First of all, I googled the Net to see whether anyone had already done it. It was even better if I could find any open source project related to this. Below are the interesting things I found:
  1. Dow Jone Interactive Chart (commerical – it is exactly what I am looking for)
  2. Interactive Bubble Chart (open – although it is not exactly want I want, but if I believe the code can benefit me if I need to customize line chart. :cool: I may just need to draw the interactive small bubble on the line to get my job done!)
  3. This demo gives you tons of chart samples. They are all great example although none of them satisfy my current need.
  4. This demo is close to what I want. From this demo, I notice I can use “annotationElement” to draw on top of the data series. However, the trick is to convert the data points to pixel coordinate in order for me to draw something that can move along with the graph even someone stretches it. To make thing easier, Ely Greenfield has created DataDrawingCanvas that helps us draw on the chart with only data points specified instead of pixel coordinates. This class extends the ChartElement like AnnotationElement does (blog). That is amazing!! Thanks!!! :smile:
  5. Google Finance Chart (It is exactly what I want. I wonder I can get the source of it)
    • I have found the blog and powerpoint of this sample (1/7/2009)
    • Google uses the Flash/ JavaScript integrate kit to get it works (blog) – I heard that it is very nice combination of Flash and AJAX. This is similar to MeasureMap‘s use of the kit.
    • It is open source example!!! (code). Thanks for Brendan Meutzner!!
    • Brendan also shows us how he created his demo in 5 steps to help us understand how to build it ourselves.
    • Step 1, Step 2, Step 3, Step 4, Step 5 – enjoy!!

Reference

Useful resources:

  1. Data Visualization by Tom Gonzalez. (Tom created an open source visualization framework named Axiis. It looks great. Once I get a chance, I will dig into it) – 7/31/2009
  2. Building a Flex Component by Ely GreenField
  3. Create component and enforce separation of concern
  4. http://www.edwardtufte.com/tufte/ (Edward Tufte – famous guy in data visualization)
  5. http://www.insideria.com/2008/03/image-manipulation-in-flex.html (Image Manipulation)

 

Leave a comment Continue Reading →

Database concurrency control – MVCC

Concurrency Issue – Lost Update

Lost update is the key concurrency problem that we try to avoid:

From the sequences above, SELECT-UPDATE transaction A will overwrite the update Transaction B made to the balance. .  If the transactions of A and B would serialize properly, the correct balance value after these transactions would be 700. For performance reasons, most RDBMS uses the isolation level “READ COMMITTED” as default. However, this isolation level does not protect any data read by transaction of getting outdated right after reading the value as another transaction can change it. You can use isolation level “REPEATABLE READ” or  “SERIALIZABLE” to protect the values read in the transaction of getting outdated via holding shared read lock on these rows up to the end of the transaction. In case you are new to RDBMS, let me give you some background.

There are 3 things we are trying to avoid in reliable RDBMS:

  1. Dirty read – if transaction A can read uncommitted changes by transaction B, it may be working on the data that may be rolled back if transaction B cannot successfully commit the changes.
  2. Non-repeatable read – Transaction A issues the same select twice during the transaction may get different results as some other transactions may commit its changes in between.
  3. Phantom read – Transaction A issues the same select twice during the transaction may get new rows as some other transactions may insert new records that fits the criteria of the query.

To protect the 3 issues above, RDBMS provides us 4 level of isolations:

  1. READ UNCOMMITTED – lowest level – no protection at all. Dirty, non-repeatable and phantom read are possible.
  2. READ COMMITTED – default setting – cannot protect non-repeatable and phantom read
  3. REPEABLABLE READ – phantom read is still possible
  4. SERIALIZABLE - highest level with largest overhead as it forces all transaction to run in serial fashion (ie. one by one). However, it resolves all the issues above.

Under the hood, RDBMS uses locking scheme to achieve the 4 transaction isolation level above. How? It uses 2 kinds of locks:

  1. Shared read lock – more than one read concurrently but no write is possible
  2. Exclusive write lock – once exclusive write lock is acquired, no one can read and write until it is released.

In other words, read locks block write lock and vice versa. In the above Lost Update problem, transaction A can use shared read lock to make sure no write is possible for the data it read during the course of its transaction. So far so good? :grin:. Lets move on!

Using high level of isolation to solve the lost update issue may give you a new set of issues related to locking:

  1. Long-running transaction – if transaction A takes a long time to commit, other transactions that wants to change the value will be queued up. Normally, if your transaction read info out from db, change it in UI and write it back. You should consider it as long transaction as you never know how long it takes for user to change it in UI (ie. user think time). In this scenario, you will split the transaction into 2. (ie. a transaction to read and a transaction to write back). However, once you split up the transaction, you have no way to guarantee  your data of the previous read up-to-date no matter what isolation level you use. How to solve this problem then? :shock: It can be solved by checking the version or timestamp of the data in database before you write it in the write transaction. If the version is the same as the first read transaction, it means no one has changed it even you leave it wide open, then you are safe to change it. Otherwise, an exception should be thrown. This mechanism is called optimistic locking.
  2. Deadlock – if transaction A needs other resources in order to commit its transaction but these resources are holding up by a transaction that is waiting for A to commit, it will create a typical circular dependency.

Locks are not just for row. It also happens to index. That is to say, for heavy updates to distinct rows of a table, the bottleneck can be index locking.

New Database Currency Control – MVCC

The aim of Multi-Version Concurrency Control (MVCC) is to avoid Writers blocking Readers and vice-versa, by making use of multiple versions of data (ie. No read lock is necessarily). The problem of Writers blocking Readers can be avoided if Readers can obtain access to a previous version of the data that is locked by Writers for modification. That is to say, there may not have concurrent write.  However, while MVCC improves database concurrency, its impact on data consistency is more complex. I will discuss about this later. Now lets look at how to implement MVCC.

Challenges in implementing MVCC

  1. If multiple versions are stored in the database, an efficient garbage collection mechanism is required to get rid of old versions when they are no longer needed.
  2. The DBMS must provide efficient access methods that avoid looking at redundant versions.
  3. The DBMS must avoid expensive lookups when determining the relative commit time of a transaction.

There are essentially two approaches to multi-version concurrency.

  1. The first approach is to store multiple versions of records in the database, and garbage collect records when they are no longer required. This is the approach adopted by PostgreSQL and Firebird/Interbase.
  2. The second approach is to keep only the latest version of data in the database, but reconstruct older versions of data dynamically as required by exploiting information within the Write Ahead Log. This is the approach taken by Oracle and MySQL/InnoDb.

How does Postgresql implement MVCC

In PostgreSQL, when a row is updated, a new version (called a tuple) of the row is created and inserted into the table. The previous version is provided a pointer to the new version. The previous version is marked “expired”, but remains in the database until it is garbage collected. So, to achieve this, each tuple needs 2 additional fields:

  1. creation id/ xmin - The ID of the transaction that inserted/updated the row and created this tuple.
  2. expired id/ xmax - The transaction that deleted the row, or created a new version of this tuple. Initially it is null.

A row is visible if:

  • its creation id is a committed transaction (ensure it doesn’t read uncommitted data from other open transaction) AND its creation id is < current transaction id (ensure non-repeatable read)
  • its expired id is null (ensure it is not stale data) or > current transaction id b/c if it is deleted by the transaction after the current one, the current transaction should ignore it.

To track the status of transactions, a special table called PG_LOG is maintained. Since Transaction Ids are implemented using a monotonically increasing counter, the PG_LOG table can represent transaction status as a bitmap. This table contains two bits of status information for each transaction; the possible states are in-progress, committed, or aborted.

One of the drawback of MVCC in PostgreSQL is that old version tuples are kept in the same table. To garbage collected these, you can “VACCUM” it.

Reference

Below are the references I used for this artice:

  1. Paper – Data Access Pattern
  2. PostgreSQL Ends the Waiting Game
  3. MVCC implementation algorithms in different databases
  4. MarkMail – very good postgresql forum
  5. Postgresql 8.3.6 Manual

  

         

 

Leave a comment Continue Reading →

Concurrent Programming – Part 1 Synchronization

Get yourself familiar with concurrency programming

When I interview my candidates, I like to ask questions related to multi-threading. I found out that it is a good topic to differentiate out a hardcore programmer from application-oriented programmer. I am not saying I am looking for someone who could write the concurrency library as efficient as the one created by Doug Lea. In fact, I am looking for candidates who has solid understanding of this topic. However, I found out that most candidates have little knowledge in this area apart from the meaning of “synchronized” keyword in Java syntax. Therefore I decide to write a series of articles to cover some areas of multi-threading that I feel important to understand. Of course, I would start from the basic first.

Introduction of Synchronization

Synchronization is a way to lock an object, so no 2 threads possibly running on the same code at the same time.

public class SynchronizedCounter {
    private double c = 0;

    public synchronized void increment() {
        c++;
    }

    public synchronized void decrement() {
        c--;
    }

    public synchronized int value() {
        return c;
    }

    public void method2() {
        ...
    }
}

If count is an instance of SynchronizedCounter, then making these methods synchronized has two effects:

  • Mutual Exclusion – It is not possible for two invocations of synchronized methods on the same object to interleave. When one thread is executing a synchronized method for an object, all other threads that invoke synchronized methods for the same object block (suspend execution) until the first thread is done with the object. Remember this rule doesn’t apply to non-synchronized methods. And the thread holds the lock of the object can reenter its synchronized methods (ie. reentrance).
  • Memory-Visibility – When a synchronized method exits, it automatically establishes a happens-before relationship with any subsequent invocation of a synchronized method for the same object. This guarantees that changes to the state of the object are visible to all threads. Most of the interviewers miss this one. :smile: In database, it is like the concept of “commit”. If you don’t commit your changes, others could not see your changes.

All in all, synchronized methods enable a simple strategy for preventing thread interference and memory consistency errors. Interference happens when two operations, running in different threads, but acting on the same data, interleave. This means that the two operations consist of multiple steps, and the sequences of steps overlap. This will result in unpredictable data lost (hard to fix). Memory consistency error occurs when complier and processor reorder statements and uses the cached value for better performance

Problem of Synchronization

Synchronization will serialize the method calls from different threads. At any given time, only one thread can execute the synchronized method and the other threads need to wait until the object lock releases. This will dramatically diminish the liveness of your application. To minimize the impact, you should:

  1. Reduce lock duration – Synchronized statements are useful for improving concurrency with fine-grained synchronization
  2. Reduce lock scope – Mutex variable in the synchronized lock may help you to avoid locking the whole object. In Java 5 concurrency library, there is class called ReentrantLock that provides the same features as synchronized with better performance and flexibility. Here is what is stated in “Java Concurrency in Practice”:

Why create a new locking mechanism (ie. ReentrantLock) that is so similar to intrinsic locking (ie. synchronized)? Intrinsic locking works fine in most situations but has some functional limitations. It is not possible to interrupt a thread waiting to acquire a lock, or to attempt to acquire a lock without being willing to wait for it forever. Intrinsic locks also must be released in the same block of code in which they are acquired; this simplifies coding and interacts nicely with exception handling, but makes non-block structured locking disciplines impossible. None of these are reasons to abandon synchronized, but in some cases a more flexible locking mechanism offers better liveness or performance.

So far so good.? Great! lets me ask you 3 questions:

  1. Question 1: In the example above, if thread A is executing a synchronized method “increment”, can another thread execute method2? Yes. Because method2 is not synchronized
  2. Question 2: If thread A is in the synchronized method xyz , can it invoke another synchronized abc? Yes. The object lock is reentrant.
  3. Question 3: If I want to make the above class thread-safe without using synchronized object lock, are there any other alternatives? Yes.
  4. Question 4: How about declare the variable volatile?
    • Managing Volatility by Brian Goetz
    • Compare Atomic variable, ReentrantLock and Volatile variable.
    • Use int, byte instead of long or double because updating int or byte is an atomic action. Atomic actions cannot be interleaved, so they can be used without fear of thread interference. However, this does not eliminate all need to synchronize atomic actions, because memory consistency errors are still possible (for example, thread A updated an variable atomically but it hasn’t flushed and sync up the main memory, so it is not visible to other threads). Unless the fields in question are declared [code]]czoxMDpcInZvbGF0aWxlLCBcIjt7WyYqJl19[[/code] the JMM does not require the underlying platform to provide cache coherency or sequential consistency across processors, so it is possible, on some platforms, to read stale data in the absence of synchronization. Look at here for better explanation. However, volatile can only guarantee atomicity and memory consistency for single variable. If you want to guarantee that for compound operations, you need to use synchronized block.(or the new java.util.concurrent classes). It is worth pointing out that increment (i.e. ++) and similar operations are not atomic in Java. So incrementing a volatile variable [code]]czoxMzpcInZvbGF0aWxlVmFyKytcIjt7WyYqJl19[[/code] is NOT thread-safe. If you need thread-safe semantics i.e. no possibility of multiple threads corrupting the variable value by having the updates unexpectedly interfere with each other, then you need to use a synchronized block to increment a variable, e.g. [code]]czoyNzpcInN5bmNocm9uaXplZChMT0NLKXtteVZhcisrfVwiO3tbJiomXX0=[[/code], regardless of the overheads this causes – Java Performance Tuning

More On Thread Safety

All the techniques I discussed so far is to show you how to make your code thread safe. They are applicable only if you have to share resources across multiple threads and those threads may modify the resources. That is to say, if you don’t share any resources or other threads only read but no write, your code are thread safe already. Here are some tips related to not sharing and read-only sharing:

  1. You can use local variables to carry out logic in the methods if possible (not share)
  2. You can use TheadLocal to hold the resources if you want to access it across multiple methods for the same thread (not share)
  3. You can use immutable object (private variable without setXXX methods) for read-only sharing. For example, String and PrimitiveWrappers like Integer. However, make sure the declare final for the reference that holds your immutable object.
  4. Most of the time, you use collection classes like HashMap, ArrayList to hold our objects. Those classes are not thread safe. To make it thread safe, you may use Collections.synchronized wrappers or simply use the synchronized version of them like Hashtable and Vector. However, these approaches have 2 problems
    • They are not performed. Why lock every reads only to protect occasional write?
    • They are just conditionally thread-safe. All individual operations are thread-safe, but sequences of operations where the control flow depends on the results of previous operations may be subject to data races like doing containsKey(), size() and iterator() methods before actually read and write can give you NullPointerException and ConcurrentModificationException if you don’t do external synchronization.
    • Here are the unconditionally thread-safe version like ConcurrentHashMap, ConcurrentLinkedQueue and CopyOnWriteArrayList to achieve thread-safe with good performance number.
    • When you write an unconditionally thread-safe class, consider using private lock object in place of synchronized methods. This protect you against synchronization interference by clients and subclasses and gives you the flexibility to adopt a more sophisticated approach to concurrent control in later release – Joshua Bloch in Effective Java 2nd version p.281.
  5. Deal with lazy initialization
  6. Handle denial of service attack that holds the object lock forever

Java Memory Model

JMM is what causes concurrent programming way more complicated than it should be. Honestly, I am not good to write this part because I cannot understand it in full. All I can do is to provide you a video from Jeremy Manson in Google. Hear what the expert said:

 If you still have questions, make sure go to his blog

http://jeremymanson.blogspot.com/

Reference

Below are some of the articles I use:

  1. What does volatile do?
  2. Sun lesson on concurrency
  3. Fixing Java Memory Model – Brian Goetz – Part 1, Part 2
  4. Rox Java NIO Tutorial
  5. Blocking Queue
  6. Synchronization and Java Memory Model – Doug Lea 

 

Leave a comment Continue Reading →

Common DBA jobs

Export schema/ data out from mysql

To export schema and/or data, you can use mysqldump command:

mysqldump -u [username] -p[password] -d [schema_name] > [filename].sql

  1. -d means no data (just gives me the schema).
  2. -B is needed for multiple schema output
  3. -h (hostname)

Export data out from postgresql

  1. Export table data from postgresql to csv format
  2. Backup and restore database in postgresql

However, if you want to export sql result set to csv in postgresql, you can consider to use COPY functionality.

COPY ( select statement ) TO STDOUT WITH CSV
COPY stock FROM ‘mydir/Stock.csv’;

Run sql script using mysql command

To run the scripts as input, we can use the following command:

mysql [schema_name] -u [username] -p[password] < [filename].sql

SQL Tips

There are times we want to put logic in SQL but not writing store procedure. Here are some of using functions that may get you there:

  • Conditional statement – CASE WHEN xxx THEN abc WHEN yyy THEN bbc …ELSE ccc END

UPDATE Account SET Sales_Location__c =
    CASE WHEN Sales_Country__c != ” THEN Sales_Country__c WHEN Country__c != ” THEN Country__c
ELSE ‘–’ END

  • COALESCE (input1, input2,….) – This function takes in as many parameter as you want and return you the first non-NULL parameter. Suppose we have a table A having 3 columns FullName, CompleteName and DisplayName. Any of these columns can contain null values. Now we want to select the DisplayName from this table, but if it is null, then return FullName, if that is also null then return CompleteName. We can easily perform the same in one select statement as: (COALESCE vs ISNULL)

SELECT COALESCE(DisplayName, FullName, CompleteName) From A

 ETL

In mysql, you can export a table from db1 and import it to db2 remotely. For example, in db2 host, you can issue the following command:

/usr/bin/mysqldump – -force – -compress – -opt -u [username] -p[password] -h [hostname] db1

| /usr/bin/mysql -u [username] -p[password] -D db2

Leave a comment Continue Reading →