Archive | Data Intelligence RSS feed for this section

Database Performance – Indexing

There are 2 main focuses I will take to analyze a database. First, I will find out how it manages the data. Second, I will look at how it scales in term of data volume and traffics. Today, I will talk about the most common indexing scheme that most of the databases use today. It is B-Tree Indexing.

B-Tree Indexing

begin figure description - The paragraph that precedes this figure describes the content of the figure. - end figure description
 
Many people think B-Tree means binary tree. It is not right. If I really use binary tree to structure the index, 1 million index values will have a very deep tree to traverse and each node retrieval is equivalent to a read operation. Then, it may take so many reads to get down to the leaf node. How can it be performed? Instead, B-Tree means balanced tree. A B-tree is said to be balanced because it will never become lopsided as new nodes are added and removed. Apart from that, each node can have many sub-nodes. So, for millions of records, it can be handled by 2-3 levels of balanced tree. So, it is very good in performance. It offers O(log n) performance for a single-record lookups.
 
The fundamental unit of an index is the index item and a node is an index page that stores a group of index items. An index item contains a key value that represents the value of the indexed column for a particular row. An index item also contains rowid information that the database server uses to locate the row in a data page. A node is an index page that stores a group of index items.
 
It is interesting to note that the leaf nodes of the index are actually  a doubly linked list. Once we find out where to start in the leaf node (find the first value), doing an ordered scan of value (index range scan) is very easy. We don’t have to navigate the structure any more; we just go forward through the leaf nodes. That makes solving range-based queries such as "BETWEEN 20 and 30" much easier.
 

When should you use B-Tree Index?

To understand when you should use B-Tree index, you should know there are 2 ways to use an index. First, you can use index as a mean to access rows in a table via rowid. If you use index for that, you want to access a very small percentage of the rows in the table. Otherwise, you need to get into "index then row" cycle many times (implies many IOs) and it will be worse than pulling bunch of rows in batch to reduce the number of IOs (the costly part of database operation). According to the experiment done, full scan is faster if  we access too high % of  rows via index. Second, you can use index as a mean to answer the query if the index contains enough information to answer the entire query. In this case, we don’t need to go to the table at all. The index will be used as a thinner version of the table. So, if you want to access a large % of rows via index, you should consider to get the query answer via the information in the index.

MySQL Indexing

There are several rules to remember for MySQL indexing

  1. MySQL will only ever use one index per table per query (except for UNION b/c it is considered as separated queries).
  2. To get around that, you can create multicolumn indexes.
  3. When there are more than 1 indexes to choose from, MySQL makes an educated guess based on the statistics gathered.
  4. MyISAM has indexes kept in a completely separate file from table rows. And table rows are stored in the random order that are retrieved by the rowid in the index items.
  5. InnoDB uses clustered indexes that has primary key and the record itself clustered and the records are all stored in primary-key order. When your data is almost always searched on via its PK, clustered indexes can make lookups incredibly fast because single lookup can pull out record in question.
  6. Primary key cannot contain NULL whereas unique index can.

 

Leave a comment Continue Reading →

Business Intelligence – Part 1 Pentaho

Getting into Business Intelligent World

When I dig deeper in business intelligence, I found out that it is a huge topic ranging from reporting to data mining. Like all the knowledge acquisition plan, I put a series of milestones for myself. If you are interested, here is my list:

Get and prepare your data

  • Data collection – log processing, web services (SOAP and REST), RSS, screen scraping and more.
  • Data preparation and crunching – ETL (Kettle)
  • Data storage – data warehousing

Visualize your data

  • Reporting and Charting (Pentaho as server and Flex as frontend)

Analysis your data

  • Data modeling
  • Data analysis (OLAP)

Get smart of your data

  • Collective intelligence
  • Data mining

Introduction of Pentaho

Firstly, I want to see whether there is any out of the box  open-source solution that captures what I am trying to do here. If so, I can reach my goal much faster. Yes. It has to be open-source b/c I don’t have $$ and I don’t want to be just a user. After doing my homework a bit, I found out an open-source BI tool named Pentaho that looks pretty solid. So, I decide to dive deep to this. Like all the tools I mess around with, I want to integrate Pentaho as library. However, I don’t find anything on the Net that shows me how to do it. I am looking into its download and checkout its pentaho-sample project. What it shows me is how to use their tools to create a report on their systems using their UI. I definitely need more!

After few days of efforts, I managed to pull out all the unnecessary dependencies from pentaho. The heart of Pentaho is its xaction interpreter. The approach Pentaho uses is to write adapters, plug into its framework and use xaction to wire them up in a workflow fashion. In fact, most of its functionalities come from other open-source projects like quartz for scheduling, shark for workflow engine and jfreereport for reporting. I don’t think their xaction is clean but I do like their architectural approach.

Enough talk, lets start! Here I would use series of articles to cut your learning curve and show you how to get yourself familiar with Pentaho as a developer rather than user. First thing first, follow the articles below to set up your environment. 

  1. http://wiki.pentaho.com/display/PentahoDoc/07.+Debugging+with+the+Standalone+Platform+Project
  2. http://wiki.pentaho.com/display/PentahoDoc/Building+and+Debugging+Pentaho+with+Eclipse
  3. http://wiki.pentaho.com/display/PentahoDoc/Manual+Deployment+of+Pentaho

Leave a comment Continue Reading →

Flex Remoting and Session Management

Power of BlazeDS

Recently, I found out that Adobe has released BlazeDS (subset of LiveCycleDS) that has 4 main advantages:

  1. AS3 to Java object communication (no XML passes back and forth is needed!)
  2. Boost up performance b/c AMF is a binary protocol
  3. Built-in proxy support that gets around the cross domain security issue from Flex in ease.
  4. Allow push messaging

I have followed the guideline and set it up. Now my Flex application can call my Java object method without passing xml back and forth. It is awesome! During the setup process, you may experience your flex cannot find the destination set up in the server.

The error “[MessagingError message=’Destination ‘SomeBean’ either does not exist or the destination has no channels defined (and the application does not define any default channels.)’]”.

The trick here is to add a services argument to the mxmlc call, something of the form below should do the trick! 

-services “[local path to your java project]/WEB-INF/flex/services-config.xml”

Now you may start enjoying how AS3 talks to your Java Object. However, if  we bypass the Servlet layer in the code, how can we carry session across remote method calls? Great that I have found out how to handle it via this article. In short, you can access Session from your Java object via:

FlexContext.getFlexSession()

Here is the quote I got from the BlazeDS developer guide.

The FlexContext class is useful for getting access to the session and the HTTP pieces of the session, such as the HTTP servlet request and response. This lets you access HTTP data when you use a Flex application in the context of a larger web application where other classes, such as JSPs or Struts actions, might have stored information.

The FlexSession class provides access to an ID and also provides setAttribute and getAttribute functionality. This is useful for storing data on the server that doesn’t have to go back to the client. However, FlexSession is not cluster-aware; if a client connects to a different server in the cluster, the client receives a new FlexSession. Nothing stored in the FlexSession attributes is persisted for clustering purposes. The FlexSessionListener class is useful for monitoring who is connected. You add a listener by using the static method to track new connections being made. You receive a reference to the session that was added. Each session can then report when it is destroyed to those same listeners. You use this for monitoring connections that close, and also to clean up resources.

When I looked into the source of FlexContext, I noticed that it leverages ThreadLocal to store context info like request, response and session.

    private static ThreadLocal sessions = new ThreadLocal();  
    /**
     * The FlexSession for the current request.  Available for users.
     */
    public static FlexSession getFlexSession()
    {
        return (FlexSession)sessions.get();
    }

Reference

Below are some of the useful references I have read so far:

  1. Jim Boone’s Blog

 

Leave a comment Continue Reading →

Power up Salesforce UI via Flex

Get started

Follow the steps below to get your first Flex salesforce app up in Salesforce.

  1. Register a developer edition account from Salesforce. Note: Dev account never expires but the account does come with a few limitations. You can only have two users, one an admin account so that you can build and install applications and the other a normal user account so you can test your work from the perspective of a normal user. The account has a 2MB data limit and you can send mass email. However, It is totally fine for playing around all the features that Salesforce provides.
  2. Download the Flex Salesforce Toolkit (ie. force.com-air_flex-1.0.zip). This toolkit provides the needed libraries to communicate directly with your salesforce.com database records from within a Flex application, using native ActionScript packages and returning strongly typed classes.The documentation on the classes can be found here.
  3. In this zip file, there is a library called as3Salesforce.swc in the bin directory just off the root of this zip file. It is the library you need to associate to your Flex project.
  4. Create a Flex project and include the swc library in it.
  5. You can follow this screencast to get your first project up.

What you can do after that?

Now you have your Flex application run locally in Salesforce. Here is my TODO list and the solutions of each one.

  1. Run your app under your own website and pull info from Salesforce using the same api.
  2. Run your app under Salesforce and have it pulled data from your system thru Flex Remoting (HTTPService, WebService, RemoteObject…etc).
  3. Can we use the API to pull Salesforce metadata like SControl?
  4. How can we provide our application via Apex Exchange?
  5. How can we use the Salesforce Flex AIR Toolkit to make have your application deal with Salesforce in offline mode? Look into this article.

What Flex gives you but not original Salesforce UI?

Now you know how to integrate Flex with Salesforce. But what problems that Flex helps us solving but not the original Salesforce UI?

  1. Capture user events on the fly and display additional fields or populate bunch of fields.
  2. Visual the data via Flex charting.
  3. Full control on the layout, and look and feel.

  

Leave a comment Continue Reading →

Adobe AIR vs Google Gear

When you come to the world of offline/disconnected application, you will hear the new buzzwords – AIR and Gear. Before I talked about these technologies, I want to talk about what problem we are trying to solve first. What we need an application to be able to function offline? First, Internet access is not everywhere. When you are on the plane, at the coffee shop, on the Maui beach etc, you may not be able to connect to your favorite Web applications. What if you want to look at some of your reports? I would like to see the traffics statistics for my own little wedding website company – Jusproposed.com. However, I don’t see Google Analytics available offine :)

OK. There is need for it. Although the need may go away once the whole world is wired by WiFi or other technologies (may still take a while), we may want to solve this problem to improve user experience. Here is how the problem get solved in high level?

  1. User request is captured locally.
  2. All your operation is handled locally to the local database.
  3. When Internet access is available, synchronize the local database with the server database.

How Gear solves this? (I learn it here)

  1. LocalServer module captures browser HTTP and HTTPS requests if there is no internet access (Browser plugin works only with Firefox 1.5+ and IE6).
  2. Database module (uses SQLite) stores the updates and provides the information needed. (news). When Internet access is available, it will do the synchronization work for you automatically.
  3. Google gear also provides background processing mechanism taht allows I/O or compute-intensive Javascript to run without blocking the user interface.

How AIR solves this?

  1. Adobe AIR is desktop application that runs without browser.
  2. Apollo provides SQLite as well for storing the updates and provides the information needed.
  3. Sync module will do the synchronization work when Internet access is available.

My preference

I have to pick either one of them. I would select AIR b/c ilike the power of Flex in UI. So, if AIR solves the problem, I may just stick with it. Simple?!! :) Since I am working for the Performance Ad Network right now, if I can pull enough data across the wire and store it locally, then I can run OLAP and other complex queries and visualize the data with the advanced data grid and charting available in Flex, that will be awesome!

Update: Flex and Gear actually can work together! (Someone demonstrates this combination via building a sales force application)

Leave a comment Continue Reading →

Database – Tuning Tips

Understand Query Processing Basic

Before tuning your slow query, you better understand how your database process your query first. Here are the steps:

  1. Each client connections gets its own thread within the server process (MySQL). The connetion’s queries execute within that single thread, which in turn resides on one core or CPU. The database server caches threads, so they don’t need to be created and destroyed for each new connection (ie. has its own thread pool).
  2. Look up the query cache to see whether the resultset of your query is available. MySQL does this via hashing your query and use the hashed value to check for the results in the cache (ie. Map[ hashed query : resultset] ). MySQL uses the exact query text it receives, so the cache is sensitive to the most trivial variations like case, space and etc. Best practice: Have your application generates the queries. If your query cannot be found in the query cache, go to step 3. Once the table is updated, the query in the cache will be flushed and updated.
  3. Parse the query to create an internal structure (the parse tree).
  4. Analysis the query
  5. Optimize the query – Create the query plan that is believed to be the most efficient way to execute a query. The effort center around indexes and table join order. Use Explain to look into the query plan that database uses for your query.
  6. Execution
  7. Send the resultset back to client

Approaches that can speed up your queries

  1. Index tables to allow the database server to look up rows more quickly.
  2. Write effective SQL that takes advantage of those indexes to the fullest extent, and use the EXPLAIN statement to check whether the RDBMS really is doing so.
  3. Tune the hardware

Concept of Indexing

The article from Paul Dubios gives a great summary of the benefit and cost of indexing and when we should use it. If you don’t want to read up 6 pages of his article (but I really suggest you do), here is my summary.

Benefits/Cost of Indexing

  1. The column you indexed are sorted and you can combine few columns together (order is important) as single index (ie. multi-columns index).
  2. Indexed column is sorted. To locate record with last name equals to “HON”, you will look it up via lastname index and pull consecutive entries until it is not equal to “HON” and you can ignore the rest. Therefore, one efficiency gained by using the index is that we can tell where the matching rows end and can skip the rest. Another efficiency comes about through the use of positioning algorithms for finding the first matching entry without doing a linear scan from the start of the index (for example, a binary search is much quicker than a scan). Normally, B-Tree index is applied by default.
  3. If you search your table in different ways, you can have more than 1 index. Remember at most one index per table is picked by RDBMS each time you query a table. You can get around this limitation via composite index.
  4. Index may be stored separately from data. But MySQL InnoDB stores them together.
  5. Indexes are always sorted, but the data on disk is not. Using an index means accessing the rows in index-sorted order (ie. random seek) rather than in the order they reside on the disk (ie. sequential access). The end result is more time spent moving around the disk and less time reading data. You can draw 2 conclusions form this knowledge:
    • If a table is going to remain very small, you may want to leave off the indexes.
    • Always use the production data for your test

Choose columns to index

  1. Index columns that you use for searching, sorting, or grouping, not columns you only display as output. In other words, the best candidate columns for indexing are the columns that appear in your WHERE clause, columns named in join clauses, or columns that appear in ORDER BY or GROUP BY clauses. If RDBMS can optimize a query using joined columns, it cuts down the potential table-row combinations quite a bit by eliminating full table scans.
  2. Consider column cardinality. The cardinality of a column is the number of distinct values that it contains. Indexes work best for columns that have a high cardinality relative to the number of rows in the table (that is, columns that have many unique values and few duplicates). The query optimizer generally skips an index in favor of a full table scan if it determines that a value occurs in a large percentage of a table’s rows. The conventional wisdom for this percentage used to be “30%.” In that case, your index gives you no benefit but cost to maintain!
  3. Favor the column with smaller data type to index. Faster for comparsion, less disk IO required, more can be cached in memory. For the InnoDB and BDB storage engines that use clustered indexes, it’s especially beneficial to keep the primary key short. A clustered index is one where the data rows are stored together with (that is, clustered with) the primary key values. Other indexes are secondary indexes; these store the primary key value with the secondary index values. A lookup in a secondary index yields a primary key value, which then is used to locate the data row. The implication is that primary key values are duplicated into each secondary index, so if primary key values are longer, the extra storage is required for each secondary index as well.
  4. Take advantage of leftmost prefixes. When you create an n-column composite index, you actually create n indexes that MySQL can use. A composite index serves as several indexes because any leftmost set of columns in the index can be used to match rows. Such a set is called a “leftmost prefix.” Suppose that you have a table with a composite index on columns named state, city, and zip. Rows in the index are sorted in state/city/zip order, so they’re automatically sorted in state/city order and in state order as well. This means that MySQL can take advantage of the index even if you specify only state values in a query, or only state and city values. However, it cannot use the index for searches that don’t involve a leftmost prefix. For example, if you search by city or by zip, the index isn’t used

Concept of Concurrency

To provide better concurrency, below are the rules of thumbs that I used

  1. Locking level: Locking at a finer level allows better concurrency (ie. row level is better than page level, and page level is better than table level), because more clients can be using a table at the same time if they use different parts of it.
  2. Table locking does have an advantage over finer levels of locking in terms of deadlock prevention.
  3. MVCC is new mechanism to achieve better concurrency b/c read doesn’t block write and vice versa..

Schema design and structuring

This section will cover the techniques that you can leverage to design your schema with performance in mind.

  1. Normalize first then denormalize later. Don’t over normalize. Use 3rd Normal Form
  2. Use the right data type. Don’t waste space unnecessarily (general)
    • Use INT unsigned for IPv4 address
    • Pay attention on TEXT, BLOB and BIGINT data types. Consider separate table for TEXT, filesystem for BLOB and other numeric data type for BIGINT.
  3. Partitioning
    • Vertical partitioning – split table with many columns into multiple tables (ex. static vs dynamic columns, frequent vs infrequent access pattern)
    • Horizontal partitioning – split table with many rows into multiple tables (ex. user-based)
    • MySQL 5.1 partitioning has issues (Postgresql and Oracle are more mature in this area).
  4. Understand the benefit and disadvantages of different storage engines (mysql)
  5. Leverage query cache effectively
    • Each time an update on the any record in the table, all queries referencing the table are invalidated in the query cache. So if you split out the dynamic fields into another table, you can make your query cache less frequent in thrashing.
  6. Index smartly

Write effective SQL

  1. Use ANSI SQL style that explicitly declare JOIN conditions using the ON clause. This syntax can handle Outer join.
  2. Use Explain plan to find out how RDBMS chose the execution plan for your SQL. Tune it accordingly.
  3. Don’t use indexed field in function call
  4. In explain plan, “Extra: Using Index” means FULL Index Scan. It happens when scan is better than seek and index table is enough to satisfy the query.
  5. Why full table scan?
    • No WHERE condition
    • No index on any type of field in the WHERE clause
    • Poor selectivity/ cardinality on an indexed field
    • Too many records meet WHERE condition (scan favors seek).
    • < MySQL 5.0 and using OR in WHERE clause

Put it all together

  • Think in set instead of iteration.

Example: For each customer, fetch the last payment record in the Payment History table. You can solve this query via correlated subquery. That is:

select p.* from Payment_History p
where p.payment_date = (select max(payment_date) from Payment_History where customer_id = p.customer_id)

The above query can be optimized a bit if you use compound index (customer_id, payment_date). However, if you want to achieve the ultimate performance, you should not use correlated subquery b/c it will execute n times and n = number of rows in the Payment_History table. To avoid that, you can use derived table. The query will be:

SELECT p.* FROM (select customer_id, max(payment_date) as last_order from Payment_History group by customer_id) as last_orders
INNER JOIN Payment_History p
ON p.customer_id = last_orders.customer_id
AND p.payment_date = last_orders.last_order

Optimize the N:M joins

If you write an application that you can tag a project many times. You may model your database schema with Project table, Tag table and Tag2Project join table. When you want to grab all projects that either tag by “mysql” OR “php”. You may write a select statement that have Project inner join Tag2Project with project_id and resultset inner join Tag table filtered by “mysql” or “php” with tag_id. However, optimizer may not join it as the order you specified. It may have Tag table filtered first then inner join Tag2Project table and pass the resultset to inner join Project table. You can force the sequence of join via using derived table.

On the other hand, if you want to grab all project that has tagged by both mysql AND php. You may end up solve it via using derived table from Tag2Project and Tag that “GROUP BY” project_id and having count(*) = 2 with Tag filtered by IN clause.

Configurable variables

Before you start tuning your server for mysql performance, you need to obtain the system information first. To do that, there are some linux command that may be useful. top command

top_display.JPG How to interpret the output from top command? Here you go.

  1. The first line is telling you that the system is up 211 days 9 hr and 38 minutes. There are 2 users using the system (from process table, they are mysql and ec users). For the last 1, 5 and 15 minutes, the load of the system shows 0.01, 0.11 and 0.21 respectively.
  2. The 2nd line shows you that there are 82 processes running in the system, only 1 of them are active (In the process table, S=R means Running) and 81 of them are in sleeping mode (S = S means sleeping).
  3. The 3rd – 5th line are very useful. It gives you the current status of our system. CPU is 99% idle. If it is not, you want to find out who is using it. The process table is sort by %CPU (if you have dual core processor, the % here can reach 200). In term of memory, the total memory in the system is around 4G in total. As you can see, 3.5 G is in used. For our mysql instance, it is using 1.2G of memory that is about 32% of the used memory. Another interesting thing to examine is the swap memory. There is 2G of swap memory allocated in this system and the system barely uses it. It is good because system takes time to swap in and out the data from memory to disk and vice versa.
  4. Troubleshooting: When you see mysql process is taking high %CPU – CPU-bound, there may be queries costing intensive cpu cycles. You can look into it via mysql administrative commands. When you see high % of IO wait – ie. IO-bound, you may have inefficient queries that cause mysql reading too many rows to locate the data you are interested in. In system level, you could reduce the # of IO via better use of memory/buffer or reduce the seek time via faster disk and less random data storage.

Other linux commands

  1. more /proc/version (check linux os version)
  2. more /proc/cpuinfo (for dual processor with speed 2.80MHz: cpu core = 2, processor = 1, cpu MHz = 2.80)

Hardware Tuning Tips

There are several key areas that we normally look at when we try to tune our database in hardware perspective.

  1. Add more memory if you can – reduce # of IOs (primary bottleneck).
  2. Achieve parallelism - multi-processor for processing and parallel write to different disk devices (eg. log and database store in different disk device). MySQL is multi-threaded.
  3. Faster disk to improve IO latency
  4. RAID can give you some advantages of parallelism

In MySQL, most of the memory mysql allocates is used for various internal buffers and caches. These buffers fall into 2 major groups: global vs per-connection buffers. The 2 most important global buffers are MyISAM key buffer (key_buffer_size) and InnoDB’s buffer pool (innodb_buffer_pool_size).

  • For MyISAM, remember that msyql doesn’t cache row data but the index block. So, the less often mysql needs to hit the disk to scan a table’s index, the faster the queries will be. In practice, OS may somehow help you out via caching some row data in cache. Even this, if possible, consider making the key buffer large enough to hold the indexes for your most actively used tables. By adding up the size of the .MYI files for the tables, you will have a good idea to set the buffer.
  • For InnoDB, both cache index and row data are together in its buffer pool via using the clustered indexes. So if the table size is not large, you will achieve the best performance via storing the whole table in cache because you don’t need to do any IO. However, if you modify the data in the table, you need to sync up your disk either per min or per commit.

Allocate enough memory for mysql

min_memory_needed = global_buffers + (thread_buffers * max_connections)

thread_buffers includes:

  • sort_buffer
  • myisam_sort_buffer
  • read_buffer
  • join_buffer
  • read_rnd_buffer

global_buffers includes:

  • key_buffer
  • innodb_buffer_pool
  • innodb_log_buffer
  • innodb_additional_mem_pool
  • net_buffer

When there are large number of concurrent users like 300-400, a particular nasty series of events may happen. The reason is large number of threads need to allocate additional memory, it can cause mysql to allocate so much memory that the OS begins swapping, which causes performance to degrade further, which means that each query takes longer to complete. With queries running more slowly, the odds of more threads needing memory increases. It is a vicious spiral. The only solution is to restore balance between the system’s memory and mysql’s memory needs. That means doing one of the following:

  1. Add more memory
  2. Decrease max_connections
  3. Decrease some of the per-thread buffer sizes

Be proactive, monitor memory use on your server.

Below are some quick tips for MySQL tuning that I have learnt from my colleagues and books. To keep this article short and useful as quick reference, I will just put the common ones here.

  • key_buffer size. (for MyISAM – default at only 8MB)
  • innodb_buffer_pool_size (for InnoDB – default at only 8MB) – reference

You need buffer pool a bit (say 10%) larger than your data (total size of Innodb TableSpaces) because it does not only contain data pages – it also contain adaptive hash indexes, insert buffer, locks which also take some time – Peter Zaitsev

  • Bumping these values up can decrease disk IO and make your database run a lot smoother.Use indexes. Enable the slow query log and look for queries that show up most often, then learn the explain command on them.
  • Check Created_tmp_disk_tables in show status; This value usually goes in hand with huge queries that aren’t using indexes. If it keeps increasing it means you are having to create on disk files to execute the query. Consider increasing tmp_table_size or putting your disk files on a TMPFS by setting tmpdir = /tmp.
  • Use the query cache if it can help you. Check the query_cache_size. Check how well it is performing. You can do this by doing a “show status” and then using this formula:

qcache_hit_ratio = qcache_hits / (qcache_hits + qcache_inserts + qcache_not_cached)

  • If there are a lot of Qcache_lowmem_prunes, that means your query cache size isn’t big enough.
  • If Opened_tables is increasing you probably have table_cache too small.
  • If you have a lot of connections, check Threads_created in show status; If it keeps increasing, you need to up the thread_cache_size in my.cnf. The max connections in MySQL are default at 200.

Reference

Below are some references I used to write this article

  1. MySQL Toolkit
  2. One large table vs many small tables
  3. The Art of SQL Tuning (Jay Pipes has a great presentation on sql tuning)
  4. MySQL Performance Blog
  5. Peteris’s Blog on SQL Tuning
  6. MySQL Query Optimization

 

Leave a comment Continue Reading →

Flex Remoting – AMF

Open up FDS?

Adobe announced that it will open source the strip down version of LiveCycle Data Service called BlazeDS. From what I heard, BlazeDS provides AMF3 remoting without data management and its data push solution is not so scalable as LCDS. (features comparsion chart). The reason is that BlazeDB still uses the blocking IO to handle connections whereas LCDS uses Java NIO. The power of Java NIO is that the server thread will not be held up for the connection during the request is in process. (Detail here). So, BlazeDB is only good for tasting the power of AMG remoting and learning how AMG protocol being implemented.

Future of GraniteDS

However, if all you need a solid AMF solution to connect Flex 2 with Java, take a look at GraniteDS. I heard that it does a great job. However, GraniteDS may become obsolete soon as the founder, Franck Wolff, seemed a bit caught off guard and unsure what the future held for GraniteDS. However, finally he decided to continue GDS development. Here are why GDS is still better than BlazeDS:

  1. BlazeDS does not offer any data management features.
  2. FDS/LCDS has it own (proprietary and closed) data management system, offers (officially supported?) service factories for Spring and Hibernate (but without any lazy-loading support) and is, say, expensive.
  3. GraniteDS integrates with known enterprise J2EE persistence systems such as EJB3 (session and entity beans with full lazy-loading support) and with wildly used and open source frameworks such as Spring (with Acegi security) or JBoss/Seam (with conversation scope support – new, coming in the next release).
  4. GraniteDS provides a fast and powerful ActionScript3 code generator (Gas3): you just have to write your database model (EJB3 entity beans), Gas3 will generate corresponding ActionScript3 beans and Hibernate code generation tools will create your database tables. So, from an enterprise point of view, GraniteDS does not really compare to BlazeDS and, on the other hand, it seems to be still a powerful and free alternative to FDS/LCDS.

Update: GraniteDS 1.0 is released. Here is an article that discusses this release.

How about WebORB?

I also heard that WebORB is a powerful alternative to LCDS (compare them here). Again it is Free! However, it is not open source and its license may constraint you doing some interesting things. I will check this out more. However, here is a promising comment from the director of scrapblog.com – a fancy Flex example on Web. "Scrapblog.com needed a remoting solution to use with Flex 2, as it currently only supports Java. I didn’t think twice when choosing WebORB for this task. Version 3.0 [of WebORB] is simply amazing. You won’t experience the flexibility and responsiveness that WebORB offers with any other remoting solution. My hat goes off to Midnightcoders, keep the great work." – Omar Ramos, Scrapblog’s Director of R&D.

Leave a comment Continue Reading →

Flex – Power of E4X

E4X is similar to XPath that is used to manipulate the structural xml in a scripting language. Here are some examples:

Delete node from XML

<root>
<level>
<detail></detail>
</level>
<level>
<detail></detail>
</level>
</root>

I want to make a copy of this XML (could be either of the three types of XML objects) and remove the <detail> nodes from every node no matter that hiarchy they are in.

private function testXML():void {
var xml:XML =
<root><level><detail></detail></level><level><detail></detail></level></root>
var test:XML = xml.copy();
delete test..*.detail;
trace(test.toXMLString());
}

Add new node to XML

var newItem:XML =
<level id=’1′>
<detail></detail>
</level>
test.appendChild(newItem);

Traverse

var myXML:XML =
<order>
<item id=’1′>
<menuName>burger</menuName>
<price>3.95</price>
</item>
<item id=’2′>
<menuName>fries</menuName>
<price>1.45</price>
</item>
</order>

myXML.item[0].menuName (for element traverse, use index to identify an element)
myXML.item.(@id=2).menuName (search an element using its attribute @)
myXML.item.(menuName==”xxx”).price

var total:Number = 0;
for each (var property:XML in myXML.item) {
var q:int = Number(property.@quantity);
var p:Number = Number(property.price);
var itemTotal:Number = q * p;
total += itemTotal;
trace (q + ” ” + property.menuName + ” $” + itemTotal.toFixed(2))
}
trace (“Total: $”, total.toFixed(2));

Assignment

myXML.item[0].menuName=”regular burger”;
myXML.item[1].menuName=”small fries”;
myXML.item[2].menuName=”medium cola”;

myXML.item.(menuName==”regular burger”).@quantity = “2″;
myXML.item.(menuName==”small fries”).@quantity = “2″;
myXML.item.(menuName==”medium cola”).@quantity = “2″;

var myXML:XML =
<order>
<book ISBN=”0942407296″>
<title>Baking Extravagant Pastries with Kumquats</title>
<author>
<lastName>Contino</lastName>
<firstName>Chuck</firstName>
</author>
<pageCount>238</pageCount>
</book>
<book ISBN=”0865436401″>
<title>Emu Care and Breeding</title>
<editor>
<lastName>Case</lastName>
<firstName>Justin</firstName>
</editor>
<pageCount>115</pageCount>
</book>
</order>

myXML.book –> XMLList (direct children of myXML object that has element name “book”)
myXML..lastName –> XMLList (any descendant that has element name “lastName”)
myXML.book.author.lastName –> XMLList
myXML.book[0]
myXML.book[0].title
myXML.child(“book”).title
myXML.book.(@ISBN=’0942407296′)
myXML.book[0].@ISBN –> output 0942407296
myXML.book.(title==’Emu Care and Breeding’)
myXML.book.(pageCount > 100)
myXML.book.(title.toString().search(‘Emu’) > -1)

var myXML:XML =
<order>
<item id=’1′ quantity=’2′>
<menuName>burger</menuName>
<price>3.95</price>
</item>
<item id=’2′ quantity=’2′>
<menuName>fries</menuName>
<price>1.45</price>
</item>
</order>;

var total2:Number = 0;
for each (var prop:XML in myXML.item) {
total2 += prop.@quantity * prop.price;
}

http://www.devx.com/Java/Article/21383/1763/page/1

Leave a comment Continue Reading →

Art of using database indexes

Need of Indexes

Image of you have a table of user info, if the table contains 50 million of rows. Without index, running a query like below will need a full table scan. Clearly it is not efficient as it is O(n) problem.

SELECT * FROM user_info WHERE last_name = “Tom”

But if we index it on last_name column, the last_name field will be sorted alphabetically. Now if you look up the last_name = ‘Tom’, you can go directly to the one starts with ‘T’. Internally, index table contains the fields you are indexed and the position of the matching records (ie. pointer).

Cost of Indexes

  1. Because database needs to maintain a separate list of indexes’ values, there is cost to keep them updated as your data changes
  2. Indexes cost space. It is a trade-off between space and time. Lets say the user_info table has 2 billions rows and last_name is 8 bytes long, you are looking at roughly 16 GB of space for the data portion of the index. Plus 4-8 bytes for each row pointer, it will go up to 32 GB of space.

With these cost, you don’t want to index every column in a table.

Type of Indexes

 

  1. Multicolumn indexes – reduce the set that matches single column only (ie. more selective). For example, if you have each field as index separately, database may not use them all at once at the same time. Like MySQL, it will only ever use one index per table per query. To choose which index to use, MySQL will make a decision about which index will return fewer rows via index statistics.
  2. Partial indexes – if you don’t have too much space for your index, you can specify a subset of bytes from your index value be used.
  3. Clustered indexes – In MySQL, for MyISAM, the indexes are kept completely separate from the row data. With clustered indexes, the index and the record itself are “clustered” together. InnoDB uses clustered indexes. In Oracle, clustered indexes are known as “index-organized tables”. The type of index will reduce two lookups (index and record data) to one. Internally, clustered index reorders the way records in the table are physically stored. Therefore table can have only one clustered index. The leaf nodes of a clustered index contain the data pages.

Deep look in MySQL: The InnoDB storage engine creates a clustered index for every table. If the table has a primary key, that is the clustered index. If not, InnoDB internally assigns a six-byte unique ID to every row and uses that as the clustered index. (don’t let it generate a useless one for you). All indexes are B-trees. In InnoDB, the primary key’s leaf nodes are the data. Secondary indexes have a pointer to the data at their leaf nodes.

Index Structure

  • B-Tree Indexes – (balanced tree indexes, a tree structure that will never become lopsided as new nodes are added and removed. It gives us O(log n) performance for single-record lookup). Unlike binary tree, B-trees have many keys per node and don’t grow tall or deep as quick as a binary tree. It is very good for range-based queries as well. For example, for quey below, server simply finds the first Ray record and last Robert record. It then knows everything in between are also matches. The same is true of virtually any query that involves understanding the range of values (>, <, MIN, MAX, BETWEEN xx AND yy)

SELECT * FROM user_info WHERE last_name BETWEEN ‘Ray’ AND ‘Robert’

  • Hash Indexes – It gives very fast lookups O(1) but it is less flexible and predictable than other indexes. First, the key is hashed and compare. So, the range-based queries cannot use it. Uniform distribution is the key here.

Index Limitation

  1. Index doesn’t work together with wildcard and regular expression search.
  2. If index selectively is like > 30% of rows (very low), table scan may be better.

 

Leave a comment Continue Reading →

Flex 3 – Advanced DataGrid

Since I am leading the reporting team in Adconion, I would like to spend sometime on the new component “Advanced Data Grid” from Flex 3. Being excited to know that this component has added many cool features on top of the DataGrid, I would like to find out whether it meets all my needs. Before I introduce it to my company, I would like to write an example to exercise its power first. The example below will demonstrate some documented features in Advanced Data Grid and some custom features that I would like in my project. To make the example prettier, I would use Yahoo Flex Skin.

Yahoo Flex Skin

yahooflexskin.JPG

Apart from providing you flex skin, Yahoo also shares with you some custom-made flex components. Among them, I would like to look into Yahoo Map once I have time. (http://developer.yahoo.com/flash/maps/examples.html). However, it is out of my scope today.

What I want to show you in my example
There are some cool features I want to try out in this example. Here is the list:

  1. Grouping
  2. Tree view with “Expand All” and “Shrink All” features.
  3. Multiple columns sorting
  4. Auto-completed search filter
  5. Smart resizing data grid
Leave a comment Continue Reading →