Introduction

Data is the most important asset to the companies. There are many topics related to data such as data visualization, data representation, data processing, data warehouse and data intelligence. This page will dedicate to provide you information related to these topics.

Level 1 - Basic skills

This covers the basic skills you need to be good at before diving into some advanced topics like data warehouse and data mining.

  • Understand key database concepts (Normalization and Join, View, Indexing, Replication, Transaction Management and Storage)
  • Write effective SQL
    • Outer join vs inner join
    • SQL Overview
    • How does database indexing work? B-Tree? Most indexes are too large to fit into memory, which means that they are going to be stored on disk. Since I/O is usually the most expensive thing you can do in a computer system, these indexes need to be stored in an I/O efficient way. A B-tree is a good way to do this. If we make the nodes the size of a physical I/O block, it would take one I/O to move to a lower depth in the tree. In the example below, an index was created on a first name kind of field. If every level were an I/O it would take 3 I/Os to find Mary (or any other leaf). However, if the index you choose has low cardinality, it may cause the index page to split. So, you may end up doing more IOs and pull out lots of records. B-Tree may not do much better than full table scan. 10% selectivity is the minimum selectivity necessary for a b-tree index to be helpful.(selectivity% = unique index/ total # of record * 100%). To handle low cardinality columns, you can either use a different kind of index (e.g. Bitmap index) or combine that column with another to make a highly selective composite index.
    • Composite indexes (sometimes called multicolumn indexes) are generally recommended if queries often contain the same columns joined with AND or if you have keys with skewed domain counts or low cardinality. The left-most column in a compound index should be the one that occurs most often in the queries. The order of the columns in the WHERE clause once had to match the order in the compound index; however, modern optimizers make that rule obsolete.
    • If you can satisfy your query by index value alone, you don’t need to fetch the data page from disk and it is much faster. If the table’s row size is big, using a covering index will dramatically increase query performance because the index will act like a small table (ie. covering index - the indexes that cover what users want). There are limitations, though. The DBMS will never use covering indexes when there are joins or groupings.
    • Clustering an index means storing data rows according to the index order. Searching on a clustered index is much faster. You can have only one clustered index per table; usually, the clustering index is the primary key. You don’t need to explicitly create indexes on primary key columns because the system creates them implicitly. To qualify as a primary key, the value of the columns should not be nullable and should be unique on the row. In order to make the primary key the clustering index, use CREATE TABLE. If you try to add it later using an ALTER TABLE statement, the system might not create it as a clustered index.
    • Index carries its cost. OLTP has operations like insert, delete and update would need to update the indexes as well. Therefore, we should index our table smartly.

DB Index Example

 

Level 2 - Data visualization

I use built-in Flex components like Chart, Grid, Pie to visualize data. There are some 3rd parties useful data visualization libraries out there for your exploration. Take a look at this.

Level 3 - Data Warehouse

  1. How data is stored in data warehouse?
    • Data is stored as dimensional model in DW (normally in star schema with fact and dimension tables). If the data volume is not too large, you may have all data in single db instance. Otherwise, you need to partition your data across different database server instances. In Greenplum, all tables are distributed, which means a table is divided into non-overlapping sets of rows or parts. Each part resides on a single database known as a segment. The parts are distributed across all of the available segments using a sophisticated hashing algorithm. Database administrators choose the hash key (one or more table columns) when defining the table.
  2. How to build a large scale data warehouse? I would suggest you to adopt the share-nothing architecture that can economically scale from terabytes to petabytes with massively parallel query performance. It is not the solution that Oracle RAC provides you because it is using shared-disk architecture.  From I have known, only Teradata, Netezza and Greenplum provide you this solution. Among them, Greenplum looks like the most cost effective. You may ask me why share-nothing is better. Let’s look at the RAC first:
    • Oracle’s scale-out story is RAC, which requires a big and expensive shared-disk infrastructure (i.e. SAN) for coordination. Problem is that SANs only have ~4GBytes/s bandwidth out, so I/O scalability is inherently capped.
    • Oracle is an OLTP warehouse at heart. It has never been good at running large parallel analytical queries. Anyone with experience trying to implement a 50TB Oracle warehouse will tell you that it takes rocket-science tuning to get it to behave.
    • Oracle RAC requires its database servers separated from the storage (i.e. SAN or Exadata Storage Servers), because each database server needs to be able to see all the data in order to function. This means that the processing is happening on an entirely different box than the one accessing the data in disk. For more detail, look at this article.
  3. Look into greenplum that built on top of many Postgresql dbs
    • Greenplum is a distributed database system. This means that the data stored in the database system is physically located on more than one database server (referred to as segments in Greenplum - Postgresql 8.2.5). These individual database servers are connected by a communications network (referred to as the interconnect in Greenplum - communicate via reliable UDP instead of TCP to remove the scalability limit of 1000 segment instances). An essential feature of a true distributed database is that users and client programs work as if they were accessing one single database on a local machine (in Greenplum, this entry-point database is referred to as the master - parses and optimizes query, then creates and dispatches the parallel query plan to segments where the user data sit). The fact that the database is distributed across several machines is seamless to the users of the system.
    • All database operations—such as table scans, joins, aggregations, and sorts—execute in parallel across the segments simultaneously. Each operation is performed on a segment database independent of the data associated with the other segment databases.
    • In order to harness the power of a cluster, query processing software had to evolve to take advantage of multiple disks, processors, and network links operating in parallel. To achieve this, the rows of a table were partitioned across multiple machines with separate disks, enabling parallel I/O scans of big tables. Basic relational query operators like selection, join, grouping and aggregation were reinvented to run in parallel via similar partitioning schemes: the operations undertaken by each node in the cluster are the same, but the data being pumped through the fabric is automatically partitioned to allow each node to work on its piece of the operator independently. Finally, these architectures allowed multiple relational operators to operate at the same time, allowing pipeline parallelism in which an operator producing a data stream runs in parallel with the operator consuming it. 
    • On the other hand, Greenplum has leveraged the idea of MapReduce to achieve parallel processing behind the scene. Here is a great white paper about this.
    • Challenges to load large amount of data loaded into DW in a given maintenance window
    • Greenplum supports fast, parallel data loading with its external tables feature. External tables can also be accessed in ‘single row error isolation’ mode, allowing administrators to filter out bad rows during a load operation into a separate error table, while still loading properly formatted rows. By using external tables in conjunction with Greenplum Database’s parallel file server (gpfdist), administrators can achieve maximum parallelism and load bandwidth from their Greenplum Database system. Greenplum has demonstrated load rates in excess of 2 TB an hour.

Level 4 - Data Mining

Data mining is the most interesting area that I want to grasp. It gives you techniques to find out patterns in your data. With this, you can understand more your problem domain and even predict/ forecast what is next. That is the intelligence that many companies are trying to achieve. Here I will provide you some good articles that could help you to step into this field.

  1. Overview of data mining techniques
  2. Statistical Data mining tutorial by Andrew Moore

Based on practical, real-world experience a CRISP-DM (CRoss Industry Standard Process for Data Mining) has been defined, by the consortium of companies which applied data mining from the days of its infancy.

 

If you are interested in the full process of data mining, you can read this. Here my focus is on “Modeling”.

  1. Classification - Rule induction methods, Decision tree, Neural network, K-nearest neighbors, Case based reasoning
  2. Prediction - Regression analysis, Regression trees, Neural networks, K-nearest neighbors
  3. Dependency analysis - Correlation analysis, Regression analysis, Association Rules, Bayesian networks, Inductive logic programming
  4. Data description and summarization - statistical technique, OLAP
  5. Segmentation or clustering - Clustering techniques, Neural networks, Visualization method.

Level 5 - Operational Analytics

Fully automated decision support, which uses a combination of business rule engines, rules-driven workflow engines, and custom code, takes the human factor completely out of the exploration-insight-action cycle of business intelligence (BI). By doing so, companies improve efficiency by making actions faster, more consistent, and less error-prone.

Figure 1: Automated vs. manual BI architectures.

Reference

Below are some interesting references related to database and data intelligence

  1. Business Intelligence: Bringing Analytics into Operations

 

 

Attached Files:

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Reddit
  • Technorati
  • NewsVine
  • Slashdot
  • SphereIt
  • YahooMyWeb
  • BlogMemes
  • Spurl
  • E-mail this story to a friend!
  • Facebook
  • Furl
  • Google
  • Print this article!