Operational databases are most commonly designed using normalized modeling, often using third-normal form or entity-relationship modeling. Normalized database schemas are tuned to support fast updates and inserts by minimizing the number of rows that must be changed when recording new data.Â
Example: Order-Management Schema for operational database
Data warehouses differ from operational databases in the way they are designed; they are optimized for efficient querying and not for updating. Data warehouses provide a read-only version of the data in the operational databases, which is optimized for querying. The kind of modeling most commonly used in warehouse design is called dimensional modeling, and the schemas produced are known as star schemas. In dimensional modeling, a database is organized around a small number of fact tables. Each row in a fact table is a single measurable event: a single sale, a single hit to a web page, etc.
Example: Order-Management Dimension Schema
The key benefits of data warehouse are simplication and consolidation of data. It normally gathers data from different operational databases into single dimensional model for reporting and analysis purpose. On the other hand, dimensional modeling offers a chance to reduce the level of complexity in your database. By reducing complex chains of tables into dimension tables, the schema becomes smaller and performance tends to improve. The approaches we take to reduce the complexity are (1) We try to model one aspect of the system for each DM schema. (2) We can denormalize the schema to reduce number of joins.
ETL Process
Once you have a data schema for your warehouse, you’ll need to fill it with data. This process is known as extract, transform, and load, or ETL for short. The first step, extraction, is simply the process of selecting all the data of interest from the operational database. Then the data must be transformed into the format needed by the warehouse. This could be as simple as renaming some of the fields or as complex as cleaning dirty data and computing new fields. Finally the data must be loaded into the data warehouse.
There are some areas you need to pay attention when you perform the ETL:
- During extraction, you will put a lot of strains to the operational database. To deal with this problem we can replicate a low-cost copy of the operational database on the warehouse machine before doing extraction. The SQL output of the extraction process can be a CSV file.
- Transformation can be computing summary data, converting postal code into geo-code (ie. lat and long) that powers”within X miles” queries. You can use Perl to do this job. The output of transformation may be another CSV file.
- Finally, you load the data into CSV into dimensional model. To speed up the load, in MySQL, we first disable indexes with ALTER TABLE foo DISABLE KEYS, and after the load, we re-enable them with ALTER TABLE foo ENABLE KEYS. Each table needs to be cleared before loading via TRUNCATE command.
- You may be wondering what happens to clients using the warehouse while an ETL process is running. In our case, nothing at all! This magic is achieved by actually having two warehouse databases, one in use and the other free for loading. All the data goes into the loading database, and when it’s full we swap it into place with RENAME.This produces an atomic switch of all tables in the loading database with the tables in the live database. It will wait for any running queries in the warehouse to finish before performing the swap, which is exactly what we want.
Quick Tips
- CSV format isn’t a standard. Use XML can solve character issue but it might not perform as well due to formatting overhead.
- Transform is not always needed. If not, use “SELECT … INTO TABLE” to provide a straight database-to-database extract-and-load.
- Incremental load is highly desirable. Use trigger can achieve that.
- Operational database uses MySQL’s InnoDB backend, providing referential integrity and transactions. However, we chose MySQL’s MyISAM backend for our warehouse for better performance as it is read-only and transactional feature is not needed.
- MySQL does not support for bitmap indexes. Bitmap indexes are ideal for the kind of low-cardinality data that is commonly used in data warehouses. PostgreSQL supports bitmap indexes as of version v8.1, as do a number of commercial database systems.
Attached Files:






































(4.75 out of 5)
1 Comment Received
Pingback & Trackback
Sorry the comment area are closed for non registered users