Study | StudyLover

Unit 4: Data Warehouse Implementation: Architecture, Design, and Technology

Download

Unit 3: Distributed Data Warehousing and Knowledge Discovery

Data Warehousing and Mining

Host based, single stage, LAN based, Multistage, stationary distributed & virtual data warehouses.

There are different types of data warehouses, which are as follows:

Host-Based Data Warehouses

There are two types of host-based data warehouses which can be implemented:

Host-Based mainframe warehouses which reside on a high volume database. Supported by robust and reliable high capacity structure such as IBM system/390, UNISYS and Data General sequent systems, and databases such as Sybase, Oracle, Informix, and DB2.
Host-Based LAN data warehouses, where data delivery can be handled either centrally or from the workgroup environment. The size of the data warehouses of the database depends on the platform.

Data Extraction and transformation tools allow the automated extraction and cleaning of data from production systems. It is not applicable to enable direct access by query tools to these categories of methods for the following reasons:

A huge load of complex warehousing queries would possibly have too much of a harmful impact upon the mission-critical transaction processing (TP)-oriented application.
These TP systems have been developing in their database design for transaction throughput. In all methods, a database is designed for optimal query or transaction processing. A complex business query needed the joining of many normalized tables, and as result performance will usually be poor and the query constructs largely complex.
There is no assurance that data in two or more production methods will be consistent.

Host-Based (MVS) Data Warehouses

Those data warehouse uses that reside on large volume databases on MVS are the host-based types of data warehouses. Often the DBMS is DB2 with a huge variety of original source for legacy information, including VSAM, DB2, flat files, and Information Management System (IMS).

Before embarking on designing, building and implementing such a warehouse, some further considerations must be given because

Such databases generally have very high volumes of data storage.
Such warehouses may require support for both MVS and customer-based report and query facilities.
These warehouses have complicated source systems.
Such systems needed continuous maintenance since these must also be used for mission-critical objectives.

To make such data warehouses building successful, the following phases are generally followed:

Unload Phase: It contains selecting and scrubbing the operation data.
Transform Phase: For translating it into an appropriate form and describing the rules for accessing and storing it.
Load Phase: For moving the record directly into DB2 tables or a particular file for moving it into another database or non-MVS warehouse.

An integrated Metadata repository is central to any data warehouse environment. Such a facility is required for documenting data sources, data translation rules, and user areas to the warehouse. It provides a dynamic network between the multiple data source databases and the DB2 of the conditional data warehouses.

A metadata repository is necessary to design, build, and maintain data warehouse processes. It should be capable of providing data as to what data exists in both the operational system and data warehouse, where the data is located. The mapping of the operational data to the warehouse fields and end-user access techniques. Query, reporting, and maintenance are another indispensable method of such a data warehouse. An MVS-based query and reporting tool for DB2.

Host-Based (UNIX) Data Warehouses

Oracle and Informix RDBMSs support the facilities for such data warehouses. Both of these databases can extract information from MVS¬ based databases as well as a higher number of other UNIX¬ based databases. These types of warehouses follow the same stage as the host-based MVS data warehouses. Also, the data from different network servers can be created. Since file attribute consistency is frequent across the inter-network.

LAN-Based Workgroup Data Warehouses

A LAN based workgroup warehouse is an integrated structure for building and maintaining a data warehouse in a LAN environment. In this warehouse, we can extract information from a variety of sources and support multiple LAN based warehouses, generally chosen warehouse databases to include DB2 family, Oracle, Sybase, and Informix. Other databases that can also be contained through infrequently are IMS, VSAM, Flat File, MVS, and VH.

Designed for the workgroup environment, a LAN based workgroup warehouse is optimal for any business organization that wants to build a data warehouse often called a data mart. This type of data warehouse generally requires a minimal initial investment and technical training.

Data Delivery: With a LAN based workgroup warehouse, customer needs minimal technical knowledge to create and maintain a store of data that customized for use at the department, business unit, or workgroup level. A LAN based workgroup warehouse ensures the delivery of information from corporate resources by providing transport access to the data in the warehouse.

Host-Based Single Stage (LAN) Data Warehouses

Within a LAN based data warehouse, data delivery can be handled either centrally or from the workgroup environment so business groups can meet process their data needed without burdening centralized IT resources, enjoying the autonomy of their data mart without comprising overall data integrity and security in the enterprise.

Limitations

Both DBMS and hardware scalability methods generally limit LAN based warehousing solutions.

Many LAN based enterprises have not implemented adequate job scheduling, recovery management, organized maintenance, and performance monitoring methods to provide robust warehousing solutions.

Often these warehouses are dependent on other platforms for source record. Building an environment that has data integrity, recoverability, and security require careful design, planning, and implementation. Otherwise, synchronization of transformation and loads from sources to the server could cause innumerable problems.

A LAN based warehouse provides data from many sources requiring a minimal initial investment and technical knowledge. A LAN based warehouse can also work replication tools for populating and updating the data warehouse. This type of warehouse can include business views, histories, aggregation, versions in, and heterogeneous source support, such as

DB2 Family
IMS, VSAM, Flat File [MVS and VM]

A single store frequently drives a LAN based warehouse and provides existing DSS applications, enabling the business user to locate data in their data warehouse. The LAN based warehouse can support business users with complete data to information solution. The LAN based warehouse can also share metadata with the ability to catalog business data and make it feasible for anyone who needs it.

Multi-Stage Data Warehouses

It refers to multiple stages in transforming methods for analyzing data through aggregations. In other words, staging of the data multiple times before the loading operation into the data warehouse, data gets extracted form source systems to staging area first, then gets loaded to data warehouse after the change and then finally to departmentalized data marts.

This configuration is well suitable to environments where end-clients in numerous capacities require access to both summarized information for up to the minute tactical decisions as well as summarized, a commutative record for long-term strategic decisions. Both the Operational Data Store (ODS) and the data warehouse may reside on host-based or LAN Based databases, depending on volume and custom requirements. These contain DB2, Oracle, Informix, IMS, Flat Files, and Sybase.

Usually, the ODS stores only the most up-to-date records. The data warehouse stores the historical calculation of the files. At first, the information in both databases will be very similar. For example, the records for a new client will look the same. As changes to the user record occur, the ODs will be refreshed to reflect only the most current data, whereas the data warehouse will contain both the historical data and the new information. Thus the volume requirement of the data warehouse will exceed the volume requirements of the ODS overtime. It is not familiar to reach a ratio of 4 to 1 in practice.

Stationary Data Warehouses

In this type of data warehouses, the data is not changed from the sources, as shown in fig:

Instead, the customer is given direct access to the data. For many organizations, infrequent access, volume issues, or corporate necessities dictate such as approach. This schema does generate several problems for the customer such as

Identifying the location of the information for the users
Providing clients the ability to query different DBMSs as is they were all a single DBMS with a single API.
Impacting performance since the customer will be competing with the production data stores.

Such a warehouse will need highly specialized and sophisticated 'middleware' possibly with a single interaction with the client. This may also be essential for a facility to display the extracted record for the user before report generation. An integrated metadata repository becomes an absolute essential under this environment.

Distributed Data Warehouses

The concept of a distributed data warehouse suggests that there are two types of distributed data warehouses and their modifications for the local enterprise warehouses which are distributed throughout the enterprise and a global warehouses as shown in fig:

Characteristics of Local data warehouses

Activity appears at the local level
Bulk of the operational processing
Local site is autonomous
Each local data warehouse has its unique architecture and contents of data
The data is unique and of prime essential to that locality only
Majority of the record is local and not replicated
Any intersection of data between local data warehouses is circumstantial
Local warehouse serves different technical communities
The scope of the local data warehouses is finite to the local site
Local warehouses also include historical data and are integrated only within the local site.

Virtual Data Warehouses

Virtual Data Warehouses is created in the following stages:

Installing a set of data approach, data dictionary, and process management facilities.
Training end-clients.
Monitoring how DW facilities will be used
Based upon actual usage, physically Data Warehouse is created to provide the high-frequency results

This strategy defines that end users are allowed to get at operational databases directly using whatever tools are implemented to the data access network. This method provides ultimate flexibility as well as the minimum amount of redundant information that must be loaded and maintained. The data warehouse is a great idea, but it is difficult to build and requires investment. Why not use a cheap and fast method by eliminating the transformation phase of repositories for metadata and another database. This method is termed the 'virtual data warehouse.'

To accomplish this, there is a need to define four kinds of data:

A data dictionary including the definitions of the various databases.
A description of the relationship between the data components.
The description of the method user will interface with the system.
The algorithms and business rules that describe what to do and how to do it.

Disadvantages

Since queries compete with production record transactions, performance can be degraded.
There is no metadata, no summary record, or no individual DSS (Decision Support System) integration or history. All queries must be copied, causing an additional burden on the system.
There is no refreshing process, causing the queries to be very complex.

Data warehousing Design

In our previous posts we have got to learn about Data Warehousing Objects, different kinds of Data Warehouse schemas and Data Warehouse Basics. Now it time we learn about how to build or design a Data Warehouse.

Designing or Building of a Data Warehouse can be done following either one of the approaches. These approaches are notably known as:

* The Top-Down Approach

* The Bottom-Up Approach

These approaches are defined by the two of the bearers of Data Warehousing namely Ralph Kimball and Bill Inmon.

The Top-Down Approach

This approach was proposed by Bill Inmon, as he stated "Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form".

In short Bill Inmon advocated a "dependent data mart structure".

The above image shows how does the Top-Down model Works.

Here are the steps:

* The data is extracted from different/same data sources. This data is loaded into the staging areas and validated and consolidated for ensuring the level of accuracy and then pushed to the Enterprise Data Warehouse (EDW).

* Detailed data is regularly extracted from EDW and is temporarily hosted in staging area for aggregation, summarization and then extracted and loaded into Data Warehouse.

* Once the aggregation and summarization of data is completed the Data marts extract the data into data marts and apply fresh transformations on them. This is done so that the data which comes is in sync with the strutures defined for the data mart.

The Bottom-Up Approach

This approached was proposed by Ralp Kimball, stated as " Data Warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model."

* Ralp kimball designed the data warehouse with the data marts connected to it with a bus structure.

* The bus structure as you can see above, contained all the common elements that are used by data marts such as conformed dimension, measures etc.

Basically, Kimball model reverses the Inmon model i.e. Data marts are directly loaded with the data from the source systems and then ETL process is used to load in to Data Warehouse.

Here are the steps:

* The data flow in the bottom up approach starts from extraction of data from operational databases into the staging area where it is processed and loaded into the EDW.

* The data in EDW is refreshed or replaced by the fresh data being loaded. After EDW is refreshed the current data is once again extracted in staging area and transformations are applied to fit into the data mart structure. The data is the extracted from Data Mart to the staging area aggregated, summarized and so on loaded into EDW and then made available for the end user for analysis.

Designing Data warehouse Database

This chapter describes the physical design of a data warehousing environment, and includes the following topics:

· Moving from Logical to Physical Design

· Physical Design

Moving from Logical to Physical Design

Logical design is what you draw with a pen and paper or design with Oracle Warehouse Builder or Oracle Designer before building your data warehouse. Physical design is the creation of the database with SQL statements.

During the physical design process, you convert the data gathered during the logical design phase into a description of the physical database structure. Physical design decisions are mainly driven by query performance and database maintenance aspects. For example, choosing a partitioning strategy that meets common query requirements enables Oracle Database to take advantage of partition pruning, a way of narrowing a search before performing it.

Physical Design

During the logical design phase, you defined a model for your data warehouse consisting of entities, attributes, and relationships. The entities are linked together using relationships. Attributes are used to describe the entities. The unique identifier (UID) distinguishes between one instance of an entity and another.

Figure 3-1 illustrates a graphical way of distinguishing between logical and physical designs.

Figure 3-1 Logical Design Compared with Physical Design

Description of "Figure 3-1 Logical Design Compared with Physical Design"

During the physical design process, you translate the expected schemas into actual database structures. At this time, you have to map:

· Entities to tables

· Relationships to foreign key constraints

· Attributes to columns

· Primary unique identifiers to primary key constraints

· Unique identifiers to unique key constraints

Physical Design Structures

Once you have converted your logical design to a physical one, you will need to create some or all of the following structures:

· Tablespaces

· Tables and Partitioned Tables

· Views

· Integrity Constraints

· Dimensions

Some of these structures require disk space. Others exist only in the data dictionary. Additionally, the following structures may be created for performance improvement:

· Indexes and Partitioned Indexes

· Materialized Views

Tablespaces

A tablespace consists of one or more datafiles, which are physical structures within the operating system you are using. A datafile is associated with only one tablespace. From a design perspective, tablespaces are containers for physical design structures.

Tablespaces need to be separated by differences. For example, tables should be separated from their indexes and small tables should be separated from large tables. Tablespaces should also represent logical business units if possible. Because a tablespace is the coarsest granularity for backup and recovery or the transportable tablespaces mechanism, the logical business design affects availability and maintenance operations.

You can now use ultra large data files, a significant improvement in very large databases.

Tables and Partitioned Tables

Tables are the basic unit of data storage. They are the container for the expected amount of raw data in your data warehouse.

Using partitioned tables instead of nonpartitioned ones addresses the key problem of supporting very large data volumes by allowing you to divide them into smaller and more manageable pieces. The main design criterion for partitioning is manageability, though you will also see performance benefits in most cases because of partition pruning or intelligent parallel processing. For example, you might choose a partitioning strategy based on a sales transaction date and a monthly granularity. If you have four years' worth of data, you can delete a month's data as it becomes older than four years with a single, fast DDL statement and load new data while only affecting 1/48th of the complete table. Business questions regarding the last quarter will only affect three months, which is equivalent to three partitions, or 3/48ths of the total volume.

Partitioning large tables improves performance because each partitioned piece is more manageable. Typically, you partition based on transaction dates in a data warehouse. For example, each month, one month's worth of data can be assigned its own partition.

Table Compression

You can save disk space by compressing heap-organized tables. A typical type of heap-organized table you should consider for table compression is partitioned tables.

To reduce disk use and memory use (specifically, the buffer cache), you can store tables and partitioned tables in a compressed format inside the database. This often leads to a better scaleup for read-only operations. Table compression can also speed up query execution. There is, however, a cost in CPU overhead.

Table compression should be used with highly redundant data, such as tables with many foreign keys. You should avoid compressing tables with much update or other DML activity. Although compressed tables or partitions are updatable, there is some overhead in updating these tables, and high update activity may work against compression by causing some space to be wasted.

Views

A view is a tailored presentation of the data contained in one or more tables or other views. A view takes the output of a query and treats it as a table. Views do not require any space in the database.

Integrity Constraints

Integrity constraints are used to enforce business rules associated with your database and to prevent having invalid information in the tables. Integrity constraints in data warehousing differ from constraints in OLTP environments. In OLTP environments, they primarily prevent the insertion of invalid data into a record, which is not a big problem in data warehousing environments because accuracy has already been guaranteed. In data warehousing environments, constraints are only used for query rewrite. NOT NULL constraints are particularly common in data warehouses. Under some specific circumstances, constraints need space in the database. These constraints are in the form of the underlying unique index.

Indexes and Partitioned Indexes

Indexes are optional structures associated with tables or clusters. In addition to the classical B-tree indexes, bitmap indexes are very common in data warehousing environments. Bitmap indexes are optimized index structures for set-oriented operations. Additionally, they are necessary for some optimized data access methods such as star transformations.

Indexes are just like tables in that you can partition them, although the partitioning strategy is not dependent upon the table structure. Partitioning indexes makes it easier to manage the data warehouse during refresh and improves query performance.

Materialized Views

Materialized views are query results that have been stored in advance so long-running calculations are not necessary when you actually execute your SQL statements. From a physical design point of view, materialized views resemble tables or partitioned tables and behave like indexes in that they are used transparently and improve performance.

Dimensions

A dimension is a schema object that defines hierarchical relationships between columns or column sets. A hierarchical relationship is a functional dependency from one level of a hierarchy to the next one. A dimension is a container of logical relationships and does not require any space in the database. A typical dimension is city, state (or province), region, and country.

Database Design Methodology for Data Warehouses

There are two different methodologies normally followed when designing a Data Warehouse solution and based on the requirements of your project you can choose which one suits your particular scenario. These methodologies are a result of research from Bill Inmon and Ralph Kimball.

Bill Inmon - Top-down Data Warehouse Design Approach

Bill Inmon is sometimes also referred to as the "father of data warehousing"; his design methodology is based on a top-down approach and defines data warehouse in these terms

· Subject oriented - The data in a data warehouse is categorized on the basis of the subject area and hence it is "subject oriented".

· Integrated - Data gets integrated from different disparate data sources and hence universal naming conventions, measurements, classifications and so on used in the data warehouse. The data warehouse provides an enterprise consolidated view of data and therefore it is designated as an integrated solution.

· Non-volatile - Once the data is integrated\loaded into the data warehouse it can only be read. Users cannot make changes to the data and this practice makes the data non-volatile.

· Time Variant - Finally data is stored for long periods of time quantified in years and has a date and timestamp and therefore it is described as "time variant".

Bill Inmon saw a need to integrate data from different OLTP systems into a centralized repository (called a data warehouse) with a so called top-down approach. Bill Inmon envisions a data warehouse at center of the "Corporate Information Factory" (CIF), which provides a logical framework for delivering business intelligence (BI), business analytics and business management capabilities.

This top-down design provides a highly consistent dimensional view of data across data marts as all data marts are loaded from the centralized repository (Data Warehouse). The top-down design has also proven to be flexible to support business changes as it looks at the organization as whole, not at each function or business process of the organization. Generating a new dimensional data marts against the data stored in the data warehouse is a relatively simple task. Though there are some challenges for the top-down approach, for example it represents a very large project with a very broad scope and hence the up-front cost for implementing a data warehouse using the top-down methodology is significant. Further, the duration of time from the start of project to the point that end users start experience initial benefits of the solution can be substantial. Also, the top-down methodology can be inflexible and unresponsive to changing departmental or business process needs (a concern for today's dynamically changing environment) during the implementation phase.

Ralph Kimball - Bottom-up Data Warehouse Design Approach

Ralph Kimball is a renowned author on the subject of data warehousing. His design methodology is called dimensional modeling or the Kimball methodology. This methodology focuses on a bottom-up approach, emphasizing the value of the data warehouse to the users as quickly as possible. In his vision, a data warehouse is the copy of the transactional data specifically structured for analytical querying and reporting in order to support the decision support system. As per his methodology, data marts are first created to provide reporting and analytical capabilities for specific business\functional processes and later on these data marts can eventually be unioned together to create a comprehensive enterprise data warehouse. The bottom-up approach focuses on each business process at one point of time so the return on investment could be as quick as first data mart gets created. Though if not carefully planned, you might lack the big picture of the enterprise data warehouse by missing some dimensions or by creating redundant dimensions, etc. when you are too focused on an individual business process.

Ralph Kimball's bottom-up approach proposes to create a business matrix which should contain all the common elements (that are used by data marts such as conformed\shared dimension, measures, etc.) defined for the enterprise as whole. With this, the user can design and develop solutions which supports doing analysis across the business processes for cross selling. You can learn more about the matrix here.

For a person who wants to make a career in Data Warehouse and Business Intelligence domain, I would recommended studying Bill Inmon's books (Building the Data Warehouse and DW 2.0: The Architecture for the Next Generation of Data Warehousing) and Ralph Kimball's book (The Microsoft Data Warehouse Toolkit).

Data Warehousing design Using Oracle

Creating an Oracle Data Warehouse

Oracle Warehouse Builder is a flexible tool that enables you to design and deploy various types of data management strategies, including traditional data warehouses.This chapter provides a brief introduction to the basic, minimal steps for creating an Oracle data warehouse. It provides a starting point for using Warehouse Builder for the first time and serves as a road map to the documentation.

This chapter includes the following topics:

· Understanding the Basic Concepts

· General Steps for Creating an Oracle Data Warehouse

· About Locations

· About Modules

Understanding the Basic Concepts

Oracle Warehouse Builder is comprised of a set of graphical user interfaces to assist you in implementing complex data system designs. Your designs are saved as metadata in a centralized repository.

The centralized repository, known as the Warehouse Builder repository, is hosted on an Oracle Database. The Design Center is the interface that provides a visual representation of the Warehouse Builder repository. Use the Design Center to import source objects such as tables and views, design ETL processes such as mappings, and ultimately design the target warehouse.

A mapping is an object in which you define the flow of data from sources to targets. Based on a mapping design, Warehouse Builder generates the code required to implement the ETL logic.Warehouse Builder can generate PL/SQL, SQL*Loader, or ABAP code for mappings.

After you complete the design of a mapping, for example, and prompt Warehouse Builder to generate the code, the next step is to deploy the mapping. Deployment is the process of copying the relevant metadata and code you generated in the Design Center to a target schema. The target schema is generically defined as the database which will execute the ETL logic you designed in the Design Center. Specifically, in a traditional data warehousing implementation, the data warehouse is the target schema and the two terms are interchangeable.

Figure 2-1 illustrates the deployment process. For the purposes of this illustration, the target schema and the repository exist on the same Oracle Database; however, in practice, target schemas often exist on separate databases. To deploy design objects and subsequently execute the generated code, use the Control Center Manager, which is the client interface that interacts with the target schema through the control center service.

As previously noted, the Design Center is the primary user interface. It is also a centralized interface in that you can launch from it all the client based tools, including the Control Center Manager. A secondary user interface is the web-based Repository Browser. In addition to browsing design metadata and auditing execution data, you can view and create reports. You can also perform limited lineage impact analysis in the Repository Browser. However, the Design Center has more advanced, interactive capabilities for lineage and impact analysis.

Figure 2-1 Warehouse Builder Components

Description of "Figure 2-1 Warehouse Builder Components"

General Steps for Creating an Oracle Data Warehouse

Use Warehouse Builder to create a data warehouse in the following recommended stages:

· Before You Begin

· Stage 1: Preparing the Warehouse Builder Design Center

· Stage 2: Importing the Source Metadata

· Stage 3: Designing the Oracle Data Warehouse

· Stage 4: Deploying the Design and Implementing the Data Warehouse

Before You Begin

Before you can use any of the Warehouse Builder client components, first ensure access to a Warehouse Builder repository, which is an Oracle Database schema that stores the metadata for the system you design.

To begin using Warehouse Builder, take the following steps:

1. Install the Warehouse Builder software and create a repository as described in the Oracle Warehouse Builder Installation and Administration Guide.

If an administrator has previously completed the server and client installation, contact that person for the connection information required to log on to the repository.

2. Launch the Design Center, shown in Figure 2-2.

On a Windows platform, from the Start menu, select Programs. Select the Oracle home in which Warehouse Builder is installed, then Warehouse Builder, and then Design Center.

On a Linux platform, launch owbclient.sh located in the owb/bin/unix directory in the Oracle home for Warehouse Builder.

Figure 2-2 shows the Design Center with the top level folders in each of its 3 explorers expanded.

Figure 2-2 The Design Center

Description of "Figure 2-2 The Design Center"

Stage 1: Preparing the Warehouse Builder Design Center

To integrate data and design a data warehouse, you primarily utilize the Project Explorer and the Connection Explorer shown in Figure 2-2.

In the Project Explorer, Warehouse Builder creates a single default project, MY_PROJECT. As a project, it contains nodes for each type of design object that you can create or import.

The Connection Explorer is the window you use to establish connections between the Warehouse Builder repository to databases, data files, and applications.

To prepare the Design Center, complete the following steps:

1. Adjust the client preference settings as desired or accept the default preference settings and proceed to the next step.

From the main menu in the Design Center, select Tools and then Preferences.

As a new user, you may be interested in setting the Environment Preferences, the locale under Appearance Preferences, and the naming mode under Naming Preferences. For information on all the preferences, see Setting Preferences.

2. Connect to the source and target data objects.

You establish these connections by defining locations in the Connection Explorer. Expand the Location node and the nodes within it to gain a general understanding of the types of source and targets you can access from Warehouse Builder.

To create a location, right-click the appropriate node and select New. Fill in the requested connection information and select Test Connection. In this step, you merely establish connections to sources and targets. You do not move data or metadata until subsequent steps.

For more information about locations, see "About Locations".

3. Organize the design environment.

If you are satisfied with the default project, MY_PROJECT, continue with the next step.

Alternatively, you can define more projects at any time. Each project you define is organized in the same fashion as shown in MY_PROJECT with nodes for databases, files, applications, and so on. For a different organization, consider creating optional collections as described in "Organizing Design Objects into Projects and Collections".

4. Identify the Oracle Database target.

Although you can use a flat file as a target, the most common and recommended scenario is to use an Oracle schema as the data warehouse target as described in these steps.

To define the Oracle target, begin by creating a module. Modules are grouping mechanisms in the Project Explorer that correspond to locations in the Connection Explorer. The Oracle target module is the first of several modules you create in Warehouse Builder.

In the Project Explorer, expand the Databases node. Right-click Oracle and select New. The Create Module wizard displays. Set the module type to target and specify whether the module will be used in development, quality assurance, or production. This module status is purely descriptive and has no bearing on subsequent steps you take.

When you complete the wizard, the target module displays with nodes for mappings, transformations, tables, cubes and the various other types of objects you utilize to design the target warehouse.

5. Create a separate Oracle module for the data sources. (Optional)

At your discretion, you can either create another Oracle module to contain Oracle source data or simply proceed to the next step.

6. Identify the execution environment.

Under the Connection Explorer, notice the Control Centers node. A control center is an Oracle Database schema that manages the execution of the ETL jobs you design in the Design Center in subsequent steps.

During installation, Warehouse Builder creates the DEFAULT_CONTROL_CENTER schema on the same database as the repository.

If you choose to utilize the default execution environment, continue to the next step. Alternatively, you can define new control centers at any time. For more information and instructions, see "Deploying to Target Schemas and Executing ETL Logic" .

7. Prepare development, test, and production environments. (Optional)

Thus far, these instructions describe the creation of a single project corresponding to a single execution environment. You can, however, reuse the logical design of this project in different physical environments such as testing or production environments.

Deploy a single data system to several different host systems or to various environments, by "Creating Additional Configurations".

Stage 2: Importing the Source Metadata

1. Import metadata from the various data sources.

In the Project Explorer, select a node and determine the locations from which you intend to ultimately extract data. Now create a module for each relevant location. After you create a module, right-click the module and select Import to extract metadata from the associated location. Warehouse Builder launches a wizard to guide you through the process of importing data.

For an example and additional information on importing data objects, see "Identifying Data Sources and Importing Metadata".

2. For the metadata you imported, profile its corresponding data. (Optional)

Before continuing to the next step, consider using the data profiling option to ensure data quality as described in "Understanding Data Quality Management".

Stage 3: Designing the Oracle Data Warehouse

1. Create and design the data objects for the Oracle target module.

In previous steps, you may have already imported existing target objects. For new target objects, design any of the dimensional or relational objects listed in Table 4-1.

To create data objects, you can either launch the appropriate wizard or use the Data Object Editor. To use a wizard, right-click the node for the desired object and select New. After using a wizard, you may want to modify the object in the editor. In that case, right-click the object and select Open Editor.

For additional information, see Chapter 4, "Designing Target Schemas".

2. As you design your data warehouse, be sure to frequently validate the design objects.

You can validate objects as you create them, or validate a group of objects together. In the Project Explorer, select one or more objects or modules, then click the Validate icon.

Examine the messages in the Validation Results window. Correct any errors and try validating again.

To redisplay the most recent validation results at a later time, choose Validation Messages from the View menu.

3. When satisfied with the design of the target objects, generate the code.

Generation produces a DDL or PL/SQL script to be used in subsequent steps to create the data objects in the target schema.

In the Data Editor, you can generate code for a single object by clicking the Generate icon.

In the Project Explorer, select one or more objects or modules, then click the Generate icon. Examine the messages in the Generation Results window. To redisplay the most recent generation results at a later time, choose Generated Scripts from the View menu.

You can save the generated script as a file and optionally deploy it outside Warehouse Builder.

4. Design mappings that define the flow of data from a source to target objects.

In the Project Explorer, expand the Oracle target module, right-click the Mappings node and select New.

The Mapping Editor enables you to define the flow of data visually. You can drag-and-drop operators onto the canvas, and draw lines that connect the operators.

Follow the Instructions for Defining Mappings, concluding with generating the code for the mapping.

5. To manage dependencies between mappings, refer to "Designing Process Flows".

Stage 4: Deploying the Design and Implementing the Data Warehouse

Recall that deployment is the process of copying the relevant metadata and code you generated in the Design Center to a target schema. This step is necessary to enable the target schema to execute ETL logic such as mappings.

To deploy and execute the data warehouse design, complete the following steps:

1. Deploy objects from either the Design Center or Control Center Manager.

In this step, you define the objects in the target schema. You need do this only once.

The simplest approach is to deploy directly from the Design Center by selecting an object and clicking the Deploy icon. In this case, Warehouse Builder deploys the objects with the default deployment settings.

Alternatively, if you want more control and feedback on how Warehouse Builder deploys objects, select Tools from the Design Center menu and select Control Center Manager.

Whether you deploy objects from the Design Center or the Control Center Manager, be sure to deploy all associated objects. For example, when deploying a mapping, also deploy the target data objects such as tables that you defined and any associated process flows or other mappings.

For more information, see "Deploying to Target Schemas and Executing ETL Logic".

2. Execute the ETL logic to populate the target warehouse.

In this step, you move data for the first time. Repeat this step each time you want to refresh the target with new data.

You have two options for executing the ETL logic in mappings and process flows. You can create and deploy a schedule as described in "Process for Defining and Using Schedules". Or you can execute jobs manually as described in "Starting ETL Jobs".

About Locations

Locations enable you to store the connection information to the various files, databases, and applications that Warehouse Builder accesses for extracting and loading data. Similarly, locations also store connection information to ETL management tools and Business Intelligence tools. For a detailed listing, see "Supported Sources and Targets".

Oracle Database locations and file locations can be sources, targets, or both. For example, you can use a location as a target for storing temporary or staging tables. Later, you can reuse that location as a source to populate the final target schema.

In some cases, such as with flat file data, the data and metadata for a given source are stored separately. In such a case, create a location for the data and another for the metadata.

About Locations, Passwords, and Security

Considering that all Warehouse Builder users can view connection information in a location, note that the passwords are always encrypted. Furthermore, Warehouse Builder administrators can determine whether or not to allow locations to be shared across users and persisted across design sessions. By default, locations are not shared or persisted.

To implement either of these scenarios, see the topic "Managing Passwords in Warehouse Builder" in the Oracle Warehouse Builder Installation and Administration Guide.

Automatically Created Locations

Warehouse Builder automatically creates an Oracle location for each user you register in the repository. For example, if you register user SCOTT, then the repository includes an Oracle location named SCOTT_LOCATION that stores the connection details for to the SCOTT schema.

During installation, Warehouse Builder creates an Oracle location named OWB_REPOSITORY_LOCATION. This location provides the connection details to the Warehouse Builder repository. You cannot rename or delete the repository location. Only a database administrator can change the password. To prevent unauthorized access to the database administrator password, all users are restricted from deploying to the repository location.

Deleting Locations

To delete a location, right-click the location in the Connection Explorer and select Delete. If the delete option is not available here, this indicates that the location has been registered in a control center and is likely being utilized. Verify that the location is not in use, unregister the location in the control center, and then you can delete the location from the Connection Explorer.

About Modules

Modules are grouping mechanisms in the Project Explorer that correspond to locations in the Connection Explorer. A single location can correspond to one or more modules. However, a given module can correspond to only a single location at a time.

The association of a module to a location enables you to perform certain actions more easily in Warehouse Builder. For example, you can re-import metadata by reusing an existing module. Furthermore, when you deploy ETL processes in subsequent steps, modules enable you to deploy related objects together such as process flows.

To create a module:

1. Expand the Project Explorer until you find the node for the appropriate metadata type.

For example, if the source data is stored in an Oracle Database, then expand the Databases node to view the Oracle node. Or if the source data is in an SAP R/3 system, expand the applications node to view the SAP node.

2. Right-click the desired node and choose New.

The Create Module wizard opens.The wizard determines the correct integrator to use to enable access to the data store you selected.

3. Follow the prompts in the wizard.

The wizard prompts you to name and describe the module and associate it with a location.

During the course of using Warehouse Builder, you may need to associate a module with a new location. For example, assuming your production environment utilizes different locations than your development environment, you need to reassociate the modules.

To change the location associated with a module:

1. In the Project Explorer, select the module.

2. Click the Configure icon.

The Configuration Properties dialog box appears.

3. In the Identification folder, select a new value for the Locations property.

OLAP

Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows managers, and analysts to get an insight of the information through fast, consistent, and interactive access to information. This chapter cover the types of OLAP, operations on OLAP, difference between OLAP, and statistical databases and OLTP.

Types of OLAP Servers

We have four types of OLAP servers:

· Relational OLAP (ROLAP)

· Multidimensional OLAP (MOLAP)

· Hybrid OLAP (HOLAP)

· Specialized SQL Servers

Relational OLAP

ROLAP servers are placed between relational back-end server and client front-end tools. To store and manage warehouse data, ROLAP uses relational or extended-relational DBMS.

ROLAP includes the following:

· Implementation of aggregation navigation logic.

· Optimization for each DBMS back end.

· Additional tools and services.

Multidimensional OLAP

MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore, many MOLAP server use two levels of data storage representation to handle dense and sparse data sets.

Hybrid OLAP (HOLAP)

Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data volumes of detailed information. The aggregations are stored separately in MOLAP store.

Specialized SQL Servers

Specialized SQL servers provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment.

OLAP Operations

Since OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in multidimensional data.

Here is the list of OLAP operations:

· Roll-up

· Drill-down

· Slice and dice

· Pivot (rotate)

Roll-up

Roll-up performs aggregation on a data cube in any of the following ways:

· By climbing up a concept hierarchy for a dimension

· By dimension reduction

The following diagram illustrates how roll-up works.

· Roll-up is performed by climbing up a concept hierarchy for the dimension location.

· Initially the concept hierarchy was "street < city < province < country".

· On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the level of country.

· The data is grouped into cities rather than countries.

· When roll-up is performed, one or more dimensions from the data cube are removed.

Drill-down

Drill-down is the reverse operation of roll-up. It is performed by either of the following ways:

· By stepping down a concept hierarchy for a dimension

· By introducing a new dimension.

The following diagram illustrates how drill-down works:

· Drill-down is performed by stepping down a concept hierarchy for the dimension time.

· Initially the concept hierarchy was "day < month < quarter < year."

· On drilling down, the time dimension is descended from the level of quarter to the level of month.

· When drill-down is performed, one or more dimensions from the data cube are added.

· It navigates the data from less detailed data to highly detailed data.

Slice

The slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider the following diagram that shows how slice works.

· Here Slice is performed for the dimension "time" using the criterion time = "Q1".

· It will form a new sub-cube by selecting one or more dimensions.

Dice

Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following diagram that shows the dice operation.

The dice operation on the cube based on the following selection criteria involves three dimensions.

· (location = "Toronto" or "Vancouver")

· (time = "Q1" or "Q2")

· (item =" Mobile" or "Modem")

Pivot

The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative presentation of data. Consider the following diagram that shows the pivot operation.

In this the item and location axes in 2-D slice are rotated.

OLAP vs OLTP

Sr.No.

Data Warehouse (OLAP)

Operational Database (OLTP)

Involves historical processing of information.

Involves day-to-day processing.

OLAP systems are used by knowledge workers such as executives, managers and analysts.

OLTP systems are used by clerks, DBAs, or database professionals.

Useful in analyzing the business.

Useful in running the business.

It focuses on Information out.

It focuses on Data in.

Based on Star Schema, Snowflake, Schema and Fact Constellation Schema.

Based on Entity Relationship Model.

Contains historical data.

Contains current data.

Provides summarized and consolidated data.

Provides primitive and highly detailed data.

Provides summarized and multidimensional view of data.

Provides detailed and flat relational view of data.

Number or users is in hundreds.

Number of users is in thousands.

Number of records accessed is in millions.

Number of records accessed is in tens.

Database size is from 100 GB to 1 TB

Database size is from 100 MB to 1 GB.

Highly flexible.

Provides high performance.

Data Mining

There is a huge amount of data available in the Information Industry. This data is of no use until it is converted into useful information. It is necessary to analyze this huge amount of data and extract useful information from it.

Extraction of information is not the only process we need to perform; data mining also involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining, Pattern Evaluation and Data Presentation. Once all these processes are over, we would be able to use this information in many applications such as Fraud Detection, Market Analysis, Production Control, Science Exploration, etc.

What is Data Mining?

Data Mining is defined as extracting information from huge sets of data. In other words, we can say that data mining is the procedure of mining knowledge from data. The information or knowledge extracted so can be used for any of the following applications −

· Market Analysis

· Fraud Detection

· Customer Retention

· Production Control

· Science Exploration

Data Mining Applications

Data mining is highly useful in the following domains −

· Market Analysis and Management

· Corporate Analysis & Risk Management

· Fraud Detection

Apart from these, data mining can also be used in the areas of production control, customer retention, science exploration, sports, astrology, and Internet Web Surf-Aid.

Market Analysis and Management

Listed below are the various fields of market where data mining is used −

· Customer Profiling − Data mining helps determine what kind of people buy what kind of products.

· Identifying Customer Requirements − Data mining helps in identifying the best products for different customers. It uses prediction to find the factors that may attract new customers.

· Cross Market Analysis − Data mining performs association/correlations between product sales.

· Target Marketing − Data mining helps to find clusters of model customers who share the same characteristics such as interests, spending habits, income, etc.

· Determining Customer purchasing pattern − Data mining helps in determining customer purchasing pattern.

· Providing Summary Information − Data mining provides us various multidimensional summary reports.

Corporate Analysis and Risk Management

Data mining is used in the following fields of the Corporate Sector −

· Finance Planning and Asset Evaluation − It involves cash flow analysis and prediction, contingent claim analysis to evaluate assets.

· Resource Planning − It involves summarizing and comparing the resources and spending.

· Competition − It involves monitoring competitors and market directions.

Fraud Detection

Data mining is also used in the fields of credit card services and telecommunication to detect frauds. In fraud telephone calls, it helps to find the destination of the call, duration of the call, time of the day or week, etc. It also analyzes the patterns that deviate from expected norms.

An Architecture for Data Mining

To best apply these advanced techniques, they must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining. The resulting analytic data warehouse can be applied to improve business processes throughout the organization, in areas such as promotional campaign management, fraud detection, new product rollout, and so on. Figure 1 illustrates an architecture for advanced analysis in a large data warehouse.

Data mining techniques

There are several major data mining techniques have been developing and using in data mining projects recently including association, classification, clustering, prediction, sequential patterns and decision tree. We will briefly examine those data mining techniques in the following sections.

Association

Association is one of the best-known data mining techniques. In association, a pattern is discovered based on a relationship between items in the same transaction. That’s is the reason why association technique is also known as relation technique. The association technique is used in market basket analysis to identify a set of products that customers frequently purchase together.

Retailers are using association technique to research customer’s buying habits. Based on historical sale data, retailers might find out that customers always buy crisps when they buy beers, and, therefore, they can put beers and crisps next to each other to save time for customer and increase sales.

Classification

Classification is a classic data mining technique based on machine learning. Basically, classification is used to classify each item in a set of data into one of a predefined set of classes or groups. Classification method makes use of mathematical techniques such as decision trees, linear programming, neural network and statistics. In classification, we develop the software that can learn how to classify the data items into groups. For example, we can apply classification in the application that “given all records of employees who left the company, predict who will probably leave the company in a future period.” In this case, we divide the records of employees into two groups that named “leave” and “stay”. And then we can ask our data mining software to classify the employees into separate groups.

Clustering

Clustering is a data mining technique that makes a meaningful or useful cluster of objects which have similar characteristics using the automatic technique. The clustering technique defines the classes and puts objects in each class, while in the classification techniques, objects are assigned into predefined classes. To make the concept clearer, we can take book management in the library as an example. In a library, there is a wide range of books on various topics available. The challenge is how to keep those books in a way that readers can take several books on a particular topic without hassle. By using the clustering technique, we can keep books that have some kinds of similarities in one cluster or one shelf and label it with a meaningful name. If readers want to grab books in that topic, they would only have to go to that shelf instead of looking for the entire library.

Prediction

The prediction, as its name implied, is one of a data mining techniques that discovers the relationship between independent variables and relationship between dependent and independent variables. For instance, the prediction analysis technique can be used in the sale to predict profit for the future if we consider the sale is an independent variable, profit could be a dependent variable. Then based on the historical sale and profit data, we can draw a fitted regression curve that is used for profit prediction.

Sequential Patterns

Sequential patterns analysis is one of data mining technique that seeks to discover or identify similar patterns, regular events or trends in transaction data over a business period.

In sales, with historical transaction data, businesses can identify a set of items that customers buy together different times in a year. Then businesses can use this information to recommend customers buy it with better deals based on their purchasing frequency in the past.

Decision trees

The A decision tree is one of the most common used data mining techniques because its model is easy to understand for users. In decision tree technique, the root of the decision tree is a simple question or condition that has multiple answers. Each answer then leads to a set of questions or conditions that help us determine the data so that we can make the final decision based on it. For example, We use the following decision tree to determine whether or not to play tennis:

Starting at the root node, if the outlook is overcast then we should definitely play tennis. If it is rainy, we should only play tennis if the wind is the week. And if it is sunny then we should play tennis in case the humidity is normal.

We often combine two or more of those data mining techniques together to form an appropriate process that meets the business needs.

Unit 3: Distributed Data Warehousing and Knowledge Discovery