Abstract
Background/Aims A stable, automated ETL environment is essential to maintain a consistent virtual data warehouse (VDW) refresh schedule. This stability is threatened at each refresh cycle because VDW source table schemas can vary over time, causing extract programs to fail when they encounter a schema that is different from what is expected. Such a failure requires manual intervention to change the extract program to conform to the new schema, disrupting the refresh schedule.
Methods We are initiating a process to dynamically modify extract code when source table schemas change, through a real-time comparison of current source table schemas against previously extracted table schemas. The process requires a metadata database (MDDB) containing schema metadata from all the sources, as well as metadata about our ETL packages. The first step in a refresh cycle is to acquire the current source schema metadata in real time. Then, using a combination of T-SQL and .Net code, we compare the current schema metadata against the metadata from the prior extract to identify source table changes. If there are discrepancies, the metadata about our extract packages provides the means to identify those packages of extract code requiring modification. We have created a maintenance program that uses the MDDB information to identify those packages requiring update, generate the needed data definition language (DDL) code segment changes, and dynamically modify the data flows within those extract packages. This is essentially application code that uses metadata to generate and modify other application code. Built into this process are logging, notification, historical archiving, and other documentation for process assurance.
Results This methodology allows us to generate corrections and changes to ETL packages quickly and easily without manual intervention. Since all tables are stored in the MDDB, the extract program can be constructed according to the schema specifications found there. It then can be dynamically updated should the schema change, saving valuable programming time in developing new extract packages for data mining/exploration.
Conclusions This approach saves both programming time and computer resources, and allows us to execute more frequent ETL refresh cycles.




