Continuous Integration in a Data Warehouse

By Chris Mills – Development Manager – Cobalt Intelligence Center of Excellence

Over the last two years, we have almost tripled the number of developers working on Intelligence applications at Cobalt, gone from 20 to over 70 production releases per year, removed our QA team, and improved production quality from roughly one “must fix” production issue per month to one every six months.  All of this in the face of scalability challenges that come with rapidly increasing data volumes and processing complexity.  Read on if you’d like to know more about how we’ve done this.

Cobalt’s engineering organization made a transition from “waterfall” to Agile development in 2009.  At the time, testing of warehouse and reporting applications was an expensive process — highly manual, and error-prone.  Often the testing phase of a warehouse project would take more time than the design and implementation phases.

There were plenty of disadvantages to this approach, including:

  • Inconsistent testing from release to release. The quality the testing was influenced by the quality of documentation and handoffs from team to team, as well as human factors like system knowledge and attention to detail.
  • All testing, even routine testing, was expensive because it was so manual and because it depended on coordination and communication across teams and environments.
  • Testing was late in the process.   By the time issues were found, developers were often days away from making their changes.  Another round of handoffs and deployments from Dev to QA was required before issues could be resolved.

 

Cobalt’s transition to Agile was driven by a desire to provide our customers with incremental releases and improved support for the fast-paced worlds of online commerce and digital advertising.  Monolithic releases were out and smaller more frequent releases, often with short lead times, were in.   The current testing approach was clearly incompatible with this, so the team began to pursue automation strategy.

The system we have developed for Continuous Integration of all Database and ETL changes relies on a variety of technologies:  Anthill for nightly builds,  Ant for orchestration, SQLUnit for unit testing, and Informatica web services for remotely launching workflows.  Test suites are managed via Ant scripts, which orchestrate the following tasks for each ETL workflow:

  1. Set up the testing environment with seed data.
  2. Ensure that any system-level preconditions for the ETL being tested are met.
  3. Execute the ETL
  4. Execute a series of unit tests
  5. Cleanup the environment so that any data changes made as a result of these steps are removed.

 

A central application (Anthill) controls the scheduling of test runs, and provides an online reporting interface where results can be reviewed.  A history of test runs is also maintained.  Test results are also delivered to the development team via email, and the team treats automated testing failures as the top priority in its daily work.   At any one time the team will have multiple warehouse releases in flight, each of which gets its own Continuous Integration test runs set up in Anthill.

At the time of this writing, more than 10,000 tests are run under automation against various versions of Cobalt’s BI codebase.  Database level tests confirm that DB structures, indexes, grants, and other objects are appropriate after DB deployments.  ETL tests confirm that processing rules and policies are enforced, and that dependencies between ETLs are accounted for.  In the spirit of Test Driven Development, new tests are added to the suite early in new projects rather than after coding is complete.

This automated testing “safety net” has enabled a number of major changes for the Intelligence product team, all of which have had a direct and very positive business impact.  Our ability to execute thousands of tests against any change in an hour or two has shortened project turnaround time dramatically.   Developers have an easy way to get near real time feedback on the impact of their changes, which has improved their efficiency.  Production quality has improved through more test coverage, and because executing the tests via software ensures consistency.

Finally, testing has moved far enough upstream in the development process that the need for a separate testing team has been removed.  Headcount that used to be allocated towards a “QA” are now fully devoted to Intelligence roadmap development.

We are now approximately two years into our automated testing initiative.  The successes enjoyed by the DB/ETL team from Continuous Integration have spread to the rest of Cobalt’s Intelligence product stack.   Team culture has evolved to the point that testing is an initial consideration for any new work, rather than an after-thought.  We continue to learn and refine, but the initial project goals of improving quality and team velocity have been achieved.   Our team of Intelligence developers did over 70 production releases last year.  Even though we no longer have a separate QA team, our production quality is higher than ever.

In an era of “big data” and increasingly complex and prominent BI applications, the ability to rapidly evolve a data warehouse is more important than ever.  The solution here demonstrates not only that robust automated testing possible in a BI environment, but also that would bring similarly large business impacts to other organizations that follow a similar approach.

While automated testing is commonplace in the software world we have found it to be quite rare in the data warehousing world.  Heavy system integration, large data volumes, and the variety of technologies in a typical BI environment pose special challenges.  We believe that the degree to which we have automated our testing process is unique, and something that other organizations seeking to improve quality and the pace of their BI development could learn from.

If you would like to share your own experiences with test automation in a warehouse setting, or if you’d like more detail on the above, please comment and we’ll get the conversation going!

About collectivegenius
Everyone has a voice and great ideas come from anyone. At Cobalt, we call it the collective genius. When technical depth and passion meets market opportunity, the collective genius is bringing it’s best to the table and our customers win.

9 Responses to Continuous Integration in a Data Warehouse

  1. HenryM says:

    Very interesting read and even though I’m a tester (working in the BI world), I very much like the sound of what you’re doing and would like to influence teams I work with, to adopt a similar approach – at the expense of my role becoming obsolete! 😉

    One key question: traditional testing has the benefit of tester independence from the developer, essentially resulting in a second set of eyes, a second way of looking at things (especially interpreting the requirement and envisaging how to test it), a second level of verification\validation of the original system requirement.

    With your set up, how do achieve that additional layer of validation? How do you have comfort that your developers are writing sufficient unit tests of adequate quality..? etc.

    Basically, in the environments I’ve worked in, I wouldn’t be comfortable accepting only the developers word that he’s tested sufficiently before releasing to production.. Just because it doesn’t break the said unit tests, doesn’t mean its fit for production..

    Thanks in advance for your reply
    Henry.

    Like

  2. Thanks, and great question. A second set of eyes is definitely important, and we’ve had a number of interesting team discussions over the years about how to put the right checks in place.

    At this point we’re reviewing our tests just like we review code. That lets us leverage the larger team’s understanding of the codebase and requirements, and ensures that things don’t go out to production having been seen by just one person. In these reviews we’re looking for things like redundant tests, missing tests, tests that are guaranteed to pass, tests that don’t test anything useful, etc.

    Also, in the spirit of Agile and TDD we’re trying to get better at expressing the requirements themselves as test cases. This part is still a work in progress, but to the extent we can do this we will bring testing even farther forward in the process.

    Like

  3. Very interesting read indeed! I wondered if the technologies applies will work with every type of ETL (IBM Datastage, BO Data Integrator, Oracle Data Integrator, Informatie PowerCenter, etc…), or only with specific types?

    Like

    • Bruce Szalwinski says:

      The ETL tool needs to support remote invocation of workflows. For Informatica, this is done via Informatica Web Services. I’m guessing these other ETL tools have a similar feature. Setup of the database is done via SQL client tools (sqlplus, psql, etc). Testing is done via SQL*Unit which connects to database via JDBC.

      If your environment meets these conditions, then I think this approach would work.

      Like

  4. MatthewG says:

    I would be interested to know more details regarding step 1 and 2. Specifically, how you were capturing seed data necessary for testing, how you were orchestrating loading seed data into the test databases, etc.

    Like

    • Bruce Szalwinski says:

      While Anthill is the scheduler and reporting interface, it delegates all of the dirty work to Ant build scripts. For each workflow, there is a build.xml that implements the setup, execute, test, teardown and report tasks.

      The setup task delegates work to its own Ant file, which by convention, is called setup.xml. Work within setup.xml is further sub-divided into two tasks, setting up the database and setting up the ETL environment. To setup the database, we have one file for each table that needs to be setup for the workflow under test. The setup build script will invoke each of these files. By convention, the file name is based on the table name. So if we were setting up the employees table, we would have a file called employees.sql. The employees.sql file would contain the set of DML statements required to initialize the employees table.

      The ETL environment is initialized by the ETL target within the setup.xml build file. Within this target, we deploy any required operating system files (parameter files, indicator files, etc).

      Like

  5. ChrisO says:

    Hi, I realize that the article is a bit old, but hope that you guys are still actively contributing, I am trying to implement TDD in our DW team, however, I am having some difficulty with creating scenarios to test, in the sense of finding tests to do, currently I have it at basic tests like testing for the object, testing that the schema is as per the spec, after that I am a bit stuck, we really want to test the ETL in it’s fullest. I was trying to apply tSQLt to the DW env, however, it exists out of faking 10 tables, and populating each of those tables with relevant data to test 1 aspect, feels like killing a fly with a stick of dynamite. Also, like MatthewG mentioned, capturing seed data to do testing on would be interesting to get some real world examples of tests specifically for ETL as all the articles I find around TDD for DW and ETL are all around the benefits and the idea behind TDD :-/ any advice would be greatly appreciated

    Like

    • Bruce Szalwinski says:

      I’ll use an example from a recent subject area that we have been developing, Widget Events. A dealer website is composed of one or more widgets. We want to track metrics about consumer behavior for each widget (hover time, time on page, did they click, basically, how did they interact with each widget).

      The Widget Events is a transactional fact table, storing one record for every widget event. The incoming data comes from an Apache log. One log record can have information about multiple widgets but the grain of the fact table is one record per widget event, so that brings us to our first test case, ensure that the grain of the table is being respected. The ETL is responsible for pivoting the input data, creating multiple rows for each widget.

      To test that scenario, we need to setup the workflow so that it can process an Apache log where there are multiple widgets in the same log record. We create these apache log entries and put them into a file called widget.access_log. That file lives in source control but we need to get it to the Informatica machine for processing. The setup task is responsible for carrying out this task.

          <property name="file1" value="widget.access_log" />
      
          <!-- copy files -->
          <ant target="etluserscp">
            <property name="source.file" value="setup/${file1}" />
            <property name="target.file" value="${src.file.dir}${file1}" />
          </ant>
      
      

      Now that the file is in place, we can execute our workflow to parse the file and load the results into a staging table. We can then verify that the use case was successfully completed using this unit test.

      <!-- Validating unique records loaded into the table stage_widget_events based on the keys site_id, event_id, widget_instance_id -->
        <test name="Unique records loaded into the table stage_widget_events based on the keys site_id, event_id, widget_instance_id and date">
          <sql>
            <stmt><![CDATA[
              select event_id,site_id,time_stamp,widget_instance_id
              from stage_widget_log
              group by event_id,site_id,time_stamp,widget_instance_id
              having count(*) > 1]]>
            </stmt>
          </sql>
          <result>
            <resultset id="1" />
          </result>
        </test>
      

      Other scenarios that have come up include but are not limited to:

      * boundary conditions (ensure string fields aren’t being truncated, numeric values are being stored properly)
      * fields map correctly from source to target (we construct an input record with known values and ensure that the values wind up in the proper target fields)

      For each business requirement, we follow a similar pattern: construct the appropriate input data, write the test that exercises the requirement, see that it initially fails, write enough ETL to make it pass, refactor as necessary.

      Like

    • Javier says:

      Hi ChrisO, can you try it with QuerySurge software, I believe that is a great tool for TDD on DW, ETL, and data migrations testing.

      If you want, you can write to me to jimoralesg@gmail.com.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: