Winning ETL battles
April 24, 2012 Leave a comment
By Sergey Alentyev – Senior Database Architect – Cobalt Intelligence Center of Excellence
By the end of another long day of ETL project development I feel pretty tired. Looking at my ETL tool mapping, I don’t see Sources, Transformations, LookUps and Targets connected by many arrows. My imagination shows me the historical battlefield plan with troops locations, phalanxes, fortresses, cavalry attacks.
Obviously I need some rest.
But this picture didn’t pop up without any reason. As military commander maneuvers right parts of his troops at the right time, destroys enemy’s forces, occupies his positions, fortress or city, ETL Architect or Developer moves and processes a lot of data from the sources to the stage, fact and dimension targets. In both cases this is all about efficient utilization of existing resources at the right time. Lets try to find analogies in military history that can help us better understand such resource utilization. Probably we will be able to make some useful conclusions. Sounds crazy? Lets see.
Battle of Thermopylae
We have powerful database server, very strong ETL server, plenty of memory and CPUs. Our ETL mappings and workflows are written by very experienced developers. But why it takes so long time to run? First, we ship all these massive data from source database to ETL server. On ETL server we process all our data record by record. Or in little better case we run few parallel threads where each thread also processes the data row by row. And finally, we ship all results back to the target database which is the same as the source database. No matter how much more memory and CPU we would have we won’t scale. It reminds me of something…
480 BC. Persian Emperor Xerxes had amassed a huge army to set out to conquer all of Greece. The Athenian general Themistocles had proposed that the allied Greeks block the advance of the Persian army at the pass of Thermopylae. Relatively small army of Spartans and other Greeks leads by king Leonidas blocked the narrow pass. You might have seen the movie.
If Persian army meets the Greeks at wide place they would smash them very quickly just because they significantly outnumbered them (vary significantly by different historians as Persians: 70K to more than one million warriors, Greeks: 5K – 7K). For two days all Persian attacks ended with nothing but Persian casualties. On third day a traitor shows Persians the secret pass, and the Greeks were surrounded. We know the end by books and movies.
Besides the unquestionable Greeks courage, military skills and love to the Motherland there was another reason for initial failure of Persian army: well-chosen terrain. Persians had a long journey to the battlefield. They could not use massive force of their army. They could not process in parallel, so to speak. The pass was 100 meters wide at most. They had to process almost in serial mode.
If we can’t utilize the resources available to us at the right moment we can lose out ETL battle. Or at least not win.
When I see ETL tool row-by-row processing, or INSERT… VALUES… SQL script, or some row-by-row looping like in PL/SQL:
cursor c is
FOR r IN c LOOP
INSERT INTO target_test VALUES(r…);
I imagine how two enemy troops are standing against each other on the battlefield. One side starts the attack – one starting to run towards the enemy line with a battle cry, then another, then another. Looks like a not very powerful attack.
PL/SQL and ETL Tools Bulk processing is better and should make a big difference.
Like PL/SQL bulk INSERT:
cursor c is
TYPE t IS TABLE OF c%ROWTYPE;
FETCH c BULK COLLECT INTO l LIMIT 1000;
FORALL i IN 1..l.COUNT
INSERT INTO target_test VALUES l(i);
EXIT WHEN c%NOTFOUND;
We send 1000 warriors together to attack. Scary picture! But when they arrive close to the enemy line they attack again one by one, even at a faster pace.
INSERT … SELECT … will be more effective and faster, closer to the situation where at least first line attack at the same time. INSERT … SELECT … in parallel is the situation when many warriors attack together.
When I see a massive parallel processing of huge partitioned table (something like:
INSERT /*+ append parallel(t) */ INTO target_table t
SELECT /*+ parallel(s) */ …
FROM source_partitioned_table s
WHERE …; )
…it resembles a Macedonian phalanx, probably the first known usage of parallelism. Each phalangite carried his primary weapon – a sarissa, a double-pointed pike over 6 m (18 ft) in length. The weapons of the first five rows (partitions?) of men all projected beyond the front of the formation. Five men can fight at the same time. The parallel degree is 5! We can rewrite our pseudo-query like that:
INSERT /*+ append parallel(t 5) */ INTO target_table t
SELECT /*+ parallel(s 5) */ …
Besides courage, military and political talent of Alexander the Great the Macedonian phalanx was one of the main reasons why this young man conquered the known world. It was very effective usage of limited resources. Even his troops were usually smaller then his enemy’s army they could apply more force concurrently. They ran in parallel!
Napoleon on the Borodino field
I’m trying to improve performance for big aggregation ETL processes where we doing a lot of I/O, joins and sorts. We have a lot of CPU and fast disks, we added bunch of memory. At this time there is nobody but our big query running on the database instance.
But I see that we’re not using this big memory. All parameters show that we should be able to use much more. And we dump intermediate results at temporary segments and then read it by one or two block at a time. Why are we not allowed to use more of the resources of our superb database?
Back to history!
1812. The French army (best army in the world for that time) supported by many European troops and leaded by emperor Napoleon (possibly the best commander of all times) invaded Russia. For several months Russians were avoiding the decisive battle fighting locally and exhausting French army. Napoleon was thirsting for such a battle. He wanted to destroy the whole enemy forces in one huge battle as he did many times before using his brilliant military talent. Finally September 7, 1812 the biggest and bloodiest single-day action by that time happen.
250,000 troops was involved from both sides, with at least 70,000 casualties. Napoleon’s generals saw victory close at hand and thought that a final push would be enough to decide the fate of the Russian army and of the war. They begged Napoleon to deploy his final and best reserve – the Imperial Guard. But during the battle, Napoleon saw how hard the Russians fought and how hard his progress had been won. Napoleon changed his mind.
“I will most definitely not; I do not want to have it blown up. I am certain of winning the battle without its intervention.”
He refused to commit the Guard as a final reserve so far from France and reliable logistics supports. What if the Russians found other resources and attacked tomorrow?
In the end, both sides claimed victory. French took empty burned Moscow but soon left, without provisions, beaten from all sides by Russians and a bitter winter.
Everybody who knows something about history knows the final result. From half a million French and other Europeans army invaded Russia only about 22,000 crossed the border back.
But why am I telling this story now? Because Napoleon at Borodino reminds me of our best-in-the-world database management system (that was initially designed for OLTP) as it behaves when it comes to huge queries where a lot of resource needs to be used, where we need to do a lot of I/O, joins, sorts, etc. We have a lot of memory but we are not using it all or even a big portion of it automatically. We need to beg Napoleon – our database – to allocate it using manual memory allocation, hidden parameters and other tricks. But our OLTP based database says I can’t allocate such huge resources to the one battle (query, report, ETL process). What if other sessions will need these resources a second later? “I was always successful with such OLTP approach in my previous campaigns and now I’m going to win this battle anyway even without using my best reserves.”
Sound familiar? “I will most definitely not; I do not want to have it blown up. I am certain of winning the battle without its intervention.”
Winning the guerilla war
To win the major battles we switch from our brilliant but OLTP-based database to one of the best MPP databases. Everything that we were struggling with before (all huge analytics queries, big ETL scans, joins and sorts) is flying now. But guess what, all small queries, online reports, everything that select and process small narrow data sets run longer. Even after tuning SQL for MPP the average small query run much slower then before. Instead of doing just a few I/O it has to send jobs to all segments, most likely do full tables scans and do it with physical reading, then probably do some kind of data shipment between segments (servers). So, we have to pay a minimum MPP price for any operation.
Imagine, you are the Roman general who just recently won the big battle against Rome’s rival. The enemy’s troops are destroyed, the major cities are occupied by Roman forces. But the rebels’ small units are still fighting for their Motherland. They pick at your troops daily, with very small units. And each time you respond by sending several heavy centurions to destroy each group of rebels. And each time, you win the individual battle.
But how much time and resources do you spend? When many gangs attack simultaneously you send all your reserves, leaving cities unprotected. At such moments they can take the cities even with relatively small forces. And you can lose everything that was taken with such giant efforts.
Would it be better to have special light cavalry small units that can do the job much faster and with fewer resources?
What happens with our MPP super system when we need to run online Web applications with hundreds and thousands really small queries running concurrently? Can we quickly exhaust our I/O, CPU and memory? Probably we would need to have some data marts on OLTP based databases for such applications.
No doubt that the future of Data Warehouses is a Big Data and the future of ETL is a Big Processing of Big Data. For Big Data Processing we need to have Big Resources. But even when we have it we need to utilize it right.
I hope this small article can help ETL and Data Warehouse Developers to better visualize the ETL process, its bulk nature in new era and understand the importance of right resource utilization a little bit better. I hope this helps you marshal your forces and win your own ETL battles in the future.