Monday, August 15, 2016

ETL: Friendly Robot or Ticking Time Bomb?

An ETL is a friendly robot that converts data into information
This friendly robot was drawn by HyperPunch84 on newgrounds

Overview


This post outlines a way I structure ETLs to keep them friendly and maintainable instead of ticking time-bombs, waiting to blow up my week.  Note that this is a fairly technical post.  If this isn't up your alley, you have my blessing to skip this one. Next week we'll get back to some non-technical riffraff. 


What is an ETL?



Despite what it looks like in the image, information is what you want
https://en.wikipedia.org/wiki/Digesting_Duck
If you remember my last post, data is the raw material that must be processed before it has any analytical value.  An ETL is the process that converts data to precious Information.  This is a critical step for business analytics because we need to do analysis on information, not data.

ETL: Extract, Transform Load


  • Extract: the script takes in data from somewhere.
    This is kind of like downloading a report to your computer.  Maybe it's a cloud service like Salesforce. Maybe it's a log file, like the high quality, artisanal log files lovingly generated by FreeWheel (plug).
  • Transform: changes the data's format and applies business logic.
    This is kind of like automating a pivot table or calculations from an excel report
  • Load: deposits the formatted information into an accessible place.
    This is kind of like saving the results of your excel "transformation" into a master excel file you use for analysis


How I structure ETLs


When first learning to write ETLs, I didn't know how to make them readable until someone else was hired to read them.  My most maintainable and extensible ETLs all now follow this 3-file format.  It's been effective for me and the other analysts I've shared it with.  


The gist: there is one file for the main flow of the program, another for all the functions that take more than one line, and a third for all predefined variables.
When something goes wrong, the main acts as a
 table of contents to isolate the error. Using this structure, when the ETL needs to be extended I'm usually just a config change away.

Main

The main file is for the flow of your ETL.  This is where you'll call the functions in the right order and iterate over the objects being transformed.  It's ok for this file to be small and straightforward.  

Pseudocode main example:

// etl_object_list is an iterable list of things to extract
import etl_object_list from configuration_file

import extract_function, transform_function, load_function from function_file

For all etl_object in etl_object_list:
extracted_file = extract_function(etl_object)
transformed_file = transform_function(extracted_file)
load_function(transformed_file, etl_object)


Function

The function file is where I put all the supporting functions.  This makes my code MUCH easier to read.  When something is off, I can easily follow the program flow in the main file then jump into the details here.  

Pseudocode functions example:

import platform_sdk

def extract_function (etl_object):
//get the request parameters to call the proper extract function
sdk_request_parameters = etl_object.request_params

//call the sdk with the right parameters to get the data
extracted_file = platform_sdk.get_report(sdk_request_parameters)

return extracted_file

def transform_function (csv_out):
add one to each value of csv_out for some reason
return transformed_file


def load_function (transformed_file, etl_object)

//get the location that the file should be loaded
        load_location = etl_object.load_location


        put the transformed_file into load_location


return true


// this is the class for etl_objects.  This is used to store all the object variables in a nice place.  This could probably live in the config file if you want

class etl_object:

   def __init__(self, request_params, load_location):
      //the parameters for the SDK's request
      self.request_params = request_params
      
      //the location to load the transformed data

      self.load_location = load_location


Config


The config file is where you keep the defined variables that your program needs to run. Originally, I played around with throwing this all in the function file at first, but it got too messy. 

When done right, this is the only file I need to update.  Need to add more objects?  Define them and add them to the list.  Need to halt etl on an object?  Cut it from the list.  Ever need to handle a new data type?  If I've coded the functions correctly, I can manage it all from the config file.

Pseudocode config Example:
import etl_object from function_file

etl_object_list = [object_1, object_2]

object_1 = elt_object(request_params_1, load_location_1)
object_2 = etl_object(request_params_2, load_location_2)

BONUS: Some ETL "Things I Should Have Known Earlier" (TISHKE)

http://www.newyorker.com/cartoons/a16089

I'm not sure these will be applicable to anyone else, but here's a list of things that would have saved me a lot of time if I had known them when I started.

TISHKE 1: Pick a sensible programming language


Don't try to impress your friends with some fancy language like Perl.  Just pick something that works, like Python.  In fact, just pick Python.  

  1. Python code is human-readable.  Readability is always more important than you think.  Remember, your script will be read more times than they are written. 
  2. Python is very easy to learn.  Not only is it accessible, but there are a TON of valuable resources.
  3. Two Python libraries will probably do most of your work for you (numpy / pandas). I put off learning pandas because I always thought it would be faster to write it myself. I was wrong. The turning point was realizing every function that took me longer than an hour to write already existed in pandas. I was literally wasting time to making a shittier version of something I already had.
  4. The Python community is passionate and friendly.
    This is Ravi.
    We became friends geeking out over Python at a greek restaurant.

TISHKE 2: Stop being cute and get er done


When I started coding ETLs, I would take incredible pride in doing something in a clever way when a simpler way would have sufficed. More than once I got so hung up on doing something clever that I lost track of the goal entirely and had to throw away what I wrote once it was done. The point is, I'm much more productive when I stay goal oriented and there's always time for fun projects after work.  


TISHKE 3: Python Requests are easy, don't be afraid of them


After relying heavily on partner SDKs to handle my HTTP requests, I ran out of options one day and had to implement a GET request in Python. Turns out the requests library I was avoiding for so long is dirt simple. All said, the entire ETL took 4 hours from reading documentation to deployment; ~50% faster than simple ETLs typically take me.  

TISHKE 4: Write readable, commented code you doofus!


Think you don't have time for comments? Well, you do. I'll tell you who doesn't have time: the you from the future who just caught a bug in your code.  Remember to comment your code and write unit tests while you still remember what the code does.

No comments:

Post a Comment