靈活可擴充套件的工作流管理平台Airflow

airflow是airbnb開源的乙個用python寫就的工作流管理平台（workflow management platform）。在前一篇文章中，介紹了如何用crontab管理資料流，但是缺點也是顯而易見。針對於crontab的缺點，靈活可擴充套件的airflow具有以下特點：

下表給出airflow（基於1.7版本）與oozie（基於4.0版本）對比情況：

功能airflow

oozie

工作流描述

python

xml資料觸發

sensor

datasets, input-events

工作流節點

operator

action

完整工作流

dagworkflow

定期排程

dag schedule_interval

coordinator frequency

任務依賴

>>,<<

內建函式、變數

template macros

el function, el constants

之前我曾提及oozie沒有能力表達複雜的dag，是因為oozie只能指定下流依賴（downstream）而不能指定上流依賴（upstream）。與之相比，airflow就能表示複雜的dag。airflow沒有像oozie一樣區分workflow與coordinator，而是把觸發條件、工作流節點都看作乙個operator，operator組成乙個dag。

airflow常見命令如下：

下面將給出如何用airflow完成data pipeline任務。

首先簡要地介紹下背景：定時（每週）檢查hive表的partition的任務是否有生成，若有則觸發hive任務寫elasticsearch；然後等hive任務完後，執行python指令碼查詢elasticsearch傳送報表。但是，airflow對python3支援有問題（依賴包為python2編寫）；因此不得不自己寫hivepartitionsensor：

# -*- coding: utf-8 -*-
# @time : 2016/11/29
# @author : rain
from airflow.operators import basesensoroperator
from impala.dbapi import connect
import logging
class hivepartitionsensor(basesensoroperator):
"""waits for a partition to show up in hive.
:param host, port: the host and port of hiveserver2
:param table: the name of the table to wait for, supports the dot notation (my_database.my_table)
:type table: string
:param partition: the partition clause to wait for. this is passed as
notation as in ``ds='2016-12-01'``.
:type partition: string
"""template_fields = ('table', 'partition',)
ui_color = '#2b2d42'
def __init__(
self,
conn_host, conn_port,
table, partition="ds='}'",
poke_interval=60 * 3,
*args, **kwargs):
super(hivepartitionsensor, self).__init__(
poke_interval=poke_interval, *args, **kwargs)
if not partition:
partition = "ds='}'"
self.table = table
self.partition = partition
self.conn_host = conn_host
self.conn_port = conn_port
self.conn = connect(host=self.conn_host, port=self.conn_port, auth_mechanism='plain')
def poke(self, context):
logging.info(
'poking for table , '
'partition '.format(**locals()))
cursor = self.conn.cursor()
cursor.execute("show partitions {}".format(self.table))
partitions = cursor.fetchall()
partitions = [i[0] for i in partitions]
if self.partition in partitions:
return true
else:
return false

python3連線hive server2的採用的是impyla模組，hivepartitionsensor用於判斷hive表的partition是否存在。寫自定義的operator，有點像寫hive、pig的udf；寫好的operator需要放在目錄~/airflow/dags，以便於dag呼叫。那麼，完整的工作流dag如下：

# tag cover analysis, based on airflow v1.7.1.3
from airflow.operators import bashoperator
from operatorud.hivepartitionsensor import hivepartitionsensor
from airflow.models import dag
from datetime import datetime, timedelta
from impala.dbapi import connect
conn = connect(host='192.168.72.18', port=10000, auth_mechanism='plain')
def latest_hive_partition(table):
cursor = conn.cursor()
cursor.execute("show partitions {}".format(table))
partitions = cursor.fetchall()
partitions = [i[0] for i in partitions]
return partitions[-1].split("=")[1]
log_partition_value = """}"""
tag_partition_value = latest_hive_partition('tag.dmp')
args = 
# execute every tuesday
dag = dag(
dag_id='tag_cover', default_args=args,
schedule_interval='@weekly',
dagrun_timeout=timedelta(minutes=10))
ad_sensor = hivepartitionsensor(
task_id='ad_sensor',
conn_host='192.168.72.18',
conn_port=10000,
table='ad.ad_log',
partition="day_time={}".format(log_partition_value),
dag=dag
)ad_hive_task = bashoperator(
task_id='ad_hive_task',
bash_command='hive -f /path/to/cron/cover/ad_tag.hql --hivevar log_partition={} '
'--hivevar tag_partition={}'.format(log_partition_value, tag_partition_value),
dag=dag
)ad2_hive_task = bashoperator(
task_id='ad2_hive_task',
bash_command='hive -f /path/to/cron/cover/ad2_tag.hql --hivevar log_partition={} '
'--hivevar tag_partition={}'.format(log_partition_value, tag_partition_value),
dag=dag
)report_task = bashoperator(
task_id='report_task',
bash_command='sleep 5m; python3 /path/to/cron/report/tag_cover.py {}'.format(log_partition_value),
dag=dag
)ad_sensor >> ad_hive_task >> report_task
ad_sensor >> ad2_hive_task >> report_task

靈活可擴充套件的工作流管理平台Airflow

python包Toil 跨平台工作流管理系統

工作流管理系統的概念介紹

可擴充套件的工作流引擎設計

靈活可擴充套件的工作流管理平台Airflow

python包Toil 跨平台工作流管理系統

工作流管理系統的概念介紹

可擴充套件的工作流引擎設計

相關推薦