oozie 中事件触发 input-events 和 done-flag

2022-09-07 01:33:34 ⋅ 11162 ⋅ 0 ⋅ 0

一、理论

工作流的执行条件

当coordinator指定的一个workflow已经进入执行时间窗口时，oozie会首先检查所有的input-events是否都已“发生”（满足），检查主要是分两个方面：

指定的文件或文件夹是否已经存在？

如果指定了done-flag, 检查done-flag文件是否存在
当切仅当所有的input-events都已发生，工作流才会进入runing状态，否则, oozie会持续监视指定的文件或文件夹，一但--- 它们创建出来，或者生成了done-flag文件，工作流会立即进入running状态。

关于done-flag

从应用的角度来看，如是监视的是文件或文件夹，在它们建立的那一刻，数据可能并未写入完整，此时立即执行action可能会丢失数据或出现错误，一个好的处理方法是等所有文件写出完成之后，写一个标志文件,能过这个标志文件来启动action,这就是 done-flag的作用！

以下是官方文档中对done-flag的说明:

If the done-flag is omitted the coordinator will wait for the presence of a _SUCCESS file in the directory (Note: MapReduce jobs create this on successful completion automatically).
If the done-flag is present but empty, then the existence of the directory itself indicates that the dataset is ready.
If the done-flag is present but non-empty, Oozie will check for the presence of the named file within the directory, and will be considered ready (done)

对于input event, 当它引用的dateset指定了done-flag，则oozie在执行action之前会“读”这个flag文件，以判断是否可以开始后续的动作。需要特别注意的是：input event不是一定要被某个action作为参数（配置属性）来引用，虽然大多数时候是这样的，如果一个input event没有被任何action引用，实际上它起到的是“输入检查”的作用，也就是在开始任何action之前，检查制定的文件或文件夹是否存在，如果指定了done flag文件，再检查done flag文件是否存在。

对于output event, 当它引用的dateset指定了done-flag, oozie并不会在执行action之后去“写”这个flag文件。

done-flag是一种非常巧妙的标记文件已就绪的做法，可以说是已经是大数据领域里最为常见的一种“模式”。典型的例子是MR在执行任务结束时生成的 _SUCCESS 文件。

生成done-flag是如此普遍的需要，以至于hdfs的cli直接提供了命令用于生成done-flag文件，以及是一个例子：

#Create a flag file named _SUCCESS under a certain input folder.
hdfs dfs -touchz "/puth/to/input/folder/_SUCCESS"

二、样例详解

使用oozie按照每小时调度，期间可能文件还没有收集到HDFS，这个时候才有Oozie轮询查看HDFS是否存在该文件标记为了测试找了有数据的文件，但是一直为WAITING状态。

<coordinator-app name="test_job" frequency="${coord:days(1)}" start="${job_start}" end="${job_end}"
    timezone="GMT+08:00" xmlns="uri:oozie:coordinator:0.2">
    <controls>
        <concurrency>1</concurrency>
    </controls>
        <datasets>
        <dataset name="input_data" frequency="${coord:days(1)}"
            initial-instance="${job_start}" timezone="GMT+08:00">
            <uri-template>${monitor_workflow_run_status_path}</uri-template>
            <done-flag>_SUCCESS</done-flag>
        </dataset>
    </datasets>    
    <input-events>
        <data-in name="input" dataset="input_data">
            <instance>${coord:current(-1)}</instance>
        </data-in>
    </input-events>    
    <action>
        <workflow>
            <app-path>${application_path}</app-path>
            <configuration>
                <property>
                    <name>nominalformateDate</name>
                    <value>${coord:formatTime(coord:nominalTime(), "yyyyMMdd")}</value>                    
                </property>
        <property>
                    <name>user_name</name>
                    <value>${coord:user()}</value>                    
                </property> 
        <property>
                    <name>nominal_date</name>
                    <value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -1, 'DAY'), "yyyy-MM-dd")}</value>
                </property>              
            </configuration>
        </workflow>
    </action>
</coordinator-app>

当某个coordinator job 开始执行时，oozie会首先检查所有的input-events是否都已满足条件，主要检查以下内容：uri-template

1、uri-template 指定路径的文件或文件夹是否已经存在；
2、done-flag 指定的文件是否存在。

只有当input-events满足了设置的条件时，工作流才会切换到runing状态，否则将一直处于wait状态，并时刻监视指定的文件或文件夹，一但input-events满足了，工作流会立即进入running状态。

done-flag 的设置一般有三种情况：

1、直接不设置 done-flag 标签，如下：

<dataset name="input_data" frequency="${coord:days(1)}"    initial-instance="${job_start}" timezone="GMT+08:00">
    <uri-template>${monitor_workflow_run_status_path}</uri-template>            
</dataset>

oozie 将默认done-flag 为'_SUCCESS'，所以需要满足 uri-template 指定路径的文件夹下存在 _SUCCESS 文件，job才触发执行。

2、设置done-flag 标签，但值为空，如下：

<dataset name="input_data" frequency="${coord:days(1)}"    initial-instance="${job_start}" timezone="GMT+08:00">
    <uri-template>${monitor_workflow_run_status_path}</uri-template>
    <done-flag></done-flag>
</dataset>

oozie 则直接检测 uri-template 指定路径的文件或文件夹是否存在，只要存在就直接触发 job执行。

3、设置done-flag 标签，值不为空，如下：

<dataset name="input_data" frequency="${coord:days(1)}"    initial-instance="${job_start}" timezone="GMT+08:00">
    <uri-template>${monitor_workflow_run_status_path}</uri-template>
    <done-flag>trigger.dat</done-flag>
</dataset>

oozie 则直接检测 uri-template 指定路径的文件夹下是否存在done-flag指定的文件如本例的 trigger.dat 文件，只要存在就触发 job执行。

为者常成，行者常至

oozie 中事件触发 input-events 和 done-flag

一、理论

工作流的执行条件

关于done-flag

二、样例详解

AI

作者：Corwien

专栏推荐

oozie 中事件触发 input-events 和 done-flag

一、理论

工作流的执行条件

关于done-flag

二、样例详解

添加附言

AI

作者：Corwien

专栏推荐