Do you know where your data comes from? Apache Airflow does and it’s getting updated to advance data orchestration


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Getting data from where it is created to where it can be used effectively for data analytics and AI isn’t always a straight line. It’s the job of data orchestration technology like the open-source Apache Airflow project to help enable a data pipeline that gets data where it needs to be.

Today the Apache Airflow project is set to release its 2.10 update, marking the project’s first major update since the Airflow 2.9 release back in April. Airflow 2.10 introduces hybrid execution, allowing organizations to optimize resource allocation across diverse workloads, from simple SQL queries to compute-intensive machine learning (ML) tasks. Enhanced lineage capabilities provide better visibility into data flows, crucial for governance and compliance.

Going a step further, Astronomer, the lead commercial vendor behind Apache Airflow is updating its Astro platform to integrate the open-source dbt-core (Data Build Tool) technology unifying data orchestration and transformation workflows on a single platform.

The enhancements collectively aim to streamline data operations and bridge the gap between traditional data workflows and emerging AI applications. The updates offer enterprises a more flexible approach to data orchestration, addressing challenges in managing diverse data environments and AI processes.

“If you think about why you adopt orchestration from the start, it’s that you want to coordinate things across the entire data supply chain, you want that central pane of visibility, ” Julian LaNeve, CTO of Astronomer, told VentureBeat.

How Airflow 2.10 improve data orchestration with hybrid execution 

One of the big updates in Airflow 2.10 is the introduction of a capability called hybrid execution.

Before this update, Airflow users had to select a single execution mode for their entire deployment. That deployment could have been to choose a Kubernetes cluster or to use Airflow’s Celery executor. Kubernetes is better suited for heavier compute jobs that require more granular control at the individual task level. Celery, on the other hand, is more lightweight and efficient for simpler jobs.

However, as LaNeve explained, real-world data pipelines often have a mix of workload types. For example, he noted that within an airflow deployment, an organization just might need to do a simple SQL query somewhere to get data. A machine learning workflow might also connect to that same data pipeline, requiring a more heavyweight Kubernetes deployment to operate. That’s now possible with hybrid execution.

The hybrid execution capability significantly departs from previous Airflow versions, which forced users to make a one-size-fits-all choice for their entire deployment. Now, they can optimize each component of their data pipeline for the appropriate level of compute resources and control.

“Being able to choose at the pipeline and task level, as opposed to making everything use the same execution mode, I think really opens up a whole new level of flexibility and efficiency for Airflow users,” LaNeve said.

Why data lineage in data orchestration matters for AI

Understanding where data comes from is the domain of data lineage. It’s a critical capability for both traditional data analytics as well as emerging AI workloads where organizations need to understand where data comes from.

Before Airflow 2.10, there were some limitations on data lineage tracking. LaNeve said that with the new lineage features, Airflow will be able to better capture the dependencies and data flow within pipelines, even for custom Python code. This improved lineage tracking is crucial for AI and machine learning workflows, where the quality and provenance of data is paramount. 

“A key component to any gen AI application that people build today is trust,” LaNeve said.

As such, if an AI system provides an incorrect or untrustworthy output, users won’t continue to rely on it. Robust lineage information helps address this by providing a clear, auditable trail that shows how engineers sourced, transformed and used the data to train the model. Additionally, strong lineage capabilities enable more comprehensive data governance and security controls around sensitive information used in AI applications. 

Looking Ahead to Airflow 3.0 

“Data governance and security and privacy become more important than they ever have before, because you want to make sure that you have full control over how your data is being used,” LaNeve said.

While the Airflow 2.10 release brings several notable enhancements, LaNeve is already looking ahead to Airflow 3.0.

The goal for Airflow 3.0  according to LaNeve is to modernize the technology for the age of gen AI. Key priorities for Airflow 3.0 include making the platform more language-agnostic, allowing users to write tasks in any language, as well as making Airflow more data-aware, shifting the focus from orchestrating processes to managing data flows.

“We want to make sure that Airflow is the standard for orchestration for the next 10 to 15 years,” he said.



Source link

About The Author

Scroll to Top