Bridging Code and UI in Data Orchestration with Kestra

2024/11/26

Shownotes Transcript

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Your host is Tobias Macy, and today I'm interviewing Anna Geller about incorporating both code and UI-driven interfaces for data orchestration. So Anna, can you start by introducing yourself? Yes, of course. I'm Anna Geller. I'm a data engineer and technical writer turned product manager. I worked in many data engineering roles, including consulting, engineering, and later also dev rel.

And currently I worked as a product lead at Kestra. And yeah, that's the subject of today's podcast. And do you remember how you first got started working in data? Yes. So I think I started working with data during an internship at KPMG.

processing data for year-end audits. So there was a lot of Excel spreadsheets and queries to SQL Server. Yeah, that was how I started. I actually also studied kind of data engineering as my master's, so yeah.

In terms of the scope of this conversation, can you start by giving your definition of what constitutes data orchestration and what is necessary for a system to be able to orchestrate data effectively? Yeah, so it's always a bit difficult to agree in the industry on definitions. The way I see data orchestration is that it's automated coordination of workflow nodes that touch data.

This means that essentially any workflow nodes that interact with data, whether they produce data or consume data, they all fall into this category. I think one misconception I see is that many people associate data orchestration only with ETL and analytics.

And instead, I think that we should see it a bit more as a broader concept that covers how data moves across your entire business.

So I think every company has their internal APIs that need to exchange data. You need to react to events like sending an email and maybe update inventory anytime there's a new shipment. You need to process data across ERP, CRM, PLM, all kinds of internal systems. And you often need to do that in real time rather than in just nightly ETL jobs. Yeah, so I think...

The distinction is whether you want to automate workflows for the entire IT departments with multiple teams, environments, internal systems, or whether you just do it for the data team. And another aspect of the challenge of trying to really pin down what data orchestration means and what you should use to execute those workflows is that

In the technical arena and in organizations, there are numerous different scheduling systems, workflow systems, automation systems, in particular things like CI/CD for software delivery. There is a scheduler in Kubernetes and other container orchestrators. There are things like Cron and various other time-based scheduling or event-based systems such as Kafka or different streaming engines.

And a lot of times because something already exists within the organizational context, when a new task or requirement comes up, the teams will naturally just reach for what they already have, even if it's maybe not necessarily designed for the specific task at hand. And I'm wondering if you can talk to some of the ways that those tendencies can lead to anti-patterns and some of the limitations in the approach of using what they already have for data specific workflows.

Yeah, so I believe there is a lot of overlap of functionality between all those CI/CD scheduling and orchestration tools. If we think about it, they all have a trigger, right? So for example, when a new pull request is open or merged, you need to do something. They all have a list of jobs or tasks to run when some event is received.

They also all have states. So they are all state machines in the end. If a given step fails, you want to maybe restart the entire run from a failed state. And also many CI tools...

maybe in the data space we don't realize it, but they also have things like notifications on failure. They have ways to maybe pause after build step to validate if the build was correct and to approve or reject deployment, right?

So there's quite large overlap and I think it's quite natural for companies to, instead of directly looking at considering dedicated orchestrator, that they first try to use what they have and see if they can expand it to use cases like data workflows, automation of microservices or automation of business processes. I think the limitations usually show up when you have

true dependencies across workflows, across repositories, even across teams and infrastructure. And also when you start running workflows at scale, because then you just lack visibility. It's kind of the same as with AWS Lambda, when you have...

Tons of those different functions at some point, you are just confused. You have no overview of what is actually the health of my platform. And let's take GitHub Action as one concrete example. GitHub Actions is great, but the moment you have complex dependencies or custom infrastructure requirements,

GitHub actions start becoming maybe not the right solution. For example, you want to run this job on ECS Fargate and run this job on Kubernetes and run this another job on my on-prem machine to connect to my on-prem database to perform some data processing. Then you have patterns like run this job only after those four jobs complete successfully.

or run things at scale, and you want to manage concurrency, you want to manage multiple code bases from multiple different teams, already managing secrets across all those multiple repositories, as you would have to do with GitHub Actions, can become a bit painful when you have multiple teams that maybe you want to reshare them. This kind of visibility and governance at scale is something where I believe you may consider a true orchestrator. Yeah.

Another challenge in the opposite direction is that teams that do invest in data orchestration will say, again, I already have something for doing orchestration. Why don't I also use that for CICD or whatever other task automation I have? And I'm curious what you have seen as some of the challenges in that opposite direction of using a data orchestrator for something that is not a data

data-driven workflow? It depends on what we, in the end, consider as data orchestrator, because many data orchestrators, they will not be able to perform this task, like triggering a CI/CD pipeline to deploy some containers.

For example, dbt cloud. If you consider dbt cloud to be an orchestrator, you will not be able to maybe start some Terraform apply from dbt. It's obviously not this use case. For Python orchestrators like, you know, all the Airflow and

all tools in the space, I think it's more feasible, but it can be a bit clunky to run, to orchestrate CI from, just from Python, because mostly in CI, what you do is you run CLI commands. You want to, maybe if you do it from Airflow, you would need to have some HTTP sensor that listens to some event web hook, maybe after your pull request was merged or something like this. So it would be feasible, but it can be quite complicated.

quite clunky and not easy to maintain. In Kestra, we try to make this pattern really easy.

Given that you can simply add a list of tasks with your CLI commands, then you add a webhook trigger that can react maybe to your pull request event. And then it's very simple. I actually have one quote. I don't know if I should just like read it out loud with one user who is doing a CI CD in Kestra. And he mentioned that it was really refreshing. It's so simple yet powerfully flexible.

It really does allow you to create pretty much any flow you require. I have been migrating our pipelines from GitHub Actions to Kestra and it's been so simple to replicate the logic. The ability to mix and match plugins with basic shell scripting or script from a language is just amazing. I think it's possible we have some good testimonials that kind of like prove that the transition was fairly seamless.

Another element of data orchestration is the way in which it's presented and controlled. There have been a number of generations of data orchestration, each focusing on the specific problems of the overall ecosystem at that time.

And one of the main dichotomies that has existed throughout is the question of whether it's largely a UI-driven or a low-code approach where you're dragging and dropping different steps and connecting them up in a DAG, or whether it's largely a code-driven workflow where that also has some

of how code heavy it is, or maybe it's a YAML description of what the different tasks are. Maybe it's pure code where a lot of times that will lock you into a particular language environment. And I'm wondering what you see as some of the main motivators for those UI versus code driven workflows at the technical and the organizational level. The main motivation to combine a code and UI driven parents is, is to

To close the market gap, the way we see the orchestration and automation tool market is that on the one hand, you have all those code-only frameworks, often requiring you to build your workloads in Python, JavaScript, or

or Java. And on the other spectrum, you have all those drag and drop ETL or automation tools. And in both of those categories, there are many solutions you can pick from. There are a bunch of orchestration frameworks. There are a bunch of no-code drag and drop solutions. But there are very few tools in the middle. And this is the gap that Kestra tries to fill. And in general, we believe that Kestra is the best among low-code orchestration solutions.

And if we make this claim that we are the best, so why are we the best? Most tools in this no-code UI space, you would first build something in the UI and they will create a dump of JSON schema and they will call it code. So...

In the end, I believe what Castrate does differently is that with every new feature, we start with code and API first, and all those UI components come later. And as a result, the YAML definition is readable. It has full auto-completion, syntax validation,

You have great UX in terms of that you have built-in documentation, revision history, Git integration, so that you can iteratively start building everything in the UI. You can then push it to Git when you are ready and you cover this whole spectrum of being able to

to have this nice intuitive UI to iteratively build workflows without compromising the engineering benefit of a framework. To kind of maybe summarize this, is that existing solutions are usually either too rigid, like all the no-code tools, or they are too difficult, like all the frameworks, you know. To some extent, with Kestra, you have all the benefits of a code-based orchestration framework without the complexity of a framework. So you don't have to

deploy and package your code. You can just go to the UI, you quickly edit it, you run it to check if it's working and you are done in just a few minutes. - One of the challenges of having a low code interface

even if there is a code-driven workflow available, is that it imposes necessary constraints to be able to ensure that even if you do have a code element, you're able to visually represent it for people who are using that UI-driven approach. And a lot of times I've seen that lock the...

into a specific technology stack where maybe it is UI driven. It will generate code for you, which you can then edit and it will translate that back to the UI, but only if you're running on Spark or only if you're running on Airflow. And I'm wondering if you can talk to some of the ways that that...

That by modality and the requirement to be able to move between those different interfaces and maintain parity between them imposes constraints as far as the interfaces or the workflow descriptions or the types of tasks or runtime environments that you're able to execute with.

There are no constraints in terms of what you can orchestrate, in terms of technology you want to integrate it with. The only constraint is that Kestra has built-in syntax validation, which means that the API doesn't allow you to save the flow if it's invalid. So this is one constraint.

There are obviously tons of benefits with this. There are no surprises at runtime because the flow is validated during its creation at build time. If you have invalid, let's say indentation in your, in your Kestra YAML, Kestra won't let you save that flow. And in contrast, like we can maybe compare it to like how it's handled in Python, because I believe your audience, a lot of them use tools like Airflow.

So with a DAG defining a Python script, your workflow logic can be potentially more flexible, but a wrong indentation in your Python script

will be detected at runtime. So in the end, it's more flexible, but also it's more fragile. And in the end, as with pretty much everything in technology, it comes to the trade-off of constraints and guarantees that we can offer. With Python, you can have potentially a bit more flexibility in how you define this workflow logic.

but at the risk of having additional runtime issues if something is incorrect. And you have also this downside that you have to actually package and deploy that code. With the benefits of being in YAML, Kestrel is a bit more constrained

But it's also portable and self-contained. It's quite painless to deploy. It's validated at build time and you can be sure that everything is working. So yeah, pretty much the only constraint is that you cannot save an invalid flow. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.

DataFold's AI-powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing.

Rated to turn your year-long migration into weeks? Visit dataengineeringpodcast.com slash datafolds today for the details. So in order to explore a little bit further as far as the constraints and benefits, I think it's also worth discussing what the overall architecture of Kestra is and some of the primitives that it

assumes for those orchestration tasks and then we can dig more into some of the ways that you're able to use those primitives to build the specific logic that you need so if you can just give a bit of an overview about how kestra is implemented and some of the assumptions that it has about the level of granularity of the tasks and the types of inputs and outputs that it supports

Yes, so maybe let's start with the architecture. Kestra started with the architecture that relies on Kafka and Elasticsearch. And it was really great in terms of scalability, no single point of failure, but at the same time, it made it

more difficult for new users to get started with the product and to explore it. Many of listeners probably know maintaining Kafka in production can be difficult. So that's why Kestra added the architecture with JDBC backend in the open source version. This means that you can use Postgres, MySQL, SQL Server or H2 as your database. And on top of that, you have

the typical server components you can expect from orchestration tool, which is you have executor, scheduler, web server, and workers. All of those components can be scaled independently of each other because all of those are kind of like microservices.

And yeah, you can, if you need more scheduler, more executor, you can just increase the number of replicas in your Kubernetes deployment and everything just works. So this is from the, let's say, DevOps backend perspective of the architecture in terms of user experience. Kestra really relies heavily on the API. We are API first product.

This is not orchestration framework that you would just define your code, run it locally, then deploy the code. Instead, everything interacts through API. So the restrictions in terms of tasks and triggers you can do are restricted by the plugins that you have in your Kestra instance. You can have as many plugins as you want. By default, Kestra comes prepackaged with all plugins, so you don't need to install anything.

This is kind of the main benefit you already get with orchestration platform like Kestra that there's no need to pip install every dependency that you need to use all those kinds of different integrations. Everything is by default prepackaged. And if you need a bit more flexibility, you can cherry pick which plugins are included.

So let's say you are AWS shop. You don't use Azure GCP. You don't want those extra plugins for those other cloud vendors. You simply don't include them in your plugins directory in Kestra. And you just cherry pick the plugins that you need. On top of that, you can build your custom plugins.

the entire process is fairly easy. You have a template repository that you can simply fork and build your code on top. Then you build your JAR file included in the plugins directory and then you have the custom plugin. Then in terms of

that you can have on top of this. As Kestra administrator, you can set plugin defaults for each of those plugins that you added to, for example, ensure that everybody is using the same AWS credentials. Or if you want to globally enforce some pattern that maybe everybody should use this way of working those properties, you can enforce them globally using plugin defaults.

And this pluggable infrastructure has some constraints in terms of that if you don't have plugins for something, you will not be able to use it. But the benefit is, yeah, you have a lot of governance. It scales really well with more plugins that you can always add. And we also have the possibility to create custom script tasks. So if some plugin is missing and you don't want to...

touch Java to build custom plugin. You can do that, for example, in Python, R, Node.js. You can write your custom script and you can just run it as a container.

That's kind of like how Kestra can support all those different kind of integrations. And so in terms of the level of granularity of the tasks or the data assets that you're operating over, what are the assumptions of Kestra as far as

the, I guess, scale of data, the types of inputs and outputs, and in particular, the level of detail that you're able to get to as far as what a given task or plugin is going to execute and how that passes off to the next task or plugin. That's mostly coordinated through inputs and outputs. So each workflow can have as many inputs as you want in all inputs are strongly typed.

So you can say, okay, this plugin is a Boolean. This plugin should be integer and this plugin is a select. So you can only select the value from the dropdown. Maybe this input is multi-select so you can only have one of the predefined values.

You can have JSON, URL, all kinds of different inputs. And that's already the benefit that they are strongly typed. So the end user who may not be as technical will know what are the valid values they can input into the workflow. Then the communication between tasks to pass data between each other is mostly operating in terms of metadata and internal storage.

If you want to pass some data objects directly, you can do that if your plugin specifies that some data should be output. Indirectly, you also have input files and output files for all script tasks. So you need to explicitly declare that, let's say, this Python task should output those two files or maybe all JSON files, and then they will be captured and automatically persisted in Kestrel's internal storage.

You can think of internal storage as S3 bucket. It can be S3, GCS, or just local storage. People familiar with Airflow can think of internal storage as Airflow's XCOMs without the complexity of having to do like XCOM push and pull. So yeah, that's how tasks can pass data between each other. And you can even pass data across workflows.

I think this is huge for governance. We have many, many users who use, for example, subflows to compose their workflows in a more modular way that you can have one parent flow that triggers multiple processes and each of them is comprised into subflows and the subflows can output some data as well and they can pass it between each other so that

you have this way of exchanging data between different teams, different projects, without having to hard code any dependencies and without having to rely on implicitly stored files somewhere locally. Another trend that's been growing in the data orchestration space is the idea of rather than data as tasks, treating data as assets where

one task might produce multiple assets the canonical example largely being dbt where you might have one dbt command line execution that produces tens or hundreds of different tables as an output and being able to track those independently particularly if there are downstream triggers that depend on one of those tables being updated or materialized and i'm wondering how kestra a

addresses or some of the ways that Kestra is thinking about that level of granularity in terms of a task producing multiple different outputs or assets as a result? Yeah, that's totally feasible. Each task can output as many results as you want to.

Maybe I wouldn't recommend to output like a thousand files because maybe the UI could break potentially. But overall, you can output as many things as you wish. And Kestra doesn't introduce any restrictions in terms of what your specific outputs can be. There is one really great feature that people really appreciate in Kestra, which is outputs preview. So that's your preview.

Task Run returns maybe CSV file or JSON file. You can easily preview it in the UI so that you know if the data format is right, if everything looks good. In the same way, if something fails, you can maybe preview the data

and see, okay, maybe what I have in this downstream task is some error in my code. Like maybe you didn't capture some edge cases. You can redeploy your workflow. So essentially you create a new revision and you can rerun it for only for this new last task. This is a feature called replay. It's super useful for like failure scenarios. If you process data and you have some things that are unexpected,

And you don't want to rerun all those previous things, right? Because everything else worked. Only this single thing didn't work. So you can very easily reprocess things that don't work simply by fixing the code and pointing the execution to the updated revision.

In terms of the audience that you're targeting, given the fact that it has this UI and code driven approach, I'm wondering how you think about who the target market is, the types of users and some of the ways that that dual modality appeals to different team or technical boundaries across the organization. Yeah, that's,

That's a great question. Our target audience currently are mostly engineers who build internal platforms. So usually you would build some workflow patterns and you want to expose some workflows maybe to less technical users, to external stakeholders. We have lots of architects, software architects coming to Kastra to support them in replatforming. This usually means they want to maybe move from on-prem to cloud or

There is also this completely reverse pattern. There are many companies who now these days move from cloud back to on-prem because of some additional compliance reasons. So yeah, a lot of people using Kestra are those platform builders who then expose those workforce to less technical users for a variety of use cases. Kestra is not focused exclusively on data pipelines. We also support infrastructure and API automation, business process orchestration,

You have things like approval workflows. One very common scenario is that there are some IT automation tasks that, for example, provision resources and some DevOps architect or manager needs to approve if those resources can be deployed. So you have this approval process implemented in Kestra that the right person can approve the workflow to continue. We have also all those event-driven data processing use cases implemented.

where you can have events. You receive events, for example, from Kafka, SQS, Google Pops app, and you want to trigger some microservice in response to this event. That's also a perfect use case for Kestra. So it's not restricted to data pipelines. And I would still say it's data orchestration because you react to some data changes in the business and you want to run some data processing in response. ♪

As a listener of the Data Engineering Podcast, you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us, you should listen to Data Citizens Dialogues, the forward-thinking podcast from the folks at Calibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. They address questions around AI governance, data sharing, and working at global scale, among others.

In particular, I appreciate the ability to hear about the challenges that enterprise-scale businesses are tackling in this fast-moving field.

While data is shaping our world, Data Citizens Dialogues is shaping the conversation. Subscribe to Data Citizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts. At the organizational level too, I'm interested in some of the ways that Kestra is able to implicitly bridge these different workflows without the different teams needing to know at

every detail of what the available data is and how it's produced. Where, for instance, I have a workflow that's taking a file from an SFTP server, processing it, generating some table in the data warehouse as a result. And then somebody else's workflow depends on the contents of that data warehouse table, does some analysis, produces some data visualization or report generation, and...

where the completion of that first task will implicitly trigger the execution of the next task without the person who controls each tasks having to explicitly communicate between them or requiring that the workflows are directly built on top of each other. - So Kestra supports that pattern using Flow Trigger

This was in fact like one of the most popular patterns already from the beginning of the product. The use case typically looks as follows. You have multiple teams that don't have tight dependencies between each other. So you would say, run this flow

only when those three other workflows from different teams have successfully completed within the last 24 hours. So you can easily define it as a condition for this flow so that it only runs

after those preconditions are true. And you can additionally add conditions for the actual data. So you can say only if this data returned maybe 200 status code, or if this data has this number of rows, do something like trigger this workflow. Kestra doesn't introduce any new concept for this. We already have the concept of triggers. So implementing those kinds of patterns is a matter of explicitly declaring in your YAML

what are expectations to trigger this workflow? And you can explicitly list

All of those flow executions that should be completed within this given timeframe, and then it will run. So I think in the mindset, it's quite similar to how many other data orchestrators do that, but without restricting it directly to only being data. Another aspect of what you're building with Kestra is the fact that it is an

open core product with a paid service available on top of it. And I'm wondering if you can talk to some of the ways that you think about what are the elements that are available as the open source? What are the things that are paid and some of the ways that you think about the audiences across those two and how you are working to keep the open source elements of it sustainable and active?

That's the challenge every open core company is asking themselves every day, I'm pretty sure. We have this framework that all features that are about security, scalability, and governance, they all go into the Enterprise Edition. And all features that are single player, core orchestration capabilities, they go into the open source version. And that's how we try to

There is, I believe there is no single answer. Every company tries to find the best solution. What we found out so far is that we have some prospects, some people coming to Kestra who would prefer to have fully managed service. And currently Kestra doesn't offer that. We have open source and self-hostable enterprise solution. So that's something we will be working on next year. It will be a big priority, especially to enable even more people to try the product.

see how it's working, and including just trying even those paid enterprise-grade features without having to first maybe talk to sales and start official POC. And as you have been building and working with Kestra, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?

Yeah, some of the interesting is we have one solopreneur who was automating their entire business with Castra, including payment automation, categorizing customer support tickets using OpenAI. So super interesting use case and great to see that Castra can be applied for solopreneurs. For more surprising and unexpected things,

I would expect more people to be able to write custom code. And what we have found out is that

There are many, many users who purely use our plugins. If they need to have some transformations, they would often just add custom Pebble expressions. This is like Python Jinja, where they transform some data on the fly without writing dedicated code, like extra Python functions. So I was frankly a bit surprised. Sometimes it seemed to me personally easier to

maybe write custom code for this aspect, but I see users just prefer to just keep things simple, just simple transformation function and move to the next task. I also was a bit surprised how many users actually leveraged the low code aspect of Kestra. Our default UI is to use the code interface. So you need to write your workflow. We have a beautiful like auto-completion, syntax validation when you just type things from the UI, but

many users still explicitly opt in to the topology view and just add things from the local UI forms. So that's one aspect which was also surprising to me. And overall, I think it's always surprising to see how broad spectrum of users are coming to us. We have some who, as I mentioned, just prefer to keep things simple. They only use our plugins.

And there are other people who just write custom code for everything. So like every task is maybe Ruby or JavaScript or Python task. So the spectrum is really wide and it's really interesting to see this.

Another aspect that I forgot to touch on earlier is given that Kestra is by default a platform service, what does the local development experience look like for people who are maybe iterating on a script or trying to test out a workflow before they push it to their production environment? Yeah, I believe the local development is really great. We have feedback from one user who mentioned that writing workflows in Kestra is fun,

which is unheard of in the world of orchestration, that building workflows can be fun. So essentially to get started, you run a single Docker container, you open the UI and you hit a single button to create a flow. From here, you add your ID for the flow, the namespace to which it belongs,

the list of tasks that you want to orchestrate and the triggers. So whether this should run on schedule or based on some event when new file arrives in S3, et cetera. And then when you start typing your tasks, you get this auto-completion, built-in documentation. You have also blueprints

that will guide you through examples on how to leverage some usage pattern. So I believe the local development experience is really unique to Kestra. And as I mentioned, some users even consider this fun, which is very refreshing. And in your own work of building and using and communicating about Kestra, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

One of the most challenging or interesting lessons we've learned is that following the common VC advice of, you know, just start with a niche, then land in an expand. I think this approach didn't work for Kestra as good as we would wish. So at first, Kestra targeted mostly data engineering for analytics use cases.

And over time, we expanded to operational teams and we focus on engineers who are building orchestrating custom applications, reacting to events, building scheduled backup jobs or building infrastructure with Kestra. And this is where the adoption really took off.

So the lesson learned is you don't need to follow the VC advice always. Sometimes following your own vision can be better. Then in terms of product building,

Some lesson learned is that we were trying to use VS Code Editor within Kestra. Within one release, we launched embedded VS Code Editor within the UI. Over time, we found it was really difficult. And in the end, it was much easier to build our custom editor than keep maintaining the one from VS Code because you have so little control over how everything looks like, how you interact between this VS Code extension and the UI interface.

Yeah, so I think this was something that was surprising. We thought it would be easier. We also thought that VS Code would be more open and not as restricted. So if you want to, for example, use GitHub Copilot in your product, you cannot do that. It's really restricted to Microsoft only. And for individuals or teams who are evaluating orchestration engines, they're trying to decide what fits best into their stack, what are the cases where Kestra is the wrong choice?

Yeah, so Kestra is the wrong choice if you build stateful workflows that implicitly depend on side effects produced by other tasks or by other workflows. To give you an example, let's say you have one Python function that writes data to a local file and there is another task in another workflow that tries to read this local file. Technically speaking, if you use worker group feature in Kestra, you could make this work, but we consider this

implicitly stateful approach a bad practice, we prefer that you declaratively configure that this task outputs a file and this file will then be persisted internal storage. And then it can be accessed transparently by other tasks or even by other flows.

In general, we try to bring infrastructure as code best practices to all workflows. So we assume that your local development environment should be the same as what you will in the end do in production. So if in prod, you usually run things in a distributed fashion, so you cannot guarantee that those two tasks will run on the same worker to access this local file.

That's why we consider this an anti-pattern. And each execution in Kestra is by default considered stateless. And only if your tasks explicitly output some results

those results are persisted and can be processed. And as you continue to build and iterate on and explore the market for orchestration engines in the data context, what are some of the things you have planned for the near to medium term or any particular problem areas or projects you're excited to dig into?

Yeah, we are really excited about the feature we will be releasing on December 3rd this year. And this will be apps. This will allow you to build custom applications directly from Kestra. So you can treat your workflows as a backend and you build custom UIs directly from Kestra.

So let's imagine that you want to have some business stakeholders who want to request some data. They can go to the UI, they can select from the inputs what type of data they want to request. Then your workflow can fetch and process and transform all the data in the way this end stakeholder needs it. And it can then output this data directly from this custom application.

So this eliminates this need, you know, and often I think as data engineers, we know this use case where a stakeholder comes in and asks like, could you fetch this data for me? I just need this report. So effectively they can fully self-serve with this approach. Similarly, if you have patterns that need approval, right? So let's say somebody wants to request compute resources,

you can fill those inputs in a custom form. Then this will go to the manager or to the DevOps engineer who can look at the request, they can approve it, and then you can maybe see the result. So in the end, those custom applications, I think this will be a feature that will unlock tons of different use cases. And we are very excited about this one.

Similarly, since we follow this approach of everything as code, we are building a feature which is custom dashboards. So we can build custom dashboards that visualize how your execution data should look like. And you can do that as code. So similarly to how you have workflow as YAML, you also have your custom dashboard as YAML, which you can version control, you can track revision history,

This is also another feature that will be launched in December. And long term, in terms of what is on our roadmap, it's a cloud launch. We need this fully managed service, as I mentioned before, and also some improvements to human in the loop. I think to accommodate to AI driven world where AI generates some data, you need to have

human-needed processes where humans can approve the outputs generated by AI. So that's also something that we work on even more. Are there any other aspects of the work that you're doing at Kestra or the overall space of you

UI and code driven orchestration that we didn't discuss yet that you'd like to cover before we close out the show? No, I think we've covered a lot of ground. Thank you so much for inviting me to the show. It's been great. Yeah, I'm really grateful. Well, for anybody who wants to get in touch with you and follow along with the work that you and the rest

of the Kestra team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. We briefly mentioned the topic of everything as code. And so far, dbt has brought this approach to analytics.

We see BI tools are catching up. So slowly you can start building dashboards as code, which can follow the same engineering practices. I think we are still far away from the world where you can really have everything in the data engineering process managed as code. And I think we should probably close this gap at some point.

All right. Well, thank you very much for taking the time today to join me and share the work that you and the Kestra team are doing on bridging the gap between code and UI driven workflows and expanding beyond data only and ETL only execution. So appreciate the time and energy that you're all putting into that. And I hope you enjoy the rest of your day. Thanks so much.

Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast-moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and co-workers.

Bridging Code and UI in Data Orchestration with Kestra 44:30 Share

Data Engineering Podcast

Shownotes Transcript

Bridging Code and UI in Data Orchestration with Kestra