#269 Governing Data Models with Sarah Levy, CEO and Co-Founder at Euno

2024/12/12

DataFramed

Richie

Sarah Levy

Richie: 讨论了数据治理的挑战，特别是不同团队对关键指标定义不一致的问题，以及语义层在解决这些问题中的作用。他强调了数据信任对于AI分析工具的重要性，并希望了解如何创建语义层以及其在更大治理战略中的地位。 Sarah Levy: 指出许多企业领导者不信任数据产品报告的数字，这是数据治理面临的最大挑战。她解释了语义层的作用，它是一个存储所有经过认证或治理的计算定义的存储库，可以解决由于不同系统计算方法不同而导致的关键指标数值差异过大的问题。她还强调了语义层对于构建可信赖的AI分析工具至关重要，因为它为模型训练提供了可靠的数据定义。她详细阐述了构建和管理语义层的挑战，包括指标的整理、代码编写和工作流程的建立，以及如何处理指标的多个版本。她还讨论了分析工程师在构建和维护语义层中的作用，以及如何平衡数据治理和创新。 Sarah Levy: 深入探讨了语义层的概念、优势和实施挑战。她解释了语义层如何作为单一事实来源，解决数据定义不一致的问题，并提高数据驱动决策的可靠性。她还强调了语义层在AI分析工具中的重要性，以及如何利用语义层来构建可信赖的AI分析工具。她详细阐述了构建和管理语义层的流程，包括指标的整理、代码编写和工作流程的建立，以及如何处理指标的多个版本。她还讨论了分析工程师在构建和维护语义层中的作用，以及如何平衡数据治理和创新，并提供了具体的案例研究和成功指标。

Deep Dive

Key Insights

What is the primary challenge in data governance according to Sarah Levy?

The biggest challenge is that business leaders often cannot trust the numbers reported by their data products, despite significant investments in data tools and teams.

What is a semantic layer in the context of data governance?

A semantic layer is a centralized store for certified definitions of calculations and metrics, providing a single source of truth for data context and ensuring consistency across an organization.

Why is a semantic layer important for organizations?

It ensures consistency and alignment across an organization by providing a central source of truth for metrics, which is crucial for trust in data-driven decisions and for training AI tools.

What are the main steps in implementing a semantic layer?

The steps include curating and mapping metrics, resolving inconsistencies, coding them, and building a workflow to maintain the layer over time.

Who needs to be involved in creating and maintaining a semantic layer?

Business analysts, data engineers, and analytics engineers are all involved. Business analysts define the logic, analytics engineers implement it, and engineers ensure it is coded and governed properly.

What is the role of analytics engineers in data governance?

Analytics engineers bridge the gap between business and data teams, owning the semantic layer and ensuring it is well-architected, coded, and maintained.

How does AI impact the role of analytics engineers?

AI tools require a governed semantic layer to function effectively, making analytics engineers critical for implementing and maintaining the layer that AI models rely on.

What is a governance score and why is it important?

A governance score measures how much of an organization's data assets rely on governed metrics and tables. It helps organizations understand how much they can trust their data and where improvements are needed.

How should organizations prioritize which metrics to govern first?

Organizations should prioritize metrics and dashboards that are highly utilized or where significant resources are invested, as these are indicators of high business value.

What is the relationship between a semantic layer and AI-driven analytics?

A semantic layer is essential for enabling AI-driven analytics by providing a certified set of definitions that AI tools can use to generate trustworthy insights.

How can data governance encourage innovation while maintaining control?

Governance should allow analysts to create and experiment freely in their preferred tools, while observability tools identify which creations gain traction and should be added to the semantic layer.

What is Sarah Levy's mission with her company, Euno?

Her mission is to help large-scale data teams easily understand and derive value from their data by facilitating the creation of a central governed semantic layer and enabling AI adoption.

Shownotes Transcript

Translations:

中文

If you want to adopt AI tools, if you want to have like a chat GPT for analytics and ask, you know, what's the number of daily active users, what's the interest in the past month? And you want to trust the number that this tool gives you, you have to rely on some sort of truth, some certified set of definitions to train those models. Welcome to Data Framed. This is Richie.

One of my biggest gripes as a data scientist is when you get to the end of a project, you're presenting your results, and someone pipes up, "I don't think you should have calculated it like that. We use a different formula on our team." There's a surprisingly large amount of creativity that can go into calculating a lot of business metrics, and naturally that means it's far too common to have 11 definitions of customer acquisition costs floating about.

At best, this simply causes lost productivity because you keep reinventing data wheels, and at worst makes data-driven decision making impossible because you don't trust your analyses. The solution is better data governance, and making your metric definitions consistent by using a semantic layer. I want to know how you go about creating one of these semantic layers, or at least which colleagues you'd badger to create one, and how they fit into a larger governance strategy.

Our guest is Sarah Levy, the CEO and co-founder at Yuno.ai, a data governance platform company. Sarah founded Yuno after being VP of Data Science and Analytics at Pagaya and CTO at Site Diagnostics. The challenges she encountered in running data teams led her to try and solve them for others. Since she spent nearly two decades battling data governance problems, I'm keen to know her tips on how to solve them.

Hi, Sarah. Welcome to the show. Hi, Richie. Happy to be here. Cool. So to begin with, just talk me through what are the big challenges in data governance right now? Wow. So I think the biggest challenge is that so many business leaders cannot trust the numbers that their data products report, right?

I think that's the biggest. You invest so much money in the data stack, in building data pipelines, data engineering teams, a warehouse, BI tools, all of that just to make data-driven decisions. And in the end, you don't trust the numbers. I think that's the biggest issue right now, if you ask me. Sure, yeah. So I can say, oh, there's a big problem if you're spending all this money on your data solutions. And then you go, well, actually, I don't really trust the answer at all. Then it's a complete waste of time.

So I know one of the solutions you're sort of interested in in order to get better trust in data is the use of a semantic layer. Can you talk me through what is a semantic layer? Yes. So it's actually, you can consider it a mart or a store.

where you park all the certified or governed or official definitions of your calculations. For example, if I led a real estate department in a huge fintech company, and one of the major KPIs was the number of assets that we own.

So number of assets can be calculated from various systems in different ways. And we had actually experienced, I experienced it myself. We had about 300 assets that we managed and the range, the numbers that we got for number of total assets from different systems ranged between 270 to 320 assets.

That's a huge mistake. So a semantic layer is a sort of mark where you will have the official definition of total number of assets. And if I want to know it, I will use the semantic layer to get the right context from the data. So there's lots of data out there in tables. And if you want to get the right context for this data, you use a semantic layer. It provides you the context for that data so that you can know which table you need to query to get the answer that you want.

Okay, so the idea is that you've got an official definition of important metrics that you need to calculate. Okay, I like that. And I guess if you don't care like 10% either way, then it maybe matters less on like how many assets you have. You said 270 to 320. It's like if that's good, about 300 is good enough, then you maybe care less. But if you want the exact answer, then you need an official definition.

So talk me through, like, are there any other benefits beyond just having a single source of truth? Like, why would you want this semantic layer?

Actually, right now, almost every BI tool has a semantic layer. It's not a new concept. If you use Tableau, you build metrics in Tableau. You build them in workbooks. You can build them in data sources. If you use Looker, you have semantics in LookML. You have the equivalent of almost every BI tool. So it's like the bread and butter of analysts to create calculations, to create definitions, to define new terms as they go. So every data system already had lots of semantic definitions.

This is what captures business logic.

The reason you want a semantic layer is because quite often there are lots of duplicated and inconsistent definitions. Many of the definitions are siloed or trapped in some analyst's workbook or a spreadsheet or something. So if you want to reach consistency and alignment across an organization, you want everyone to speak the same language, you want to build this central source of truth, this semantic layer where you have the official definition that everyone can trust. This is certified.

This is the right definition for total revenues for daily active users. There might be lots of other copies out there for experiments, for ad hoc analysis, for things that were built and abandoned. But that's where you find the truth. So this is why it's so important. And it's important for two things, because you want to know that the number that you get is the right number.

And when we're facing the future, and I guess we'll touch that a bit more, if you want to adopt AI tools, if you want to have like a chat GPT for analytics and ask, what's the number of daily active users? What's the interest in the past month? And you want to trust the number that this tool gives you. You have to rely on some sort of truth, some certified set of definitions to train those models.

Yeah, I can certainly see how if you've got lots of different analysts working on similar problems, they're all going to calculate things in similar but not quite the same way. In fact, I think I've even done it myself. I've had to go back and calculate something I knew I calculated last year, and then I've probably done it the same, but maybe not. So having that single standard definition is going to produce that sort of duplication of workbook.

Actually, this leads to a question in like, how do you make sure that you only have one version? Like the implementation seems like it's made it hard stopping analysts doing all this duplicate work. I mean, you touched on the most important part. I think today almost every organization understands they need a semantic layer. But at the same time,

you rarely see well-built, managed semantic layers. And the reason is it's actually a hard implementation process. And let me try to, you know, summarize what it consists of. So first, you need to curate the right metrics. You have...

thousands of metrics that were built over the years in a large scale enterprise. They're all buried everywhere in BI tools, in data applications, in data science notebooks. So you need to find, map them, understand which ones matter, which ones are the most important KPIs. You need to resolve all inconsistencies and duplications. If you have three versions of something, understand what is the right version that you want.

to add to this. So after this curation, mapping, you know, understanding which measures capture business value, which you can delete just to keep the environment clean and clear, then you need to code them.

All right. That's a big migration process. But once they're coded, what does the workflow look like? So you cannot just tell the organization, listen, now we have a semantic layer. These are your definitions. That's what you're going to use from now on. Stop creating new stuff in your notebooks. That's now your dictionary. And that's what you can use because things change all the time.

So a day later, there are already 20 new metrics out there buried in all those BI tools. So if you want to really build and manage a semantic layer, you also need to develop a workflow.

that takes into account that things will change all the time. I think these are the main challenges. So curating, creating it, and then building a workflow that keeps maintaining it up to date, consistent as you go. Yeah, I can certainly see how there's a lot of subtle process and organizational challenges in there. Maybe we'll back up and say, who needs to be involved in creating all these definitions out?

It sounds like you're going to need someone technical in order to create this, but also someone with business knowledge as well. That means different teams working together. So yeah, talk with everyone who needs to be involved in this and what their different roles are.

I mean, everyone, I mean, things like business logic is created by the business, not some back office of engineers who will decide what's the important metric. So the business logic is created, it's evolving, it keeps changing, and this happens on the business side. And, you know, analysts, business analysts that work closely with the business, they're usually embedded in business domains.

That's where the inception of new semantics, new logic happens. And at the same time, although there were attempts to teach analysts how to code or try to turn them into engineers, there is an engineering effort involved. Today, when you say governed, you mean coded, version controlled, documented, tested.

It's not just, you know, I'm not just writing the definition. There is a way to manage this like code so that it's actually governed. So to have version controlled coded metrics, you need engineers to be involved.

And they need to understand what analysts want them to code there. Now, there are ways to bridge the gaps a bit better. Now, especially with AI, you have co-pilots and auto-code generators. I mean, these gaps, just coding things, these gaps will become less and less big. But still, you need to design, to architect this properly. You need to make sure that the data that those metrics rely on

The transformations, the table, they're also built and designed well. So there is a huge effort here that combines business as the creation, the creators, analysts as implementing those or writing them using data language and then engineers coding them.

So everyone's involved. Okay, so yeah, a lot of different teams there. Does this have an implication that you need some analysts embedded within those business teams in order to be able to sort of write the definition down in a technical way that's come from the business logic? It sounds like you need someone with both the data skills and the business skills.

So you need business understanding and you need technical understanding. Now, there is a new role in the data space that was invented, I think, by DBT called analytics engineers.

And the more, I mean, the more we see how this role evolves and, you know, the people in charge, it reminds me of product managers in the software development world. They are kind of the bridge between the business and the data team. They understand the technicalities of data. They can write and code things like engineers, but they're also closer to the business. They work closely with the business artists.

So analytics engineers, I think they became officially the owners of the semantic layer. They're the one that build it, that maintain it, and they should be able to manage this conversation. Okay, that's very cool. And it seems like analytics engineer is one of these sort of hot new data roles. Yeah. So yeah, can you maybe get more into depth on like, how does one become an analytics engineer? What sort of skills do you need to do this role?

I've interviewed in the past year over 300 analytics engineers. I mean, I really spoke to so many. And there are different stories. Sometimes it's engineers that really express interest in the business and they're keen to, you know, see the impact of their work. So, and you see the same thing with product managers. This is why I like the comparison. So it's sometimes it's engineers that have a very

strong business understanding and interest and they can speak to business people and they become analytics engineers. But I think more often it's analysts that want to, you know, skill up and become engineers. And it's like a natural path from the world of analytics to the world of data engineering. Like it goes through analytics engineering.

So I've seen both things. Interesting. So you get some very business-focused people, but also there are some where they're kind of halfway in between a data engineer and a data analyst. All right. Yeah. And so is the role of this job, is it basically just like grinding out lots of metrics all the time? Is it like vast amounts of just creating these definitions for how the business wants to run, or is there more to it? So in fact, they are kind of the ones that really understand the importance of business logic governance.

And in a way, you could say they own business logic governance. It started with just modeling in DBT, so writing the DBT transformations instead of having joins and, you know, computed columns built in Tableau. They do it in DBT, and they're the ones coding these things in DBT. And then, you know, the natural next step would be, I mean, transformation sculpture logic, semantics, metric sculpture logic. So they're coding this stuff.

But like every engineering role, it's beyond coding. It's really architecting it the right way. It's really understanding how you build the processes and the workflows, how you determine that something's a duplicate. How do you know which one are the certified things, which are not? How do you design the system? And many of them are actually pretty senior in their skill set, in their impact. I mean, the level of impact that they have, they

They can sometimes work with like 20 engineers on the data platform, hundreds of engineers in the business side, and there are like four or five analytics engineers that really design the whole interface.

So you mentioned tools like dbt and I guess there's a lot of SQL in the background and BI tools like Tableau, whatever. So beyond that, I mean, because generative AI is sort of working its way into absolutely everything. Is there an AI angle here? Is that changing how the analytics engineer role works?

Everything that will impact semantic layers and governance will change how analytics engineers work or their role. I think, I mean, AI, maybe five years ago, I mean, the companies that introduced semantic layers and now data visionaries, everyone said like semantic layers are important. If you want to trust your data, if you want to have this source of truth, you need to build a semantic layer. I think with AI, this becomes clear. Without a semantic layer, it's not going to work.

If you build a central governed semantic layer, it might work.

And you need to do it the right way. So I think that's where analytics engineers will become, you know, analytics engineers and also, you know, the data leadership and how it owns this and builds the roadmap for that. But they will be the ones implementing this. And I think good or well-performing data teams will make AI work and others will fail. And the question is whether they will be able to manage a centrally governed semantic layer, right?

Okay, so I guess since some people are going to succeed, some people are going to fail, we need to figure out how to be in the success group. Maybe just for some motivation, do you have any examples of companies where they've built a semantic layer, they've seen some good results? Like talk me through some case studies.

I think it's still in the early stages. I'm working with a big customer that started working with the semantic layer in DBT early on. They built all the metrics there. They really have a source of truth for metrics. They let analysts create things

So they have their playground, they have where they do things, but as things become mature, I think that it's a very big company, like 5,000 people, hundreds of analysts. And there are like a dozen of analytics engineers that really managed to centralize the metrics for each business domain. And their data is really contributing lots of value to their decisions, to the business decisions. All their business rely heavily on data.

It's a big European unicorn company, micro-mobility company. So I've just seen this. And they were one, I mean, the first adopters of the DVD semantic layer. I guess the tricky part is going to be measuring success. Like what constitutes success? It sounds like there's some productivity benefits from not duplicating work. And there's some more nebulous things about making less stupid decisions because the numbers are wrong. Can you talk me through like, how would you go? We've implemented a semantic layer. This is how we know it's successful.

If you invest in governing your data model or your business logic and semantics, and I would say also transformation, so tables, fact tables, this altogether captures your governed business logic. So we actually introduced something we call governance core. If you could say for an organization, for example, what percentage of their dashboards

rely on governed metrics, meaning metrics that are in the official semantic layer or governed tables, tables that are coded in dbt. What percentage of the queries are from governed resources?

This gives you already a first indication of how well you can trust the results there because they don't rely just on any join that someone did in an external table with a row table with a CSV, but on actually data that is version controlled, that is coded, that is governed. So that's one way. And we took this concept of governance score, like the simplest way, and expanded it to more sophisticated governance insights. So what's the duplication

How many duplicates you have in your official semantic layer? Is this like zero duplications or close to like 20% of it is duplicated, 30% is duplicated? How well is it documented? What percentage of your metrics are poorly documented or well documented? And you can think where we can take it. So if you use those like governance scores, you can actually see how close you are to actually...

using governed, controlled logic and not just anything that someone creates. So I think this will become more and more useful.

Okay, so I like the idea of just tracking what percentage of your metrics are actually governed and what percentage is just sort of ad hoc analyses. I guess this has a sort of knock-on implication that you want to just gradually start shifting things to a governed approach. So where do you start? Is there a specific order? Should you start with one area of the business? Should you do a few metrics from every area of the business? What's the plan for actually getting all your data governed?

If I needed to say it in simple words, I would say where the business value lies. I would want to start with the thing that brings the highest value to the business.

That's where you want to start. So many practices are like, let's start with the main KPIs and pick the most important business domains where all this focus is. But you can actually use, and that's also something that we introduced, you can use utilization as a very strong indicator to value.

The measures that are currently used, the dashboards that are currently used, that people actually watch them and use them and refresh them. That's where the business value is right now from all the data assets that you build. Let's start there. Let's make sure that the highly used data assets, data products are governed. And then, you know, you can prioritize based on that. That's a very strong indicator for value. And then you have cost where you spend money.

if you spend a lot of money on these measures, on these tables, maybe you want to go there because you want to make sure you spend money on the right things. I love that idea of using utilization to see where value lies in your datasets. Because obviously, like, so many dashboards are just like, "I created it," and then maybe someone looks at it, maybe they don't. But some, it's like, "Well, yeah, okay, the C-suite's looking at this, like, every day just to track something important." That's obviously going to be a much higher value.

I can share with you a statistic. Almost every customer I'm working with, over 50% of their dashboard has zero utilization in the past two, three months. Over 50%. It's crazy. You have all those dashboards that usually sit on extracted tables and you waste money on that and everyone gets lost there and it's not even used by anyone. So yeah, it's super important.

Absolutely, yeah. So tracking that utilization does seem an important facet of governance. Okay, so I'm wondering, since the big point of the semantic layer is to reduce the amount of chaos, so you're not having to track individual datasets, you're just tracking metrics.

As this scales, do you have the problem that you've then got to track all the metrics you've just created? I mean, my personal take on that, and there are different pieces, but my thesis is that every data application and every business intelligence tool will have its own local semantic layer. You will have the Looker semantic layer, the Tableau semantic layer, the Hex semantic layer, the Shing semantic layer, and we can go on and on and on.

And it will be like the place where things are created fast, they analyze and stored locally if you want to experiment with something or try something. And then there will be a sort of shift from the local semantic layer to the universal semantic layer. And this is something that is consistent and aligned across every data application and all data users. And to do that, you have to own very powerful observability and mapping tools.

You have to see what's created everywhere. You cannot just expect analysts to say, well, you know, this is an important measure. Let's open a ticket for the analytics engineers that maintain the universe of semantic layer to add that. They will create it in their tool, add this to a dashboard. This dashboard will gain traction and no one will bother because no one has time, right? Everyone's working so hard to deliver their product on time.

So you have to own and obtain powerful observability tools that map everything that exists, that identify duplicates, that indicate this and this and this. This should go to the universal semantic layer that it's time. Shift them. They're already highly used. They are still trapped. You want to add them. You want to align them with everyone. So

And on top of this powerful observability capability, you can build a workflow. Okay. All right. So I guess the intermediate stage from like everything's in this universal sort of metric store, you've got these sort of local stores you're dealing with like one department at a time, and then gradually you can sort of shift them to this sort of central place. Okay. The next thing is maintenance. So once you've created these metrics,

I know, especially at Datacamp, business is always like, well, you know, are we calculating this in the right way? And you're going to want to update the metrics. Then you've got, I guess, multiple versions of like how you calculate, I don't know, like your customer lifetime value or something or your customer acquisition cost. And I guess you want the new version, but also you want the old version just for consistency of like previous reporting. How do you deal with multiple versions of metrics?

So thank God we've got Git, right? Introduced into data, finally. So I think, I mean, today, every time as you manage dbt transformations in the warehouse, in dbt, in a Git repo, you do the same for metrics. It's managed like code with version control. You can roll back. You always test new versions and you run regressions and everything you do with code

Now you can do is metrics. And you know, each report, which version of the metric it's used, it's part of the system. It has to be. Otherwise, as you mentioned, you will create a duplicate whenever you want to change and a duplicated dashboard. And again, this chaos is formed, you know, just like that.

Okay, that seems to make sense that as long as you capture all your business logic in code, then you've got access to Git and other version control tools. And that way you can manage things maintenance just happens using natural software-driven life cycles.

Okay. So I guess the other thing is just about who in leadership needs to be involved in this? Because you've got data teams, you've got business teams. Should managing all this, is this the responsibility of the chief data officer? Is it your chief revenue officer? Is it someone from IT? Like who needs to take charge of this?

So it's obviously depending on the scale, on the scale of the organization, right? Most organizations, smaller organizations usually don't even have a CDO role. They will have like VP data of analytics. In a good case, often it's like director level, director of data platform, director of data analytics. So they usually own the implementation, but then I think it's split because those data folks and the leadership of the data, I think they understand quite well

why you want to build a semantic layer that you need to govern. They have a pretty hard education role to educate the business leadership, how semantic layers, what they have to do with AI and with the pace at which they're getting reports and why they cannot trust the numbers. And it's a pretty difficult role to educate everyone on that. So they usually are the champions. I mean, they buy the tools, they implement them, they own them.

But they need to get the business buy-in for that. And it's on them to teach them, and we help them, but to teach them why those things are related in the first place. Yeah, I can certainly see how there's going to be this big education component to make sure you've got data people and business people talking to each other productively. Okay, so you mentioned that for smaller businesses, you're not going to have this chief data officer role. And so I'm now wondering, is

Is there a difference in how you go about implementing this if you're a small business versus a large enterprise? I've been speaking and working with organizations of like 200 people organizations, 1,000 people organizations, 10,000 people organizations. I think that, I mean, what changes entirely is the level of chaos.

When there are 100 people, 5, 6 data people, they just know everything by heart. They can talk to each other. They know where to find things. They know which metrics exist and how they were defined and when and by whom. It's just easy. It's still solvable without all those tools. They might even say, well, we don't need a semantic layer. We just don't have conflicts. We don't have duplicates. We don't have all of that. We control it. We manage it so well.

And then there is like a phase transition. When the number of data practitioners exceeds like 25 people, then you lose control.

And often if you don't build it right from the beginning, then you start replatforming, migrating, changing everything. Almost every data team is in some sort of replatforming project. Now the most popular I hear about is replatforming to governance. Everything was about democratization, access, access, democratization. Now we replatform to embrace governance. So if you don't do it when you're small, you replatform. I think as you become large,

It's just the pace at which new things are created and the amount of logic that you already have across the business domain. It's something that you cannot control just by aligning everyone and speaking with everyone. And that's where it becomes critical.

Okay, yeah. So it sounds like one of the main benefits to this then is that it allows you to scale your data teams. It allows you to scale the usage of data because there is just less chaos and you don't need to spend more time worrying about consistency because things are sort of guaranteed to work that way. And maybe the biggest benefit, and that's something that I think business leaders will find super relevant and interesting, is really AI.

Because today they depend on those dozens of data people to just create a report for them that tells them, you know, how many new revenues were gained in the past quarter based on, you know, territory, campaign, business product, and so on. And this reality of just asking and getting an answer that you can trust, that's a reality that every business leader, I think, dreams of. It just seems still too far away. But this really is what's enabled

by building a centrally governed semantic layer. This dream will no longer be a dream.

Ah, okay. So this is really, if you want to get to self-service analytics and just have that AI chatbot that's going to give you the answers to all your data questions, then you need this semantic layer to be built first. That's cool. So can you talk me through how this all fits together then? So you've got a generative AI layer, you've got a semantic layer. Is there anything else that needs to happen in order to realize this self-service dream? So I like to draw it like that. The

The journey to AI, okay, for this chatbot AI analytics tool. So the first step, I think, is beginning to build this semantic layer and getting this cross-ecosystem observability of all the metrics that are being created everywhere, whether it's in this semantic layer or in the local semantic layers or in notebooks or wherever. So you start by getting observability and mapping the utilization.

And then if you think about how you train those AI tools, you need to really tell them, you know, this metric is certified for your training model. This is not. This is just an experiment. This is just a duplicate. So this is sort of you can think about it as a layer of governance insights that help you mark that's certified, that's not. That's certified, that's not.

And once you're there, once you have everything mapped, you have a semantic layer, you have this certified labeling mechanism that is smart. It's not just a stupid manual mechanism. It relies on governance insights. From that point on, and we've run a few POCs on that,

The tools that we already have in AI will just make it work for the data model level. So you'll be able to ask questions in natural language like, show me the dashboards that report the daily active users and numbers that were used in the past two months by the product department and our government. You can get the exact data. And once you have that, you can ask any question you want on the data because it knows how to find the right places to query and generate the right queries.

So the building blocks are creating the semantic layer, getting observability and utilization so that you can actually build a workflow to manage things, to decide what goes there, what needs to be deleted, what's duplicated and needs to be resolved, to build the tools to do that. And then some governance insights that allow you to tag, this is going to the training model, this is not going there. From there on, it's almost a plug and play thing.

Okay, plug and play sounds wonderful. But there's a lot of work at the end of that stage, right? A lot to be done, yeah. One of the big pushbacks whenever you mention data governance is that it's going to stifle innovation. So can you talk me through how do you do data governance in a sensible manner that still encourages innovation?

So you can call it innovation, creativity, freedom. And it's a built-in challenge because governance is usually associated with slowing things down, creating like ticketing workflows, open a ticket, wait for a priority, wait for things to be built for you, just then start using that. So the problem is clear. When you're biased towards governance, everyone experiences friction and bottlenecks and everything's

slowing down because the problem is that freedom for analytics and creativity and innovation is critical.

Because that's how you really solve business questions. You cannot just rely on what exists. You have to get the freedom to build new terms, to create a new analysis as you go. Even if 90% will be garbage, that's how analytics works. And this is why I'm emphasizing so much this observability piece.

You have to let analysts create things independently, creatively in their native environments, preferred tools, preferred language at their pace as they like. That's where the magic happens. But as I said, 90% is garbage. And you will see that through the usage. It's not going to be used. They're just going to create it. No one's going to use it. Create it, try it. It will be local, their notebook. But then they create a report or a data product and it gets traction.

And that's when you understand it's creation needs to be added to the semantic layer. But you have to maintain this level of creativity. Otherwise, again, you're stuck in place and no one wants that.

Okay, so it sounds like you need to distinguish like this is an ad hoc analysis on something new with this is something we need to reuse. I guess analysts shouldn't be allowed to redefine how a company like total revenue is generated. You need an official definition for that. But if they're just playing with something new, then that needs to be less governed. So let me give you a real world example. So I was working with a customer. They had their definition for engaged user.

And a user that signed up for the, I mean, signed into the application once a week was defined engaged user by marketing, by sales, whatever. And they always track the number of engaged users because, you know, churn usually, when the number of engaged users reduces, then eventually it translates to churn and no one wants to experience churn, right?

But then they figured the definition. So the product people, they ran an analysis and they figured that this definition once a week is not a good indicator. It's actually twice every three weeks. That's a much better indicator. And they did their experiment and they realized that. Now think about all the dashboards that rely on the once a week definition.

And now they need to go and figure out who is using that. And now we want to change the terminology. And I want to use twice every three weeks. And it's the new engagement user definition. And this becomes like a nightmare. So they keep it in product. And we know where this goes. So in the world of a semantic layer, they would actually be able to create a new version in this certified place. And they will be able to introduce a new concept, a new company-wide concept.

And it will not just be buried in their notebooks, but only when they gain confidence.

So you have to get both. You cannot just limit that. But then once the official definition changes, you have to allow and enable updates and versioning and all that. I really like the story. And it just shows that there is a lot of value that can be got there just as long as you're governing the right things and maybe giving freedom to analysts to do what they want in other places. All right. So just to wrap up, what are you most excited about in the world of data governance?

So, well, I am the co-founder and CEO of a data governance company called Yuno. I think after working almost 20 years with data teams in various, and I'm in a lot of fields, in cybersecurity, in healthcare, in fintech, all trying to get sense from data, I figured there were so many challenges there. So, yeah.

My mission is to really help large scale data teams understand data easily and get the value that they can from data. So that's why I chose to do that and build this company and try to solve some of the problems that we just touched.

Yeah, helping people get value from data is a very worthwhile cause. Well, let's be more precise. Helping people or facilitating the creation of this central governed semantic layer and take organizations all the way to AI. That would be the most, the more precise way of defining this mission. Yeah.

Okay, nice. Yeah, semantic layers certainly sound very exciting and I love that it enables that all that sort of fun generative AI use case as well. Excellent. All right. Thank you so much for your time, Sarah. Thank you. Happy to be here. Thanks for inviting me. Bye-bye.

#269 Governing Data Models with Sarah Levy, CEO and Co-Founder at Euno 37:31 Share