Open Source Data Analytics with Sameer Al-Sakran

2024/12/3

Software Engineering Daily

Sameer Al-Sakran

Sean Falconer

Sean Falconer: 本期节目讨论了数据分析和商业智能领域中数据访问的挑战，以及Metabase等开源工具如何解决这些挑战。Metabase作为一个开源商业智能工具，专注于数据探索、可视化和分析，允许用户使用或不使用SQL与数据交互。讨论还涵盖了数据分析领域的演变、Metabase的开发历程以及选择Clojure语言的原因等。 Sameer Al-Sakran: Metabase的目标是让普通员工能够轻松访问和分析数据，而无需依赖分析师或工程师。Metabase与其他商业智能工具的不同之处在于，它更注重激发用户的好奇心和探索欲，而非仅仅提供预先设定好的仪表盘。Metabase的安装和配置非常简单，并且支持多种数据库。选择Clojure语言是为了简化部署和安装过程，并提高代码的可维护性。Metabase的商业化方式包括云服务、增值功能和白标许可。Metabase的权限模型比较复杂，包含多种权限控制方式，以满足不同用户的需求。Metabase选择开源的主要原因是其认为这是软件最佳的消费方式，并且开源能够促进社区参与和改进。关于AI在数据分析中的应用，Sameer Al-Sakran认为自然语言作为数据分析工具的界面将会越来越普遍，但LLM生成查询或分析可能存在风险，因为数据分析对准确性的要求非常高。 Sean Falconer: 本期节目探讨了数据分析领域的发展趋势，包括自然语言处理技术的进步以及工具易用性的提升。讨论了如何设计易于理解的数据模式，以及如何降低技术门槛，让更多具备专业知识的人能够更好地利用数据。同时，节目还探讨了利用生成式AI简化数据访问的可能性，以及Metabase如何适应这一趋势。Metabase的权限模型旨在平衡数据安全性和用户访问权限，并提供了多种权限控制方式。最后，节目讨论了开源商业模式的可行性，以及如何通过提供增值服务来创造商业价值。

Deep Dive

Key Insights

Why does Metabase focus on making data accessible to non-technical users?

Metabase aims to empower users with real jobs to explore and answer their own data questions without needing help from analysts or engineers. It reduces the bottleneck created by technical teams and allows users to satisfy their curiosity independently.

What is the primary goal of Metabase in the analytics space?

Metabase is designed to be the 'last mile' in analytics, enabling organizations to get data into the hands of the people who need it most, without requiring extensive technical expertise or large data teams.

How does Metabase differentiate itself from tools like Tableau or Looker?

Metabase focuses less on creating static dashboards and more on sparking curiosity and enabling users to explore data iteratively. It allows users to refine and answer follow-up questions easily, making it more interactive and discovery-driven.

What trends has Sameer Al-Sakran observed in the evolution of the analytics stack?

Sameer notes that tools have become significantly easier to use over the years, reducing unnecessary complexity. He also highlights the rise of natural language processing (NLP) as a major trend, though he believes deterministic tools will remain critical for accurate analytics.

Why does Metabase use the Clojure programming language?

Metabase chose Clojure for its ability to manage parse trees and transpilers effectively. It also leverages the robust JDBC driver ecosystem in the Java world, which was a key factor in their decision to use a JVM-based language.

What are the challenges of using LLMs for generating analytics queries?

Sameer believes that LLMs may struggle with generating accurate queries due to the high stakes of analytics accuracy. He suggests that deterministic tools, combined with LLM-driven agents, are a more promising approach for ensuring correct data outputs.

How does Metabase handle data caching to improve performance?

Metabase offers in-memory caching and pre-computation of models and metrics. It also supports caching data from multiple sources into a centralized data warehouse, which acts as a read-only cache for faster access.

What is the motivation behind open-sourcing Metabase?

Sameer argues that open-source software is the best way to consume software, especially for critical data stacks. It allows for easier audits, security, and customization, and fosters community contributions and feedback.

How does Metabase monetize its open-source product?

Metabase monetizes through cloud services, paid features for larger-scale deployments, and white-labeling options for embedding Metabase in other applications. The company focuses on providing value to the installer and their bosses, with free features for installers and paid features for higher-level needs.

What does Sameer think about the future of software development with AI assistance?

Sameer believes that while AI will reduce the mechanical skills required for software development, the value will shift to those who can identify what to build and how to solve problems effectively. Prompt engineers or product creation specialists will still be needed to guide AI in building better products.

Chapters

Metabase is an open-source business intelligence tool designed to make data accessible to non-technical users. It focuses on ease of use and exploration, allowing users to interact with data with or without SQL, unlike other tools that focus on pre-built dashboards. The goal is to empower everyday users to answer their own data questions.

Metabase is an open-source business intelligence tool.
It emphasizes ease of use and data exploration for non-technical users.
It allows interaction with data using or without SQL.
It aims to reduce reliance on dedicated data analysts for simple data questions.

Shownotes Transcript

Translations:

中文

Data analytics and business intelligence involve collecting, processing, and interpreting data to guide decision-making. A common challenge in data-focused organizations is how to make data accessible to the wider organization without the need for large data teams. Metabase is an open-source business intelligence tool that focuses on data exploration, visualization, and analysis.

It offers a lightweight deployment strategy and aims to solve common challenges around data-driven decision-making. A key aspect of its interface is that it allows users to interact with data with or without SQL. Samir Alsakran is the founder and CEO of Metabase. He joins the show to talk about the challenge of data accessibility, the evolution of the data analytics field, key lessons from his 14 years leading Metabase, why the platform uses the Clojure language, and much more.

This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

Samir, welcome to the show. Thank you. Thank you. It's great to be here again. Yeah, thanks so much. So we're talking analytics today. I think there's been various approaches to analytics have been around for a long time. There's been a bunch of different generations of BI tools, things from like Tableau and Looker. And there's also, I think, different takes on the problem of analyzing data, things like driving insights,

using things like Streamlit where you're actually building your own dashboards. And then there's frameworks like dbt that are focused on data transformation and preparing the data before it reaches the visualization stage. And Metabase has been around for nearly a decade now. You've probably seen a few evolutions of products in the space. So what's the background of the company and where does Metabase kind of fit into this world of analytics? - Yeah, I think we're the last mile. So in general, in most companies, there's data that people are interested in

That lives in one or more places, depending on how bundled up you are. It could all live in one really gleaming, perfect data warehouse. Everything is perfectly organized and everything's right. Or it could just be this complete defederated mess. I think we are the thing that lets you get data into the hands of the people that have real jobs. And so I think one of our core kind of center points has always been the poor sucker with the day job. And really, it's about how to get as much data

of their curiosity satisfied by them without help from anyone else is possible.

So there's been, like you say, a lot of really compelling ways to instrument your company, to have better understanding what's happening, to have better awareness of what certain segments of your customers are doing, what people are clicking on, who's signing up for what, when. All these things have been pretty dialed in for decades. I think the playground that we're trying to really do well in is just there's a normal person that has questions, right?

And we don't want there to have to be an analyst or an engineer that has to deal with every single question that comes up.

So a way to think about us is that most times you see a dashboard, there's a certain set of questions that dashboard answers. And then there's an easy anticipatory set of like questions three through 20. And we're trying to make it really easy for someone to answer those. So, you know, compared to a looker or a streamlet or Tableau, we're less about creating a dashboard that's consumed as is and more about creating a dashboard that really just sparks some amount of interest or curiosity and

and that the subsequent clicks and subsequent iterations and refinements are where a lot of the magic of Metabase shows up.

So is essentially the typical sort of target user then is like a non-technical user that needs to be able to not only like analyze the data, but perform essentially this discovery process because they might not even know necessarily what they're looking for or what questions they have. They want to be able to kind of like mix and match and explore. Yeah, like that's the final person, like the final constituent. I do think that analytics is often a multiplayer game.

And there are different roles people fill. And we are generally set up by engineers. So for the most part, an analyst is not the person who is setting a database. It's usually an engineer and it's usually someone that has a database that's lying around. And there's people out there in the company that need stuff from that database.

And so usually the person that like downloads a Docker image and then runs it is not the final user. It's the person that is serving the final user. So we do separate kind of our own internal lingo, the installer persona versus the end user versus analysts and pro users. But in general, our heart and soul is helping the poor sucker with a day job get their questions answered themselves, or at least a subset of that, and taking the burden off of the engineer that's currently a bottleneck. Okay. Yeah.

Yeah. And I definitely want to get into some of the details on some of the configuration and setup process. But maybe before jumping there, since you've been in this space for a long time, I think you founded multiple companies related to analytics. What are your thoughts on the evolution of this analytics stack through your career? And what surprised you? What trends, I guess, are you paying attention to today that you maybe were not on your radar a few years ago?

I mean, I think the big one and the easy one is just like natural language processing finally kicked up a viable solution to many things. And it turns out to be even more broad ranging than previously feared. So I think that's the kind of easy knee jerk response. I do think there is something, I think that's part of a larger secular trend that I've been following for, I don't know, decades at this point, which is just that tools have gotten easier to use.

And that there is an intrinsic complexity in analytics. And there are certain things that are just naturally annoying about how to calculate net revenue retention, for example. There's some math you got to be aware of. There's some choices you have to make. And the actual equations are kind of annoying. And encoding those in SQL or Python or what have you is just kind of a pain. But there's also a lot of just unnecessary complexity.

that has over the years been chipped away at. And if you were trying to calculate anything in the 70s, you had to write a bunch of code. And gradually, SQL took over. And you kind of walk that forward to Excel, then to SQL, Tableau, and kind of the last couple of generations of more like post-iPhone software, where everyone just has a much higher bar for interaction quality and ease of use. And I do think that there has been a very...

palpable sense of simplification of the tools itself. And it's not that what users are doing is getting simpler. It's just like there's less...

self-inflicted annoyance. And so I do think that just the general user experience has improved dramatically over the last 10 or 20 years. I do think there's also a lot of interesting things happening around data shaping itself. I think there's always been a question how you should store data, what the appropriate format is, how to deal with consistency. There's all kinds of textbooks about data warehousing. But I do think one of the things that I... I don't say it's surprising, but

I don't think I would have given it as much weight as I currently do, but I think the success of a self-service data organization largely revolves around what schema you present to users.

And so given a choice of where to spend time, getting the schema cleaned up and specifically in a way that lets a normal person with a normal cognitive model of their business look at that and recognize what they're looking at. So I do think there's often data at rest formats that make a ton of sense from the efficiency or consistency or just convenience perspective that essentially make it impossible for anyone that's not eyeballs deep in the actual code base of that application to make sense of it. Yeah. So...

What's that look like in order to present a schema that's understandable by someone who's not in the weeds of the database or the data warehouse? I think it's fundamentally about resisting the urge to normalize everything, to have workhorse tables that are both enriched and manageable in size. So anything over 20-ish columns becomes harder and harder to use.

The columns should have English or whatever language your company runs under, names. You should be able to understand what's in a column without having to look something up. And that there should be a relative, let's just call it simplicity, to how concepts are represented. So users have addresses, and ideally it's not like two tables with a foreign key from an address into the user.

And while that may be accurate, it may represent the fact that certain people have addresses historically. That sort of thing makes it really difficult for someone that's just trying to look up customer data to make heads or tails of it. And there's a lot of things like that where...

You probably want to have specialized data sets that are just views on whatever data looks like at rest. And you probably want to iterate those based on like department or use cases. And so there's a lot of things that are very brilliant ideas that an average database designer would have that essentially make that data set unusable by anyone who is not as smart as them.

I do think there's a need to, I don't want to say smart, because I think people in tactical roles are generally fairly intelligent in my experience, but they know their world. They don't know the world of relationship databases. Yeah.

And making them learn the world relation database and get their stuff done kind of puts up an initial barrier. If you'll forgive me, I've often used the blogging metaphor for this, where once upon a time to write something on the internet, you had to learn PHP and how to do various command line shenanigans and set up this, that, and the other thing. And at some point, the amount of effort it took to set up a blog was more about configuration and installation than it was about the quality of writing.

And hence, you had really, really good writers that weren't able to get their word out. And when you reduce the technical burden of how difficult it is for someone to get the written word out there, the people that are actually skilled at writing versus skilled at setting up Unicorn or Nginx or what have you to actually get the word out. And so I think there's something similar that happens with most organizations where

Inside of a company, the people who have the most nuanced view of revenue retention or active users or the specific mechanics of a checkout funnel are not necessarily people that know how to write Python or SQL. The people that are running that funnel or that retention analysis and that actually are talking to users and have a fairly specific understanding of what people are doing. And they're the ones that know whether you should count a plan upgrade as part of like

whether you should attribute that to the retention of the originating plan or the final plan. I think also besides having sort of that domain specific knowledge that the person who's maybe writing the query doesn't have like sort of the business side of it, the different people also are going to have different sort of perspective and experiences where they

they might be able to solve a problem like bringing in new information from this other domain that feels disconnected, but because they have experience in it, they're able to essentially recognize those patterns across things that seemingly on the surface maybe seem disconnected. Yeah, for sure. Yeah, I mean, I mostly agree with that, but I do think that there's, in terms of just the data set like shape, just to kind of pop the stack a little bit, I do think that one of the critical things is not prematurely abstracting

and letting the different usage of datasets have different dataset shapes. I know it's not exactly what you're talking about, but I just, sorry, my brain just went off on a rail there. One of the things you talked about there, I mean, you had the analogy about blogging, like if we can essentially reduce the friction to getting up and allowing someone who's good at writing to write and not have to deal with sort of these technical hurdles, then you're going to end up with a lot more people just, you know, writing. So if we can reduce the sort of technical hurdles and configuration steps involved with accessing data,

then we're going to have a lot more people who are maybe good at actually driving insights for the business from the data available to do that. Now, you mentioned essentially all the interest of course around LLMs. And I think there's a number of companies that are trying to leverage generative AI now as essentially this like

interface to democratize access to data. And I'm curious, what are your thoughts on that? And how does sort of Metabase fit into that world? Yeah, I mean, I think there's maybe two different angles on that, the data cut and then the remainder. So kind of carve off two pieces and there's some like residual. I think that it's pretty clear to me at least that some subset of people want to talk to the computer.

And the idea of unstructured, just natural language as an interface for existing functionality is pretty much like written into the timeline. I suspect that there's going to be some set of things where people will just naturally and organically want to start talking or typing in a way that's conversational natural language. And so I think that's going to increasingly be just the hard expectation for all tools in analytics otherwise.

To support that as a UX paradigm, much the same way as mice existed and suddenly all of a sudden you need to have a menu system. And if you don't have a menu system and everything is command shortcuts, you're kind of weird and you have to kind of explain yourself. Now, there's still tools that are 95 plus percent driven by keyboards today, even in a world with phones and touchpads and mice. And I do think there will be going forward a need to figure out

where squishy natural language is the right user interface. I think it's a separate notion of kind of using LLMs to generate queries or generate analyses or generate deep dive execution plans or whatever you call them. I think I'm somewhat less bullish on that. And I do think that

Let me advert that and say I'm fairly excited about what you can do with agents that are wielding deterministic tools. And I think that there is going to be a lot of ways to push forward what a malleable, squishy agent that is basically working LLM land with hallucinations, with all the usual caveats you have there, what it's able to give users if it is able to then invoke tools that return absolutely correct numbers.

So I think one of the things with analytics is it's a very harsh place in terms of expected accuracy.

And that if something is wrong 2% of the time at organizational scale, it just doesn't work. Like if 2% of your numbers in your company are wrong and you just can't tell which 2%, that really doesn't fly. Especially if that 2% changes randomly on you. So I think that trying to generate SQL or generate whatever target language you have is probably a rocky road. And that will work well after I think the game has been played and won. Yeah.

I do think that what is exciting and what I think will start taking hold is, you know, I have this toolbox of deterministic stuff. I have agents or, you know, single or multiple agents can like use that. And then a lot of the heavy lifting is going to come from the actual deterministic tools themselves.

And just to like kind of bring it all in, I do think it's important that if the machine produces a number, that number is right. And I think that a world where the number is just like, eh, it's kind of cool, rapidly falls apart once you're talking about like real operations with real stuff that people care about. But I do also think that

Most people that are not in analytics underestimate how much time goes into understanding why number X is the same as number Y. So my revenue number here is like 1.25. My revenue number here is 1.27. Which one's right? And working analysts tend to spend a disgusting amount of time dealing with that. And so I think things that make that harder are net-net a larger burden on analytics. But in terms of what's happening with Metabase, I just think that for us...

Our target persona has always been the non-engineer, the non-analyst. And I do think those people are rightfully going to want to talk to computers. And so, you know, we've had two different iterations of having a chatbot that we've had. We're constantly playing around with stuff. We have a couple of dark alphas. I do think that we're also playing a lot with how LLMs can interact.

You know, we've had various classification clustering recommendation algorithms woven into the code base for ages. We've gradually played around replacing some or all of those with LLMs and LLM invocations. But I think that there's some very, very interesting stuff around, again, the new idiom being I can talk to the computer. And so I think that's where we're putting a lot of our chips in.

And I think that an LLM as an analyst is still, I think that'll happen. I just think that'll happen well after a bunch of really, really cool stuff gets produced in other ways. Yeah. I mean, I think that what you're saying is right. You need to start sort of with the types of tasks that LLMs are reliable for today, especially when we're talking about analytics. Like you can't get the wrong revenue number. You can't get these numbers wrong or it's going to lead to all kinds of problems, but

But back to Metabase, if I'm using this product, I want to get started with it. What is that process? So I'm assuming that an engineer is sort of the first person that's working with Metabase to get the setup. What is that setup and configuration process?

Yeah. So our whole bag has been that we're the laziest possible option. And I think that we've tried to make it very easy for someone to spin a step alongside a very over-the-stage project. And so you just pull up a Docker image, you run it, you point us to your data warehouse. At that point, you just do database and give people accounts. There's a couple of options in the open source version. There's some better ones in the pro version, but generally just

Just download a jar. If you run jars, download Docker image, if you don't, or, you know, we have a cloud service if you don't want to do either. But I think it's literally a couple of minutes. And for folks that are,

super early in the cycle of their product or their project inside of a larger company, we actually suggest that you don't do anything else. Don't make dashboards, don't write reports. Let that happen organically, but that set us up before there is an analyst is usually our very strong recommendation.

because it can delay the need for analysts by just having there be a controlled place where people have accounts. They can run SQL questions if they know how to write SQL. You can give them SQL templates to run. There's a query builder they can use on their own. There's potentially lots of easy ways to click and hunt and peck their way to Nirvana. But I do think that for us, the primary thing that we're trying to do is delay the need to get serious about data. So I think that there's a certain...

people have to set up a data warehouse, to set up dbt, to set up a bunch of other stuff. And that all makes a ton of sense, but you should probably have something that lets the normal humans in your company ask questions like months or years before that moment. Yeah. And,

And then is the cloud services that the main way you commercialize? It's one of the main ways. I think that there's three ways to commercialize. One of those is just, hey, you don't want to run it yourself, we'll run it for you. We do have an open core model. So there is some features that will help you at a larger scale that you can buy from us.

And then if you want to slap your logo on it and embed in the application, there's a separate license for that. So potentially, if you want to white label us in your application, that is also a thing for you. Okay. And then as a user interacting with this or the front end of this, what's that experience like? And then what is going on behind the scenes to essentially pull the data? Yeah, I mean, there's a couple of different folks that I'll talk about. I think the person who's setting this up is probably going to be smashing SQL together.

So you kind of show up, you hit a button, you can write SQL, you can save that, you can write dashboards. And so there's a power user mode effectively where if you know what you're doing,

You can do all kinds of rich dashboards, template type SQL, data transformations, model things, persist models, etc. I think there's also, from the end user perspective, there's just the ability for me to click on stuff. When I click on stuff, it changes. And then when I can use a simple query tool and I can just click on buttons and I get answers. And so for that, we have a target language called MBQL. It's just kind of a pre-parsed

pseudo-SQL-ish kind of thing. Our user interface generates MBQL. MBQL then gets transpiled to various SQL dialects or Mongo or some other... Basically, we have a couple other community drivers for non-SQL-based languages.

That gets executed. So everything that is run runs on your database data warehouse, and then it gets pulled back. And there's a bit of post-processing that gets chucked over the client. So for the most part, for a whole host of reasons, we don't want to generate SQL directly. And we don't want to force people to have to write SQL directly. So the heart of the application is a transpiler. So is...

Who's writing the MBQL statements? The computer is. So I click on stuff, and then we have essentially React components to do some stuff. They invoke an MBQL lib library that's in ClojureScript. ClojureScript manipulates this first tree effectively, and then that gets kicked over the wire, and that's how most of our queries get represented. How many different languages do you have to transcompile into? Yeah.

I always get this wrong, but I want to say there's something like 20 first-party drivers and then maybe another 10 third-party drivers. So we wrote a bunch of drivers for common databases. And then every once in a while, someone in the community writes something for a database we don't support, but on the order of 30 different databases or targets of MQL. Okay. And why did you choose Clojure as the development language?

I mean, originally it was Python. So the first version of this was random Python. And then when we thought about the deployment installation story, so I kind of glibly mentioned we made installation and configuration really easy. We actually went through a lot of trouble for that. And we use Clojure to do that in many ways. So we wanted to have a single atomic binary you can download. We

We wanted to have mature database drivers. And so we really didn't want to be forced to run lots of weird processes in the Python, like Docker image, where there's just multiple modes of failure. And so we ended up deciding to use the JVM language, tried to port the Python to Scala. That didn't go all that well. And then decided to move to Clojure after a week of banging our head against Scala. And it was...

It was specifically the ability to manage the transpiler and just dealing with parse trees that made the choice of Clojure specifically compelling. That was the main advantage of over, say, writing the code directly in Java against the JVM? Yeah, so we knew we wanted JDBC drivers. I still think that's, in general, the driver ecosystem in the Java world is pretty robust and pretty reliable, especially compared to Go or JavaScript at the time. JavaScript's gotten better. Go is still what it is. It's all right.

And yeah, so given the choice between writing it in Java or Scala, but the decision to use a JVM was probably the first decision we made. Has that been, you know, that choice around that language, has that been a challenge in terms of like bringing in new engineers to the company? Like, is it harder to find people that know the language? No. I think, I mean, it's actually been beneficial in that. I think, yeah.

A lot of people want to write in Clojure. It's one of those languages which just has a specific set of ergonomics. And if you don't like parentheses, sorry, it's really not going to make you have fun. But I do think it's given us a pretty concrete advantage where lots of people just want to write Clojure for a living. And we have that as a benefit of working on our code base. So I think that from that perspective, it's been very, very beneficial. I also think that, I mean, this is my personal opinion, but there are good engineers and bad engineers.

And good engineers can pick up new languages. Not really as a consequence of being good engineers, but I think that if you're a good engineer in D-sharp or F-sharp, you can probably learn Clojure. And so in general, we have been very cool with people coming in wanting to learn Clojure, but don't necessarily have it dialed in yet. Are there certain advantages, disadvantages with running on the JVM for this particular application? The main advantage we have, specifically for the open source self-hosted world, where it does just a single file.

Like you download a Nuber jar, it's a single download. Other works doesn't work. You hit Java dash jar run, other works doesn't work. And there's just certain predictability and atomicity to the installation.

So that's been a huge, huge thing. And I really don't think that we have grown as fast or as well had we had a 20-page installation process that required compiling native extensions and scouring some repository for the right version of something. So our ability to build that single binary has been critical. And so I still think that was a category of the right thing to do all along. Dealing with JVM...

is a dark art. And there are certain times when we've had to deal with strange memory issues, like,

debugging some things in Clojure LAN and JVM LAN have been challenging at times. The ecosystem is definitely leaps and bounds where it was when we started. And yeah, I'd say that it's probably like a bigger, fatter binary than we might've gotten in other places. And we're definitely, because it's an Uber jar, because it has everything bundled in, it is like a heavier, like just file than if it was just, here's a,

strip down, set the code base and go pull in all your dependencies. With the transpiling to different versions and flavors of SQL and different DVMFs, was there particular hard engineering challenges with creating that? I mean, it was a pain, yeah. So

It's a lot of code. I've lost track of exactly how much it is, but I want to say it's like 50,000, 70,000 lines of just fairly dense closure. There's a ton of just adjacent stuff we use. So it's highly non-trivial. I think it's fairly gnarly, complicated code. It was a difficult task. I think the folks on the team did it really well. We've gone someplace really cool with it. And I do think it is a fairly difficult undertaking that people managed to pull off. And I do think we've gotten a lot of benefit from it.

But it was probably a dumb idea. Looking back on it, I was like, hey, we're going to write a compiler. Probably a more sensible measured person might have been like, yeah, let's try to figure out a way to win without doing that. In some ways, it was taking the hard way down the mountain. What do you think, if you do it again and you go a different direction, what's that direction look like? I still think I would make the big decisions the same way given what I knew.

And I still think that having there be a target independent intermediate language is the right way to do it. I think I'd probably, doing it all over again, really change the level of granularity and abstractness of language. And I would have it be even further away from SQL than it actually is. And I do think that one of the things that has been challenging has been

Every once in a while, there's a set of conceptual domain models we have about user land where metrics models, there's these things that live in that world that are hard to map to NBQL primitives. And so there is a tension between the primitives NBQL is built off of, which kind of is almost, I'd liken it as like if SQL is assembly, it's like C, right?

And if I were to do it all over again, I would rather than creating a C compiler, I would create a list compiler where there's just the ability to have a higher level DSL closer to what actual user land concepts are rather than having to try to express things in user land down to a C kind of like degree of level and language with a level of abstractness of C on top of an assembly. And rather I'd like just had a more scaffolding and more

in some ways, more abstract concepts that build up the target language. Do you think if you wanted to go in that direction, you wanted to basically build this different level of abstraction, is that something that would be a reasonable project to take on now? Or is it essentially too much time and product dependencies exist for the NVQL system? I think it's one of those things where there's a lot that's working that we don't want to mess up.

And so rewriting the target language, which I want to say is at the center of like on the order of 200,000 lines of code. Like, is that additional benefit worth it? I'm not sure. I think that given where we got to, things worked out. I think we probably could have gotten here faster.

So some of this is not just, you know, are we at a place that is good, but it's also getting here took a while. And I think that we could have speed run a lot of it by having better abstractions. So I think, you know, for things like metrics and models and some of the higher concepts we now have,

And the way we deal with dimensions, the way we deal with like column abstractions, unifying those across different databases when they point to the same thing. So for example, latitude really means the same thing in a database. It's not like a column is latitude. There's just a latitude concept. I think we could have speed run where we got to in maybe half the time by having a higher level of scaffolding. But I don't know if I would rip it all out now.

Is there some level of caching of the data that's happening within Metabase as well? Yeah, so a couple variants of caching. So the simplest one is just like, hey, you run a query, we'll cache for you. And that has some speed up at some level. Like, I don't know, this is caching, right? Like different vendors have different ways of saying, we can speed up your stuff by 2000% by whatever. So we have like in-memory caching. We do a fair amount of pre-computation, especially models and metrics where we will...

essentially pre-compute on some schedule or some push nature. And so those are two different ways of viewing it. And then there's sort of like a more manual version where as you start thinking about cross-database data sets, just having those live in a centralized data warehouse or some sort of centralized place. And depending on how you structure things, you can do that as a cache where you're pulling things from a

like a database of record, you're stuffing them into this other place that's much faster. And then you're using that as kind of a read-only cache. But then it's like pull or usually pushed from the centralized data source databases. So two-ish levels, layers of cat levels caching and arguably that third level as well.

Do you run into any challenges with essentially like the data getting out of sync? So what the user's pulling is being pulled from the cache, but the actual underlying data has changed in some significant way? In theory, yes. In practice, not that often. I think that usually manifests when something's busted. So I think data staleness is usually...

the way that this stuff comes up as opposed to the cache itself being a problem. So in general, you know, we cache things for n seconds or n days. But I think that often a lot of analytics still is not fully real-time analytics.

So you don't have a single database that is consistently and always and forever up to date. There's often multiple writers into it that have different schedules. And so it's not uncommon to either have daily numbers for some data sets or to have, for example, every 20 minutes you pull Salesforce and you get some stuff. And so the underlying data set often has...

a distribution of data freshness. And I think that the overall analytics profession has just kind of like learned to absorb this and to try to find ways to like both live with the fact that there's gonna be different data freshness and try to propagate freshness through lineage or through whatever tools you have, as well as try to make the way that you calculate numbers that matter, have it be done in a way that doesn't require you to be able to hit a fully fresh data set that's completely consistent.

So just as an example, you'll often be pulling... I think we pull from 20 different data sources into our data warehouse. We have stuff in Stripe, stuff in our CRM, stuff in different services we run. And those are all happening on different schedules. And they're not all exactly happening on the minute that the data point gets generated. So there is often a little bit of soft inconsistency. But for the most part...

you can kind of get around it, get around the implications of that most of the time. How does the permissioning model work and how fine-grained is that? So permissioning is kind of the bane of my existence. And if you were to ask me, what did we mess up? A lot of those roads go to permissions. I do think it's actually really, really hard to construct a permission system that gives everyone the knobs they need without creating a monster.

So I think that we've had very different perspectives on this over the years. And so maybe just to make this somewhat entertaining, people can like

have a good time off our misery. Once upon a time, we were just really centered around this idea that you give people access to data, and then the actual products of the data figure out whether someone has access to or given report or not. That didn't really go down very well. We rapidly, or not rapidly, but after a lot of kicking and screaming, we're pulled over into a world where we have a parallel system of folders, where you have collections, collections have permissions, they have sub-collections,

And so there's a mixture of the ability to lock things down at a department by department or function by function. But anything you put in collections, you can use that folder metaphor. And people have read/write and admin access to those. We simultaneously have the ability to lock things down by datasets. So for example, you can say, "These three tables have PII and

And these eight groups can't touch that. So you're not able to look up user addresses, for example, if you're an intern. And then on the kind of more paid side, we also have data sandboxing, where you have the ability to lock things down by column or row, where you can basically say like interns are allowed to see aggregate metrics based on users, but they're not allowed to look up phone numbers of customers.

There's effectively three different permission systems around just data access, collections, permissions, and lastly, more bespoke and more complicated conditional ways of either creating hierarchical permissions or column or row of controls. Can I also control, if I create some sort of view of the data, can I control

What level of access someone has to manipulate it so I could essentially create a view of the data that is maybe like a read-only view that I embed in my application? 95% of our usage is read-only. So I think that in general, we do have the ability to do write-back, but that's not a common thing. But yeah, there is definitely lots of ways to create safe little sandboxes for people that have differential trust to play it.

A lot of what we sell is really things that help you in these various scenarios. So I think for most people that are operating a database in a pretty high trust environment where everyone has the same permissions and you're all part of the same team, the open source version is more than good enough. And then at some point, as you have less trust and less homogeneity in your group, the paid features kind of really kick in.

You have around, I think, 40,000 GitHub stars. So tell me about the motivation behind open sourcing Metabase. Yeah, I mean, I'd say we probably have less stars than we should given our footprint. I think we've never really played much in the way of the GitHub vending metric games. I think we're open source first and foremost because...

I think that's the right way to consume software. And I think that if you're running something in your data center and you're touching data warehouses that matter, like I actually think that it being open source is a better format to consume it in. I mean, I think if you want to consume a service, that's great. Those work out really well. But I do think there's something to your data stack being open source first and foremost. I do think there's just a lot of things that

That simplifies. It's easy to do audits. It's easy to be paranoid in security measures. It's easy to fix things and you'll feel like they'll not be fixed by a vendor at a speed you like. I just think there's a lot... Maybe it's just me talking about my own formative career, but I've often had to run software from vendors that was just not being fixed or we were breaking in weird ways. And the ability to go into the source code and muck around was something that I really value. Yeah.

And so just on a personal level, I just think that's how most software should be delivered, at least at this point in time. As the world changes, my opinion there will change. And I think given that it had a lot of interoperability, so we're targeting 30-ish databases, having people be able to inspect the drivers and be like, actually, the way you're hitting index here is kind of like hokey. You should do it this way instead. It is very beneficial. And I do think that

We have gone a ton from being open source products in terms of information adoption usage. And so we still very much appreciate people complaining.

Like, I know it sounds kind of like weird, but we get a lot of value from people complaining. It's like, it gives us a pretty clear sense of who's, who wants what, how badly they want it. And I think it's an amount of information that in other contexts, I would have spent a lot of money to generate. And so having something that is in the public eye is actually very like valuable in that and just all in itself. For something like this, like where you talked about how you feel like this is

Open source is essentially the model that software should be consumed. So then from a business side, the value that the business is bringing where they can charge money is no longer essentially the lines of code that they've written. They have to find other ways of essentially bringing value. So in companies that are sort of open source first or really investing in open source, how do you think that they need to think about bringing business value so that they can actually... At some point, they have to pay bills, essentially. The general...

The frame that I have there is this. You should understand what you're in charge for very, very early on. I think that it's dangerous to write the project, release it, run it for a year or two and be like, gee whiz, how to make money off this thing. And so I think that most software has, ideally has a specific user. It has a specific set of constituents and people that get value from it.

And you should understand who's using it, why they're using it, what they value, what the other cast of characters are. And then how to somehow, assuming you're going to commercialize it, what the lines of commercialization are, and then try to do a really good job of seeking those lines.

So I think we, from very early on, knew we wanted to charge for white labeling and that if you wanted to embed us in your application, that's great. But we're an application first and foremost. So if you want to white label us, that's going to be a paid thing. We're not building a library to build your own. We're not building an open source library to build your own analytics applications. We're explicitly building an application that you can embed.

I think that has, that created a lot of clarity. It made it easy to just understand how the roadmap should look. It made, hopefully made us printable to our users. So I don't think we've ever pulled any, you know, any rugs out from anyone where we like took bad features or like did anything too, too capricious. And so I think that if you're planning to, as like an entrepreneur, as a founder or as,

a company trying to release software through open source, like understand what people will eventually pay for. And I think the clearer that vision is and the more justifiable it is, the more likely you are to get the lines right. And I think there's a lot of projects that have tried to commercialize and it's kind of bombed. Like, you know, for a long time, there was no open source companies. Then there was a flurry. And then a lot of them had kind of a come Jesus moment. And I think that,

One of the things that has separated the people that have won has just been some sense of like, okay, this is why someone pays. And I think it's important to separate out like the winning products. I think without a winning product, you're not really playing the open source game. You're just kind of having some weird half-assed marketing adventure, side adventure.

So understanding what you're giving away and why, and why people want it, and making sure that it actually can replace the alternatives and it's not just a crippled version. And secondarily, like, cool, if you win that, what exactly is it you're selling? And for us, I think a lot of that just boils down to understanding the installer and then their boss. And we try to make the things that installers value free, the things their bosses will demand after it's successful, paid.

And that was kind of the general heuristic we ran with. You know, it's worked in some ways and on others, but I think having something like that from the very, very early days that you believe in and that you're able to validate somehow, some way, even before you start charging money is really important.

Yeah, so I mean, I think that what you can sort of charge for and how people evaluate the value that the delivering has changed over time. Like there was a time where you could write sort of shrink wrap software and you were explicitly charging essentially for that software. Obviously it was bringing value, but you were in a lot of ways charging for essentially the lines of code that you were written. And I think now, especially with like managed services and other different ways of essentially monetizing, commercializing businesses,

It's changed the model where you can essentially give away the source code and the value is not there. It's something else. It's in terms of like making it really easy to run or certain enterprise features that are maybe not available in the open source model or whatever it is. Do you think now where, you know, more and more code is essentially being written with at least the assistance of AI that,

That, in some ways, even lowers the value of the lines of code even more, where it makes sense to figure out other ways of essentially delivering value to your customer. I mean, I think this depends on what the implicit rate of improvement for AI is. So I think there's a version where no humans have any value, therefore don't bother. I'm not quite that extremist, but I think there's another version where it's like...

You know, it's mostly just going to be like where it is today with slightly better ergonomics. And somewhere in those two poles, there is like the path that we'll be on. The reason I bring that up is there's some parts of that spectrum

where the ability to turn arbitrary incantations in something resembling natural language into something that works remains very, very valuable. And that, you know, LLMs and co-pilots and all that are really just a higher level language.

but you're still fundamentally working in a higher level language. In some ways, the LLM is really just a compiler for your, or interpreter for your super really leveraged DSL. And there's still someone that has to make the incantation. And the people that can make that incantation will have valuable skills. And the people that are able to pull that together to solve actual problems are still valuable. I do think that as...

The level of skill required to build a certain system decreases or changes. It starts to shift value into the people that are able to understand what to build and that the relative value of someone that knows, actually, I need to fill this specific Lego to make money is even more important.

So I think that like, but I still think that for most that spectrum of how far AI goes, there still needs to be someone holding a wand, you know, and speaking incantation.

I just think that the nature of that language will change. And how much of the value is in the prompting versus the actual post-processing or pre-processing? How much of it is in the actual modeling train? How much is in fine-tuning? There's going to be a lot of stuff that is still high value that has to get done by somebody. And unless you assume that LLMs and the genetic systems you build around them get so advanced so fast that all that gets done by them...

then there's still a humans will be doing all this. And whether we call them a software engineer or a prompt engineer or a product creation specialist or magician, it doesn't really matter. There's still going to be some number of people. I do think that it will probably change the leverage. And so I do think so you will not need a thousand software engineers to build something. You might only need 10 prompt engineers to build something that equals equal scale.

Hopefully, this means we do bigger and crazier stuff and that we have better toys in the future and that we're able to tackle bigger projects. But I still think that for quite a portion of that spectrum, there will be someone that... And companies will still need...

to figure out what those legos are, identify them, build their best version at Lego, and then somehow find a way to get in front of people and have people want to buy from them. I think this kind of is a nice way to tie things back to what we were talking about even at the beginning where you used the analogy about blogging, where if we can reduce essentially the

configuration setup steps to help people who want to write and put their stuff out there, then you're going to get a lot more creative work that's going on. And I think it's similar where if you can essentially lower the barrier to entry, being able to create a lot of code and eventually products, then I don't think it's that

There's less people doing that stuff. There's actually more people doing that stuff because now it's not that anybody could do it, but someone who has some level of skill can now essentially create some kind of product experience, or at least we'll get there at some stage. If I can give maybe a concrete example, which might crystallize this, I think, again, barring some weird singularity, we're probably going to still want iPhone workout apps.

And someone's going to have to build the best workout app. And the question of whether the primary skills behind the person building apps will be of everyone who is at least this level of proficiency with iOS development and Objective-C and blah, blah, blah. Or is it like who has the best idea for a workout app? I think that what's going to happen is that

Having those mechanical skills, which were critical when the iPhone launched, and the best workout app of the first generation, just whoever was able to write a bug-free app, has shifted to who has the best ideas around how to structure the thing. But there's still a market for it. You still got to build it. You still got to build a better one than the next person. And there's still going to be people that build that app.

Again, they just might have a different title and might be working on a different editor. Well, Samir, thanks so much for being here. I really enjoyed the conversation. We ended up going deep at the end, which I like that. I think there's a lot to digest, especially when we're talking about products that are really focused on reducing, I think, the barrier to entry or the friction involved with accessing, analyzing, driving value from data. Likewise. I had a great time. Thank you for having me on here. All right. Thanks and cheers.

Thank you.

Open Source Data Analytics with Sameer Al-Sakran 47:29 Share