AI agents can automate machine learning tasks, participate in Kaggle competitions, and assist with exploratory data analysis. They are particularly useful in machine learning but may struggle with broader data science tasks due to the need for creativity and communication skills.
Data science involves both technical and creative aspects, including problem-solving and communication. AI agents are better suited for narrow, repetitive tasks rather than broad, open-ended problems that require human creativity and reasoning.
The 'and' data scientist is a highly skilled unicorn who can perform multiple roles, such as statistics, programming, and analysis. The 'or' data scientist specializes in one area, such as regression or scatterplot creation. The 'and' data scientist is rare and often found in startups, while the 'or' data scientist is more common in mature data teams.
The full-stack data scientist is returning due to the need for holistic problem-solving in smaller teams and startups. AI tools can help fill gaps in areas like data engineering and analytics, making it feasible for one person to handle multiple roles.
A great data scientist will need product sense, communication skills, and project management. These skills ensure that data solutions align with business needs, are effectively communicated, and are delivered with stakeholder involvement.
Fortran is seeing a resurgence due to its performance in high-performance computing, such as weather forecasting and deep learning model optimization. Some users are rewriting critical parts of code in Fortran to improve performance over Python-based frameworks like PyTorch.
R is declining in popularity compared to Python, especially among individual learners due to fewer job opportunities. However, it remains relevant in industry use for the next decade, particularly for data analysis tasks where its tools like the tidyverse and ggplot2 are still preferred.
Python is expected to remain dominant due to its widespread adoption and versatility. It has become the de facto standard for data science and programming, much like the QWERTY keyboard, ensuring its continued dominance.
SQL is expected to evolve with improvements in syntax and usability, influenced by tools like DuckDB. Despite being 50 years old, SQL continues to be a cornerstone of data work, and its language is still improving to make it easier to learn and use.
Adel is excited for a correction in the AI space, where hype subsides, and more grounded, valuable applications of AI emerge. This could lead to a more mature industry with less overpromising and more practical AI solutions.
All right, Richie, how are you? Life is good. Great to be recording with you again, Adele. Great indeed. So this is our second industry roundup episode. And what are we going to be talking about today, Richie?
Well, last time we spent a lot of time chatting about AI. Since the title of the show is Data Frame, today I figured we'd talk about data. We should probably try and not talk too much about AI, but I'm sure it'll creep in somewhere. There should be a drinking game every time we accidentally talk about AI. You have to take a drink. I think we'll be drunk in the first 10 minutes if that happens.
Now, we're going to talk about three stories today. One, no surprise, we're going to mention a bit AI, just a bit. We're going to talk about coding agents for data science specifically and what they could mean for the future of the data team. We're going to talk about the return of the full stack data scientist. Could that be happening? And three, are old programming languages making a comeback? We'll see. I don't know. So maybe first story, all in all, this is something I've been looking at, particularly excited about, but also, you know, intrigued of.
This year, we saw the release of Devin, the coding agent, Replit Code agent was released. Google has been experimenting with coding agents, right? Even Langdaw AI has a data science agent specific for Snowflake, for example. So we're seeing more coding agents and more research is focusing now as well on like building data science focused coding agents. I think there was a Chinese paper that you sent me a week ago, Richie, I think, on
on how these Chinese researchers will be able to build a data science agent to automatically fit machine learning models. Other fractal.ai is working on Kaggle competitions. Maybe based off of what you've seen, Richie, here as well, what are the particular use cases of coding agents in data science specifically? What are you excited about? I think AI agents coming soon has been a persistent theme over the last few months. It really has. Like a lot of Data Frame guests have been talking about it.
And a lot of the use cases they give are like, oh, there'll be something that can help you book your flights and automatically add a calendar invite. That's not very exciting to me. Maybe if you're an executive who flies every week, brilliant. I fly like two or three times a year. Not that useful for me. But data science agents,
I am looking forward to these. And because there are so many different data roles, I think agents can be more useful in some places than others. So machine learning is perhaps the most amenable area here. So you think about data robot has been pushing automatic machine learning for nearly a decade now. So we know that automatic machine learning is a really good idea. It's not entirely clear how much value the large language models add to this process.
There's been a machine learning agent from Lang.ai, which basically is designed to participate in Kaggle competitions automatically. I suppose great if it wins, you know, you could win some money without doing much effort. But the idea of having some sort of bot doing machine learning in general, I'm in favor of it. I would say data science seems harder, though. So data science involves a lot of communication skills as well as the technical skills. It's very broad.
I don't think broad data science bots are going to be a thing in the near future, but we might see some for narrow tasks. Okay, that was the last one that I wanted to focus on. Actually, Ben Stensel, we recorded an episode this time last year where he talks about, okay, coding agents or generative AI will be able to work a lot on SQL code or boilerplate code. But what he was excited about actually is the use of LLMs to streamline, potentially automate the creative parts of data science, like the problem thinking.
Have you seen agents, whether, you know, coding agents focused on data science or traditional, or say traditional, more like software development coding agents, being able to also work well on the creative side of this type of work from problem solving? How do you scope a solution? What are the different avenues you need to look at to solve a particular problem? So on and so forth.
Absolutely. So there are a lot of attempts at this coming soon or just being released. So Google has this experimental notebook based AI agent that's specifically for data science. And so the idea is that it can do exploratory data analysis and it can do some interpretation of your results. I don't know how well it performs yet because reasoning about data and coming up with good conclusions is like, we know LLMs can do a bit of this.
Can it completely replace a human? Not yet, really, unless you spend a lot of time working on specific problems. I've also seen there's an AI agent built into Snowflake. So this is interesting because rather than trying to do all of data science, it's quite limited in scope, which hopefully means it performs well on what it does.
And the big problem with doing data science on a corporate data set is, oh, you know, you've got this giant corporate database, hundreds or thousands of tables, and the agent doesn't know which bits of data to use to solve the problem. So the idea with this
You create a snowflake view, giving it just the data it needs, and then it can answer specific problems like giving you weekly updates on your time series. So limited stuff that works well there. Okay, so still limited, not yet ready for primetime, the creative parts of data science. Maybe what are the, for those who are building these agents, what are the nuances that you take into account when building a data science specific coding agent versus a general coding agent? Yeah.
Yeah. So I think it's much the same as any other software project is you need to be really clear on like what the scope is, what you're trying to solve, because the broader you go, the harder it's going to be. You just have to write infinite number of tests to see whether it works or not. So I think that's the most important thing is start with stuff where it
It doesn't require creativity and you can automate things. So, streamline stuff that's going to give you that instant productivity boost. And then you work on, well, can we do reasoning for a limited number of use cases and then expand and iterate. Do companies even need data science agents? How do you imagine the implementation of a data science agent in data teams today as well? If we think about data analytics, maybe rather than data science,
there are really like a finite number of common business problems that people want to solve. And so that's much more amenable to create an agent for. So particularly if you have something where you do weekly or monthly reporting, then creating an agent to answer questions about the report, there's only going to be so many variations. So it's probably a lot easier to create something that works really well for that. Yeah. And we've been talking about this theme of AI system coding for a while now on DataFrame and our webinars, our go-alongs, so on and so forth.
We've talked quite a bit about how AI-assisted coding will change the data role. But if you have an agentic workflow or an agentic tool that takes open-ended tasks and starts tackling them, how do you see those changing the role as well? Yeah, that's really interesting. I think that probably leads...
nicely into our next story about how is the data science role changing. So this is something where we had Cassie Kozakoff on the show last year and she talked about how there are two kinds of data scientists now. So maybe we just play the clip from last year.
The concepts of the and data scientist versus the or data scientist. So the and data scientist would be somebody who is a statistician and they're an analyst and they're an AI engineer and they're a machine learning specialist and, and, and. Somebody who thinks that data science is the everything of data and that you should be expected to be maximally hardcore. You should be paid a lot because you are this amazing unicorn who can do everything.
And the and data scientist is incredibly offended by the or data scientist, just somebody who dares to wear the exact same job title as them. That is a job title that's supposed to come with money and is an or data scientist as in they just do one of these things.
Maybe not even that well. They touched a regression at some point. They made a scatterplot. Now they call themselves data scientists. You can imagine the horror from somebody's point of view. I spent 20 years becoming this good. How dare you? How actually dare you do this? Okay, I feel for both sides. It is horrible to have had the impression that in order to practice in the first place, you have to have a level of quality that's insane.
And you put yourself through that very difficult schooling and you show up and you're like, right, this is why I should be paid these dollars, which are rare and highly qualified. And then you see other people jumping onto the scene to pretend to be you, to take your salary and dilute the market. How horrifying. I get that. On the other hand, the and data scientist is a threat to getting things done because there are just so few of them.
If everybody has to be that truly top hardcore level, nothing's going to get done. And it makes much more business sense to specialize. And even if you are an end data scientist, you will not be doing all of them every day or every week or every month. Even projects come in phases and during some phases, it's mostly statistics. And during other phases, a whole lot of programming and so on.
And does that really all need to be housed in one person if it's not used all the time? Maybe it does make sense to specialize. If you have this expanding universe of data science professional roles, data science adjacent roles, then maybe it's okay if people aren't at the everything of data. And maybe it's okay if they only do some part of it. That means that you can have more people in it and you can get more done with data.
Okay. Cassie Cosgrove, dropping the wisdom. Yeah. Cassie's point is that there were sort of two different generations of data scientists. So the first generation, people were like, well, I need to hire a statistician. I need to hire a programmer. I need to hire a data analyst. And I can't afford to hire three people. So let's hire one person. And that was the data scientist role that was born. So it was really quite a technical, high profile job. You know, you need a lot of different skills.
And then people realize there aren't many people who can do all these things together. And so data science seems to sort of change to have a lot more junior people. So you have multiple people with different skills. And so there's these two breeds. They're like, I can do everything data scientist. And you've got that. I can just do something specific job data scientist. And so I wonder whether that's going to continue. So basically, to begin with,
Do you think we're going to have more of these unicorn data scientists in the future? Now AI is changing things or now economic conditions are changing things. Are we going to have more specialized data scientists? It's hard to answer that question. From one end, you can make the case for either scenarios happening. And I see probably it depends on the maturity of a data team. I think sometimes
startups will most likely or smaller data teams will most likely go for a unicorn data scientist moving forward because they need someone who can probably plug in the data pipeline, right? Works on analytics engineering, right? Use AI to be able to kind of fill gaps that they may not have and be able to approach problems from a holistic perspective.
But I do think as data teams mature, even if you have full stack data scientists on your team, you will still need data engineering expertise, for example. You will still need functional analysts to join your business teams.
Maybe AI will be able to, you know, streamline the use or the need of a functional analyst, but I don't see it as well happening in the foreseeable future. So I do see that there will be a place for both in the future. I don't think so. Maybe not the best answer, but I do think that there will be a place for both full stack data scientists and a more kind of specialized data teams. But I wouldn't be surprised to see more and more so full stack data scientists given the
how AI allows you to fill the gaps that required multiple people before on a team. Absolutely. It's very much an it depends answer. I suppose it depends on who you've got in the rest of your team, who you've got in other teams and trying to make things play out as a whole. Now, you mentioned the term full stack data scientist. Now, we did an episode earlier this year with Sabin Goyal from Outer Bounds. So he was talking about full stack data science being, well, it's a regular data scientist plus that kind of engineering knowledge to put models into production.
I think a lot of people define it differently. So do you have a sense of what a full stack data scientist would look to you? Like what skill set would that involve? I think a full stack data scientist is someone who's able to approach problems, a variety of problems that literally refer to the full stack of data within your organization and able to tackle them.
I'll give you an example of a good full-stack data scientist. Back when Datacamp was early in its days, Ramnath Vaidyanathan, who's no longer in our Datacamp, moved on to another organization that was at Datacamp for about five, six years. He was a full-on full-stack data scientist. He would work on data pipelines. He would build machine learning models. He would deploy them. He would run analysis.
Any data problem, you can put him on it. And now we have a much more mature analytics team with different functional experts for our marketing team, business development team, so on and so forth. So Ramnath is that example of a full stack data scientist because he's able to be
effective on 90% of possible data problems you may encounter on your data team. And this is what I mean. Ramnath was extremely valuable during his first five years of DataCamp because it was a really small organization. And this comes back to that startup point. So for me, a full-stack data scientist in a nutshell is someone who is able to tackle a large majority of problems that a data function may encounter.
Okay, I agree Ramnath was brilliant. It also highlights the problem that Ramnath is not very reproducible. So maybe the flow here is like if you're making your first data hire, you want someone senior with a broad variety of skills and as your team grows, then you gradually get people who are more specialized. Completely agree. So I
So I think the key thing here is that in addition to data skills, you maybe need some other skill just to make you stand out, especially as data literacy becomes more popular. Like everyone's a data worker now, at least in white collar jobs. So you need data plus something else. What should that something else be? This is something that we talked about on our Data Camp Radar conference a few weeks back.
The importance of product sense, communication skills, and project management cannot be understated, especially as you mature more as a data function. So what is product sense? A lot of data science, probably one of the biggest traps data science has fallen into is resume-driven development. You want to build the coolest, sexiest, shiniest toy that uses the most cool, advanced algorithms.
without necessarily having in mind what you're trying to do. So product sense essentially is how do I use my data skillset to affect a business problem? Not how do I fit my data skillset neatly
Because I want to experiment with these certain tools and algorithms and products to fit that business problem however I can. And then you end up having stakeholders that are unhappy with your solution. So it's always working backwards from the problem. That's product sense. Communication skills. If you're a data analyst or a data scientist and you're presenting to a C-level executive on your ROC curves, you're probably making a mistake here.
You need to be able to also communicate clearly and have adopted kind of a language that your audience speaks. So that's where communication skills, I think data storytelling is a big part of that. And thirdly, project management, right? Do you keep your stakeholders involved? Do you get reviews often? Do you make sure there are milestones within the delivery that you're working on? This is, I think, kind of the key business skills that will define a great data scientist from a good data scientist in the future.
I like this. There's quite a wide variety of, well, I mean, they're not quite soft skills, things like project management, but they're not sort of hard technical skills in the same way that coding is.
And I use hard and soft in the traditional sense of how technical it is. I always feel like soft skills are one of those joke phrases because the soft skills are the ones that are hardest to learn. Anyway, yeah, so you've got communication skills, project skills, and then I suppose the other stuff that we talked about where it really is a sort of technical add-on, like stuff like having enough engineering skills to put code into production. So you've got a real choice there for your secondary skill to go alongside data. I couldn't agree more, and
And in a lot of ways, you're painting here kind of two visions of a full stack data scientist, right? Like you have a full stack data scientist who's really good technically. They're able to build models, deploy them into production, so on and so forth. But then you also have a full stack data scientist who's able to build models, do analysis, but also manage a project, communicate with stakeholders. So there's also different visions of what a full stack data scientist can be in this context. Does that make sense?
It does seem that the future is fairly bright and that you do have a bit of choose your own career. And there's quite a lovely way to do the things that you're interested in. Like if you don't care about product management, then don't do that. If you do care about product management, then, you know, go for it. And there's going to be a role there for you. So maybe to sum up, the full-stack data scientist is making a comeback, but not in the way that we thought it would. At least that's how I posit it. And speaking of comebacks, I think this also makes a great segue to our third story.
This was not pre-planned at all, which is old languages make a comeback and Python keeps on getting better. So just to get some context here in the latest TOB index for programming language popularity, Python remains number one. SQL is in the top 10. So that was also pretty interesting. However, there are two shocks.
One Fortran is the eighth most popular programming language on the planet today, and it's been on the rise since the end of 2022. And similarly, MATLAB is also on the rise again. I always thought MATLAB only exists in grad school labs and that's it. And it's never used in any other place.
So I'm surprised that it's ranking number 13 as the number 13 most used programming language on the planet. So yeah, speaking of comebacks, are out of favor or what was previously thought as out of favor programming languages making a comeback?
Yeah, so those results, Fortran becoming more popular and MATLAB becoming more popular, don't make an awful lot of sense to me. I had to ask the internet to try and find some answers to this one. So I asked on the Fortran subreddit, why is Fortran making a comeback? And there are a couple of answers. Some suggest it's a real thing, some suggest it's not. So...
The top answer is, well, it's because PyTorch is too slow. So if you are doing any kind of foundation model building, any kind of deep learning, PyTorch is the standard tool, but it's also written in Python. I think there's a bit of C underneath there, but it also, it's designed to be a general framework. It's not optimized for performance. So if you are building a giant model, you get some benefits in just rewriting some of the important bits of code in Fortran. But
I can't believe that's like an awful lot of people. I feel like that's probably thousands of people, maybe tens of thousands at most doing that. So I'm not totally sure whether that's a real trend or not. And what about MATLAB? Why is MATLAB making a comeback? MATLAB is even less clear.
And I think it might be the same reason. I mean, I do love MATLAB, I have to say. I used to program MATLAB earlier in my career. It's got this nice mix of like, it's got a great IDE, but it also like can write code as well. So you've got that good sort of halfway point between I'm writing code or I'm pointing and clicking. But I suspect the reason might be just a quirk to IOBI index. So it's very heavily based on search results. So if you've got great documentation, great examples, then of how to use the code, then
you appear to be more popular than you actually are. If a lot of students are using your tool in grad school and they're searching for homework answers? Does that go into the tailgate? That sort of thing, yeah. So I think it might just be because MATLAB has very good documentation rather than there's like a lot of people using it. So I've not seen a sharp uptick in MATLAB jobs or like people talking about MATLAB in the wild. And it's the same with Fortran. Apparently, they've had a big effort to try and make their documentation clearer. So that might partially explain why it's appearing to be more popular.
Interesting. And, you know, you mentioned kind of that deep learning angle for the rise of Fortran. What are other potential reasons why Fortran may be
also going up in popularity. The other stuff Fortran tends to be used for is just it's high performance computing. So a lot of scientific simulations, like it's really big in like weather forecasting. They're the only people I know who are like super keen on Fortran. It's obviously like the language is like 70 years old now at this point. So yeah, anything you've done, engineering, simulation, things like that. For data science, I'm not sure that many people are using Fortran directly. Like I think if you want to write fast code now, it's all like it's rust underneath the hood.
And what about other languages? Richie, I'm going to ask you a sensitive question. Maybe how's R doing these days? This makes me very sad. So, yeah, I mean, my background is in the R community. It is still my favorite programming language by far. It is sliding down the rankings, which is a shame. It's just Python's eating everything. And I think the reason I really love R is like data analysis tasks.
it's just nicer to write than Python. It just works. Yeah, yeah. Like the tidyverse stack is just nicer than Pandas. ggplot2 is just nicer than any of the plotting libraries Python has. And for anyone who's like, well, if Python has Plotly, you can also do Plotly from R. So yeah, it's great stuff, but it's falling out of favor. And this is something we've seen on Datacamp as well, but there's a big difference between B2C and B2B audiences. So if you're an individual learner,
Nobody's using R now. No one wants to learn R because there aren't as many jobs. But on a business level, programming languages have a much longer lifespan. So it's going to be around in industry use for the next decade, at least I would think. But for individual learners, they tend to prefer Python. Yeah, I also started off as an R learner. That's how I broke into the data science space. I do enjoy working in Python much more. I do feel like it's more intuitive to me, Python syntax than R syntax at this point in time.
But it's interesting to have seen like the R versus Python debate has subsided, let's just say, in the past year and there's a clear winner. Maybe one third programming languages that was always thought to be a dark horse in the data space that I don't see a lot talked about today is Julia. How's Julia doing as well, Rich? It's never quite managed to catch hold of
the mindshare as it should have done. Like there was a lot of hype for it a decade ago when it was first launched. It's never managed to get those sort of big industry sponsors that Python's had. It's not quite managed to get the sort of academic base that R had. So,
It's still taken away. I think the language is getting better, but it's not quite taken over as a thing that everyone must learn. And I feel like Rust is taking that spot, at least in the ether, based on what I've seen data scientists talk about and kind of people who are
try to be at the leading edge of the conversation. Do you see Rust taking over, Julia, here as that third language in the data space? Well, so Rust is one of the rising stars of the programming language space. It's just gaining an awful lot of popularity, but it's not really trying to replace like Python or SQL. It's a much lower level language. So what it's doing is it's competing with C++ and Fortran for lower level code. So one of the big exciting things happening with Rust in the data space is...
Polars, which is a Pandas replacement. So you write your data manipulation code, but it runs much faster built on top of Rust rather than whatever Pandas is built on, which is layers and layers of Python, I think. And we do have a really detailed comparison article that compares Pandas and Polars. So I highly recommend that you check it out. We're going to leave it in the show notes. And
And maybe a couple of final questions here as well, Richie. What are your expectations for the programming language space in the next year? Who's still going to be on top? Will there be any surprising, any surprise dark horse that will come out? What are your predictions?
Oh, man. I don't think I want to bet against Python. I think it's just become too dumb at this point. It is the QWERTY keyboard that you will never get rid of. That sounds too negative. I do actually like Python. So, yeah, Python's going to stay strong. The big thing I'm most excited about is the SQL language improving. So, I mean, SQL's been around for 50 years, but the language is still evolving slightly. And I think...
particularly with DuckDB that has took a lot of influence from the R language so there's a lot of ways of making SQL syntax easier for people to learn easier for people to write so I'm excited to see changes in the SQL language itself yeah and it's incredible that SQL is still going so strong how many years has it been at this point?
Yeah, it was the 50th anniversary earlier this year. Yeah, we had Don Chamberlain on the podcast, who was the inventor of SQL. I highly recommend that everyone listens to that episode. And maybe as we wrap up, Richie here, you know, I asked you what are your predictions for the programming languages, but this is going to be our last Industry Roundup episode of the year. And we are recording during Thanksgiving. So what are you grateful for and what are you looking forward to?
Oh, what am I grateful for? I mean, you know, I have a very privileged life. I have good health. I have good family life. I have a good job. So lots to be grateful for there. Stuff I'm looking forward to. I'm hoping like this AI stuff, this data takes off a bit so I can automate a lot of my job and, you know, just go to the beach. Trust me, you won't be going to the beach. We'll just find...
New stuff to do. I'm sure you'll find more things for me to do. How about yourself? What are you grateful for and what are you looking forward to? I'm also grateful for, one, having amazing colleagues. I'm not just saying this because you're here, but having amazing colleagues. Yeah, also have a very privileged life. Couldn't be more grateful for my family, my friends, my partner, my colleagues. Great job. So grateful for everything and everyone in my life. And then what am I looking forward to in 2025?
So I'm kind of excited for a bit of a correction to happen in the AI space. I'm very excited about the potential for AI, but I also see a risk of the, maybe I'll say it this way, the Bitcoinification of AI, where you see a lot of hype around the AI space that tends to create either false expectations or negative emotion around AI.
I'm excited for that to subside slightly as we mature more with the technology. And, you know, it does seem like we're going to reach some form of plateau, at least in the intelligence of the models. The product experience will get better. The agentic capabilities will get better. And then, yeah, what that means as well is that, you know, maybe certain startups will take a hit, but at least we'll have a much more grounded industry as well. So that's what I'm looking forward to in 2025. Seeing less AI hype and seeing more AI value, essentially.
Less hype, more value. But yeah, talking about AI capabilities plateauing, that's a very controversial take there. Lots of arguments around this. Is AI, particularly generative AI, going to scale on us much further? Yeah, that sounds like a good follow-up episode. Something to cover in the next Industry Roundup. Something to cover in the Industry Roundup. And with that, I think we'll end it here for today. We will be taking a break on DataFrame, I think, starting for the last two weeks of December, and we'll come back with new episodes.
for the start of the year. Very excited to kick off the new year with you, Richie. All right. Likewise.