📖 ThursdAI - Sunday special on datasets classification & alternative transformer architectures

2024/2/5

ThursdAI - The top AI news from the past week

Frequently requested episodes will be transcribed first

Shownotes Transcript

Hello hello everyone, welcome to another special episode (some podcasts call them just.. episodes I guess, but here you get AI news every ThurdsdAI, and on Sunday you get the deeper dives)

BTW, I'm writing these words, looking at a 300 inch monitor that's hovering above my usual workstation in the Apple Vision Pro, and while this is an AI newsletter, and I've yet to find a connecting link (there's like 3 AI apps in there right now, one fairly boring chatbot, and Siri... don't get me started on Siri), I'll definitely be covering my experience in the next ThursdAI, because well, I love everything new and technological, AI is a huge part of it, but not the ONLY part!

📖 It's all about the (big) Datasets

Ok back to the matter at hand, if you've used, finetuned, trained or heard about an AI model, you may or may not realize how important the dataset the model was trained with is. We often talk of this model, that model, and often the only different is, additional data that folks (who I sometimes refer to as alchemists) have collected, curated and structured, and creating/curating/editing those datasets is an art and a science.

For example, three friends of the pod, namely LDJ) with Capybara, Austin) with OpenChat and Teknium) with Hermes, have been consistently taking of the shelves open source models and making them smarter, more instruction tuned, better for specific purposes. These datasets are paired with different techniques as well, for example, lately the so-called DPO (Direct preference optimization) is a technique that showed promise, since it not only shows a model which answer is the correct for a specific query, it shows an incorrect answer as well, and trains the model to prefer one over the other. (see the recent Capybara DPO improvement by Argilla), which improved model metrics across every evaluation)

These datasets can range from super high quality 16K rows, to millions of rows (Teknium's recently released Hermes, one of the higher quality datasets comes in at just a tad over exactly 1 million rows) and often times it's an amalgamation of different other datasets into 1.

In the case of Hermes, Teknium has compiled this 1 million chats from at least 15 different datasets), some his own, some by folks like Jon Durbin), Garage bAInd), and shareGPT, from LMsys.org, which was complied by scraping the very popular sharegpt.com) website, from folks who used the shareGPT extension to share they GPT4 conversations. It's quite remarkable how much of these datasets are just, conversations that users had with GPT-4!

Lilac brings Garden

With that backdrop of information, today on the pod we've got the co-founders of Lilac), Nikhil Thorat) and Daniel Smilkov), who came on to chat about the new thing they just released called Lilac Garden.

Lilac is an open source tool (you can find it RIGHT HERE)) which is built to help make dataset creation, curation and classification, more science than art, and help visualize the data, cluster it and make it easily available. In the case of Hermes, that could be more than millions of rows of data.

On the pod, I talk with Nikhil and Daniel about the origin of what they both did at Google, working on Tensorflow.js and then something called "know your data" and how eventually they realized that in this era of LLMs, open sourcing a tool that can understand huge datasets, run LLM based classifiers on top of them, or even train specific ones, is important and needed!

To strengthen the point, two friends of the pod (Teknium was in the crowd sending us 👍), LDJ and Austin (aka Alignment Lab) were on stage with us and basically said that "It was pretty much the dark ages before Lilac", since something like OpenOrca dataset is a whopping 4M rows of text.

Visualizations in the Garden.

So what does lilac actually look like? Here's a quick visualization of the top categories of texts from OpenOrca's 4 million rows, grouped by category title and showing each cluster. So you can see here, Translation requests have 66% (around 200K rows) of the translation category, and you can scroll on and on and add filters and really dissect this whole thing up and down.

The categorization is created by running Lilac on your dataset, which uses embedding algorithms and other neat tricks to quickly chunk and put labels on the categories (AKA classifying them).

Btw, you can see this view and play around with it yourself, here)

But running this on your own local machine can be a drag, and take hours if not days for bigger datasets, including sometimes hanging and not even working 100%, so the Lilac folks created Lilac Garden, which is a hosted solution by them to provide a dataset, and do classify something like 4M in 4-5 hours or so.

Which is definitely not possible on local machines. If you're into that kind of thing, again, Lilac is open source ,so you don't have to sign up or pay them, but if speed and this view matters to you, definitely check Lilac out!

RWKV with Eugene (Pico Creator)

On the news segment of ThursdAI we mentioned Eagle, which is the 5th version of RWKV, an attention free, potential alternative to Transformers, that's being developed fully in the open source. Later in the show we had the honor to have PicoCreator, one of the front running folks in the RWKV effort, which is an attempt to see if Transformers can be beat with a new type of architecture (RNN) that doesn't require specific attention mechanisms, that add the problem of Quadratic Attention scaling, making LLMs hard and expensive to run the more context is provided.

Eugene had some technical issues so joined in the middle of the pod, so we didn't have a full deep-dive, however, I figured it's important to bring this info to you guys, as these efforts may yield AI that runs 10-100x cheaper and potentially faster on devices, using almost infinite context lengths.

RWKV and other attempts like StripedHyena (Together AI) and Mamba (from Tri Dao) are attempts that are worth watching as they may supersede or join with Transformers to create the next jump in LLM capabilities.

That's all for this Sunday, needless to say, with the Vision Pro releasing on a Friday, it's been a full weekend of future exploration, which is the main driver in my personal life!

P.S - if you read through to here, you get a gift! A teaser, I have done something different on the pod, recorded a human interest podcast x AI, for the first time. I mostly bring the news and sometimes deep dives like this one, but this story I couldn't ignore, so stay tuned if you're into dating x AI, and how technology disrupts our lives and wether this is all moral or not, as I recorded an Episode with Sasha Jadan and his new Fiancee Karina, which his AI bot picked out for him, after swiping and matching with over 5200 girls on Tinder. The AI also... suggested he'd propose which he did. It was a very interesting conversation that I plan to upload soon!

That's it from me this week, see you all on ThursdAI and don't forget, if you liked this, do me a solid, listen to the pod and then leave a review or a 5 star (at least a 4?) on Apple podcasts 🙏 This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe)

📖 ThursdAI - Sunday special on datasets classification & alternative transformer architectures 50:37 Share

ThursdAI - The top AI news from the past week

Shownotes Transcript

📖 ThursdAI - Sunday special on datasets classification & alternative transformer architectures