Sam Lehman: What the Reinforcement Learning Renaissance Means for Decentralized AI Artwork

The Delphi Podcast

Conversations with Crypto Gigabrains. Hosted by Tommy, Co-Founder and Founding Partner at Delphi Ventures

The Delphi Podcast

Sam Lehman: What the Reinforcement Learning Renaissance Means for Decentralized AI

April 30, 2025 • The Delphi Podcast

0:00 | 1:08:02

Join Tommy Shaughnessy from Delphi Ventures as he hosts Sam Lehman, Principal at Symbolic Capital and AI researcher, for a deep dive into the Reinforcement Learning (RL) renaissance and its implications for decentralized AI. Sam recently authored a widely discussed post, "The World's RL Gym", exploring the evolution of AI scaling and the exciting potential of decentralized networks for training next-generation models.The World’s RL Gym: https://www.symbolic.capital/writing/the-worlds-rl-gym

🎯 Key Highlights

The three phases of AI scaling: Pre-training, Inference Time Compute, and the RL Renaissance.

How DeepMind's novel RL approach (using GRPO) created powerful reasoning models with minimal human data.

Understanding "reasoning traces" and how models learn to "think" longer and more effectively.

The potential downsides of human preference data potentially inhibiting model creativity, drawing parallels to AlphaGo.

Exploring the "World's RL Gym" concept: Decentralizing RL through open environments, diverse tasks, and verified data.

Why open, collaborative RL environments might outperform closed-source labs in generating diverse AI strategies.

The critical role of high-quality base models for successful RL fine-tuning.

Future AI architectures: Continuous learning and the potential of modular Mixture-of-Experts (MoE) models.

Current landscape: Open-source vs. proprietary AI, the challenge of model lock-in, and the role of crypto networks.

Debunking recent claims that "RL is dead" and understanding its true impact.

💡 Want to stay updated with the latest in crypto & AI? Hit subscribe and the notification bell! 🔔

🧠 Follow the Alpha

Tommy's Twitter: @Shaughnessy119

Sam's Twitter: @SPLehman

Symbolic Capital’s Twitter: @symbolicvc

🔗 Connect with Delphi

🌐 Portal: https://delphidigital.io/

🐦 Twitter: https://twitter.com/delphi_digital

💼 LinkedIn: https://www.linkedin.com/company/delphi-digital

🎧 Listen on

Spotify: https://open.spotify.com/show/62PR1RigLG2YN5Pelq6UY9?si=18ac7ccf36ab4753

Apple Podcasts: https://podcasts.apple.com/us/podcast/the-delphi-podcast/id1438148082

Youtube: https://www.youtube.com/channel/UC9Yy99ZlQIX9-PdG_xHj43Q

Timestamps

00:00 - Introduction: Sam Lehman, Symbolic Capital & "The World's RL Gym"

01:30 - History of AI Scaling: Pre-training Era

03:30 - Phase 2: Inference Time Compute Scaling

09:30 - Phase 3: The RL Renaissance & DeepMind Moment

14:30 - How DeepMind Trained R1 without Human Preferences

16:30 - AlphaGo Analogy: Human Data Inhibiting Creativity?

20:30 - Generalizability of RL Training: How Far Does It Go?

22:30 - The "Aha Moment": Models Learning to Think Longer

25:30 - Concept: Decentralized RL & The World's Gym

31:30 - Why Decentralize RL? Open Collaboration vs. Closed Labs

35:00 - Understanding Reasoning Traces

39:00 - Current Decentralized RL Projects (Prime Intellect, General Reasoning)

41:30 - Future Architectures: Continuous Improvement & Modular Models

46:30 - Open Source vs. Proprietary AI: Landscape & Challenges

50:30 - The Lock-In Problem with Foundational Models

52:30 - Is AGI Here? Experiences with GPT-4o

56:30 - Investment Focus in Decentralized AI

59:00 - Modular MoE Models & Jensen's HDEE Paper

1:03:00 - Debunking "RL is Dead" Claims

1:06:00 - Importance of Performant Base Models for RL

Disclaimer

This pod

SPEAKER_01 0:00

You're now plugged into the Delphi Podcast. Hey everyone, it's Tommy from Delphi Ventures, and today I'm thrilled to have on Sam Lehman, who uh wrote one of the absolute best posts on AI called The World's RL Gym. Uh, what the RL Renaissance means for uh decentralized AI. Sam, how are you? How's it going?

SPEAKER_00 0:27

Doing well. Uh really happy to be here. Thanks for having me on.

SPEAKER_01 0:30

Sam, you're uh you're a partner at Symbolic Capital. You're a crazy good AI researcher. Tell us a bit about yourself before we dive into the post for everyone.

SPEAKER_00 0:39

Yeah, so I've been at Symbolic for about three years now. Um, sort of start with the font around inception. Have been investing across Web3 broadly over those three years. Um, in a past life, I worked in TradFi. And even before that, I actually uh opened a craft brewery. So I've done a bunch of different things. Uh, but now with Symbolic, I've been very focused on the world of decentralized AI. It's a huge part of our funds thesis. We do pre-seed seed investing mainly. Um, and I've just really been enjoying exploring this world of Web3 that's for me provided the most excitement that I've ever had investing in this category.

SPEAKER_01 1:21

Yeah, I mean, our funds are definitely aligned on the thesis there and the focus. So I'm excited to talk to you on this. Um, let's start with the beginning of the post and see where the conversation takes us. But let's just open up with you give a really good brief history of AI and ML scaling, right? The pre-training scaling laws, the importance of data in this training, scaling up these models. Maybe just like walk us through the genesis of the history of the AI scaling stuff.

SPEAKER_00 1:48

Yeah, definitely. So I think about this history in three phases. Um, phase one being focused on pre-training, phase two on inference time, test time, compute scaling, and then sort of this new um emergent phase, which is RL scaling, which is what the post is focused about. And essentially to sort of zoom, maybe zoom out initially, my goal with the piece was to, you know, starting with this new RL renaissance that's upon us, trace my way back to figure out how we got here, what came before, and specifically, like, you know, what lever levers were AI researchers pulling to scale and make more performant models? And how has that changed over time? How do we get to this new paradigm, which feels like the paradigm of RL? Um, so we started with pre-training, which is essentially being able to throw more data and compute into models at pre-training. And as long as we were able to increase those two factors to end up with more performant models. Um, and there was a lot of work back in 2020 and then 2022 where very early researchers at OpenAI, some of which you probably know, went on to uh found anthropic. And then later by DeepMind researchers, they basically sort of codified these rules and sort of found a very clear scaling law, which is essentially the ratio of data to the eventual parameters you want to have that model end up with. Um, and essentially we found that you know most of these models at the time, way back when, um, were had too many parameters relevant to the data that was going in. So we needed to basically increase the amount of data and then sort of addition the amount of compute to get to better models. So that was sort of the pre-trained era. And then um, you know, fast forward to you know 2023, 2024, we had these reasoning models come out, and we realized that there's now this sort of new paradigm upon us, uh, where by allowing models sort of more time to think, uh, we could get more performant models. So now it just wasn't you had to put more compute and data in at the beginning of pre-training, but actually, once you had the model, you could allow the model to think longer, to use more compute resources while it's solving problems and get better performance. That was the second phase. And that took us to the RL phase.

SPEAKER_01 4:26

Yeah, it's pretty crazy. So basically, you found out that in the pre-training phase where these AI labs are just throwing, you know, mass compute, basically the the zero and one parameters in these models, there was too many parameters for the amount of data going in. So there wasn't enough data to to effectively change those zeros and ones to make a performant model. Is that what happened?

SPEAKER_00 4:47

Yeah, exactly. They're basically by um codifying what were called sort of the chinchilla scaling laws, they found that the optimal ratio of data to parameters was around roughly 20 tokens per parameter. But before they sort of codified these scaling laws, there were that ratio was off. There were too many parameters relevant to the data going in. And that sort of opened researchers' eyes to this idea that we had to get more data going into these models.

SPEAKER_01 5:19

That's awesome. So that's that's phase one, which you just covered. And then phase two is this inference time, you know, test time compute scaling, like where the model is sort of thinking in in real time. Walk us through that a little bit more because obviously everyone's seen like ChatGPT think, but I don't think they have like the nuance that that you have with it.

SPEAKER_00 5:38

Yeah, so basically um up until you know roughly end of 23, 24, you had you know more of these sort of down the middle, uh straightforward LLMs where you would put in a query, get you know your immediate response back. Uh the model had sort of one shot to think, solve problem, give you your output. Um, and then through you know techniques that have become more apparent with time, researchers were able to develop these reasoning models. Models that actually had the ability to use more compute at the at time of inference, to spend more compute to think longer, to check their work, to explore different pathways of solving a problem. Um, and we found sort of similar to how there were pre-training scaling laws, where you scale up data and compute and get a better pre-trained model. You can also scale up the amount of compute a model, a reasoning model, can use to think, to allow it to basically think harder, think longer, and get better answers. Um, and again, you know, thanks to Google DeepMind researchers, uh, there was a paper called Scaling LLM Test Time Compute that showed that you could have a smaller model, but allow it to think longer and it could outperform a larger model. Um, and sort of this showed us, okay, now there's a new that there's a new uh lever we can pull to get better performance, and that's allowing these reasoning models to think longer.

SPEAKER_01 7:16

Yeah, it's so cool. It it kind of came out of left field a bit because everybody was just thinking like, let's just build bigger data centers, let's throw more hardware, let's throw more data. And that was sort of, you know, I think where everybody was just going to spend the next 10 years, and then inference time or test time compute just came out and kind of shocked the world that this you could just think at the time you asked the question.

SPEAKER_00 7:37

Yeah, and there were some pretty, I mean, uh pretty amazing sound bites that sort of came out of this new paradigm. And that's sort of roughly around the time when uh Jensen was on the BG2 podcasts and said something like, you know, inference is gonna be a billion times bigger than pre-training. Uh and so we need the combo of sort of you know having more inference, more compute at the time of inference, also just having sort of more consumers hitting these models, you know, demanding more inference. There was just like perfect sort of conflicts of, oh my gosh, okay, all the compute is gonna be going towards inference now. Um, and that was sort of again sort of marked a bit of a paradigm shift.

SPEAKER_01 8:17

Yeah, it is pretty crazy because originally before the test time compute stuff came out, I assumed inference would be huge or bigger than training in a sense of just raw quantity of requests to the models, like just pure asks. I did not think it would ever be in the the thinking time per request.

SPEAKER_00 8:34

Um yeah, exactly. And so that would this was a really cool sort of new again vector in which we could scale up models. And in the paper, I sort of show this graph where you have um scores on the Arc AGI eval, uh, which is a popular way to test sort of the abilities of new models. And they're pretty flat as you get sort of, you know, uh every new GPT-2, GPT 3, GPT 4 release. And then there's this sort of near parabolic shift when the reasoning models come out and they just crush this eval because these models are now able to expend so much more compute when they're actually solving the problems. And again, that it resulted in a real fundamental shift in the whole industry.

SPEAKER_01 9:21

Yeah, it is pretty crazy. I mean, for those just to describe the graph a bit, you have the ARC AGI score and you have ChatGPT 4.0 at, you know, let's call it, I don't know, maybe this is five or six percent, and then you have 03 at 90% a couple months later.

SPEAKER_00 9:36

Yeah, it's crazy.

SPEAKER_01 9:37

So walk us through phase three. So we talked about pre-training, we talked about inference time. What happens at the uh phase three here?

SPEAKER_00 9:46

Yeah, and so what I use is sort of the deep seek moment as a catalyst for a lot of this. And it is, I think, when talk about RL for LLMs uh became a lot more mainstream. I want to be really clear, RL has been around for a long time, and all the Frontier labs were doing RL, specifically RLHF, on their models around this time. And the DeepSeq didn't uh they weren't the first to use RL, they just sort of used it in an innovative way and then uh in a sort of innovative and very elegant way, I would say. Um so essentially what we saw with this Deep Seek moment was they took a very performant base model and through, which we can get into more if you want, their very uh elegant, sort of GRPO-focused RL process got a base model to develop extremely powerful reasoning capabilities through again this RL process. And it showed that you could use RL on LLMs to take one model and make it much, much more performant to sort of incentivize and elicit this really powerful reasoning behavior. And to do so in a way, and this is like the really important thing, with limited human intervention. And that was a huge, huge uh sort of breakthrough with what DeepSeq published was that the processes they used had minimal human intervention.

SPEAKER_01 11:21

Yeah, it's crazy reading your post because I remember when Deep Seek um not came out, because it came out like one or two weeks before, but when uh that Sunday NVIDIA crash was about to happen that Monday morning when DeepSeek really went viral, everyone just claimed it was distilled off OpenAI. But it sounds like from what you're saying, like they took a pretty novel approach to train R1 in the end. Would that be fair?

SPEAKER_00 11:46

Yeah, from what I understand, most of the industry was using PPO, and then DeepSeq showed the viability of GRPO. And since then I've seen a lot of other researchers, engineers implementing GRPO and their own RL processes. Um I think it's important to note here, like so much of this is extremely new. In the wake of Deep Seek, I just saw this massive explosion of different RL algorithms um on my timeline. Everyone's like, oh, there's you know, there's the nth PPO PO type thing. Um, you know, what's what's going to be next? So there's been a lot of experimentation, but uh GRPO was a really sort of exciting new um uh algorithm that the deep seek team implemented for their RL process. Um, and and a lot of teams have been using it.

SPEAKER_01 12:37

It's really cool. So I I wasn't really too aware of this until I read your post, but you you lay out sort of the deep seek stack where you have deep seek v3, which is their 671 billion parameter, I guess we'll call it a base model. Then you have R10, which is the model they trained using V3, and then you have Deep Seek R1, which is like this cleaned up version. Um I'd love to maybe walk through this process a bit, like you know, maybe your like key takeaways on going from that base parameter model to this smart reasoning model, like just that flow.

SPEAKER_00 13:11

Definitely, and I think this went maybe a bit under the radar because there was basically DeepSeq. Um they published V3, which was this mixture of experts sparse model, I think around December. And then it was about a month or so later that R1 and R10 came out. So there were actually staggered releases. Um, and so it's confused some people about uh what was actually going on, I think. Um the way I would think about it is essentially the Deep Seek team had their base model, V3, and then they performed two you know similar but slightly different RL processes on that base model to end up with these reasoning models. Um, one process you ended up with R10, the other process you end up with R1. What was really exciting is sort of what they did with R10, um, which is where you get sort of that removal of human intervention. With R10, they essentially took the base model V3, they fed the model a bunch of questions, mainly around math, coding, and were able, because they were pretty verifiable domains, just say, you know, the model gets a one if it's correct or a zero if it's incorrect. And they allowed the model through trial and error to solve a bunch of those problems and get that binary reward and learn how to become fairly performant at solving these types of problems that was being fed to it. Usually you would either feed the model um a lot of sort of highly curated human-generated data to show how it should mimic the way that humans might want to uh you know get their outputs. But with R10, it was literally just here's a bunch of problems, start thinking, figure it out. And through that process, the model learned to think longer and to use more compute while it's thinking, and with that became much more performant. And that's really that was a like I don't want that sort of to go under the radar. That's a like that's a really exciting innovation.

SPEAKER_01 15:27

Yeah, no, that is so just to rehash this a bit. So what you're saying is like they gave R1 a bunch of questions at, you know, and then they opened up the back of the textbook and gave it all the answers, but it did not tell it the process on how to go from question to answer. And that's how these models are usually trained. Is that fair?

SPEAKER_00 15:49

That's exactly right. Um, and then this is you know also very important to note. What you end up with R10 is a very performant model. It's very good at solving problems and getting the right answer, but it's not very human-ledible. So the actual answers it gives and the reasoning traces it generates, sort of the chain of thought that you can look at, they would it would mix languages, it wouldn't use proper syntax. It's not a model that's great for everyday interaction with humans. You basically have this sort of like wild child that was somewhat uninhibited by human preference, that got super smart, but isn't exactly uh you know been conditioned to fit what humans would like to interact with.

SPEAKER_01 16:45

But but but Sam, so the but even though the model is switching languages uh while it's thinking, it's still getting to the answer. That's right. That's nuts. I yeah, it's it's it's hard to understand like thinking in English and Chinese and getting to an answer is is nuts.

SPEAKER_00 17:01

Um exactly. And I think you know there's something something I've been spending a bit more time on researching is sort of how the uh uh sort of injection of human data, of human preference into the RL process could potentially inhibit uh the uh creativity, expressivity, and ultimately the performance of models, right? Um I was actually recently sort of learning more about the history of Alpha Go and Alpha Zero. And the Google researchers found that you know initially they trained this model to be you know a great Go player, which is a very complex game, but one with sort of verifiable rewards. And at first pass, they gave the model a bunch of examples of how humans play Go, and then sort of gave that because they felt like they needed to have some reference to start the RL process and then become a performant model at Go. They later went back and just they removed that sort of interjection of human data and they let the model know the rules and have verifiable rewards, and just learn to play, learn to win go without that sort of human interjection of data. And you got an even better model, a model that would make moves that no human could do themselves, would think of themselves. And that's how you get this sort of like I think it was called like move 37, which was this move that no human player would ever make that blew the minds of all Go players because this model is doing something that humans wouldn't think to do. And so I worry sometimes that, or I wonder, you know, maybe we should just let these models be a bit more weird, uh, be a bit more curious, uh, explore a bit more, and not, you know, not chain them to to human preference, human data.

SPEAKER_01 19:01

Yeah, it's crazy that move 37 was so long ago. And here you and I are talking about this, like I mean, you mentioned it's not new, but yeah, I don't know. Like, why is it such a big deal now? Is it because we're all able to use it and understand it?

SPEAKER_00 19:15

Like Yeah, I think you know, you know, one RL is very well suited to like games, to chess, to go, because the constraints were very clear. The the the reward state, the rewards are really easy to calculate, right? You either win or you lose the game. It's been much harder to fit that to LLMs, to everyday human interactions. And it's taken some time, I think, for the industry to figure out how to fit these RL methods that were sort of great in certain contexts and sort of this new chat tool use, problem solving era that we're in now. Um and it's also, I think, why we've seen so much of the gains be made in coding and math, because those are the most easily sort of uh easy to verify in a binary way. And I think there's a lot of challenge right now. Okay, how do we use these RL techniques in non-fervia, non-verifiable domains in creativity, poetry, writing, um, domains like that?

SPEAKER_01 20:24

So, Sam, you mentioned that DeepSeq was trained on like having this big supply of questions and answers, right? Does that imply that these models can't get good at answering new questions for which they don't have the answers to? Like I'm I'm trying to figure out like where the gap is between you know training these things on question and answer and then solving unique and new things for which they've never seen an answer before.

SPEAKER_00 20:51

Yeah. Um they're good at sort of the domains in which they were trained. So it I want to be clear a lot of this work is still reliant on having really high-quality question to answer pairs to start with and the correct results. There's been limited work on sort of how training a model to be great at math and coding uh generalizes. So if you get a model a bunch of data on math and coding and give them those sort of question-answer pairs, will that model then become good at creative writing? Um I know you've had Travis from Ambient on the podcast recently. I think he's very bullish on generalizability. I think I'm personally a bit more skeptical. I think it's a bit, it's a bit under researched, um, and it's a big problem to solve now. I think uh generalizability is gonna be a huge new theme, as well as being able to sort of adapt the RL process to uh new. Non-easily verifiable domains as well. And I think there's a lot of sort of uh unsolved questions there.

SPEAKER_01 22:05

Yeah, definitely out of my my range, but I'll definitely maybe get you both on to chat through it. Um and then and then just one of the things you mentioned during the post that I really want to spend a little bit of time on was you mentioned it's it's about like halfway down your post that DeepSeek, um the length of the response, I think it thinks more and it thinks longer, right? Like that just seems weird, right? Like, so like if it gets the the 10 questions on the 10th one, it's thinking longer for that one. Like, why is that? Or am I describing that wrong?

SPEAKER_00 22:39

Yeah, you know, essentially what would happen was as it was getting uh more and more questions and it was being rewarded for getting the answer right, it started to sort of be incentivize that the longer it was thinking, the more right the answer was. And so there's this beautiful chart in the deep seek paper where on one axis you have steps. So basically, you know, how many questions have been fed to the model, how long it's been post-trained for, and on the y-axis, you have the length of response. Basically, as it's getting uh as we're kind of going out on the x-axis, uh, it's learning to think longer. And I think if I recall right, there's what was called through the researchers' researchers called the aha moment, where there's a reasoning trace where the model is it's generating this trace. It's you know, it's it's it's it's it's thinking with text essentially, and it says, you know, wait, wait, I need to think about this longer, or I should think about this way. And it's literally telling itself, wait, if I think more about this, I can get the answer right. And you see that um play out in this chart where uh the more it learns, the more it learns to think longer because it's will tends to get answers more right with these longer traces. Um that's a sort of a really beautiful moment um in the training process that the researchers called out.

SPEAKER_01 24:06

It it is kind of crazy. It sounds like, you know, historically, I always thought that proprietary data or just access to, you know, large swaths of data or go forward new data was like one of the biggest motes. But from what you're describing, it really sounds like these reasoning traces or or this flow of thinking is the most important sort of data source or something. Is that fair?

SPEAKER_00 24:30

Or I go back and forth on this a lot. Um in the piece, you know, I talk a lot about how you know essentially the RL process is you generate, you get the model to generate synthetic data, you verify that data, and you take those verifications and allow the model to learn from it and become more performant. Um, and you can sort of get this virtuous cycle where it's like you sort of coax the model to generate this data, you verify it, it feeds back, it makes them all more performant. Um but at the same time, I've you maybe adjust my thinking a bit because to get the model going, you still need to get at really high-quality question and answer pairs. Um, and you know, there was talk of the deep sea team just spending hours and hours on end generating, you know, math encoding pairs themselves. Um, and so I you know, I think there the next push is gonna be figuring out how can we get away from this reliance on really high quality human data to get sort of this this flywheel started and just allow for more of the uh this virtuous cycle synthetic data all the way down to play out.

SPEAKER_01 25:50

So you brought us to uh a fun part of the the podcast in your post. It's just I guess like that next step to creating this decentralized reinforcement learning type network. Um I don't want to give any details, but love your take first, but what is this like this convergence here that you're talking about? Because this is an interesting avenue.

SPEAKER_00 26:12

Yeah, so I get to this part in the post where my goal is sort of to map out how all this fits together, how you would decentralize the RL process, but also why you might want to. And I break it into these sort of three parts that I sort of name in a bit of a cutesy way, you know, the first part being the foundation, the second being the gym, and the third being the refinery. Um and I can sort of explain these pieces, but essentially the foundation is you need a great foundation to do RL on top of, and that foundation is a performant base model. And I think that if we're gonna decentralize this whole thing, we might as well decentralize uh the pre-training for that base model. You need a gym to sort of uh elicit the different types of reasoning behavior of different sort of LLM cognitive strategies. Uh, you need to generate that high-quality data. And then the last part, the refineries, you actually need a network to do the optimization, the actual post-training itself. Uh, and so that's how I map things out.

SPEAKER_01 27:32

No, that is it's a really helpful mental model. You have your base model, right, at the bottom, and you want to do decentralized training, like news, prime intellect Jensen, take you back to do that. Then in the middle, you have the gym to generate all these high-quality reasoning traces, which we just discussed, and then the refinery to like use those to then you know go back and train the model again. Um I'm really curious your take on on the gym part. Um, just like an environment to generate diverse high-quality reasoning data, like, is that you know, uh, a university? Is that Microsoft's research team? Is that the Delphi research team? Like, like, what are these environments? Like, how do you see them operating?

SPEAKER_00 28:14

Yeah, so um the idea of an RL gym is not new. Um essentially, what it is is a uh a digital environment where you can have a model try out different strategies and sort of solve a given problem, a challenge in many different ways, sort of to have that to have that sort of um arena with enough of guardrails around it to steer the model to find the best strategy. And so the you know, one of the first examples of an RL gym was the OpenAI gym. It was an old project they had where it was literally an environment for devs to test out different R Rail strategies on like basic tasks. Um there's something called like Carla, which is sort of an environment to try out different strategies for self-driving cars. Um, and so these have existed before. What I call for is now that we're seeing how RL can work for LLMs, and LLMs have um applications across many, many, many very diverse domains. We need to spin up environments to allow for the RL process to play out across many different domains. So there could be, you know, environment that's great for math, you know, a set that's focused on creative writing, on medicine, on drug discovery, on physics. Um, you know, the idea is to have many environments that can elicit the best strategy for a given domain. And then you need to pair with that robust verifiers. So you need a way to have sort of environments eliciting all this sort of uh behavior amongst the models to solve problems. And then you also need verifiers which say if this was right or wrong, or to you know, pick the best strategy and therefore elicit the most performant behavior.

SPEAKER_01 30:18

So, Sam, how do I contribute a reasoning trace to the gym? Like let's say I'm a you know, I'm a venture capitalist, I I invest in projects, like I wanna I want to contribute to this gym. Do I like upload my investment committee doc and like the outcomes of those those projects on if I made money or not? Like, like what is the the process for adding, you know, for for adding to a gym?

SPEAKER_00 30:43

Yeah, so the the concept um would allow you to contribute in different ways. My idea would be you have an open platform where anybody can create environments that then allow other people to bring models to that environment and to generate synthetic data, generate these reasoning traces, um have those traces get verified, and then um, you know, allow sort of that corpus of data that has, you know, that hasn't verified and contribute that to the training of new models. Um my hope is that this can be done in a very open, decentralized way.

SPEAKER_01 31:30

That was uh that was my million-dollar question for you. Like if we're if we're looking at the, you know, you're an open AI, you're an anthropic, um, you know, take your pick of a of an AI lab, and you're looking at this specific issue where, hey, we need a lot of reasoning traces to to perfect our models. Like, I guess my main question for you is what why is it better to do that decentralized than do it from their perspective?

SPEAKER_00 31:57

Yeah. And so this gets into a bit of the philosophical, but I think an also sort of more you know fundamentals driven argument. But a lot of my belief in this domain comes from a concept that the USV team uses, where when they believe that the best companies come out of you know areas where you're unlocking sort of innovation at the edge, um, where there's you know the ability for sort of people to try out different things in a very open, open way. Um RL, I think, is uniquely well suited to that idea because the whole process is about allowing for models to explore different strategies to get those different attempts verified and to learn from that and to improve from that. And I personally think that you want a platform where as much experimentation across as many different domains can happen as humanly possible, because only then will you elicit the absolute best possible strategies. Um, I use, I've been sort of working on this analogy in my head where when if I think about a closed source frontier lab trying to uh you know elicit the best RL strategy. It's like taking a super, super smart kid and putting that kid in a room alone. You give them the best tutors and the best textbooks and resources, uh, and you ask them to solve the world's hardest problems. That's kind of what it feels like sort of the Frontier Labs do. It's very it's very closed, but they have so many resources that they're, you know, they might be able to get to the best answers. Um, the flip side, the more sort of open um side of this would be uh a school where the smartest people in the world can attend and they can collaborate with each other, they can share ideas, they can um, you know, uh figure out the best way to solve problems together. And oh, I I tried it this way, you did it that way. I think yours is more interesting. Let's let's go down this path. And I think I'm very and I don't think I'm very confident that's going to lead to the most progress, the best possible strategies when you have that open collaboration and exploration.

SPEAKER_01 34:32

No, it's a it's a fair take. Um maybe something that's a little confusing for me, I'll I'll run by you to see if I'm on the right, right path here. But when I'm thinking about reasoning traces, I get a little confused when I'm thinking through the data that an open AI has to use from just people globally asking questions and reaching conclusions, right? And there is sort of this, you know, supervised human involvement of here's my question, no, no, no, here's more data. This is the way I want to solve, and then bam, their answer. Would you consider that like a reasoning trace in its own right? Or is that sort of separate from what we're talking about here?

SPEAKER_00 35:11

Yeah, so a reasoning trace is just the chain of thought string that the model um created, produced to end up at its eventual answer. So if you've ever seen, like, I mean, DeepSeq was the one that sort of pulled back the curtain, they they showed them all of thinking where it would say, oh, the user is asking this question. Uh, I think that I should solve it with this strategy. If I were to do that, I would you know add these two numbers together and then I would divide by this. And it's basically that full chain of thought. Um, and those traces are you know important feedback into the model to continue to incentivize and elicit, you know, better and better reasoning behavior to get to the event and right answer.

SPEAKER_01 35:55

Yeah, that is cool. It I I get a little cloudy on trying to figure out how people globally will contribute to the gym. Like what you're saying makes sense. I'm just like, you know, having hundreds of millions contribute, I get a little rough on how it how it happens, you know?

SPEAKER_00 36:11

Yeah, so it's not about like I'm gonna sit down and I'm gonna hand write out a bunch of reasoning traces. It's more I you know, the the the sort of the world's gym is this idea that you can allow people to create different environments to put models into, to have those models generate the reasoning traces, for those traces to be verified if they're right or if they're wrong. Um, and for that sort of that platform to allow for the elicitation of the best possible reasoning strategies. It's about having like many, many diverse environments created and creative ways to verify what's right or what's wrong or what's most preferable in an open way.

SPEAKER_01 36:56

That is pretty cool. I I guess if you had to take the other side of the argument, is there any way that you think an open AI could use like their funding or offer people free access to the in return for these traces or anything that you think would sort of go against the thesis of a global decentralized RL Gym?

SPEAKER_00 37:17

I mean, I think what they have and what they're clearly doubling down on is just is distribution. They have so many users on their platform that you know, with O3 coming out now trying uh and sort of getting these models to use tools to sort of start to move into different parts of reasoning, different types of problems. Uh, and they have now sort of this platform that feeds them so much very, very valuable data. Um I worry about sort of one that being a real advantage for them, just having the most users generating this great synthetic data without the users really even know it. Um, and so you know, they are gonna have the volume for sure, but I still think at the same time, the downside is that you're hoping that one giant centralized company can go after and try to solve the right problems, go after the right domains. Um, I'm sure they're working on you know non-math and coding focused stuff. Obviously, they're doing a lot of multimodal uh innovation as well. But I'm sitting here, you know, still waiting for, you know, when are we gonna get real progress and non-verifiable domains? Where is the progress in creative writing and things outside of math and coding going into common? If there was a platform where anybody could explore those things and we could figure out different strategies to verify those domains, I think we're more likely to get to the best models for those areas uh faster than one centralized player.

SPEAKER_01 39:03

That's cool. And it's not really hypothetical, right? Like I know you mentioned that there are a couple of projects that have done this in the past. And I think in the post, you also mentioned Prime Intellect has Synthetic One, which has two million reasoning traces. So it sounds like this is very much moving forward as a as a concept.

SPEAKER_00 39:24

Absolutely. Yeah. So there are two companies that I call out in the piece that are sort of starting to um make efforts in this area. Prime Intellect's been at the forefront of a lot of innovation, decentralized AI. Um, I assuming a lot of people listening might might know of them, but they've been working on decentralized RL. They just had a big release on Synthetic 2, which is their new decentralized RL run. But earlier they'd announced this project called Genesis, which was an open source library of many, many reason traces. So uh anybody can get access to these types of traces that can be used to uh you know post-train models. Um, the other project I called out was called General Reasoning. It's not a crypto company, it's an open source library uh platform, excuse me, where anyone can contribute and verify reasoning traces in many different domains. So across math, medicine, physics, writing, etc. Um, I think it's a super cool project. It's sort of for the greater good right now, but I mentioned the post, you know, could could this be monetized some way? Could people be getting uh rewarded for uh either generating uh you know traces with models or by creating new environments that uh elicit you know more performant mile behavior?

SPEAKER_01 40:47

That's really cool. So, Sim, so let's let's take this a step further. Like, let's say that we have this global decentralized network, this gym, right? We're crowdsourcing diverse tasks, verified reasoning traces, we have all these new environments, and it's a it's a huge valuable mode of data, right? Um let's like how are we feeding that back into the the world? Like, do you envision that goes into one open source model? Do you envision it goes into many open source models? Like, how does that train AI for the betterment that we're all using?

SPEAKER_00 41:20

Yeah, um that's a that's a great question. That's a very big one. And I think you're gonna get a lot of different answers. And you know, there's this concept of one global open source decentralized world model that I know a couple of different teams are working on, where you could end up with a with an architecture that would support this of continuous improvement. So you could constantly be generating new synthetic data, new traces, always feeding that back into the model. And instead of doing sort of these one-off pre-training runs or you know, post-training RL runs, have that data always going back into the model and constantly improving and having sort of this one, this one model that sort of one decentralized network supports that anybody could access. I think that would be really cool. I think I have sort of this soft hunch that that's where the this industry is going to go, whether it's decentralized or centralized. There's gonna be sort of new architecture that can support that. I've been a bit recently geeking out on the idea of highly modular sparse models and sort of the ability to have different, very small, more specialized models that are constantly being trained and improved in their one narrow domain, get fit together and plugged together. And I'm happy to talk about that at some point, but I think that there'll be there will be systems where you can continually get a bunch of really sort of specific data, use that to train these sort of niche expert models, then fit them together and uh to make a sort of supermodel. Um, but uh that's sort of some new stuff that I've been slowly looking into.

SPEAKER_01 43:08

No, it's it's exciting, right? I mean, historically, I always thought the world would sort of fracture into like millions of fine-tuned models and then you know, your your handful of open source and proprietary, you know, mega models. Um, and then I recently became a a rag uh maxi because I built out a vector database for our fun to reference passplays, and it sort of shifted me a little bit away from fine-tuned models to just large models with with that vector call to that database. Um, so this seems a little a little different, but it it yeah, it's hard to form a new worldview and and keep up.

SPEAKER_00 43:49

Yeah, I don't know. I mean, I don't know if it will derail us too much, but it seems pretty relevant. Um I think uh I might not commit myself to it right. Now, but I have been planning to write my next piece on modular mixture of experts models. With the idea being, you know, right now sort of there's this model architecture, MOE, or shorthand, you know, spar uh sparse models versus dense models, um, where you basically have these sort of areas of the model, these subsets of parameters that are focused on you know one domain, like you know, math, writing, etc. Um, it's still one unified model, but only a certain sort of subset of the parameters of a model get activated when it's inferencing. Um, so it's sort of roughly, but definitely not one-to-one, but sort of roughly like mirrors the human brain where there's different parts of your brain that are focused on different types of tasks of different types of problem solving. Um and the idea has been can we end up at a place where we can have these sort of experts within the model uh be more plug-and-play? Can we pull one out, update it, improve it, plug it back in? If somebody develops a new expert that is, you know, phenomenal at uh I don't know, you know, long-term world context planning or coding in a very, very niche language. Um can we develop that expert and then plug it back into the model, like almost like a Lego block? And that would allow for like a lot of interesting experimentation for sort of these smaller, more nimble models, experts to get sort of developed by specific research teams and then plugged into other people's models. Because right now you can't you can you can't do that, and they all need to be sort of trained at once and sort of fit together in unison. Um, but there's been a lot of cool, smart people I've seen slowly talking about this, where I think that might be where the industry is going and would therefore unlock this new paradigm where there's all of this experimentation innovation happening around these sort of sub-experts where people with different um specialties could make their you know their their expert really, really good and then get used by other people.

SPEAKER_01 46:16

Yeah, that's a phenomenal thesis. Um I mean, my my first principle's view is just having that narrower andor more modular endgame where the models just don't have to generalize about the world and could either think in just that one narrow domain is where you sort of get the juice per area. But um we'll see. I I'm curious, like so the the window has shifted so many times. Like a year ago, it was you know, we cannot beat open AI, uh decentralized training and and crypto AI and open source is cooked. And now you have Sam and the anthropic founders literally talking about decentralized data. Um, you have these Prime Intellec runs, the Jensen runs, the news releases, you have your report, like you have this huge shift from pre-training now to inference time compute, right? Reasoning traces. Has all of this changed your worldview on open source versus proprietary AI?

SPEAKER_00 47:13

I wish I could sit here and tell you, like, yes, 100%. It's decentralized AI all the way. Um I'm not quite there. I think I'm a maybe uh, you know, I'm a bit more measured. Um, where I'm at is that distributed AI is inevitable. And when I was writing the piece, everybody kind of everybody knew the Frontier Labs were doing distributed runs, where you would have um, you know, multiple data centers being able to do pre-training in a distributed way. Um it's not full-on decentralized where you have heterogeneous compute on a sort of in a trustless environment where anybody across the world can contribute contributed to a training run uh with you know anything from an H100 all the way down to a MacBook or an iPhone. Um, so we're not quite there, but distributed AI is very much a thing. And it's also, to be clear, you know, particularly well suited to the RL context because you can get a lot, you can get away with a lot more asynchronicity, and it also needs more parallelization. So it's great for sort of lots of workers working in unison on different types of hardware, and Prime Intellect and Jensen have both shown the efficacy of that with their recent launches. I still wonder about sort of the need for tokens for sort of speculative incentivization with all of this. I do think there is something to what Alex is doing at Pluralis where you can shard a model, keep the model in someone's or a piece of the model on someone's hardware, and then be afforded rewards based on model usage, given how much compute you contribute to a run. I think there is something there. Um, it's just, you know, will the model that this network can train be used enough for it to be valuable? Um, and then sometimes I just worry like, uh, are we really gonna be able to beat the models that these frontier labs are putting out? Because they're pretty effing good. But um, I would like to hope that sort of with this focus on sort of creativity, exploration, diversity of contributors, we could get sort of better, more interesting models.

SPEAKER_01 49:39

Yeah, I think that's a fair take. I think it's um, I mean, I don't want to give my whole whole thesis because I have more questions for you, but we're definitely bullish on the open source crypto AI side. It it is hard to contextualize like that part of the thesis, though, right? Like the just openness and global access of people to use, iterate, and own, like logically should lead to a better outcome in the open sort of token-owned route. But there obviously are these Goliaths that have clearly better tech right now, like the opening eyes of the world.

SPEAKER_00 50:13

Yeah, and the incentives for them are to you know pull up the ladder behind them, create motes wherever they can, right? They're not gonna want you, they're not gonna want to allow you to port your data wherever you want, right? They're trying to lock you in to their models, and they want to make it as hard as humanly possible for anyone to leave their ecosystem. Um, that is what you know worries me. And for a long time, I'm sure you heard this too. Everybody was saying that no, no one should be in the model game. Well, now with open I open AI blowing out memory, um, oh wait, maybe there actually is something to this model game. And if we can get you completely loyal to this one model that knows everything about you, you're never gonna want to leave. Uh, and so with how fast this is changing, sort of you know, the incentives leading to a specific type of behavior by these companies, I definitely do worry that maybe we're not gonna be able to get people to try, you know, more open source models, at least at the retail level.

SPEAKER_01 51:18

Yeah, I mean the lock-in is is a real issue, right? Especially with memory. Um, I mean, I I get annoyed when I run out of Claude instances and have to copy and paste the entire combo to my cloud instance, right?

SPEAKER_00 51:30

Yeah, yeah. Like I I used to be like a Cloud Maxi, and then OpenAI has got me pretty locked in right now. And like um something would have to be really, really good to make me want to leave.

SPEAKER_01 51:44

Yeah, that it is crazy though, like if you build with these models, um, everyone just thinks, like, oh, you can just swap one model out for the other. And it's just like, I mean, I do very basic versions with with Bob, my analyst, but that's an AI agent, but it's really not that easy, right? Like, you can't just set the temperature and the top B and K and like throw your prompt, and it's exactly the same. Like, it these really do become, to your point, like locked into that model. Like everything you throw in that prompt is is sort of fine-tuned for that experience and that model.

SPEAKER_00 52:14

Yeah, for the again, for the longest time, people literally thought about this industry like that. Oh, well, you're just gonna basically hot swap the newest and greatest model, and there's gonna be no loyalty to any model provider, whether it's open source, whether it's closed sourced, you know, who cares? You're just gonna go after and swap in the best one at that point in time. I don't think any of the executives at these model companies are dumb. They don't want to live in that world, right? They want to lock you in and they want to make it as hard as possible for you to swap your model. And so, again, we see that with memory, it's like that's sort of one of the first uh moves we've seen made to try to increase the ability for lock in. And I think it's only gonna get uh sort of more extreme as we move forward.

SPEAKER_01 53:01

Sam, do you have views on so I had two hard questions for you? The one was gonna be motes for foundational models, the second one was gonna be one AGI. I feel like considering they're hard, you can take whichever one you want.

SPEAKER_00 53:13

I'll take the second one. With O3, I saw all of these like AGI as now posts, and I was so jealous of people who had that experience because I had the most frustrating experience with O3 possible. I have, I'm not gonna like fully describe it, but I basically have one task that I use for every sort of agentic operator style um model that I can get my hands on, which I essentially tell the model to go to a company's website, go to the team page, make a spreadsheet with every team member's name and their title. And like an old Gemini deep research uh model was terrible, like just full-on hallucinations, couldn't figure it out. Um, and then like got to try it with Manaus, which was going to solve it, but it was gonna take almost the entire day. And I like ripped through my free credits in 10 minutes that I just couldn't finish it. But it was gonna it was it was gonna get there. And then 03, I had to prompt it like six times. It kept failing, it couldn't generate a CSV. I saw other people had this problem too, where like you get stuck in a task loop where it just keeps sending you emails that the task is gonna get retried. And so I got like 50 emails in a couple hours. Um it it did eventually get there, but it was like still really buggy. And so, like for me, that doesn't feel like AGI, it wasn't that captivating magical experience that maybe some other people had. Um, it's still one of the coolest, best models I've ever used. It's it's amazing, but I'm definitely not saying that AGI is here right now. I do think though, with the ability to sort of have it learn from all this new, interesting tool use behavior and figure out how to string together different um uh you know uh tools and products that and interfaces that it that it can use, like that's we're running it to like extremely useful uh real-world performance model behavior. And I'm I think that would be getting into sort of the AGI territory pretty soon.

SPEAKER_01 55:28

Yeah, I I mean it's it's kind of funny you bring up the AGI experience. Like I I felt like I had that with the original ChatGPT, and then I read Stephen Wolfram's like ChatGPT Explainer, which was excellent, like two years ago, and I was like, okay, this is just sort of next token prediction, right? Like everybody knows this. And you know, then it kind of a little bit like faded for me. Um but with O3, I didn't have that, you know, oh shit moment like I had with ChatGP original, but the concept of these reasoning traces and just figuring out how to solve problems on its own from start to finish, like that to me is more obviously like way more AGI-like than next token prediction.

SPEAKER_00 56:09

Um yeah, and and again, that's where we're gonna get to really, really exciting, like you know, near exponential improvement and model capabilities. Um, I don't know if you read the the Arab Experience paper recently, but it's sort of talking about that shift or moving from sort of the more hands-on sort of human intervention RLHF environment to just letting these models out to the world and allowing them as access to as much data as possible, as many tools as possible, and just allow them to figure things out themselves uh without with much, much less human oversight. I think that's where we're gonna get some very, very cool innovation happening.

SPEAKER_01 56:51

Yeah, that is that is wild. I so I mean, on the venture side, like knowing what you know now, are there specific areas you're looking to invest in, like whether it's crypto AI or whether it's just full-blown traditional AI, like any underinvested areas that you'd be really interested in meeting teams with?

SPEAKER_00 57:12

So we don't do any uh you know straight traditional AI investments with Symbolic. So it's all within the world of decentralized AI, crypto AI. Um, I think that you know there are there are teams out there right now that are doing incredible work that I'm continually in awe of. Pluralis, prime intellect, Jensen, Noose, Ambient, EXO. Um, I'm sure I'm feeling a couple uh a couple in there, but those are just a hell of a list. Some of my favorites, yeah.

SPEAKER_01 57:42

All incredible teams.

SPEAKER_00 57:43

I know and love all of those teams, and they're really pushing the frontier. Um, I've been, it's not really a crypto company, but they're doing cool things around sort of financializing GPU compute. SF Compute is really cool. And there's a new team that I've been chatting with who's trying to bring that on-chain, uh, and then also sort of be able to do sort of really easy-to-use, easy developer experience platform for uh linking data centers together and doing sort of distributed inferencing and pre-training and post-training. Um, so and what I like about is sort of like I love the big picture kind of stuff that Pluralis is trying to do of like you know, fully model parallel consumer hardware, pre-training runs. Uh, I maybe because it's sort of easier to imagine a role where it can be realized sort of more like in the middle type companies. So uh, you know, maybe it's distributed but not fully decentralized. You can use stables for settlement for payment. Those types of ad companies, I think, are appealing to me because they can be used by anybody right now. Uh but if you want to really swing for the fences, I think the stuff that you know Pluralis is doing is really, really cool.

SPEAKER_01 59:01

Yeah, he Alex Long is such a great founder. Um, really smart, really strong, like well-researched views. So excited to see um what he comes out with and what else his team brings to the table, which I'm excited for.

SPEAKER_00 59:14

Um yeah, and I would I would just say, you know, what I'm interested in, this sort of modular MOE stuff is I can't get it out of my head. And so if anybody is working on those problems, um so Jensen's actually published on it. It also like Jensen had this week where they announced a bunch of stuff, and RL Swarm sort of was the flagship thing that people got excited about. But they had another paper, HDEE, which is working on can we train sub-experts in parallel on heterogeneous hardware and then fit them back together to sort of have this sort of semi-modular model. They published on that. It's really cool stuff. If you're into this, you should go read that HDEE paper. Um, but yeah, if anybody is doing more work there, I want to see it.

SPEAKER_01 1:00:00

Yeah, I I actually just had Ben from Jensen on the pod. I mean, what was your take on RL Swarm from Jensen? Like, were you impressed? Were you happy? Is it solving a key issue that you're that we just talked about?

SPEAKER_00 1:00:10

Or I think it's phenomenal. I think it basically proves the thesis that collaboration, like that sort of school idea was talking about, where you can get models to learn from each other and to say, oh, you know, I'm humanizing them a little bit, but it makes it maybe easier to understand. Models sharing their strategies with each other and saying, oh, actually, the way that you're doing this is the best way to solve this problem. I'm gonna learn how to do it that way. So that sort of cross-pollination, that collaboration leading to more performant models is the like exact decentralized RL thesis, right? The more collaboration you can have on an open platform, the better models you're gonna end up with. Um, and so yeah, I'm a huge fan of what they're doing with RL Swarm.

SPEAKER_01 1:00:55

I I mean I think they had, I mean, I had them on the podcast the end of March, but I think they had they got up to what, like a million, was it a million models or a million users?

SPEAKER_00 1:01:03

I forgot what the Yeah, I don't have the I don't have the numbers off the top of my head, but I I assumed I was just super impressed by how many people were actually using the technology that they've created and the number of models that went up on the hugging face was really cool to see.

SPEAKER_01 1:01:18

Yeah, it it's also I mean it's tangential, but I think it's worth talking that like you know, Noose, Jensen, Primate Elect, Exo, like all the projects that you mentioned and others that are out there that have incredible teams, like all those deep tech or like hard tech crypto AI companies are doing this without tokens and they still have all this interest. So it's just awesome to see that organic interest in these projects, right?

SPEAKER_00 1:01:42

Yeah, I mean, because they're solving, they're solving real problems, right? You know, the teams you mentioned, Noose has been collaborating with Deidre Kingma, who created one of the optimizers that's used in like every single Web 2 training run. Prime Intellect's been working with researchers at DeepMind who created DiLoco and have implemented it in the real world. Like, this isn't crypto land, you know, niche stuff that only applies to terminally on-chain people. They're actually solving real problems that are relevant to all of AI. And it just so happens that you know at some point we might want to use this tech to exist in the decentralized context. And for that, we're gonna need blockchains to coordinate behavior, potentially incentivize certain types of behavior or reward people for their contributions. But they're starting with you know really fundamental, important technological challenges that not just the crypto people, but all the AI people have worked on as well.

SPEAKER_01 1:02:47

That is really cool. Sam, any like last thoughts on like open source AI, on MOTS, on China versus the US? I know I'm throwing a lot out there to see like what piqued your interest, but just curious if if you have anything that you feel strongly about that we like didn't cover or you want to talk about?

SPEAKER_00 1:03:05

Yeah. Um I think that you know what there's there's I mean, maybe a couple things on my mind. The first would be very recently there were some um, there was some talk online about RL not actually.

SPEAKER_01 1:03:21

I'm so glad you brought that up.

SPEAKER_00 1:03:22

Yeah, new new behaviors in models. I'm I'm still trying to sort of fully wrap my head around it, but my initial take is essentially so there was this there's this paper, I think it's out of Chinghua, but I'm not, I can't don't quote me on that. Essentially showing that there's not sort of a big difference, at least as the author tried to present it, between the performance of base models versus reasoning models. What it really showed is that reasoning models are more likely to get the right answer on like their first try at a problem than a base model. But if you give the model a ton of time to think, it the base model will sort of roughly get the right answer at the rate the reason model does, or in some cases, be more likely to get the right answer. Um, and people are sort of touting this as like RL's all fake, there's nothing good here. And that's not true. Uh at least right now, what RL does is it it teaches models, it elicits a type of behavior, a type of reasoning behavior that we want the model to do. And that reasoning behavior is really at getting you the right answer fast. Um, so I don't understand how this got sort of construed as RL is dead. RL is definitely not dead. RL got us O3, which is the best model that we've ever had. So I think that's maybe one thing that's been on my mind. Um let me just feed that back to you, Sarah.

SPEAKER_01 1:04:47

So the the the thought is that RL can get you the right answer uh on its first try, but it clearly thinks for longer. But if you ask a base model the question, you know, 10 times, you'll eventually get the right answer somewhere in there. But you so you you could spend the same aggregate thinking time between the two, the 10 tries or the one try, but on the the base model side, you don't know which answer is the right one.

SPEAKER_00 1:05:12

Yeah. And so with that, you then if you can imagine a scenario where if you had all these answers and you had like an oracle that can is all knowing and can always pick the best answer from all of the you know ones that basemall generates. You wouldn't really need RL to do the reasoning behavior. Um, but we don't we don't exist in in that world. So there's definitely still a need for RL to have to elicit this reasoning behavior to allow us to get the right answers at pass one.

SPEAKER_01 1:05:43

Yeah, it's it seems like a dumb take to call RL dead. Um I gotta read the paper though.

SPEAKER_00 1:05:47

So yeah, I mean I think I shared it with you, but there's an open edit researcher who there was this there was one guy on Twitter who was like, you know, uh RL is dead, everything's a sham, and then an open net researcher. Just for fun of being like, this must go crazy if you're dumb. So, like it's uh, which I thought was pretty funny. Um, so yeah, it's not RL is definitely not dead. Um, there is something very important in there, which I definitely want to make clear in all of this, which is performant-based models. You you need a performance-based model to have great RL. Um, there's this amazing clip from Dario from Anthropic Talking, where he says they're sort of talking about their experiences with RL at Anthropic and OpenAI previously, and he has this line where he essentially says, you know, the models were too dumb, the base models were too dumb to do RL on top of. And what we found is that the better your base model is, the better results you're gonna get from the RL process. And so there's been, I think, a trend of trying to say that this RL renaissance means the death of pre-training, and that's not exactly true. If we can keep doing better pre-training and getting more performant base models, we're gonna get even better rewards from RL. So we want both of those things to work together in unison. Um, so that's really important too.

SPEAKER_01 1:07:14

Yeah, that is that is super important. Sam, thank you so much for uh for coming on the show. I'll leave, I'll leave the listeners with this, and I'm sure, I'm absolutely sure we'll have you on again soon for your next post or for a debate one of our uh one of our other AI experts. Um, but I really, really think people should check out your post. I mean, the world, I'm gonna link it in the show notes, but for those that may have missed in the beginning, it's called the world's RL Gym. And it's it's definitely on my my top five list of reads in this space, and I definitely think everyone should check it out. So, Sam, thank you uh for making the time.

SPEAKER_00 1:07:45

Thanks. Really appreciate that. Uh happy to chat with anybody on any of these topics more. Hit me up on Twitter. Uh, I hope you enjoy the peace.

Tommy Shaughnessy

Host