Unpacking decentralized training
My attempt at highlighting the necessity of training LLMs in a distributed / decentralized manner, explaining a web of topics, and articulating the potential synergies between crypto + AI
Where have I been?
First things first, SPECIAL SHOUTOUT to sam lehman, rodeo, haus, yb, smac, ronan, and ibuyrugs for all of the comments, edits, feedback, and suggestions - you helped bring this to life and I really appreciate it.
Also, some of these arxiv links open up as browser-based PDFs, so just a warning incase you don’t want to deal with that.
As I’m writing this it’s been three months since my last post. What have I been up to since then?
I don’t know. I’ve been doing a lot of reading, trying to workout like five days a week, and generally making the most of my last semester attending college.
My mind gets a bit antsy anytime it’s been a month or two since I’ve written a long report, so this is my attempt at returning to baseline and getting back in the groove of things.
If you couldn’t tell by the title, this is a report that’s largely about distributed/decentralized training, accompanied by some info covering what’s been happening in the world of AI, and some commentary about how all of this fits together / why I believe it’s valuable.
This won’t be as technical as other reports been written on the subject, and I’m certain it won’t be entirely accurate either.
It will, however, be the most digestible report on the subject that you can find.
Pretty much everything here is explained in short detail and if it isn’t, there’s a hyperlink or two that provide a lengthier explanation.
This is a report about both decentralized and distributed training, which might sound interchangeable, but these are two very different things.
When an AI lab sets out to train an LLM, they’re tasked with managing a number of obligations that contribute to a finished and working LLM.
Researchers and developers have to juggle data collection/curation, pre-training/fine-tuning, post-training, reinforcement learning, and configuration/deployment.
This isn’t the entirety of what goes into building a foundation model, but I split it up in my own way that’s hopefully more easily understood. All you need to know is that LLMs take in massive amounts of data, teams decide on a specific architecture for the model, training and refining follows this, and finally some post-training and polishing comes in prior to a model’s release. Oh, and most LLMs use the transformer architecture.
This process can be generally referred to as centralized training.
Sam Lehman described distributed training as a “process of training via hardware that is not physically co-located” while decentralized training is “similar to distributed training in that the hardware being used for pre-training is not co-located, but it differs in that the hardware being used is heterogeneous and not trusted.”
The distinction is made because even though most of this report references distributed training, there’s a ton of value that can be found in creating and scaling it with crypto incentives, aka tokens. That’s probably what most people reading this will care about.
This concept of paying out tokens to network contributors in exchange for work is very well known and documented.
Even without looking at the more intricate examples seen throughout DePIN (decentralized physical infrastructure networks) you can find this in Bitcoin’s PoW model.
It’s difficult to make an argument for whether or not this model scales for most DePIN projects, but in my view, decentralized training has sufficiently high implications and potentially massive economic value, so the behavior can be incentivized for an extended period of time. Compared to many other projects that exist under the DePIN umbrella, decentralized training is easily the most significant.
Put more simply, I’m not worried over a hypothetical scenario where a crypto project miraculously achieves AGI and people don’t want to buy the token or contribute. I’ve seen people do far worse when the stakes are infinitely lower.
Just wanted to get that out of the way, and don’t worry - the differences will be elaborated on throughout the report, so it’s fine if you don’t have the full picture just yet.
I had a lot of fun writing this, so hopefully you have just as much fun reading. I do all of this for free, for some reason. No one paid me to do any of it.
If you enjoyed in any way, please consider subscribing to this blog as well (it’s still free) and/or sharing with a friend or reposting on X.
Have fun.
Defining decentralized ai and articulating the distributed training value proposition
Key takeaways of this section:
Distributed training = geographically separated hardware; Decentralized training = heterogeneous, untrusted hardware
Tokens can power participant rewards in a network, similar to Bitcoin or DePIN projects - distributed training could see massive demand and drive attention back to DePIN & DeAI
Decentralized training can unlock collaboration at a global scale & large-scale computational power at a fraction of centralized training costs
* Note: If you’re only reading this post to learn more about distributed/decentralized training, you can probably skip this section. *
You can’t write about something called decentralized training without writing about crypto, or more specifically, Decentralized AI (DeAI for short).
I’d originally placed this section towards the end of the report but decided it was best to move it to the forefront, before all of the boring stuff.
Want a TLDR?
Distributed training isn’t a complicated science project masquerading as a business opportunity, but an increasingly feasible set of steps towards uprooting how we train AI models.
Not only that, but distributed training offers an alternative to a) the hundreds of billions in big tech capital expenditures on data centers, b) the numerous pieces of pesky middleware designed for localized clusters, and c) ultimately presents an opportunity for the little guy (all of us) to take a crack at building ASI.
As much as the broader crypto community likes to say otherwise, the reality is that crypto needs AI far more than AI currently needs crypto. What do I mean by this?
Some could say it’s because crypto attracts a lower quality pool of developer talent than the traditional AI industry would, resulting in less ambitious and generally more lackluster ideas and products.
Others might say it’s because all tokens that aren’t Bitcoin or Monero are vaporware, so DeAI is no different. You hear this one a lot. It’s most commonly used when memecoin valuations are discussed, but sometimes it extends over into discussions of stickier sectors like DeFi or DePIN and the apps that live in these subsets of crypto.
It isn’t a secret that until recently, there hasn’t been much innovation from the DeAI sector and the countless companies that have raised venture funding on the promises of decentralizing AI through some type of novel, crypto-enabled enhancement(s).
This market map from Galaxy was already crowded in Q1 2024, struggling to incorporate every protocol. If another were made today, you couldn’t even fit 70% of them, let alone squeeze all of it in a way that’s visually appealing:
Most of what we’ve seen out of these teams can be viewed as a type of preparation for the future - one where AI interacts with blockchains, a world where we’ll suddenly need all of this AI-adjacent, crypto-enabled tech.
But what about the right now?
When I say there hasn’t been much innovation, I’m mostly saying there hasn’t been anything released that’s made an impact on DeAI adoption or the non-crypto AI industry. This is fine, and the intention isn’t to dunk on these projects as it’s likely a handful eventually gain adoption.
What I mean is that as a sector, DeAI is kind of twiddling its thumbs and waiting instead of acting.
These protocols are banking on the fact that AI gets incorporated into every aspect of technology and business - not a bad bet btw, just look at one of a16z’s hundreds of enterprise AI blog posts - but struggle to articulate why they’ve raised money and/or (mostly and) why they’re relevant to the DeAI industry today.
It’s my belief that DeAI has yet to experience any semblance of “takeoff” because a) the usage of blockchains by a large majority of the global population still hasn’t occurred, b) some of the problems being solved in DeAI aren’t entirely necessary at this point in time, and c) a lot of the proposed ideas just aren’t possible.
More than anything, I believe DeAI isn’t catching a ton of attention from outside our bubble because it’s difficult enough to get people interested in anything else involving crypto outside of maybe memecoins and stablecoins.
This isn’t a knock on the industry, just an observation. And it’s pretty obvious. Even something as universally respected (trusted?) like Circle is struggling to sustain the weight of a suggestion it might IPO at a $5 billion valuation.
But in my opinion, the third point (that proposed ideas aren’t possible) has done the most damage to DeAI in its short lifespan.
This is just one example that should be fairly clear to most DeAI researchers or general skeptics, but if you’re attempting to create fully on-chain, fully autonomous agents that interact without human intermediaries, there really isn’t even a centralized corollary to benchmark your progress against.
In fact, there isn’t even a fully autonomous agent that can interact persistently without human intermediaries outside of the context of blockchains. It’s like trying to build a house on Mars before we’ve even landed any humans there.
Fully autonomous agents have yet to be released or even excessively teased from major AI labs, but we saw coins like ai16z and virtuals reach peak valuations of over $2.6 billion and $4.6 billion, respectively.
There were a number of agentic frameworks being pushed by these projects as well, but very little came out of them (imo). This isn’t me trying to be overly negative - as it was a lot of fun trading these coins for a while - but none of this really contributed anything to the non-crypto AI industry.
The frameworks proposed by these web3 teams haven’t gone on to gain adoption from Anthropic or OpenAI, or even the broader open source community.
Even worse than not gaining traction is a potentially ugly truth that all of these antics only reaffirmed web2/TradFi/big tech’s collective belief that crypto still remains a fundamentally unserious space.
Maybe the frameworks don’t suck, and the marketing is just poor because these projects launched tokens - which can stand out as a negative to those outside the industry - but it’s hard to believe something supposedly so innovative wouldn’t be adopted solely because the founding teams decided to launch a token.
“Every agent I know, know I hate agents.” - Ye, the artist formerly known as Kanye West
From some basic digging and general interacting online, things like MCP (model context protocol) have seen an infinitely larger adoption rate than these frameworks, with some even claiming MCP has already won. Why is that? Well, it works, it’s (mostly) free, and people enjoy software that they can incorporate into their day-to-day lives, with apps they already use.
What do people get out of agent frameworks? More often than not, literally only the ability to “build” or deploy more agents, with this description already being a stretch in 99% of web3’s cases. Most people don’t want to buy our coins, so what value do you imagine they’d get from deploying agents that have nothing to do with workflows and everything to do with launching new tokens?
* Note: No shade to @diego_defai it’s just that yours was the easiest thread to find and popped up first. *
But what even is decentralized AI, and why are we being told we need it?
Lucas Tcheyan wrote in 2024: “The driving force behind ongoing experimentation and eventual adoption at the intersection of crypto and AI is the same that drives much of crypto’s most promising use cases - access to a permissionless and trustless coordination layer that better facilitates the transfer of value.”
Sam Lehman wrote a section in his report about crypto-enabled incentives, pointing out “crypto has shown that decentralized networks can achieve massive scale through providing thoughtfully designed incentives.” I mean, just look at Bitcoin.
Even if we can be honest with each other and admit the Bitcoin model is at the very least bit weird on paper, this does not discount the fact that net-new incentives (receiving BTC in exchange for work) changed the world and propelled us into a timeline where the United States government is actively exploring a strategic fund for BTC.
This mindset also been the guiding belief or modus operandi (if I’m allowed to get fancy with it) behind decentralized physical infrastructure (or DePIN for short) which 0xsmac and myself wrote about back in September 2024.
We have a few different definitions of what Decentralized AI is, but nothing is definitive. This is understandable considering it’s a nascent sector within an already somewhat nascent industry, but we should at least be able to identify the 5 W’s of DeAI - the who, what, when, where, and why.
Who is going to use this? What problems are better solved with the integration of crypto? When will this be used? Where would a product like this capture the most attention or largest user base? Why does this need venture funding (jk) and/or why does it need to exist?
In my opinion, Vincent Weisser of Prime Intellect lays out the challenges & problem areas succinctly for almost anyone to understand:
Vincent also provides a list of potential use cases for DeAI and what can/should be built. I won’t drone on about all of them, but it spans almost every layer of the stack and sums up the sector in a way that hasn’t really been done.
Distributed (or P2P) compute networks, decentralized/federated training methods, decentralized inference, on-chain agents, data provenance, on-chain verifiability, and a handful of others.
DeAI is more than just the compute that trains models, scraped data that gets purchased by large labs, or services that verify model outputs are correct. It’s an entire ecosystem of product innovations built to disrupt an industry that’s almost perfectly suited for decentralization.
It seems most in the industry are attracted to the challenge of decentralizing AI because they love decentralization, but more than that, it’s a pressing issue for a lot of humans.
If AGI or ASI ends up in the hands of a single entity, that isn’t really fair.
It would suck.
None of us would be able to fully take advantage of these superintelligent, digital aliens, because corporations would own the model weights, code, bespoke training methodologies, and technology used to create these models.
Assuming someone like OpenAI or Deepseek gets to it first, it actually becomes a major national security threat, too, if it hasn’t already.
If distributed training works at scale (which we’re already seeing) and integrates with other DeAI tech like zero-knowledge proofs or other privacy-preserving mechanisms, maybe we’ll have a good chance at defending against a monopoly on superintelligence.
In a world where distributed training researchers continue to understand an entirely new set of scaling laws and subsequently scale up distributed training operations, it’s unlikely we ever turn back and optimize for more localized training methods of the past.
If you’re a large lab or big tech corporation like Google / Meta / Amazon, it’s in your best interest to research distributed training and make it a priority. Dylan Patel spoke about this in 2024, but if you still want further confirmation this is actively being explored by big tech companies and major players, consider the DiLoCo paper was written by DeepMind (acquired by Google for $650 million in 2014). It’s also worth mentioning Dylan Patel wrote about multi-datacenter training here.
Rodeo pointed something out to me that feels quite obvious in hindsight - the smartest minds and the largest tech companies in the world are actively pursuing how to create a massive network of nodes through decentralized principles.
Doesn’t that sound familiar to you?
If you had to argue one thing that Bitcoin did in its almost two decades of existence, it’s proven that when a decentralized network of individuals with aligned interests are given the proper incentives, legitimate change can happen.
First we decentralized money, and now we can leverage this experiment to decentralize intelligence. The odds are stacked against everyone working in this field, but you could have argued the same in Bitcoin’s early days.
A comparison could be made between the earlier days of Bitcoin adoption and the current DeAI community, though there are a number of differences, most notably a broader, more provable market demand and the presence of venture funding which doesn’t really signify we’re “early” as it once was with Bitcoin.
And the benefits of distributed / multi-datacenter training aren’t exclusive to big labs, either, but actually the complete opposite.
A technical innovation like distributed training makes it possible for groups of individuals from anywhere on the globe to pool their resources and train competitive models. Minimizing communication requirements is just one part of the equation.
What about lowering the hurdles for at-home training with consumer hardware?
What about using a token as an initial wedge for bootstrapping innovation without significant capital outlays?
This will be covered later on in some short analysis of Exo Labs’ work, but here’s a recent tweet from Alex Cheema describing this exact concept in relation to Apple’s M3 Ultras and the new Llama models from Meta.
Distributed training doesn’t just unlock more efficient training, but an entire global community of researchers, hobbyists, and enthusiasts who were previously locked out of working on frontier models. What happens when a few dozen individuals with hundreds of even thousands of GPUs are given the golden ticket to competing with centralized frontier labs?
Overview of some AI basics, compute, and scaling laws
Key takeaways of this section:
Modern AI training relies on GPUs for data parallelism, making them an industry bottleneck and simultaneously a very hot commodity
Increasing compute and data typically leads to higher performance but scaling compute cluster sizes introduces its own set of challenges
DeepSeek’s progress has shown creativity in model creation (not just more GPUs) and prove you can achieve state-of-the-art results at lower cost with a bit of outside-the-box thinking
Centralized training is very expensive and difficult; distributed training is too, but has more positive externalities if executed on correctly
It’s best to start with a refresher on what’s been happening across the AI industry and use this as a jumping point for the more complex topics that follow.
Most reading this are hopefully somewhat plugged into what’s been happening with recent LLMs (Sonnet 3.7, GPT 4.5, Grok 3), the Magnificent 7’s AI expenditures, and the increasingly capable models being released almost every week.
There are some good reports that describe the work that goes into training an LLM, so I’ll reference a few of these throughout:
Training LLMs is a very capital-intensive venture, and you can see below just how much big tech companies have spent on infrastructure. The details will be covered shortly, but most (if not all) of this goes towards things like GPUs, data center buildout, maintenance, and other hardware requirements that contribute to the final product.
By the way, this list is limited to just three big tech corporations:
You might wonder why GPUs are used instead of CPUs, or even what the difference between the two is.
Citrini highlighted that the distinction between GPUs and CPUs comes from which type of parallelism is used for computation. GPUs are optimized for something called data parallelism while CPUs are better off at task parallelism.
The machine learning industry realized that GPUs - initially designed for rendering graphics - are also quite good at performing calculations rapidly. I won’t get into the weeds on the speed of things, but they’re very fast.
Data parallelism is a process where “the same operation is performed on many data elements in parallel” while task parallelism is when “different operations are performed on the same or different data.”
For training an LLM, data parallelism just makes more sense due to the highly repetitive nature of parsing large datasets and performing simple operations on it, which is why GPUs became and remain such a hot commodity.
Something like task parallelism didn’t make sense because AI datasets are highly variable - you wouldn’t want to over index on a single piece of data within a massive set because you’d never finish training a model or it would take so long that it’d be costly and/or highly inefficient.
People like to say the word compute, and they’re referring to GPUs when they do this. If someone asks “how much compute does Meta have” or “how much is Elon spending on compute next year” they’re talking about GPUs.
The Carnegie Endowment wrote a nice summary on what compute means, how it functions, and why it’s mattered so much. It’s helpful if you’re still a little lost and want a more general overview before reading the rest of this.
Compute has been the main focus of AI labs because of something known as a scaling law, particularly the power-law relationship or correlation between more performant models and the larger numbers of GPUs and data that go towards training them.
To be precise, the specific law being referenced here is referred to as the pre-training scaling law. The graphic below goes a bit beyond that, but I found it helpful for framing where we’re at in model development today and where we’re headed:
Brief aside, but OpenAI’s 2020 paper on scaling laws is said to be one of the more foundational analyses on the relationship between compute, data, and model parameter count.
Scaling laws have held up.
It’s difficult to find accurate GPU counts for newer models, but here’s a rough estimate of scaling laws in action for some of OpenAI’s models over the years:
GPT-1: 117m params & about 8 Nvidia V100s
GPT-2: 1.5b params & tens to a few hundred Nvidia V100s
GPT-3: 175b params & 1k-2k+ Nvidia V100s
GPT-4: Trillions of params & 8k-30k Nvidia A100s/H100s
You might remember Sam Altman calling for trillions of dollars to build larger and larger data centers, or the proposed $500 billion Stargate bill, or even Zuck’s 2GW+ data center ambitions - these are initiatives that came about due to a (perceived) need for extremely large, power-hungry data centers.
It was actually announced on March 31st that OpenAI completed a new funding round, receiving a $40 billion capital injection (with 75% of this coming from Masayoshi Son and SoftBank).
Since scaling laws remained in the picture for so long, everyone that wanted to build a good model was forced to amass larger and larger amounts of compute, as well as even more performant types of compute (aka better GPUs). Most of these come from Nvidia, though it’s worth exploring the potential of Apple Silicon.
Everyone became trapped into a massive race to buy up these GPUs and train larger models, but things have gotten complicated. Models get smarter when you train them with more GPUs, but it becomes increasingly difficult to train them because of disturbances, errors, cooling requirements, interconnects, and a bunch of other issues.
Later sections will cover more of the details, but most of these training algorithms are already quite capable and bottlenecks exist almost entirely in the implementation and scaling phase. It’s already possible to achieve a fully distributed training run, the only challenge remains in taking this from 0.5 → 1.
Distributed training is actually a step towards getting around a lot of this, which would be huge.
If we can eventually train state-of-the-art models across a number of distinct data centers, from a variety of continents and countries, all without these burdens, we could get even better models with less hassle and far more performant training runs.
That’s why it matters so much - it can be just as good as centralized training if proven scalable, but better in almost every other way if successful. And if you think about it, these centralized corporations and labs have to bend their operations towards the trend of distributed training, not the other way around.
If you already own a large data center it’s difficult for you to work backwards and redesign the infrastructure to accomodate for distributed training methods. But if you’re a smaller, scrappier team of researchers that set out from day one to pioneer work on distributed training, you’re much better positioned to benefit from the technology.
Epoch AI wrote a report on scaling back in 2024, describing not only traditional (compute-centric) scaling laws, but some of the other potential bottlenecks that might plague labs in pre-training runs to come (which will be covered).
The most important thing to highlight here is that the number of GPUs or the size of a data center isn’t the only bottleneck. Beyond just acquiring these GPUs - which is difficult enough - labs need to stress over power constraints, the latency wall, chip manufacturing capacity, and even geopolitical tensions.
And this is just the laundry list of concerns for centralized training runs - distributed training has its own set of issues, mainly getting around the communication bottleneck and scaling training runs.
Many of the other constraints are relevant to distributed training because of the obvious reality that distributed training is inherently sensitive to factors like geography, location, and - not sure if this is a word - locality.
Distributed training isn’t just the study of how to train models residing in multiple locations, but an all-encompassing field that takes the most difficult problems in centralized training and pairs them with even more challenging, unproven theories from distributed training research.
That’s one of the reasons this topic has stood out to me so much - the stakes are incredibly high and this is one of those areas where so many disciplines overlap, it’s almost impossible to get a full picture of what’s occurring.
If you think about significant leaps in technology across time, distributed training fits the bill and deserves a chance at success even if there aren’t currently any tokens for me to shill in this.
This idea of scaling laws “ending” or experiencing diminishing returns has been greatly contested, and it isn’t really my place to make an opinion because for the most part, no one is entirely sure.
Beyond pre-training scaling laws, there’s much to be said about post-training and test-time compute (TTC) laws. Post-training is concerned with topics like fine-tuning, reinforcement learning, and some other more advanced mechanisms that are covered in the next section.
TTC, on the other hand, is a lot more complex.
But am I the one to write about these? This report has been nothing short of exhausting to write, as it’s felt like I’m constantly taking one step forward and three steps back, struggling to make sense of new information or learning I wrote an entire section and was sadly mistaken about all of it. I’ve struggled immensely, but for what cause?
I don’t even make any money from writing these.
Keeping it short, post-training laws are currently in vogue thanks to the ludicrous rate of improvement measured in OpenAI’s “o” models relative to GPT-4 and the non-reasoning models released in the years before:
Post-training research is red hot and on fire right now because it’s working (duh) and it’s a more cost-effective way of scaling model performance, assuming you’re a large lab that already has the GPUs. Put simply, post-training is majorly additive and has the potential to redefine how large labs are pushing towards AGI.
I mentioned how reinforcement learning in combination with reasoning models has certainly challenged the industry’s perception of scaling laws, but it hasn’t necessarily defeated any arguments against these scaling laws holding.
If anything the advancements being made in post-training only stand to benefit the entire lifecycle of creating a model, as this new data can eventually feed into better models. There might come a time where 99% of the innovation in model creation and curation comes from post-training optimizations, if this isn’t already being pursued.
But that’s enough for now. I’ll take a few steps back and run through the pre-training process and some of the more critical functions aside from GPUs.
Compute is obviously crucial to a training run, but like I hinted at earlier, there’s an entirely separate set of storage, memory, energy, and networking requirements that are just as important as GPUs.
Energy: It should be obvious that large data centers require large amounts of energy, but what about cooling infrastructure? What about actually sourcing the necessary energy requirements and ensuring consistent power output?
Storage: LLMs are made up of large datasets and parameters, so you can imagine the storage requirements for this are high.
Memory: Pre-training runs can take a while and proper memory requirements are necessary to maintain memory across GPUs and nodes.
Networking: Citrini’s interconnects report gives you more info than you’ll ever need to know about networking, but data centers need high-speed and low-latency interconnects to actually facilitate a run.
All of these models are pre-trained with interconnected, large, geographically constrained clusters consuming large amounts of energy, comprised of expensive and highly capable technologies.
Tens of billions of dollars have gone towards data center buildout, fundraising rounds for labs, and countless other expenditures as companies have thrown their hat into the race towards superintelligence.
But things got complicated earlier this year.
DeepSeek-R1 and its accompanying paper were released January 22nd, 2025 and remained somewhat under the radar for maybe a week before everyone caught on. Unless you’ve been on a digital hiatus or have a poor short term memory, R1 was a massive shot from left field for almost everyone in the industry.
It’s been said that R1 was trained with 2,048 Nvidia H800 GPUs which comes out to roughly $61 million worth of GPUs assuming a cost of $30,000/GPU - give or take $5,000 depending on where and when DeepSeek acquired these. However, there’s also a difference in reporting based on numerous internet sources for the above and a report from semianalysis which estimates 10,000 H800s and 10,000 H100s.
I think regardless of the actual number of GPUs used to train the model, what DeepSeek has been able to accomplish is the true story here. Not its supposed cutting of costs or ability to dodge GPU import regulations, but creativity in model construction and reinforcement learning advancements.
The news about DeepSeek’s GPU shenanigans came as a shock to many, considering every major lab had prioritized amassing more and more compute in the past 2-3 years and almost zero indication this wasn’t the “correct” way to build highly capable models. DeepSeek’s process and strategies will be covered with more detail in the following section.
Here are some of the other foundational models and their respective costs, not accounting for training times or other hang-ups of the pre-training process:
OpenAI’s GPT 4o: 25,000 Nvidia A100 @ $8-20k/GPU
xAI’s Grok 2: 20,000 Nvidia H100 @ $25-30k/GPU
Google’s Gemini 2.0: 100,000 Trillium chips @ $2.7/hour/chip
Meta’s Llama 3.1: 16,000 Nvidia H100 GPUs
Anthropic’s Claude 3.5 Sonnet: unspecified but estimated tens of thousands
OpenAI’s GPT o1: unspecified but supposedly very many GPUs
* Note: I wanted to include citations here but there were too many different resources used, and as I’m editing this, it would take too much energy to go back and find these. Sam Lehman also pointed out to me that employee salary + compensation can factor into these costs, so that’s worth considering if you want to explore absolute cost of training runs. *
Even though we don’t have costs or GPU counts for some of the older models (and for many of the newer models like Claude 3.7 and GPT 4.5, understandably) we can assume these have stuck to the scaling laws of AI and amassed larger and larger amounts of GPUs or more performant GPUs.
This is a good spot to mention that not all pre-training runs are created equally.
The Llama-3 technical report is a good resource for understanding just how many variables go into this, and the table below shows how easy it can be for something simple to trip up a run or create an issue leading to idle training time.
As you can see from the list, it could be an issue related to GPUs, networking, dependency, maintenance, or even something unknown - you can’t count anything out. Just owning the GPUs doesn’t give you the golden ticket to a flawless pre-training run.
I could take time here to examine some of the proposed equations for measuring training efficiency, like MFUs, MAMMFs, SFUs, and Continuity, but Ronan already did a good job on that and it might drag this report on even longer than it should already be.
TLDR?
Many different variables go into determining the efficiency of a training run, involving both software and hardware, though most of this depends on FLOPs and measuring this in a really lengthy way.
Anyways.
This next section will expand on our knowledge of LLMs and dissect the training process, particularly the post-training phase and some of the innovations occurring here.
Exploring reasoning models and reinforcement learning
Key takeaways of this section:
Reasoning models have quickly taken over as the dominant structure for newly released state-of-the-art models (across almost every lab)
Reinforcement learning is a very technical problem space that’s quickly becoming one of the major vectors for innovation in model optimization
DeepSeek did a bunch of impressive shit and deserves a lot of credit for pushing the boundaries of model design at a time when many labs were seemingly plateauing
We can pivot here to the more recent popularity and deployment of reasoning models, which have proven themselves to be extremely competent and even led to Sam Altman claiming these models are OpenAI’s focus for the foreseeable future (after GPT-4.5):
Reasoning models are a unique type of language model trained with reinforcement learning to perform more complex reasoning, models that can think before producing an output. These reasoning models were developed to better resemble humans and the ways we solve problems in our everyday lives, producing a chain of thought detailing their internal ideas prior to answering a user’s query. Here’s what it looks like:
Sebastian Raschka wrote in this report that there are two main methods of improving reasoning models: increasing training compute or increasing inference compute, which can also be referred to as inference-time scaling or test-time scaling, with the distinction coming in when the scaling is done. In this case, inference-time is referring to a period occurring after the training has already been done.
Ronan’s report highlights an under appreciated aspect of scaling under the reasoning paradigm, referencing a tweet from samsja over at Prime Intellect:
It’s my fault for not explaining the whole forward/backward pass thing sooner, but now is as good a time as ever to use it in support of the distributed training thesis.
A forward pass occurs when a neural network processes data inputs, doing layer-by-layer and running the model forward from input → output. Backward passes are calculations measuring how far off a model’s output was from the supposed correct answer, with this info getting routed backwards through a model to inform parameters what weights need adjusted.
The fun part with reasoning models and improvements in the post-training phase stem from the reality that these processes are inherently less intensive in communication requirements during the pre-training phrase. Samsja points out the RL & normal training’s order of magnitude difference in forward pass count.
Which kind of brings us to DeepSeek.
DeepSeek-R1 came as a shock because the 2,048 H800s came across as small by comparison, made possible through a few different techniques used simultaneously:
Mixture-of-Experts (MoE)
Multi-Head Latent Attention (MHLA)
Supervised Fine-Tuning (SFT) & Reinforcement Learning (RL)
* Note: You might remember hearing about MoE from my November 2023 report covering Bittensor. *
Putting all of these techniques together resulted in a model that beat out many of the most highly capable, commercially available LLMs, which sparked speculation and debate (back in January) that DeepSeek-R1 was the best state-of-the-art model currently live:
Haus explained to me another, more under appreciated engineering feat pulled off by DeepSeek. Here’s a short explanation straight from its source (Stratechery):
“Here’s the thing: a huge number of the innovations I explained above are about overcoming the lack of memory bandwidth implied in using H800s instead of H100s. Moreover, if you actually did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; that’s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is actually impossible to do in CUDA. DeepSeek engineers had to drop down to PTX, a low-level instruction set for Nvidia GPUs that is basically like assembly language. This is an insane level of optimization that only makes sense if you are using H800s.”
I did a little more digging and realized this gets very into the weeds, but it’s super impressive how DeepSeek was able to innovate not only on the software side (MoE, SFT, and RL stuff) but extremely challenging hardware problems like the PTX conversion from CUDA, a parallel computing platform created by Nvidia.
It’s my understanding that most model development and GPU-related work makes use of CUDA, if only because it’s the standard to use with Nvidia GPUs (which are the most widely demanded by far).
If there’s one thing to learn from DeepSeek, it’s that hard problems are meant to be solved and the entire industry is still discovering newer and more crucial hard problems to solve.
This entire moment is important because for the first time in a while, the AI community woke up and learned that maybe just throwing more compute at new models isn’t the most optimal way of scaling model performance.
What if more creativity throughout the model training process was the key to unlocking a more likely path to AGI? What if the secret sauce wasn’t so obvious?
This logic will eventually be extended out to the idea of training models in a distributed manner, specifically challenging the belief that GPUs must be located in a singular geographic location in order to train a SOTA model, along with some ideas concerning whether or not the amount of GPUs being used is even necessary, or if there’s a way to combine these scaling methods and achieve something truly exceptional.
There are a few other ways to further improve reasoning capabilities, including inference-time compute scaling, pure reinforcement learning, RL + SFT, and Pure SFT - I won’t get too into the weeds on any of this. Like I said in the previous section, we have a few of these scaling laws that are all equally seeing their fair share of advancement, and the industry is still learning how to make sense of them in tandem.
Sebastian provides a list of reasoning papers, showing just how creative researchers are getting. Just to showcase how strange all of it gets, here are some snippets if you want to look deeper down the rabbit hole:
Terms like underthinking and methods like the Thought Switching Penalty are used to help reasoning models “think better”
Researchers explore the associative memory of LLMs and its effects on information absorption
Backtracking methods get used to help reasoning models explore alternative solutions
It’s important to cover reinforcement learning, a subset of machine learning / AI concerning the relationship between an agent and its environment.
Reinforcement learning is a process where rewards and penalties are assigned to an agent for its behavior, gradually steering the agent towards the correct answer(s).
Even though the definition refers to the model as an agent, this isn’t necessarily the same thing as what people describe when referring to autonomous agents. In case it comes up again, just imagine an agent as a model that receives an input from a user and desires to provide the correct output.
When combined with reasoning models, reinforcement learning minimizes the need for human feedback (or RLHF) in the process of training new models. Instead, these models iteratively improve, reasoning through sets of problems in a step-by-step approach to obtain correctness.
And while this does reduce the need for a constant human-in-the-loop, it’s important to say here that some human feedback still makes sense.
With the training of DeepSeek-R1-Zero (as pointed out here by Sam), which notably used zero human data and avoided supervised fine-tuning, reinforcement learning carried them to the finish line but ultimately needed improvement after “it had issues with generating human-readable outputs and would often mix its output languages.”
On a fundamental level this should make sense, as a consumer-focused chatbot should ideally avoid mixing languages for whatever target audience it’s communicating to. This is just one example but it shows us that even with advancements being made, we’re still in the early innings of cracking post-training and removing human intervention entirely.
It’s okay to be confused. None of this is extremely necessary towards developing an understanding of distributed training, it’s only additive and used to provide an explanation for what the broader industry is thinking about when training new models.
I’m trying to paint a picture of the current model development landscape and how these larger labs are currently reasoning (lol) about the types of models to come.
And it’s not just a bunch of hype and big words - the results speak for themselves:
The above chart comes from a December 2024 evaluation from ARC, a team working on tests and experiments to measure existing model capabilities to better document the arrival of AGI. OpenAI’s initial o1 models did very well, but the spread between even o1 high and o3 low represents a difference of 44% on ARC’s score.
This caught a lot of attention for obvious reasons, but we can compare these reasoning models to non-reasoning models to even better visualize the differences in performance:
Every single model on this leaderboard possesses some type of reasoning ability. We’ve rapidly entered a scenario where each subsequent, state-of-the-art model from a major lab will be a reasoning model.
When we explore distributed training methods in the next section, all of this will become relevant as the advent of reasoning models actually benefits these methodologies and has the potential to make distributed training even more efficient.
One of the more interesting side effects of each new model possessing reasoning capabilities is the realization that existing models can generate their own data.
Aligned wrote: “In reinforcement learning, the model learns by trial-and-error and feedback, generating its own training signals. This diminishes demand for conventional large-scale annotation projects.”
They also point out DeepSeek-R1 and its two-model process, which took one model’s RL-generated reasoning traces as training data for the second model. This sounded pretty significant, considering the previous charts detailing new model intelligence and the consistent performance improvement with each subsequent reasoning model.
If models keep getting smarter and are trained with data generated by themselves, it makes sense we could eventually achieve some type of almost flawless model that everyone agrees is AGI.
It was once believed that the real bottleneck facing model development concerned data, but this wouldn’t an issue once models can think for themselves and use their own reasoning to learn from mistakes and improve recursively.
However, I’ve also been told data remains an issue and we still need higher quality data, though this feels like less of a problem compared to compute scaling laws and how much money that requires. I’m sure the data problem will remain, but I don’t see how it’s a bottleneck when these new models are doing all the work - if anything it only requires a bit more infrastructure to produce the data at scale.
Sam Lehman points out in another report (I like this one a lot) that DeepSeek’s great leap forward didn’t just showcase the power of ingenuity in model development, but the need for “vast amounts of reasoning data for and environments in which to generate this data.”
He proposed the diagram below as a potential method for gathering up this data:
Another aspect of reinforcement learning that caught my attention comes from the distinction between GRPO (group relative policy optimization) and PPO (proximal policy optimization).
PPO is comprised of a policy model, reward model, and value model. The post-training process under PPO relies on using the reward model (trained on human preferences) to steer the policy model (the one being trained) towards a more ideal state. The value model is “a neural network that estimates the expected sum of future rewards for a given state” (via sam) and acts kind of like a tutor or a critic, observing the work done by policy and reward models.
GRPO is a method of RL post-training that simplifies things by only utilizing the reward model and policy model. DeepSeek-R1 made use of GRPO to reduce memory and compute overhead in the post-training process, which if you could guess, is quiet valuable in this dynamic where we’re creating distributed training algorithms that do exactly that.
DeepSeek’s ability to make post-training less reliant on a complex system like PPO and achieve basically the same results for less work is massive.
Instead of trying to fully jump into distributed training for reasoning models with reinforcement learning implemented as well, the idea is to leverage the strengths of reinforcement learning to bolster future development of more capable models.
If Sam’s thesis plays out and teams begin to develop these environments purpose-built for generating reasoning traces at scale, then GRPO + decentralized synthetic data generation might quickly become the next big thing - not only in the DeAI space, but for the entire industry of researchers and teams building superintelligent models.
It’s a cool problem space, and one of the first genuinely new ideas I’ve found involving decentralized training and crypto since I started writing this.
In the first section I listed some of the key problem spaces and infrastructure being built to fill these gaps. Sam proposes an entirely new problem space, one that crypt-incentives might be best suited to fix.
Maybe in the future I’ll tackle some of the specific methodologies that go towards boosting reasoning model efficiency, but that might make this even more of a slog to read through. For now, I’ll move on and explore the reason(s) you came here in the first place.
Understanding optimizers, parallelism and distributed training methods
Key takeaways of this section:
Three main methods of parallelism are used - data, tensor, and pipeline parallelism - serving a unique purpose depending on the task at hand
Optimizers are a very crucial piece of the tech stack, as many distributed training algorithms manipulate these or work to further optimize them
DiLoCo, OpenDiLoCo, DisTrO, DeMo, DiPaCo and SWARM Parallelism are covered in detail
DiLoCo and the Streaming DiLoCo paper optimize for more infrequent sync to achieve reduced bandwidth
DisTrO optimizes for smaller parameter changes
DeMo deals with cutting out momentum states
SWARM Parallelism & DiPaCo are modular pieces of infrastructure for other distributed training algorithms
Hopefully you understand why it was necessary to write about all of that. Even if you’re an avid user of LLMs and generally aware of tech trends, the following section is somewhat dense.
Unfortunately there’s still a bit of preliminary homework required before getting into DiLoCo, Distro, DiPaCo, and all of the other mechanisms.
Specifically, we need to talk about parallelism and optimizers.
Luckily these aren’t as strange as they sound, though they’re heavier on technical jargon. When the word parallelism is used, it’s referring to the act of using multiple processors or cores simultaneously to accomplish a task - similar to when we compared GPU and CPU differences in parallelism efficiencies (data versus task).
Parallelism can come in many forms, and the three most common - or most relevant to this report - are data parallelism, tensor parallelism, and pipeline parallelism.
This stems from the reality that model training involves large amounts of data and many parameters, with these two hurdles necessitating a few differentiated methods of handling this as efficiently as possible depending on the task.
It’s important to highlight that the types of parallelism being discussed all come with their own sets of tradeoffs.
There isn’t a singular approach that’s definitively better and there isn’t one that’s definitively worse. Here’s a helpful graphic showing each of these processes side-by-side:
Data parallelism involves keeping identical copies of the entire language model across a number of GPUs, with each device performing their own calculations and computing a different batch of data. After each “training step” the model weights are synced across all devices to ensure consensus, repeating until the work has been considered complete.
Ronan pointed out that data parallelism has low communication requirements and necessitates high “on-chip” memory demands, due to storing the full model on each device rather than splitting it like the other techniques. This is fine if speed or accelerated training is the goal, but poor for “extremely large models” as you would probably struggle to fit a high parameter model onto every device.
When I discuss DiLoCo later on in this section, the topic of reduced communication overhead is central to the argument. Data parallelism pairs well with DiLoCo because of this shared periodic weight synchronization.
Tensor parallelism works to solve the memory issues created by data parallelism, splitting the model vertically and separating weights and activations (outputs) between devices. This makes it so each device takes responsibility for completing a portion of the operations, rather than the whole thing.
However, because these devices are now responsible for less work, communication requirements increase with tensor parallelism and devices must synchronize more frequently. Comparing tensor parallelism to data parallelism is easier if we just use the DiLoCo applicability again.
Because tensor parallelism regularly exchanges partial outputs of its split devices, communication requirements are increased significantly more than data parallelism. If paired with DiLoCo, we’d just be shooting ourselves in the knee trying to make it fit.
Pipeline parallelism which divides the model horizontally (rather than vertically or not at all) across multiple devices in order to generate a sequential, assembly line kind of system.
A device processes the layers assigned to it, handing off activations to the next device, and so on. This reduces memory required per device but also introduces pipeline bubbles or idle times that are pretty much exactly what they say they are.
As you can see, all of these have their own tradeoffs and possess some type of unique advantage over one or the other.
Instead of choosing between one of these three parallelism methodologies, they are typically combined or integrated together to form the 3D parallelization approach.
Big Brain Holdings discussed the Megatron-LM implementation - specifically its structure and training process - as an example of 3D parallelism and one of its implementations out in the wild. Megatron-LM’s GPUs are arranged into a 3D grid, with each dimension (out of three, obviously) representing one of the three types of parallelism. It’s complicated.
Even though 3D parallelism has its benefits, there are still unanswered questions, mostly concerning its communication requirements and the potential lack of availability for proper interconnects. Oh, and it’s also quite difficult to implement, considering it morphs three distinct approaches to parallelism into one.
Overall, parallelism is very relevant to the discussion because they present themselves as potential bottlenecks, but nevertheless bottlenecks that distributed training might help minimize or eliminate.
Moving on, optimizers are methods or algorithms that are used to minimize loss function during the training phase, iteratively adjusting a model’s parameters to achieve more accurate predictions.
I haven’t discussed it yet but a loss function is a type of equation or mathematical model that measures how well a model is performing, usually in the form of a value displaying the difference between an accurate prediction and the model’s answer.
It’s unfortunate we have to blanket optimizers under a single categorization, because there are a ton of interesting rabbit holes that you can fall into. There’s gradient descent, AdamW, RMSProp, Momentum, FTRL, and a bunch of others that offer some set of advantages and disadvantages depending on what you’re looking for.
Understanding optimizers is important because most of this section involves distributed training methods that take creative liberties with optimizers. The first method I’ll cover - DiLoCo - makes use of a dual optimizer system, leveraging AdamW and Nesterov Momentum to create communication efficiency workarounds.
DiLoCo was created by Arthur Douillard in collaboration with DeepMind just two years ago in 2023. The original paper is titled “DiLoCo: Distributed Low-Communication Training of Language Models” and is pretty much just like the title says. The paper examines a new type of algorithm and how the team managed to get around the communication requirements of centralized training.
DiLoCo’s largest advantage comes from its focus on reducing the frequency in which machines communicate with each other. Each island of devices manages its own copy of the model parameters and runs many local optimization steps using the inner optimizer, AdamW. More precisely, the paper communicates in its conclusion that “workers only need to send data once every 500 steps” which massively eliminates many of the issues (or technically features) of centralized training runs.
I’m not sure if I defined what an island is, but it’s basically just a smaller cluster or set of GPUs within a larger set or datacenter. It also sounds cooler than saying cluster or maybe even mini-cluster.
These islands share their outer gradients to align the parameter replicas with a single global parameter. DiLoCo vertically slices training time into relatively long, localized work phases, punctuated with brief communication phases that synchronize the replicas to the global parameter.
One of the main benefits of DiLoCo is that each individual machine or island can function almost independently for a large fraction of the training run, lowering the communication overhead (synchronization frequency) compared to standard data-parallel or fully synchronous methods.
By communicating infrequently, DiLoCo makes it more feasible to train large models even when machines are separated by these low-bandwidth or high-latency connections, which is the main appeal considering we want to work towards a world where anyone on the planet can collaborate with others to train a massive model.
There’s been other work done on DiLoCo more recently by Douillard and other members of the DeepMind team, leading me to believe it’s one of the more promising approaches out there if your goal is to work towards scaling up distributed training runs.
To illustrate just how useful DiLoCo is, here’s a table from an Exo Labs blog post showing pre-DiLoCo, internet training speeds. Without distributed training methods, attempting to train even smaller scale projects over the internet is significantly slower than if you’d just done it the regular way:
And before getting into the other methodologies, I’ll briefly cover the aforementioned work on DiLoCo recently published by DeepMind.
The “Streaming DiLoCo with Overlapping Communication: Towards a Distributed Free Lunch” paper is mainly concerned with DiLoCo’s existing trouble around peak bandwidth, despite its ability to enable less synchronizations between workers.
Three methods are used to improve DiLoCo:
More selective synchronization of parameters
Allowing training during synchronization
Quantizing worker-exchanged data
For the record, I didn’t know what the word quantize means, and its definition is kind of wordy but it refers to a methodology of restricting the possible values of something within a predefined set.
In the context or worker-exchanged data, the transmission or exchange of data can be less burdened by bandwidth constraints and make its way through the system more freely.
I enjoyed reading this paper because it’s a good example of really valuable research despite the reality that we aren’t yet creating SOTA models with distributed training methods.
Streaming DiLoCo detailed how the model was now broken up into fragments, reducing the peak bandwidth by synchronizing the fragments in some type of sequence instead of requiring total cooperation.
One of the issues of DiLoCo came from its needing to synchronize all of the model’s parameters together, with this improvement from the new paper ensuring all model parameters are still exchanged across the worker set.
The ability to train during synchronization is huge, because previously, the workers would have to pause and exchange gradients. Streaming DiLoCo implemented a type of overlap between communication and computation to limit idle time.
Quantized communication details are more complicated, but the TLDR is that outer gradients (Nesterov momentum for DiLoCo) were compressed even further into 4-bit values so the volume of data was reduced.
* Note: I don’t know if that’s an accurate image, but if you imagine compressing from 16→8→4 bits, it might be helpful to visualize the difference between the waves’ peaks and troughs. Doesn’t really matter that much.*
OpenDiLoCo was developed by the Prime Intellect team, and is just as it says - it’s an open-source implementation of DiLoCo. The team defined a few of the key challenges still facing decentralized training researchers:
Slow interconnect bandwidth
Ensuring fault-tolerant training
Non-homogenous hardware settings
As discussed in the previous DiLoCo section, this is a training method that combats the first challenge - “fixing” the bandwidth problem. Prime Intellect was able to reproduce DiLoCo and apply it to a billion-parameter scale model. This is a major achievement, though maybe you’re a little confused, considering the models you’re most familiar with and use everyday (like Claude 3.7 or GPT o1) are in the trillions of parameters. Maybe not trillions, but much higher.
The reason OpenDiLoCo’s model was only one billion parameters is because of memory constraints, increased costs of training runs for higher parameter models, and general engineering challenges, or specifically parallelism. Regardless of parameter count, it isn’t important to focus on any perceived shortcomings, but the massive success in achieving this.
Here are the replication results in practice:
There are a lot of useful anecdotes from the previously linked blog, but I found this to be most helpful in contextualizing progress:
“Due to DiLoCo’s significant reduction in communication time, the all-reduce bottleneck only accounts for 6.9% of the training time, minimally impacting the overall training speed.”
Overall the OpenDiLoCo paper is great, and Prime Intellect has done a ton of other work since then, so definitely check out their blog if you’re interested in their more recent research.
* Note: As I’m reading back, I realize they did the INTELLECT-1 training run for a 10b parameter model (and are planning on a 70b training run). Link here, but not sure I have it in me to write more, though it is very impressive. *
Distributed Training Over-The-Internet (DisTrO) was developed by the Nous Research team and is defined as a method to reduce the amount of data being exchanged in distributed training, though the process differs from DiLoCo (and Streaming DiLoCo).
In its most basic form, DisTrO synchronizes every step (as is standard for all training runs) but transmits just a small fraction of the information. This is kind of similar to the Streaming DiLoCo quantization stuff, but the DisTrO paper highlights how the “AdamW + All-Reduce” optimizer is replaced, which is different from reducing the bit size.
The main finding from this research is that DisTrO manages to reduce the amount of data transmitted by a factor of 500-1000x, which is remarkable on its own, but even cooler if you consider that it might one day be possible to combine the benefits of DiLoCo with DisTrO to increase communication efficiency by an even larger magnitude. I’m assuming this could be the case, based on this excerpt from Big Brain Holdings:
“Both of these approaches are not mutually exclusive and could be used in combination to achieve almost a 500,000x reduction in communication overhead–making it feasible to train LLMs in a decentralized manner.”
Decoupled Momentum Optimization (DeMo) is described as “a fused optimizer and data parallel algorithm that reduces inter-accelerator communication requirements by several orders of magnitude.”
What does this actually mean?
DeMo differs from the other methods discussed due to it being an optimizer-level innovation. The momentum part refers to a type of extra memory, and DeMo gets around requirements of GPUs holding copies of all momentum states by offloading it. Importantly, these momentum states slow down the training process and introduce some communication bottlenecks. It’s very unfortunate but there aren’t any cool graphics I can include for this section, except for this one (which isn’t really necessary but it says Hellaswag so I had to):
Basically, it’s an architecture-independent / topology-agnostic way of improving how optimizers function with various distributed training methods. DiLoCo uses both AdamW and Nesterov optimizers - what if there was a way of actually changing the fundamentals of how optimizers behave?
The DeMo paper works through this question and proposes two major modifications, specifically the removal of all-reduce operation and extraction of “fast components” prior to synchronization. These work together to reduce memory overhead, as each node only retains its local slice of momentum and the smaller, faster component.
SWARM Parallelism is a model-parallel training approach created in 2023, designed “for poorly connected, heterogenous and unreliable devices.” A lot of the SWARM parallelism paper relies on something known as the square-cube law which works to describe the reality that as model size is scaled up, computation time exceeds communication time because of their different time complexities (O(n^3) and O(n^2) respectively).
The main idea of this time complexity relationship is that even though the bandwidth doesn’t really improve, you can still scale up model training for increasingly larger amounts of compute and achieve quality results.
SWARM parallelism also makes use of the “splitting thing” and takes models, breaks them into blocks, and distributes this amongst smaller swarms of workers.
One of the biggest advantages of SWARM parallelism is its ability to manage worker downtime or unreliable nodes - instead of shutting down completely, different workers from various other sub-swarms get positioned to fill in for the lost worker or unreliable node.
This is a bit different from methods like DiLoCo because the pipelines built up with SWARM parallelism operate on their own unique stages, whereas DiLoCo is more isolated.
The paper has a good amount of test runs and experiments laid out, so go read that if you want to look deeper. I know that a few DeAI projects have recently been exploring SWARM parallelism, most notably Gensyn though I’ve yet to look into the details of their algorithm.
They did reference the DeepSeek / GRPO stuff in a paper, which is interesting:
The research on Distributed Path Composition (DiPaCo) was also written by Arthur and the DeepMind team, laying out the structure of a co-designed modular architecture and training approach for models.
Throughout the training process DiPaCo distributed computation through a set of paths and “shared modules” while using a local-SGD optimization of DiLoCo to gain the communication/synchronization reduction.
This was published in 2023, and even though I haven’t seen much discussion of it lately, DiPaCo was worth including because it’s less of a training method than it is a type of infrastructure for other training methods, in this case DiLoCo.
DiPaCo is capable of creating a much larger, virtual model through this path-splitting architecture, dividing up a model’s parameters across any number of heterogenous compute providers and heterogenous hardware (aka different types of GPUs).
The architecture also diverges from DiLoCo in how its outer + inner synchronizations are structured, with DiPaCo only running synchronizations if the paths share the same module. With DiLoCo, every single worker stored a copy of the entire model, synchronizing every 50-500 steps.
DiPaCo is more “mixture-of-experts” than it is distributed training, which is arguably cooler because you can theoretically scale out the DiPaCo path/module architecture across any number of islands, but don’t quote me on this because this is only my understanding of it. Here’s a bit more info on DiPaCo from Big Brain.
And that’s most of it. I think this section gives a much clearer picture of what these methods actually enable compared to most writing on the subject. I’m not an extremely technical person and don’t have a hard science or math degree, so all of this explanation was just the result of reading and trying to understand the implications from a high level.
From what I can tell most of this research is only a few years old at the very latest - no one knows what might get released next or what’s actively being developed. Distributed training is extremely promising to me considering the amount of brain power being directed towards all of this feels so low.
It isn’t that hard to imagine teams continuing to crack larger and larger distributed training runs, incorporating some type of decentralized RL data generation, and combining these advances to create even a single model capable of going toe-to-toe with centralized incumbents.
That’s kind of why it felt necessary to describe DeepSeek’s breakthroughs, because even though these achievements were spawned out of centralization, it shows that true innovation is possible even if it’s considered unlikely.
It’s crucial to realize that even with the already bustling ecosystem of training methods laid out here, the key isn’t to overanalyze what exists, but what might soon be possible.
Like I’ve said more than a few times, if even a fraction of this stuff works out, the implications are quite large for the centralized and decentralized AI communities it’s almost impossible not to feel optimistic.
“The one important thing to note is that this depicts a static point in time – today – while we are primarily focused on peaking around the corner into the future. It’s not enough to understand where these technologies reside currently but rather how realistic are the paths for each of these to move toward the top right corner [of this matrix]. That will dictate startup formation, strategy and ultimately success.” - Smac & Knower
Implications of distributed training, next steps, and final thoughts
“I imagine some future where there are a thousand different minds being grown, each having its roots in a thousand or more distinct computers separated by sometimes great distances, swapping information surreptitiously one another, below the waterline of the monitoring systems designed by many AI policy control regimes.” - Jack Clark, Import AI
I really struggled coming up with a decent conclusion to all of this.
There are already so many high quality reports that cover distributed training in far greater detail than what I’ve done - so how was I going to bring it home and offer a new perspective?
The goal of this wasn’t to try and larp as being technical or write about a bunch of ideas I only understand on a surface level, but to gradually unveil some of the hidden mysteries behind a dense subject matter.
It’s difficult because in the past I would use this blog to write about investment opportunities in crypto or theses that could usually be expressed in some type of investment vehicle. With distributed training, most of this stuff is being worked on by a) very large corporations, b) private companies, and/or c) research teams that you can’t directly invest in.
This leaves most of us in a difficult spot as it isn’t immediately obvious how to not only identify winners and losers, but actively place concentrated bets on the fastest horses.
It’s probably useful to look towards the eventual intertwining between reasoning models & traces, reinforcement learning, and distributed training for a taste of what’s to come, but I’ll also say that there are decent opportunities to provide compute to a handful of these protocols and networks at the moment.
This is just a rough list of teams I’m pretty sure are currently accepting, have been accepting, or have accepted compute in the past:
Even though he bulk of work will still continue to come from proving distributed training is feasible, there’s real value to be found in developing the applications or services that generate novel combinations of crypto incentives & machine learning infrastructure.
It would be nice to review some of the DeAI protocols here, but that would take too much time. Shoutout to YB for the source (Topology), but here’s a more updated market map of DeAI teams currently building in the space.
I’m sure there are many of interesting plans around tokenomics, balancing incentives, and ensuring integrity amongst these dozens of protocols, but that’s a task for someone else.
Some of the areas I’m personally excited about include the verification of compute, distributed reinforcement learning as a trojan horse for data generation, and squeezing utility out of readily available consumer hardware to more easily train quality models.
Somewhat unrelated, but Grass is supposedly doing quite well on the data scraping side to the tune of >$50 million in revenue. I’ve had my issues with Grass’ model in the past but maybe there’s more to it than I’d thought. If the team can roll out some additional features or product lines related to decentralized data generation, I think they’re best positioned right now to benefit from the previously described synthetic data boom.
There’s also Vana which was given a shoutout in the GSR report, though I admittedly haven’t looked too deeply at Vana just yet.
I think there’s something to the model, because instead of more general scraped data like Grass has prioritized, Vana incentivizes user-owned / private data - things like social media posts or even personal journal entries. I’m not sure how well this scales in regards to training net-new model types or vastly differentiated models built around niche data, but it was at the very least unique enough to deserve a mention.
I decided to include verification here because I was re-reading an older report from Haseeb on decentralized inference/verification and it got me thinking more about the utility of existing DeAI protocols.
These verification protocols are either working to solve a niche problem that will eventually work and be used at scale by someone, or building something more general for the here and now.
The easiest way to frame this comes from maybe 3-4 years ago, when optimistic rollups like Optimism and Arbitrum were dominating while zero-knowledge rollups were still struggling to make it to mainnet. If you remember, Ethereum mainnet was actually dying to offload some of its execution, and L2s served a purpose then.
Blockspace is readily available these days and isn’t as hotly contested, but the rise of optimistic > zero-knowledge L2s remains a good example of why sometimes good enough is more than just good enough and wins the race.
It isn’t necessary to describe the differences between optimistic ML and zkML, but some of the more technical specs stood out to me as I was thinking about distributed RL data generation stuff. Projects like EZKL, Panthalia, and Gensyn are all working towards some type of generalized verification (whether optimistic, zk, or something in-between) and fit into the thesis quite nicely.
We need verification because compute offered on P2P or decentralized compute marketplace does not come with a cryptographic guarantee. If more money, research, and attention flows towards distributed/decentralized training, we can’t expect serious industry professionals to take us seriously if we don’t offer the most basic assumption - trust.
On the consumer hardware side, one of my favorite experiments to watch has been EXO Labs which is developing software/infra for anyone to run highly performant models on GPUs they have available at home.
This relates to the distributed training thesis because if we can more easily manipulate optimizers, parallelism, and squeeze even more juice out of cheap/accessible GPUs, it follows that we can all collaborate to train models that compete with the largest labs.
I’ll admit a lot of it does feel like a pipe dream, but it isn’t impossible.
Even if only a fraction of these ideas and techniques ever get implemented at scale, it’s hard to imagine the effects not being felt across every level of the centralized AI stack.
Regardless of where we go from here, I think the report you’ve just read did a good job of pointing out some of the non-obvious implications and a few second/third order effects of these distributed training advancements.
But I have been told that this report is very long and a conclusion needs to operate as a kind of reward for the readers, especially for something like this which reached well over 10,000 words.
Instead of listing out more protocols that have caught my attention or sharing other hyperlinks, it’d be easier to just explain in simple terms why all of this is so exciting to me.
I’m not sure how much science fiction you guys have read, but there are a lot of well-written novels that incorporate AI either directly into the main plot or leverage it as a main feature of a novel’s world building. One of the books I’ve recently been enjoying is Accelerando by Charles Stross.
I won’t drone on about the plot points, novel structure, or characters within Accelerando, but I’m including it because of how unique its depictions of futuristic technology are. It’s less about the actual devices or things people are actually using, but the ways individuals within the story use technology everyday and how their lives are fundamentally different because of it.
It’s an almost completely alien setting contrasted against where we’re at today, which is odd to say considering most humans in the developed world spend a majority of each day staring at a screen.
It’s equal parts dystopian and inspiring, similar to how I feel broadly about AI.
There are more than a handful of posts on my Twitter about AI, job automation, UBI, and loosely related topics relevant to this societal fear of replacement. I sympathize with everyone who’s ever expressed concern over being automated away by AI, and I’d be lying if I said I didn’t occasionally worry about job security.
Accelerando does a good job of highlighting some of the more fun or “out there” integrations between AI and humans, but equally displays more often than not just how challenging the world has gotten.
The tech will be built and all we can do is hold on for dear life.
It’s my belief that decentralized/distributed training - when coupled with other advancements mentioned - is one of the only methods of self defense we have against an overwhelmingly likely dystopian set of outcomes.
If we can build our own AGI at home, run state-of-the-art models locally, and do all of this on affordable hardware, that’s a good enough outcome for me.
One of the easier ways to absorb the decentralized thesis might just be thinking a bit harder on why you got into crypto in the first place. Addressing the elephant in the room, I’m sure many of us joined the space because it seemed easy to make money - but that isn’t the only reason, or at least I’d hope for most.
The obvious example to use when asked about the utility of crypto is to respond and ask if this other individual has ever tried wiring money. It’s quite simple, but if you’ve been unfortunate enough to need to interact with the modern financial system, you’d understand immediately just how much easier crypto is.
Beyond that, it should be fairly consensus that your money is your money, and regardless of whether or not Bitcoin is actually a CIA psyop, it’s still vastly better than relenting control to higher powers and storing all of your money in a bank under surveillance.
Just like money, AI should be decentralized.
The stakes are so much higher, especially if we eventually create AGI or god forbid an ASI that turns us all into bugs. Wouldn’t you rather have the optionality of at least prompting the ASI before it turns you into just another energy source?
I originally wanted to write a more conclusive thesis or wrap-up commentary here, but it might not be necessary. After re-reading, editing, and working on this for so long I realize most of my opinions and beliefs are scattered throughout.
You could say my argument is a bit distributed on its own. I hope the Accelerando analogy works, as that’s kind of the first science fiction story I go to anytime the topic of decentralized training comes up.
That’s about all I have to say for now. If you enjoyed reading this please consider sharing it online or messaging me to talk more. Peace.
Love the Accelerando shoutout - right up Lex' alley
https://lex.substack.com/p/analysis-what-2025-will-bring-for
Really enjoyed your breakdown of distributed vs decentralized training concepts! When you mentioned MCP adoption, it reminded me of resources I found on mcpdb.org where their mcp server list helped me understand practical implementations. Your insights on DeAI's position relative to traditional AI are spot-on!