5 Facts on Real World DevSecOps in 2026 (Sponsored)Datadog analyzed tens of thousands of production applications to reveal where risk is actually showing up and what it means for teams handling security today. Download the full report to dive into the key findings, including:
In December 2024, an AI lab called DeepSeek released a 671-billion-parameter language model along with a technical report describing exactly how they built it. Six months later, a different team called Moonshot AI used that report as a starting point. They scaled the design to a trillion parameters, ran into a training instability problem that emerged at that scale, invented a new optimizer to solve it, and shipped their own model. Eight months after that, a third team called Zhipu AI integrated a different DeepSeek innovation into their architecture and contributed a new training framework of their own. These three teams work for different organizations. However, they were indirectly collaborating in public, through model releases where each company was learning from what its predecessors had done before. This has been made possible by the rise of open-weight models, where even competitors get to learn from each other. The pace of that kind of collaboration has changed over the past eighteen months, and the reasons trace back to the architecture and training choices these teams made in the open. In this article, we will look at how open-weight models have transformed the AI landscape. Disclaimer: This post is based on publicly shared details from various sources. Please comment if you notice any inaccuracies. Open Weight vs Closed WeightEvery modern large language model has two important things behind it:
A closed-weight model is one that the company keeps behind an API. The user sends some text to the official API endpoint, the company’s servers run the model on their hardware, and a response comes back. The parameters stay with the organization, and running the model on personal hardware or adjusting it for a specific task remains out of reach. An open-weight model is one where the company has published the trained parameters. Anyone can download them, run the model on their own hardware, and adjust it for their own data. The training data and the full training code, however, usually stay private. The term is “open weight” rather than “open source” for this reason. In traditional software, “open source” means the full source code is available to inspect and reproduce. Most AI models marketed as open source are actually open weight, where the trained model is public, while the process that produced it remains closed. This distinction matters because the published weights, paired with detailed technical reports, are what allow other teams to study a design and build on top of it. Different open-weight models also use different licenses, ranging from very permissive ones like MIT and Apache 2.0 to custom licenses with specific commercial restrictions, so the practical freedoms vary across the ecosystem. See the diagram below that shows the difference between accessing a closed-weight model and an open-weight model: The MoE ArchitectureEvery major open-weight LLM released at the frontier in 2025 and 2026 shares the same architectural skeleton. It is called a Mixture-of-Experts transformer, or MoE for short. Modern LLMs are built from stacked “blocks.” Each block has two main parts, an attention layer that figures out which previous words matter for the next one, and a feed-forward layer that does the actual computation. In a regular (”dense”) model, every parameter activates for every word the model processes. Adding more parameters to make a smarter model means the cost of running it scales linearly with that count. With hundreds of billions of parameters, this becomes impractical. MoE solves this by replacing the single feed-forward layer in each block with several smaller “expert” sub-networks, plus a small routing component that picks which experts to use for each word. The model can store knowledge across many experts while only computing a few of them per word. This is why two numbers matter for every MoE model:
DeepSeek V3, for example, has 671 billion total parameters but only 37 billion active per word. Kimi K2 has a trillion total, but 32 billion active. Qwen3 has 235 billion total and 22 billion active. When comparing the cost of running these models, the active counts are what matter, rather than the totals. A trillion-parameter model and a 235-billion-parameter model can cost roughly the same per query if their active counts are similar. See the diagram below that shows how an MoE block works: Beginners often assume that experts in MoE specialize by topic, with a math expert, a code expert, and so on. The reality differs from that picture. The router picks experts per word, rather than per question, and the patterns experts specialize in are mostly outside human interpretation. The routing is fine-grained, and a single sentence will pass through many different combinations of experts as it generates. The MoE architecture explains why every frontier open-weight team is using roughly the same approach. The interesting differences lie in the design choices teams make inside that design approach, and three places are where those choices show up most clearly. Tips to take AI agents from playground to production (Sponsored)Every organization is exploring how to adopt AI agents or MCP servers, but how many of them are in production? And if they aren’t in production, how likely is it that authentication, access control, and agentic identity concerns are the reason? Watch this on-demand webinar from Descope to learn:
Attention StrategiesThe first place is in how teams handle attention. Every time the model generates a word, it looks back at every previous word in the conversation to figure out what comes next. To avoid recomputing this lookback at every step, the model caches information from earlier words. This cache is called the KV-cache, short for “keys and values,” and it grows as the conversation grows. For long conversations, the KV-cache becomes the main memory bottleneck. Three different strategies have emerged for managing it.
Each strategy is a rational choice depending on what the team is optimizing for, whether that is engineering simplicity, memory efficiency, or context length. See the diagram below that shows the three attention strategies side by side: Expert Count and SparsityThe second point where the teams diverge is in how aggressively they use the MoE pattern. Across the major open-weight models released in 2025 and 2026, the number of experts ranges from 16 to 384. That wider spread reflects a real disagreement about how far to push sparsity. At a fixed compute budget, increasing the number of experts can lower training and validation loss, meaning the model learns better from the same amount of compute. The tradeoff is memory. More total experts mean more total parameters that need to live in memory, even if only a few of them fire per word. Kimi K2’s trillion total parameters require a multi-GPU cluster regardless of how few experts activate per token, while Llama 4 Scout’s 109 billion total parameters fit on a single high-memory server. Both belong to the same architectural family, although the deployment realities differ significantly. A separate disagreement exists around whether to include a “shared expert” that processes every word and provides a baseline capability floor. DeepSeek V3, Llama 4, and Kimi K2 include one. Qwen3 dropped it after using one in their previous Qwen2.5-MoE, and their technical report stays silent on why. Consensus in the field on shared experts remains elusive, which is worth flagging because beginners often assume technical questions in well-resourced labs are settled. Many of them remain open. Llama 4 takes a particularly distinctive approach. Rather than making every layer in the model an MoE layer, Llama 4 alternates between dense and MoE layers. It also routes each word to only one routed expert, plus the shared one, instead of eight as in DeepSeek V3. The result is fewer active experts per word, with each expert being larger, which represents a different architectural bet from the rest of the field. See the diagram below that shows how the expert count varies across these models: Training ApproachesThe third point at which teams diverge is in training. Architecture is one-half of how a model behaves. Training is the other half, and lately it has become where the more meaningful differences live. Pre-training is the part where the model learns to predict the next word across trillions of tokens of text. Pre-training gives the model its base knowledge of language and the world. The scale varies between teams, with DeepSeek V3 trained on 14.8 trillion tokens and Qwen3 trained on up to 36 trillion. The general approach, however, remains similar across teams. Post-training is everything that happens after, and post-training is where models now diverge the most. Three techniques deserve attention here.
Beyond these techniques, the training infrastructure itself has become a meaningful contribution. Two examples from the recent ecosystem stand out.
Both contributions may end up being more reusable across the ecosystem than any specific architectural choice. Architecture is converging while training is now where teams place their different bets. See the diagram below that shows where these training approaches sit in the overall pipeline: The Borrow-and-Build PatternA pattern runs through all three of these approaches when looked at together.
Each team built on the previous team’s published innovations, and each added something the next team could build on in turn. These innovations all depend on published weights and detailed technical reports. See the diagram below that shows how innovations have traveled between teams over the past eighteen months: This is an observation about the open-weight ecosystem rather than a verdict that open-weight is “winning” or that closed-weight teams have fallen behind. Closed-weight teams are doing different work, optimized for different things, and many of their innovations stay private by choice. The open-weight ecosystem, however, has produced a kind of technical conversation that ran on a smaller scale before. ConclusionThe specific models covered here will likely be overtaken in months. The framework for reading them, however, will probably hold. Modern open-weight LLMs share a common skeleton, the MoE transformer, where total parameters and active parameters represent two different costs. Within that skeleton, teams place distinctive bets in three places:
The license under which a model is released determines what teams and individuals can actually do with it, and “open weight” remains technically narrower than the traditional notion of open source. The most interesting development in this period of AI engineering is the innovations that open-weight models are inspiring through their published designs References
|
#6636 WWE 2K26 v1.06 + 6 DLCs [Monkey Repack] Genres/Tags: Arcade, Fighting, Sports, 3D Companies: 2K Games, Visual Concepts Languages: ENG/MULTI6 Original Size: 129.2 GB Repack Size: 107.8 GB Download Mirrors (Direct Links) .dlinks {margi… Read on blog or Reader FitGirl Repacks Read on blog or Reader WWE 2K26, v1.06 + 6 DLCs [Monkey Repack] By FitGirl on 28/03/2026 # 66 3 6 WWE 2K2 6 v1.0 6 + 6 DLCs [Monkey ...








Comments
Post a Comment