좋습니다. DeepSeek V3부터 DeepSeek R1까지의 연구개발 총 비용을 조사하고 추정하겠습니다. GPU 사용량, 모델 크기, 훈련 데이터 규모 등의 기술적 세부 사항을 고려하며, 연구개발 인력 비용, 인프라(데이터 센터, 클라우드 비용 등), 기타 운영 비용을 포함하여 분석하겠습니다. 또한 유사한 규모의 다른 모델(GPT-4, Gemini 등)의 개발 비용과 비교하여 분석을 보완하겠습니다.
조사가 완료되면 상세한 분석 결과를 공유하겠습니다.
DeepSeek V3 to R1: Research & Development Cost Analysis
1. Technical Overview of DeepSeek V3 and R1
Model Scale: DeepSeek-V3 is a massive Mixture-of-Experts (MoE) language model with 671 billion parameters (37B active per token)
arxiv.org. DeepSeek-R1 (and R1-Zero) uses the same architecture and size as V3 (they are architecturally identical to the V3 model)martinfowler.com. This means R1 leverages the 37B active-parameter MoE setup established in V3.
Training Data: V3 was pre-trained on an extremely large dataset of 14.8 trillion tokens drawn from diverse, high-quality sources
arxiv.org. This dataset size is orders of magnitude larger than that used for earlier models like GPT-3 (which trained on ~300 billion tokens). The huge corpus for V3 enabled it to attain strong general capabilities. R1’s development built on V3’s pre-trained model, then further fine-tuned it (with reinforcement learning and some supervised data) to enhance reasoning. The R1 paper introduced a “zero-SFT” RL training approach (R1-Zero) and a refined RLHF-style training for the final R1 model, focusing on chain-of-thought reasoning. These RL fine-tuning phases use custom reward setups (unit tests for code, LLM-based answer grading, etc.)epoch.aiepoch.ai, but they involve far fewer tokens than the massive pre-training – on the order of billions of generated tokens rather than trillionsepoch.aiepoch.ai.
GPU Usage & Training Duration: Thanks to its MoE design and extensive software/hardware optimizations, DeepSeek-V3’s full training (including pre-training on 14.8T tokens, context length extension, and alignment fine-tuning) consumed approximately 2.788 million GPU-hours on NVIDIA H800 GPUs
arxiv.orgarxiv.org. This is remarkably low given the data scale – it equates to roughly 116,000 GPU-days. In practical terms, if run on a cluster of ~2,048 GPUs (which some reports indicate was the scale used)therecursive.comtherecursive.com, the pre-training could be completed in roughly 6–8 weeks. Indeed, analysis suggests DeepSeek’s team likely utilized a ~2048-H800 GPU cluster for trainingepoch.ai. The training process was noted to be highly stable, with no irreversible loss spikes or training restarts neededarxiv.org– a testament to their engineering.
After pre-training, additional fine-tuning steps for R1 were shorter. The reinforcement learning phase to produce R1-Zero and then R1 involved on the order of thousands of gradient update steps (estimated ~8,000 steps) and generation of ~2 trillion tokens worth of model outputs during training exploration
epoch.aiepoch.ai. This translated to a much smaller compute requirement. Rough estimates put the RL training for R1-Zero at ~6.1×10^23 FLOPs, roughly **1M worth of GPU time** assuming similar efficiency as pre-training[epoch.ai](https://epoch.ai/gradient-updates/what-went-into-training-deepseek-r1#:~:text=Combining%20all%20of%20this%2C%20we,when%20measured%20in%20GPU%20hours). Even if efficiency was lower, the RL stage’s cost is significantly less than the gigantic pre-training. A second RL phase (after a “cold-start” supervised fine-tune) likely added a similar order of compute[epoch.ai](https://epoch.ai/gradient-updates/what-went-into-training-deepseek-r1#:~:text=Subsequent%20training%20for%20R1). In total, the **R1 fine-tuning (SFT + two RL phases)** likely took on the order of **a few hundred thousand GPU-hours**, perhaps on the order of an extra n1–2M in compute cost. In wall-clock time, the reinforcement learning stages were short – on a 2,000-GPU cluster, one can complete the needed rollouts and updates in around a week or twoepoch.ai.
Summary: DeepSeek-V3/R1’s development was compute-intensive but highly optimized. The model’s enormous scale (37B active params, 14.8T tokens) was made feasible by clever techniques (Multi-Head Latent Attention, optimized MoE routing, FP8 precision, etc.) that maximized hardware utilization and minimized needed GPU-hours
martinfowler.commartinfowler.com. The final GPU compute usage for one full V3 training run was ~2.8M GPU-hours and for the R1 RL fine-tuning likely on the order of 0.1–0.3M GPU-hours. This gives a baseline for direct computing costs, which we analyze next.
2. Research & Development Costs
Developing DeepSeek V3 and R1 required not just raw compute, but extensive human effort and experimentation. Key R&D cost components include:
-
Personnel (Researchers & Engineers): A project of this scale requires a large, highly-skilled team. DeepSeek’s author lists and reports indicate a substantial team – on the order of hundreds of researchers and engineers collaborating. In fact, SemiAnalysis reports DeepSeek has around 150 employees (many recruited from top universities) working on these AI efforts
semianalysis.com. These experts are extremely sought-after; DeepSeek has even offered very high salaries (>1.3M USD for top talent)** to recruit AI researchers, well above typical industry rates[semianalysis.com](https://semianalysis.com/2025/01/31/deepseek-debates/#:~:text=regularly%20runs%20recruitment%20events%20at,employees%2C%20but%20are%20growing%20rapidly). Assuming a more average fully-loaded cost per technical staff (including benefits) in the low-to-mid six figures, the **annual personnel expenditure** could easily exceed **n20–30 million. The development from V3 through R1 spanned roughly one year of intense work (2024 into early 2025), so labor costs in that period likely total in the tens of millions of dollars. This includes model designers, engineers optimizing the distributed training, data curators, and researchers working on novel RL techniques. In other words, human capital was a major investment, possibly on par with or exceeding the pure compute costs.
-
Research Effort & Experimental Trials: The $5–6M compute figure often cited for DeepSeek-V3’s training only accounts for the final successful training run
www.reddit.comwww.reddit.com. In reality, the team undoubtedly spent much more compute on experiments, prototyping, and iterative research. They likely trained smaller-scale models, ran ablation studies, and fine-tuned parameters extensively (e.g. tuning the MoE gating strategy, testing FP8 training stability, etc.) before committing to the full run. This “hidden” compute usage is part of R&D cost. As one expert pointed out, ignoring the cost of experimentation is misleadingwww.reddit.com– many trial runs and variations are needed to develop a state-of-the-art model. We can safely assume multiple runs (or portions of runs) were executed to refine DeepSeek’s methods. If even a couple of additional partial training runs were done, that could easily double or triple the raw compute expenditure. Martin Vechev of INSAIT emphasized that the true cost must account for “running this training (or variations of it) many times, and also many other experiments”, not just one runtherecursive.comtherecursive.com.
Additionally, the R1 paper’s advancements (novel RL algorithms like GRPO, and the “zero SFT” approach) would have required research and tuning. Developing a robust reward model, rule-based evaluators (for code/math rewards), and cleaning the “cold-start” supervised dataset all incur time and possibly some labeling costs. We should also consider data sourcing and curation: assembling a 14.8-trillion-token corpus and filtering it for quality is a massive project on its own. While much of that data may come from open sources (web crawl, etc.), processing and storing it is non-trivial. Vechev mentions “data collection and other things… can be very expensive” in such projects
therecursive.com. This could include costs for web crawling infrastructure or purchasing/licensing datasets.
-
Academic Research & Publication: DeepSeek’s team published multiple technical reports (DeepSeek-LLM, V2, V3, R1). The process of conducting literature research, collaborating across institutions (the author list includes many researchers), and writing these papers also represents effort (and thus cost). While hard to quantify separately, it’s essentially part of the researchers’ job. One could count expenses for attending conferences or workshop presentations, but these are minor relative to other costs. The main cost here is the opportunity cost of brilliant engineers focusing months on experimentation and paper-writing. In summary, the R&D phase (spanning problem exploration, algorithm development, prototype training, and documentation) likely added several multiples of the final-run compute cost in terms of both compute-hours and labor hours. The value of the expertise and time invested is immense, although not as easily visible as a cloud bill.
3. Infrastructure and Operational Costs
Building and running DeepSeek’s models required heavy infrastructure investment. We break down the key infrastructure and ops costs:
-
Data Center / Compute Infrastructure: Rather than using public cloud services, DeepSeek appears to operate its own high-performance computing (HPC) clusters. According to analyses, the organization (with backing from a Chinese tech fund) amassed on the order of tens of thousands of top-tier GPUs for AI work
semianalysis.com. SemiAnalysis estimates DeepSeek/High-Flyer collectively have around 50,000 Hopper-generation GPUs (including ~10k NVIDIA H800 and 10k H100 units)semianalysis.com. This is an enormous capital expense – potentially >500 million** worth of GPUs alone[semianalysis.com](https://semianalysis.com/2025/01/31/deepseek-debates/#:~:text=%E2%80%9Cside%20project%E2%80%9D%20as%20many%20in,even%20after%20considering%20export%20controls)[semianalysis.com](https://semianalysis.com/2025/01/31/deepseek-debates/#:~:text=We%20believe%20they%20have%20access,H100s%2C%20but%20lower%20network%20bandwidth). In fact, one estimate put DeepSeek’s total **server capex at ~n1.6 B (across all its clusters)semianalysis.com. For the specific training of V3/R1, the team likely utilized a fraction of this capacity (e.g. a cluster of 2,000–2,500 H800 GPUs) for a couple of months. If we consider just that hardware: 2048 H800 cards have an asset value of ~$50–100M in aggregatetherecursive.com. Owning the hardware means DeepSeek isn’t paying cloud rental fees, but it must front the cost to buy or finance the equipment and keep it depreciated over time.
To put it in perspective, if DeepSeek had rented equivalent compute from a cloud provider, the direct training of V3 would be about **n2/hr for H100-class GPU)
arxiv.org. However, owning the data center shifts this into upfront investment. The economics of scale likely favor DeepSeek since they use these GPUs for many projects (trading algorithms, other AI models, etc. shared with their parent fund)semianalysis.comsemianalysis.com. But it’s important to note that the infrastructure cost is a major component of the overall R&D cost – essentially, DeepSeek had to have ~$50M+ of hardware at the ready to even attempt training a model like V3. If we amortize the hardware cost over the projects, the V3/R1 training might “use up” a few million dollars worth of the hardware’s life (since the GPUs can be re-used many times).
-
Cloud Services vs. On-Premise: DeepSeek’s approach was to run on its own cluster (likely for reasons of both cost and control). For comparison, had they used a cloud like AWS or Google Cloud, they would incur charges not only for GPU hours but also for storage and networking at cloud rates. The choice of on-premise likely saved money at the scale they operate. As noted, the n2/GPU-hour cloud rate
arxiv.org, which is a rough market price. DeepSeek’s actual out-of-pocket electricity cost for the training run would have been much lower (perhaps on the order of <n944M in operating costs** associated with running their large clusters over timesemianalysis.com, which includes power and facilities. For a single model run, the proportional share is small, but not negligible. We can estimate that V3’s training (2.788M GPU-hrs) consumed roughly 0.84 GWh of electricity (assuming ~300W per GPU) – at typical industrial electricity rates that’s under $100K in power. Networking hardware and data center cooling gear also factor in, but again, spread over many uses.
-
Storage and Data Processing: Handling 14.8 trillion tokens of training data requires a significant storage system. Assuming an average token is a few bytes, the raw text dataset could be on the order of tens of terabytes (50–60+ TB) of data. DeepSeek would need high-speed distributed storage to feed this data to thousands of GPUs continuously. This likely meant investing in a large-scale storage cluster (distributed file systems or object storage) and a high-bandwidth networking fabric. High-performance networking is crucial: the MoE training required coordinating 64 experts across GPUs, which entails extensive inter-GPU communication
epoch.aiepoch.ai. DeepSeek’s H800 GPUs have somewhat reduced interconnect bandwidth (relative to H100) due to export restrictions, which the team mitigated via careful algorithmic schedulingepoch.aiepoch.ai. Still, they likely deployed InfiniBand or NVLink networking to connect GPUs – an expensive but necessary investment for scaling. In summary, the networking and storage infrastructure costs (hardware, setup, and maintenance) add a few more millions of dollars to the overall project cost. These are largely one-time or fixed costs for DeepSeek’s data center, but they were essential for the project’s success.
-
Other Operational Costs: Beyond the big items, there are other recurring costs. This includes software licenses (if any proprietary tools were used, though much of their work builds on open-source frameworks), cloud services for any auxiliary tasks (e.g. experiment tracking, backup storage), and support staff. Data cleaning and labeling might have required hiring annotators or contracting services (for example, to label some rewards or to filter undesirable content from the dataset). Additionally, once the model is developed, serving it (as DeepSeek did via an API/app) incurs inference infrastructure costs, but those are outside the scope of R&D. It’s worth noting that operating at this scale is typically only feasible for well-funded organizations. DeepSeek’s parent fund literally invested in building an AI supercomputer ahead of time, which gave DeepSeek a platform to train models at marginal cost. Most academic labs or startups cannot afford such an infrastructure, highlighting DeepSeek’s unique position.
Summary: The infrastructure costs for DeepSeek V3/R1 break into capital expenditures (cluster hardware estimated in the hundreds of millions of dollars for thousands of GPUs
therecursive.com) and operational expenditures (power, cooling, maintenance, staff, data management, etc., likely millions of dollars per year for a large cluster). However, since this infrastructure is shared across many projects, we attribute only a fraction to the V3→R1 development. The compute for the final training run was ~$5.6M (if rented)arxiv.org, but the true infrastructure cost behind that is larger when including hardware amortization. In effect, DeepSeek traded off cloud rental costs for owning a data center, which is economically sensible given their scale (cloud costs would have been higher in the long run).
4. Cost Comparison with GPT-4, Google Gemini, and Others
To put DeepSeek’s R&D expenditures in context, we compare them to known or estimated costs of other state-of-the-art models:
-
OpenAI GPT-3 (175B, 2020): GPT-3’s training was famously expensive for its time. It’s estimated that training GPT-3 on then-current hardware (V100 GPUs) cost around $4.6 million USD
en.wikipedia.orgin compute. Some estimates went as high as n4.6M is often cited assuming an optimized run. Notably, GPT-3 used about 300 billion tokens for training, far less than DeepSeek-V3’s 14.8T, yet cost a similar order of magnitude in compute. This is because GPT-3 is a dense model and had lower hardware efficiency. By 2023–24, techniques that DeepSeek employed (sparsity, lower precision, better parallelism) have dramatically improved the cost-per-token of training large models. In short, DeepSeek-V3 (open-source) achieved in 2024 a training cost on par with GPT-3 (closed model from 2020), despite processing ~50× more data – highlighting improved efficiency.
-
OpenAI GPT-4 (2023): OpenAI has not fully disclosed GPT-4’s architecture, but it’s widely believed to be significantly larger than GPT-3 and possibly uses MoE or other advanced setups. Sam Altman (OpenAI’s CEO) stated that GPT-4 cost more than $100 million to train
en.wikipedia.org. Other analyses (e.g., by the Stanford AI Index) estimate GPT-4’s compute bill around 78 million**[pureai.com](https://pureai.com/Articles/2024/04/23/Open-Source-Models-Cost.aspx#:~:text=,modern%20LLM%2C%20cost%20around%20%24900). In any case, it’s clear GPT-4’s training cost was **on the order of tens of millions of dollars – likely 15–20× higher than DeepSeek V3’s n5–6M official compute cost. Part of this is due to scale (GPT-4 may have on the order of 1 trillion parametersen.wikipedia.organd possibly more tokens in training), and part due to using brute-force dense training with massive clusters. The closed-model development style also incurs costs for extensive safety testing, multiple experiment runs, and keeping everything proprietary. Compared to that, DeepSeek’s approach appears far more cost-efficient for the achieved performance. It’s notable that DeepSeek-R1 is reported to attain performance comparable to OpenAI’s models in many areasgithub.com, yet was developed at a fraction of GPT-4’s budget.
-
Google Gemini (2023): Google’s Gemini is a next-generation multimodal model. According to industry reports, Gemini “Ultra” (the largest version) cost roughly $191 million in compute to train
pureai.com. This staggering figure (more than double GPT-4’s) likely comes from an enormous scale (possibly multiple trillions of parameters across modalities, and using Google’s TPU v5 chips). Even a smaller variant “Gemini Pro” was rumored to cost hundreds of millions. Clearly, companies like Google are willing to spend nine figures on flagship models. In comparison, DeepSeek’s entire operation from V3 to R1 likely did not even approach one-tenth of that cost. It highlights a philosophical difference: DeepSeek engineered for efficiency and open-sourced their work, whereas Gemini aimed to push absolute capability boundaries with virtually no expense spared. Of course, Gemini’s capabilities (especially in multimodal tasks) might exceed DeepSeek’s, but from a pure cost-performance standpoint, DeepSeek demonstrates a more economically efficient path to a high-quality model.
-
Other Open Models: Meta’s LLaMA 2 (70B, 2023) was another open model touted as high-performance. Meta didn’t publicize training cost, but experts estimate it in the single-digit millions of dollars (similar to GPT-3’s range) given it was trained on ~2 trillion tokens. OpenAI’s smaller ChatGPT variants (like GPT-3.5 Turbo) also benefited from iterative optimization – for instance, one analysis noted that by late 2023, **training a GPT-4-level model could be done for around n100M costs
team-gpt.com. DeepSeek’s result aligns with this trend of plummeting training cost: they claim ~**5–6M for V3**[arxiv.org](https://arxiv.org/html/2412.19437v1#:~:text=training%2C%20DeepSeek,the%20costs%20associated%20with%20prior)and perhaps ~n1–2M for the R1 fine-tuning, totaling under **n250 million estimate required to [train comparable models]”_www.aiwire.net. While that $250M figure might refer to an extreme case, it underscores that open projects like DeepSeek have shown it’s possible to get 90% of the capability for a fraction of the cost.
In summary, DeepSeek R1’s development budget is an order of magnitude lower than the likes of GPT-4 or Gemini. GPT-4 reportedly >$100M
en.wikipedia.org; Gemini Ultra ~191M[pureai.com](https://pureai.com/Articles/2024/04/23/Open-Source-Models-Cost.aspx#:~:text=,modern%20LLM%2C%20cost%20around%20%24900); even Anthropic’s Claude 2 or other rivals likely spent tens of millions. DeepSeek’s clever R&D allowed it to spend perhaps on the order of ~n10M in compute (including experiments) and additional tens of millions in other costs – still well below the 9-figure sums seen in big-tech labs. This cost advantage could make DeepSeek and similar open models very competitive in terms of ROI (return on investment) for model development.
5. Estimated Total Cost and Economic Evaluation
Bringing together all the above components, we can estimate the total R&D cost from DeepSeek V3 through R1:
-
Compute/Cloud Costs: If we consider the final training runs only, DeepSeek-V3’s training was ~$5.6M in GPU rental value
arxiv.org, and the R1 reinforcement learning stages perhaps another <n10–15M worth of GPU time was expended in total to develop and refine V3 and R1 (multiple runs, tuning, etc.). This is the “burn” of running thousands of GPUs for many months cumulatively. Notably, the team themselves emphasize how “economical” the training was at 2.8M GPU-hours per runmartinfowler.com, achieved via co-design of algorithms and hardware. So even if they ran, say, two or three full-scale runs in development, it’s still under $20M in compute, which is very impressive for such a large model.
-
Personnel Costs: With ~150 highly-qualified team members working for roughly a year (mid-2023 to early 2025) on this project, the staff cost is roughly **n200k–$300k per person-year, knowing some key researchers earn much more
semianalysis.com). This is a significant investment in human resources. It reflects not only salaries, but the opportunity cost of these experts focusing on this project. That said, this cost bought the innovations (MLA, DeepSeek-MoE, GRPO, etc.) that slashed the compute needs. One might say DeepSeek traded additional R&D labor for lower training compute cost – a conscious decision that paid off in achieving a state-of-the-art model more cheaply. In industry terms, this is a classic engineering trade-off: spend more time and talent to optimize, to save money on raw compute.
-
Infrastructure Capital Cost: DeepSeek’s hardware was likely already in place (thanks to High-Flyer’s prior investments
semianalysis.comsemianalysis.com), but if we assign a portion to this project: using 2048 GPUs for e.g. 2 months is roughly 1/6 of a year of that hardware. If the 2048 GPUs cost ~50M to purchase[therecursive.com](https://therecursive.com/martin-vechev-of-insait-deepseek-6m-cost-of-training-is-misleading/#:~:text=2,more%20GPUs%20than%202048%20H800), then ~2 months of use “costs” about ~n8M in depreciation of hardware (assuming ~3-year hardware lifespan). This is a rough way to attribute capital cost. In any case, amortized hardware usage worth several million dollars was consumed by V3/R1. If the data center was purpose-built, one could also amortize construction and facility costs, but those would be relatively small per project. The storage and networking gear dedicated to handling the training might add another ~n5–10M**.
-
Data and Miscellaneous: As a smaller component, data collection/curation might have cost on the order of $1M+ (for storage of the corpus, possibly paying for some datasets or contracting data cleaning). Any annotation for supervised fine-tuning or RL reward modeling (e.g. humans ranking outputs or writing prompts) would also factor in, though DeepSeek’s papers suggest a lot of this was automated or mined from the web, not manually labeled. Still, we include a minor allowance (perhaps a few hundred thousand dollars) for such activities.
Adding these up, a plausible **total R&D cost for DeepSeek V3 + R1 is on the order of n50M, this would be considered very cost-effective given the model’s caliber. For comparison, **n100M+ in just compute
en.wikipedia.org, and with their larger team and infrastructure that total might be >$150M, DeepSeek’s figure is easily 3–5× cheaper overall.
Cost-Effectiveness: From an economic perspective, DeepSeek R1’s development appears to be highly cost-effective if the goal is to approach cutting-edge AI capabilities without breaking the bank. By investing in research-driven efficiency, DeepSeek achieved a model that rivals the best, for perhaps one-tenth the cost of the most expensive projects. Their strategy of using sparse models (MoE) and low-cost hardware optimizations paid off: the GPU-hours needed were dramatically reduced relative to a dense model of similar scale
martinfowler.com. This suggests an attractive ROI where each dollar spent on research or compute yielded significant performance gains. It’s also a validation of open research – many of DeepSeek’s techniques built on prior academic ideas (Mixture-of-Experts, low-precision training, etc.), and they in turn open-sourced their results to benefit others.
However, it’s important to temper the narrative of “only $6M to train a GPT-4 competitor” – which was a popular headline – because, as we’ve detailed, that figure omits substantial hidden costs. Experts have noted that the real cost is many times higher when all factors are included
therecursive.com. The engineering hours and multiple trial runs effectively convert into dollars that easily push the total well beyond $6M. In essence, DeepSeek saved money on compute by spending more on R&D. This is a great trade for them, but one should recognize the total spend was still significant. It’s just that in the AI landscape, “significant” means tens of millions, not hundreds of millions for a model of this caliber – and that is indeed a breakthrough for the community.
From a competitive standpoint, DeepSeek R1 demonstrates that a leaner operation can challenge tech giants. OpenAI and Google might outspend DeepSeek by huge factors, but the marginal performance improvements they achieve with that extra budget are relatively small (closed models still hold an edge, but not an overwhelming one). For example, closed models had a median ~24% performance advantage on certain benchmarks as of early 2024
pureai.com, despite often 10× the cost. This implies diminishing returns on throwing more money at the problem, and more emphasis on smart techniques. If open models like DeepSeek continue to narrow the quality gap while keeping costs low, it bodes well for more accessible AI development.
Conclusion: The total investment from DeepSeek-V3 through R1 can be estimated around n100M+ cost of GPT-4
en.wikipedia.org****or $191M for Geminipureai.com, DeepSeek’s approach was highly economical. It showcases that with innovative design (sparsity, efficient training) and a focused team, one can achieve cutting-edge AI at a fraction of the traditional cost. This dramatically improved cost-efficiency has positive implications: it lowers the barrier for academia and smaller companies to participate in frontier AI research, as evidenced by the surge in open-source models recentlypureai.compureai.com. In summary, DeepSeek R1’s development was expensive in absolute terms (tens of millions of dollars), but relative to its competitors and capabilities, it was a cost-effective endeavor. The model’s success validates the economic viability of investing in research-driven optimization to reduce the brute-force costs of AI development, a strategy likely to be emulated by others aiming to build advanced models under budget constraints.
Sources:
- DeepSeek-V3 Technical Report (2024) – model size, data, and GPU hoursarxiv.orgarxiv.orgarxiv.org
- DeepSeek-R1 Overview – model architecture and training approachmartinfowler.comepoch.ai
- Reddit discussion on DeepSeek cost – clarifying the $5.5M figure vs real costswww.reddit.comwww.reddit.com
- Martin Vechev (INSAIT) commentary – true training cost is much higher than 6M, need to include multiple runs & data, notes 2048 H800 GPUs cost n50–100Mtherecursive.com
- SemiAnalysis report – DeepSeek’s GPU assets (~20k GPUs) and talent pool (~150 employees, high salaries)semianalysis.comsemianalysis.com
- Epoch AI analysis – estimated FLOPs and ~$1M cost for R1’s RL phase, 2 trillion tokens generated in RLepoch.aiepoch.ai
- Stanford HAI (AI Index 2024) – training cost estimates for GPT-4 (~n191M)pureai.com
- Fortune / Lambda (via Wikipedia) – GPT-3 training cost ~$4.6M on V100 GPUsen.wikipedia.org
- OpenAI (Sam Altman) – GPT-4 training exceeded $100Men.wikipedia.org.