Basically - "any model trained with ~28M H100 hours, which is around $50M USD or - any cluster with 10^20 FLOPs, which is around 50,000 H100s, which only two companies currently have " - hat-tip to nearcyan on Twitter for this calculation.
Specific language below.
" (i) any model that was trained using a quantity of computing power greater than 1026 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 1023 integer or floating-point operations; and
(ii) any computing cluster that has a set of machines physically co-located in a single datacenter, transitively connected by data center networking of over 100 Gbit/s, and having a theoretical maximum computing capacity of 1020 integer or floating-point operations per second for training AI."
You have the connection speed between phones to worry about, as well as a different architecture. There’s a big difference running the kernel over a new layer and its inputs locally within a GPU chip, vs. copying that data to into packets, filling in all of the rest of the information associated with the packets, sending it to the phone’s radio, having it turned into radio waves, transmitting that to a cell tower, routing it through the network to the cell co, routing it on to the receiving phone’s cell tower (maybe via a satellite or two), transmitting it to the destination phone, decoding the radio waves, etc. I’m deliberately leaving out some details (like the bsd socket layers and encryption and decryption), and I’m sure I’m missing many other complications.
BUT, it’s conceivable, in future, as tech improves and the gap between consumer hardware and what’s needed to run AGI narrows , and so on.