At WWDC this year, Apple announced a team-up with OpenAI. ChatGPT will be embedded in many places in the next iOS version, serving up writing assistance, creating custom memojis, and making Siri smarter.
And it costs you nothing!
However, does it cost Apple nothing?
According to reports, Apple is not paying OpenAI one thin dime for this integration. So why would OpenAI do this? After all, it requires enormous processing power, management time, development work, etc. It’s easy to understand Apple’s motivations: it gets leading-edge AI integrated into its platforms.
But what does OpenAI get?
The Data Wall
In the landmark series of essays, Situational Awareness, the “data wall” is explained:
There is a potentially important source of variance for all of this: we’re running out of internet data. That could mean that, very soon, the naive approach to pretraining larger language models on more scraped data could start hitting serious bottlenecks.
Frontier models are already trained on much of the internet. Llama 3, for example, was trained on over 15T tokens. Common Crawl, a dump of much of the internet used for LLM training, is >100T tokens raw, though much of that is spam and duplication (e.g., a relatively simple deduplication leads to 30T tokens, implying Llama 3 would already be using basically all the data). Moreover, for more specific domains like code, there are many fewer tokens still, e.g. public github repos are estimated to be in low trillions of tokens.
Training LLMs requires vast amounts of data. In fact, we’re out of data. All the large publicly available datasets are already trained to death, and doing clever things like sucking in every Reddit post or every Tweet has already been done. There’s ebooks and YouTube and other things, but the core problem is that there is only so much data.
Now Apple is presenting OpenAI with a vast new storehouse of data: everything its hundreds of millions of customers do on their iPhones, iPads, Watches, and Macs.
Remember, Sam Altman is a Liar
Now, of course, they won’t be training on Apple user data. Of course not.
Also remember that at every turn, Sam Altman has ignored safety, rules, and limits. That’s why he was canned.
If you honestly believe that your Apple-originated data is not going to be used to improve ChatGPT’s models, I have a bridge in New York I’d like to sell you.
The official story is that Apple is acting as a brand ambassador for OpenAI…wink, wink.