~ Encouraged by enlightening conversations with Prof Leon Bergen. All views are solely mine and do not reflect anyone else’s.

Background

Generally, monetary betting is frowned upon and is illegal in India. It is seen as disrespectful to the Goddess of Wealth. However, I realized if I were to succeed as a researcher I need to learn how to identify research problems and argue initial hypothesis. As an engineer, I built my value function by simply using my tools and having others use my tools and give me feedback. Unlike for products, I believe there is a lot of noise in researcher’s value functions especially for a nascent field such as Machine Learning.

Prof Bergen and I often discussed hosting bets to gauge if our predictions were correct. As such this post will be an evolving post formatted as following:

  • I will be making bets every 6 months
  • Each bet will be written / rewritten over that week to allow me to finalize my predictions
  • I will be making predictions on the following timescales: 6 months / 1 year / 5 years
  • When a bets duration expires, I will evaluate if I was succesful at making a predictions

Week 28th June 2024 to 5th July 2024

6 Month Bets

Current Background: Working on educational apps thus my evaluation of models is biased by this work.

Bet 1: Claude 3.5 Opus will be the LLM frontier model

I believe the best model for the next 6 months on most tasks would be the Claude 3.5 Opus model. There a few reasons for me to believe this:

  • Claude 1 / Claude 2 was clearly behind the latest GPT models at time of release
  • Claude 3 when released was one year behind GPT 4 models but was the only comparable model
  • Claude 3.5 Sonnet is the first model to conclusively beat current OpenAI flagship models
  • Claude 3.5 Opus > Claude 3.5 Sonnet
  • The above stated progression makes me believe that OpenAI’s pace of developing frontier models has slowed down comparatively
  • Google just released Gemini 1.5 Pro few months ago and Claude 3.5 Sonnet seems to be better
  • I believe Google / OpenAI are increasingly focused on new ways of deployment and possible post-training of model
  • As Meta’s Chameleon pointed out, training end-to-end multimodal models have additional challenges combined with my next point which I believe further slowed the pace of developing frontier models
  • Given the instability at OpenAI, I believe OpenAI’s focus has shifted more to deployment of AI models rather than intensely focused on developing frontier capabilities
  • This has happened before GPT2 came in 2019, GPT3 in 2020 but GPT4 came in 2023 (I believe OpenAI had a setback of a year due to figuring out Mixture-Of-Experts and researchers leaving to create Anthropic)

Note: By frontier, I don’t necessarily mean best at public benchmarks. It’s quite possible that OpenAI / Google develop a new post-training / distillation methodology that boosts performance similar to how InstructGPT was better than GPT3 despite being 100x smaller. For ex, GPT-4o has better conversational abilities but it is worse at teaching (in my view) than GPT4-turbo or GPT-4 because it likely can’t compute higher order pedagogical reasoning.

So while, GPT-4o is arguably better to talk to and follows instructions better, it’s foundational frontier capabilities are not better than GPT-4.

Here are the capabilities which I consider foundational:

  1. Zero Shot Reasoner
  2. In Context Learner
  3. Long Context Reasoning

Verifiability: If I am using the OpenAI API for developing most of my applications at the end of 6 month period then I would say I was wrong. Right now the rate is 100% OpenAI use. Unclear to me what benchmarks are there to evaluate ICL and Long Context reasoning as a frontier task.

Note: I am unclear what is the frontier model for other modalities.

Bet 2: OpenAI will shift from ChatGPT product to an agent deployment

OpenAI will showcase new post-training and/or scaffolding / “unhobling” in 6 months.

We have already seen OpenAI introduce the notion of memory in ChatGPT. It is clear they want to develop a desktop assistant agent app. However, assistant tasks are longer horizon than what current ChatGPT is capable of. In my view, its a post-training problem to figure out and maintain the right context to get the models to work. There is not enough long-horizon pre-training data. Furthermore, no matter how capable the model is at long-horizon, it will still need to develop scaffolding techniques to figure out memory. It is unlikely, that we deploy models where the context window contains all previous tokens for every small task we would want to do.

Furthermore, this Fall would be 2 years since ChatGPT which arguably kick-started this revolution. It’s a good period to introduce the next big thing in my view.

Verifiability: OpenAI releases a blog post talking about their new technique and we see the new desktop app in action

Bet 3: If LLAMA3-400B releases, there will be a end-to-end finetuned multimodal model

If LLAMA3-400B releases, there will be a end-to-end finetuned multimodal (text/image guaranteed) possibly (text/image/audio) LLAMA3-400B.

GPT4 was the first model that could use tools which is crucial for applications. LLAMA3-400B in my view would the first useful open-source model for end-user assistant application (someone who is not a prompt engineer) and I fundamentally believe that end-to-end multimodal is better way to interact for an application. For ex, how can someone ask for help when one is working on CAD design.

Verifiability: If its open-source, then its easily verifiable. Difficult to verify if its close source. I can only confirm close-source if such an application exists and someone in the industry tells me they are using LLAMA3 without making me sign an NDA.

Bet 4: OpenAI would not release end-to-end multimodal freely on their API

This means one can not just call the end-to-end model without prior permission from OpenAI. In my view, OpenAI would focus on enterprise customers that they can get longer-term usage agreements rather than individual developers. I believe this for the following reasons:

  • Right now people can easily switch APIs every month or few months depending on whichever is the better model. This was fine when OpenAI reigned supreme. As I mentioned, I don’t believe that will be the case in the future and for most backend hidden use-cases smaller models and open-source models work fine.
  • Reduces the risk of incorrect deployment of models causing harm to the OpenAI brand. Since OpenAI is the front for all AI now, anytime I see articles on how AI is bad, there is often an association of OpenAI. Thus, OpenAI would want to reduce this moving forward given how easily an end-to-end multimodal model can be misused.
  • OpenAI would want to reduce competition for its desktop agent app which I believe has the potential to become the interface for the net upending Google (I don’t mean OpenAI develops a search engine just that it issues calls to a search engine and becomes the interface)

Verifiability: Easily verifiable

End Date: January 5th 2025


1 Year Bets

Current Background: Working on educational apps thus my evaluation of models is biased by this work.

Bet 1: GPT-5 generation frontier model will release or be announced

  • GPT3 released in 2020, GPT4 in 2023, GPT5 level model should be announced or released in 2025
  • Anthropic will develop a GPT5 level model before OpenAI, but they might not announce it before OpenAI. Anthropic’s rate of progress of frontier model seems to be faster than OpenAI, this could be late comer’s advantage but I believe Anthropic has been making consistent research progress due to their lack of product diversity and researcher retention (as Dario Amodei put it, all founders are still at Anthropic)
  • I truly believe Anthropic’s work in interpretability is important and improve our understanding of scaling which in turns allows faster scaling
  • Mira Murati went record saying GPT5 will be a year and half later (putting the timeline for OpenAI around December 2025 which is around the 10 year anniversary of OpenAI)
  • Deepmind’s Gemini Pro released in Dec 2023, Gemini 1.5 released in Feb 2024, it is highly likely Gemini 2 will release in Dec 2024. But I don’t believe Gemini 2 will be GPT5 level (apart from the context window length claims). I think Google wants to quickly release AI products to protect search and Deepmind wants to do research and there is conflict of interests in my view. This is why Gemini’s main selling point has been the context window length in my view.

I will predict what I believe the model will be capable of by listing tasks it can do:

  1. Web Research: Given a task, such as “write a report on underperforming stocks based on relevant metrics”, the model (only one model) would be able to figure out the relevant things to search, reason across the long context of its actions, compose a highly detailed document which includes figures and write a report. A standard report but relevant data needs to be identified and collected to write such automated reports.

  2. Education: Solves all undergraduate level tutoring. Based on task 1, it is highly likely it can come up with lesson plans and deliver lectures. But would not be able to do so for new graduate topic classes that haven’t been taught. For example, any course at UCSD that is taught under the code CSE 291.

  3. Software Engineering: Get a score of greater than 80% on SWE-bench (Current highest is 19%)

Verifiability: Easily Verifiable by announcements. The Anthropic claim is hard to verify if they don’t announce it. But I will count my claim to be true if their release comes within a month of OpenAI releasing GPT 5. For my prediction regarding tasks, I will need to look up / create a benchmark.

Bet 2: 2025 would mark the highest GenAI penetration in consumer products

Basically the first app that people will use won’t be the browser.

  • While all I can think about these days is Generative models, for most ordinary people don’t use or wouldn’t think to use ChatGPT on a daily basis. It’s probably something they heard in the news or a comedy sketch.
  • GPT4’s tool use ability made the first model that can be directly somewhat useful to an ordinary human. GPT3 was useless. It is highly likely GPT5 level model + OpenAI’s desktop agent will change how everybody uses their computer. Microsoft + Apple have also released their own versions of how LLMs will be used. But I truly believe a GPT 5 level model and new scaffolding techniques are required to truly change how people interact with their computers.
  • GPT5 level model requires rethinking how human workflows are carried out in companies, but that is harder in my view than giving people a chatbot subscription and asking them to use it. Thus, if a prediction of “impending recession” (in my view recessions / growth are always impending), then businesses would likely incorporate AI due to worker cuts.

Verifiability: Easily verifiable

Bet 3: Some Data Distributions make language models easier to be mechanistically interpreted

This is just my wish. A lot of people write how GPT can’t do this and that. Then they scoff at when people point to transformers trained on reasoning tasks (arguing that took too much data). I believe the following:

  • Let task T be a relation between languages L_I and L_O where a language is a set of strings composed of elements of an alphabet
  • If we have a verifier V such that V(y | x) is true if T(x) = y, then for any given input x from L_I, we can run V(y_i | x) on all y_i from L_O. However, this is an inefficient brute force algorithm. Here the complexity to generate an answer is equivalent to the number of strings * length of the string * complexity of verifying an individual string
  • We can argue that a smarter algorithm would require to run the verifier on fewer strings
  • If an algorithm can reason over a task, it will always generate only one string which is accepted by the verifier
  • Assumption: Reasoners are easier to interpret than other algorithms since its the same algorithm for all instances
  • For some tasks, the space spanned by transformers of a particular size contains reasoners
  • During training, we eliminate some hypothesis based on empirical risk
  • If we have a data distribution, then with sufficiently many samples, we could narrow down to hypothesis that reason (I wonder how the number of samples vary depending on the data distribution)
  • It would be desirable to know the relationship between distributions of data and the abilities of models trained on them

Note: It is unclear to me whether reasoning is simply pattern matching or not. After all one can say given a formal system of rules of resolution, one only need to identify the relevant rule for deduction. We can keep doing this until we can deduce all statements. If reasoning is pattern matching, then I think a model that reasons can be more easily interpreted.

By properties, I mean the properties are verifiable by mechanistic interpretability and not task performance.

Verifiability: Easily verifiable if a paper exists. I would not surprised if there is such a paper exists. I know of there is an entire body of work of just mainpulating data to see its effect on models.

Jul 1st Undermind result:

No papers were found that empirically compare different preprocessing techniques and their impact on the interpretability of transformer-based language models.

Most retrieved papers focus on the performance impact of preprocessing techniques on transformer models. For instance, one explores how different preprocessing steps influence BERT’s sensitivity and classification outcomes in detecting offensive language, but does not focus on interpretability. Papers delve into improving the interpretability of transformers using attention-based methods but do not compare preprocessing techniques. Comprehensive studies directly addressing the impact of preprocessing on both performance and interpretability of transformer-based models are needed for thorough empirical analysis.

End Date: July 5th 2025


5 Year Bets

Bet 1: Smarter LLMs can be trained from data generated by dumber LLMs without leading to nonsensical outputs

Basic Argument: Humans generated all their language based on observations from the world. Children learn from this language to gain world knowledge. AI can get observations from the world. Any AI that claims to be human level, must be able to learn from its own outputs.

Assumption: On average newer generations are smarter than older generations. I would say economic output is an indicator?

The basic argument depends on the notion that AI have to be human level intelligent. We don’t require AI to be human level intelligent. If you refer above to my informal definition of a smartness, if AI comes close to human intelligence then we can expend more compute and still generate human quality outputs. We may not be human level efficient in generating this outputs.

We do not currently see this since the AI currently are too dumb and do not narrow the search space enough for us to find high quality generations. Furthermore, it could be that current models can not perform verifications for such human level tasks.

But in my view, GPT-6 level model (imagine two jumps in capability from GPT2 to GPT4) would get us close to human intelligence. I don’t think it will be an accepted AGI model. (Again I am fuzzy on the definition of AGI). Hence my prediction which I believe is easier to verify. This also signals that AIs can then re-produce themselves without the need for humans to generate any new knowledge.

Verifiability: Easy to verify if we see academic results. Hard to know in industry.

Bet 2: We will observe another Chain Of Thought type phenomenon or an upgraded version of it

What does a Chain of Thought type phenomenon mean?

It’s a phenomenon where models can bootstrap their performance without any outside influence except for a few tokens such as “Think step-by-step” or without any outside tokens.

  • They start writing / searching tools to solve problems without being prompted?
  • The pre-trained models start writing plans / do chain of thought style reasoning to solve tasks without being prompted / trained to do so
  • For a given task T, they can automatically identify relevant skills to learn and get examples of that?
  • Something wilder?

Verifiability: Easily verifiable since everybody will be doing this in their Benchmarks

Bet 3: We don’t need any RLHF

RLHF in-short involves the following:

  1. Collecting user feedback on model responses
  2. Finetuning a classifier head on a pre-trained model to create a reward model to rate more responses
  3. Doing RL on the pre-trained model to increase the reward assigned by the reward model

Why do we not need RLHF in the future?

  • If you believe that zero-shot and ICL are core model abilities that get better with scale. It stands to hold that a smarter pre-trained model can assign rewards to responses without the need for additional fine-tuning
  • It is also possible that we don’t need to do RL on the pre-trained model rather simply give examples of good responses for it to generate good responses
  • Constitutional AI demonstrates this in a limited manner to remove harmful responses
  • Mechanistic interpretability improves to identify undesirable behaviour and we don’t need to rely to fine-tuning

Bet 4: College becomes more expensive and fewer people attend college

Verifiability: Easy

Bet 5: Median household incomes fall in US from 2022 levels (accounting for inflation) but credit interest rates fall to near 0

Verifiability: Easy

End Date: July 5th 2029


Last Updated: July 7th (as I forgot to do in week, this series of bets will not be updated)