How do we design benchmarks resistant to shortcut learning and
actually evaluate how strong a model is at NLP for example?
Should AI progress be made by chasing SOTA on benchmarks or
are there alternative routes?
How can we measure “sentience” of AI? Parrots are nowhere as
good as GPT-3, yet we would intuit that they have more
consciousness than GPT-3 does.
Why is that?
Is this intuition correct?
Has the AI Spring/Winter pattern cropped up in any
other fields? Is this a pattern unique to AI, or is there
some underlying cause for this over-optimism?
Why is symmetry so prevalent in nature? Can symmetry
be harnessed for neural network design?
Could you do policy iteration on a neural network?
How do you adapt it to continuous?
What comes after tokenization?
The Bitter Lesson will probably consume it and replace it with
something lower level that is more learnable
Why is it that a transformer seems infinitely scalable?
Why don’t curriculums really do anything? It’s mad that a
model is learning calculus and basic algebra at the same time, and
that’s seemingly okay?!
How do you combine AI with humans in a productive
manner?
How do you ground AI in the real world?
What is wrong with the classic AI agent formulation?
What does AI for love look like?
Can we use AI for paper replications & verifiability along
with finding papers that it disagrees with?
Really, this is like accelerating the Kuhnian revolution, AIs
should just accumulate and make extremely clear what the
contradictions and problems in a field are.
How could AI help us avoid monoculture?
Do we care more about the Pass@K than the single time?
This kind of has to do with not only creativity but error
correction, that it should be able to correct itself. This ties
into the papers that training a LLM on wrong paths that correct is
actually more beneficial so that it can learn to backtrack. It
isn’t just showing it the right way to do things, you have to have
it explore and then figure out how to self correct. What does this
look like in text modelling?
Can you train a model purely on not outputting the wrong
answers? Does this not converge to the same thing as positive
training? What is the difference?
Does training on verifiable rewards lead to overall better
performance on unverifiable fields as well?
Chiang, T. (2023, February 9). ChatGPT Is a Blurry JPEG of the
Web. The New Yorker.
https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web
Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J., Rytting,
C., & Wingate, D. (2022). Out of One, Many: Using Language
Models to Simulate Human Samples (arXiv:2209.06899). arXiv.
http://arxiv.org/abs/2209.06899
Benton, G. W., Maddox, W. J., Lotfi, S., & Wilson, A. G.
(2021). Loss Surface Simplexes for Mode Connecting Volumes and
Fast Ensembling (arXiv:2102.13042). arXiv.
http://arxiv.org/abs/2102.13042
Chan, S. C. Y., Santoro, A., Lampinen, A. K., Wang, J. X.,
Singh, A., Richemond, P. H., McClelland, J., & Hill, F.
(2022). Data Distributional Properties Drive Emergent In-Context
Learning in Transformers (arXiv:2205.05055). arXiv.
http://arxiv.org/abs/2205.05055
Cong, Y., & Zhao, M. (2022). Big Learning: A Universal
Machine Learning Paradigm? (arXiv:2207.03899). arXiv.
http://arxiv.org/abs/2207.03899
Delétang, G., Ruoss, A., Grau-Moya, J., Genewein, T.,
Wenliang, L. K., Catt, E., Hutter, M., Legg, S., & Ortega, P.
A. (2022). Neural Networks and the Chomsky Hierarchy
(arXiv:2207.02098). arXiv. http://arxiv.org/abs/2207.02098
Dohan, D., Xu, W., Lewkowycz, A., Austin, J., Bieber, D.,
Lopes, R. G., Wu, Y., Michalewski, H., Saurous, R. A.,
Sohl-dickstein, J., Murphy, K., & Sutton, C. (2022). Language
Model Cascades (arXiv:2207.10342). arXiv.
http://arxiv.org/abs/2207.10342
Ha, D., & Tang, Y. (2022). Collective Intelligence for
Deep Learning: A Survey of Recent Developments (arXiv:2111.14377).
arXiv. http://arxiv.org/abs/2111.14377
Haluptzok, P., Bowers, M., & Kalai, A. T. (2022). Language
Models Can Teach Themselves to Program Better (arXiv:2207.14502).
arXiv. http://arxiv.org/abs/2207.14502
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai,
T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J.,
Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G.
van den, Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen,
E., … Sifre, L. (2022). Training Compute-Optimal Large Language
Models (arXiv:2203.15556). arXiv.
http://arxiv.org/abs/2203.15556
Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O.,
Graves, A., Silver, D., & Kavukcuoglu, K. (2017). Decoupled
Neural Interfaces using Synthetic Gradients (arXiv:1608.05343).
arXiv. http://arxiv.org/abs/1608.05343
Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., &
Stanley, K. O. (2022). Evolution through Large Models
(arXiv:2206.08896). arXiv. http://arxiv.org/abs/2206.08896
Liu, Z., Kitouni, O., Nolte, N., Michaud, E. J., Tegmark, M.,
& Williams, M. (2022). Towards Understanding Grokking: An
Effective Theory of Representation Learning (arXiv:2205.10343).
arXiv. http://arxiv.org/abs/2205.10343
Power, A., Burda, Y., Edwards, H., Babuschkin, I., &
Misra, V. (2022). Grokking: Generalization Beyond Overfitting on
Small Algorithmic Datasets (arXiv:2201.02177). arXiv.
http://arxiv.org/abs/2201.02177
Richards, B. A., Lillicrap, T. P., Beaudoin, P., Bengio, Y.,
Bogacz, R., Christensen, A., Clopath, C., Costa, R. P., de Berker,
A., Ganguli, S., Gillon, C. J., Hafner, D., Kepecs, A.,
Kriegeskorte, N., Latham, P., Lindsay, G. W., Miller, K. D., Naud,
R., Pack, C. C., … Kording, K. P. (2019). A deep learning
framework for neuroscience. Nature Neuroscience, 22(11),
1761–1770. https://doi.org/10.1038/s41593-019-0520-2
Sejnowski, T. (2022). Large Language Models and the Reverse
Turing Test (arXiv:2207.14382). arXiv.
http://arxiv.org/abs/2207.14382
Tay, Y., Dehghani, M., Abnar, S., Chung, H. W., Fedus, W.,
Rao, J., Narang, S., Tran, V. Q., Yogatama, D., & Metzler, D.
(2022). Scaling Laws vs Model Architectures: How does Inductive
Bias Influence Scaling? (arXiv:2207.10551). arXiv.
http://arxiv.org/abs/2207.10551
The Bitter Lesson. (n.d.). Retrieved September 30, 2021, from
http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Vogelstein, J. T., Verstynen, T., Kording, K. P., Isik, L.,
Krakauer, J. W., Etienne-Cummings, R., Ogburn, E. L., Priebe, C.
E., Burns, R., Kutten, K., Knierim, J. J., Potash, J. B., Hartung,
T., Smirnova, L., Worley, P., Savonenko, A., Phillips, I., Miller,
M. I., Vidal, R., … Yang, W. (2022). Prospective Learning: Back to
the Future. ArXiv:2201.07372 [Cs].
http://arxiv.org/abs/2201.07372
Zador, A. M. (2019). A critique of pure learning and what
artificial neural networks can learn from animal brains. Nature
Communications, 10(1), 3770.
https://doi.org/10.1038/s41467-019-11786-6
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D.
(2018). mixup: Beyond Empirical Risk Minimization
(arXiv:1710.09412). arXiv. http://arxiv.org/abs/1710.09412