AI

Questions

How do we design benchmarks resistant to shortcut learning and actually evaluate how strong a model is at NLP for example?
Should AI progress be made by chasing SOTA on benchmarks or are there alternative routes?
How can we measure “sentience” of AI? Parrots are nowhere as good as GPT-3, yet we would intuit that they have more consciousness than GPT-3 does.
- Why is that?
- Is this intuition correct?
Has the AI Spring/Winter pattern cropped up in any other fields? Is this a pattern unique to AI, or is there some underlying cause for this over-optimism?
Why is symmetry so prevalent in nature? Can symmetry be harnessed for neural network design?
Could you do policy iteration on a neural network?
- How do you adapt it to continuous?
What comes after tokenization?
- The Bitter Lesson will probably consume it and replace it with something lower level that is more learnable
Why is it that a transformer seems infinitely scalable?
Why don’t curriculums really do anything? It’s mad that a model is learning calculus and basic algebra at the same time, and that’s seemingly okay?!
How do you combine AI with humans in a productive manner?
How do you ground AI in the real world?
What is wrong with the classic AI agent formulation?
What does AI for love look like?
Can we use AI for paper replications & verifiability along with finding papers that it disagrees with?
- Really, this is like accelerating the Kuhnian revolution, AIs should just accumulate and make extremely clear what the contradictions and problems in a field are.
How could AI help us avoid monoculture?
Do we care more about the Pass@K than the single time?
- This kind of has to do with not only creativity but error correction, that it should be able to correct itself. This ties into the papers that training a LLM on wrong paths that correct is actually more beneficial so that it can learn to backtrack. It isn’t just showing it the right way to do things, you have to have it explore and then figure out how to self correct. What does this look like in text modelling?
Can you train a model purely on not outputting the wrong answers? Does this not converge to the same thing as positive training? What is the difference?
Does training on verifiable rewards lead to overall better performance on unverifiable fields as well?

To-read

The unreasonable effectiveness of RNNs

Bibliography

Chiang, T. (2023, February 9). ChatGPT Is a Blurry JPEG of the Web. The New Yorker. https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web
Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J., Rytting, C., & Wingate, D. (2022). Out of One, Many: Using Language Models to Simulate Human Samples (arXiv:2209.06899). arXiv. http://arxiv.org/abs/2209.06899
Benton, G. W., Maddox, W. J., Lotfi, S., & Wilson, A. G. (2021). Loss Surface Simplexes for Mode Connecting Volumes and Fast Ensembling (arXiv:2102.13042). arXiv. http://arxiv.org/abs/2102.13042
Chan, S. C. Y., Santoro, A., Lampinen, A. K., Wang, J. X., Singh, A., Richemond, P. H., McClelland, J., & Hill, F. (2022). Data Distributional Properties Drive Emergent In-Context Learning in Transformers (arXiv:2205.05055). arXiv. http://arxiv.org/abs/2205.05055
Cong, Y., & Zhao, M. (2022). Big Learning: A Universal Machine Learning Paradigm? (arXiv:2207.03899). arXiv. http://arxiv.org/abs/2207.03899
Delétang, G., Ruoss, A., Grau-Moya, J., Genewein, T., Wenliang, L. K., Catt, E., Hutter, M., Legg, S., & Ortega, P. A. (2022). Neural Networks and the Chomsky Hierarchy (arXiv:2207.02098). arXiv. http://arxiv.org/abs/2207.02098
Dohan, D., Xu, W., Lewkowycz, A., Austin, J., Bieber, D., Lopes, R. G., Wu, Y., Michalewski, H., Saurous, R. A., Sohl-dickstein, J., Murphy, K., & Sutton, C. (2022). Language Model Cascades (arXiv:2207.10342). arXiv. http://arxiv.org/abs/2207.10342
Ha, D., & Tang, Y. (2022). Collective Intelligence for Deep Learning: A Survey of Recent Developments (arXiv:2111.14377). arXiv. http://arxiv.org/abs/2111.14377
Haluptzok, P., Bowers, M., & Kalai, A. T. (2022). Language Models Can Teach Themselves to Program Better (arXiv:2207.14502). arXiv. http://arxiv.org/abs/2207.14502
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. van den, Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., … Sifre, L. (2022). Training Compute-Optimal Large Language Models (arXiv:2203.15556). arXiv. http://arxiv.org/abs/2203.15556
Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., Silver, D., & Kavukcuoglu, K. (2017). Decoupled Neural Interfaces using Synthetic Gradients (arXiv:1608.05343). arXiv. http://arxiv.org/abs/1608.05343
Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., & Stanley, K. O. (2022). Evolution through Large Models (arXiv:2206.08896). arXiv. http://arxiv.org/abs/2206.08896
Liu, Z., Kitouni, O., Nolte, N., Michaud, E. J., Tegmark, M., & Williams, M. (2022). Towards Understanding Grokking: An Effective Theory of Representation Learning (arXiv:2205.10343). arXiv. http://arxiv.org/abs/2205.10343
McDermott, D. (1976). Artificial intelligence meets natural stupidity. ACM SIGART Bulletin, 57, 4–9. https://doi.org/10.1145/1045339.1045340
Power, A., Burda, Y., Edwards, H., Babuschkin, I., & Misra, V. (2022). Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets (arXiv:2201.02177). arXiv. http://arxiv.org/abs/2201.02177
Richards, B. A., Lillicrap, T. P., Beaudoin, P., Bengio, Y., Bogacz, R., Christensen, A., Clopath, C., Costa, R. P., de Berker, A., Ganguli, S., Gillon, C. J., Hafner, D., Kepecs, A., Kriegeskorte, N., Latham, P., Lindsay, G. W., Miller, K. D., Naud, R., Pack, C. C., … Kording, K. P. (2019). A deep learning framework for neuroscience. Nature Neuroscience, 22(11), 1761–1770. https://doi.org/10.1038/s41593-019-0520-2
Sejnowski, T. (2022). Large Language Models and the Reverse Turing Test (arXiv:2207.14382). arXiv. http://arxiv.org/abs/2207.14382
Tay, Y., Dehghani, M., Abnar, S., Chung, H. W., Fedus, W., Rao, J., Narang, S., Tran, V. Q., Yogatama, D., & Metzler, D. (2022). Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? (arXiv:2207.10551). arXiv. http://arxiv.org/abs/2207.10551
The Bitter Lesson. (n.d.). Retrieved September 30, 2021, from http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Vogelstein, J. T., Verstynen, T., Kording, K. P., Isik, L., Krakauer, J. W., Etienne-Cummings, R., Ogburn, E. L., Priebe, C. E., Burns, R., Kutten, K., Knierim, J. J., Potash, J. B., Hartung, T., Smirnova, L., Worley, P., Savonenko, A., Phillips, I., Miller, M. I., Vidal, R., … Yang, W. (2022). Prospective Learning: Back to the Future. ArXiv:2201.07372 [Cs]. http://arxiv.org/abs/2201.07372
Zador, A. M. (2019). A critique of pure learning and what artificial neural networks can learn from animal brains. Nature Communications, 10(1), 3770. https://doi.org/10.1038/s41467-019-11786-6
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). mixup: Beyond Empirical Risk Minimization (arXiv:1710.09412). arXiv. http://arxiv.org/abs/1710.09412