How do we design benchmarks resistant to shortcut learning and
actually evaluate how strong a model is at NLP for example?
Should AI progress be made by chasing SOTA on benchmarks or
are there alternative routes?
How can we measure “sentience” of AI? Parrots are nowhere as
good as GPT-3, yet we would intuit that they have more
consciousness than GPT-3 does.
Why is that?
Is this intuition correct?
Has the AI Spring/Winter pattern cropped up in any
other fields? Is this a pattern unique to AI, or is there
some underlying cause for this over-optimism?
Why is symmetry so prevalent in nature? Can symmetry
be harnessed for neural network design?
Could you do policy iteration on a neural network?
How do you adapt it to continuous?
What comes after tokenization?
The Bitter Lesson will probably consume it and replace it with
something lower level that is more learnable
Chiang, T. (2023, February 9). ChatGPT Is a Blurry JPEG of the
Web. The New Yorker.
https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web
Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J., Rytting,
C., & Wingate, D. (2022). Out of One, Many: Using Language
Models to Simulate Human Samples (arXiv:2209.06899). arXiv.
http://arxiv.org/abs/2209.06899
Benton, G. W., Maddox, W. J., Lotfi, S., & Wilson, A. G.
(2021). Loss Surface Simplexes for Mode Connecting Volumes and
Fast Ensembling (arXiv:2102.13042). arXiv.
http://arxiv.org/abs/2102.13042
Chan, S. C. Y., Santoro, A., Lampinen, A. K., Wang, J. X.,
Singh, A., Richemond, P. H., McClelland, J., & Hill, F.
(2022). Data Distributional Properties Drive Emergent In-Context
Learning in Transformers (arXiv:2205.05055). arXiv.
http://arxiv.org/abs/2205.05055
Cong, Y., & Zhao, M. (2022). Big Learning: A Universal
Machine Learning Paradigm? (arXiv:2207.03899). arXiv.
http://arxiv.org/abs/2207.03899
Delétang, G., Ruoss, A., Grau-Moya, J., Genewein, T.,
Wenliang, L. K., Catt, E., Hutter, M., Legg, S., & Ortega, P.
A. (2022). Neural Networks and the Chomsky Hierarchy
(arXiv:2207.02098). arXiv. http://arxiv.org/abs/2207.02098
Dohan, D., Xu, W., Lewkowycz, A., Austin, J., Bieber, D.,
Lopes, R. G., Wu, Y., Michalewski, H., Saurous, R. A.,
Sohl-dickstein, J., Murphy, K., & Sutton, C. (2022). Language
Model Cascades (arXiv:2207.10342). arXiv.
http://arxiv.org/abs/2207.10342
Ha, D., & Tang, Y. (2022). Collective Intelligence for
Deep Learning: A Survey of Recent Developments (arXiv:2111.14377).
arXiv. http://arxiv.org/abs/2111.14377
Haluptzok, P., Bowers, M., & Kalai, A. T. (2022). Language
Models Can Teach Themselves to Program Better (arXiv:2207.14502).
arXiv. http://arxiv.org/abs/2207.14502
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai,
T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J.,
Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G.
van den, Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen,
E., … Sifre, L. (2022). Training Compute-Optimal Large Language
Models (arXiv:2203.15556). arXiv.
http://arxiv.org/abs/2203.15556
Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O.,
Graves, A., Silver, D., & Kavukcuoglu, K. (2017). Decoupled
Neural Interfaces using Synthetic Gradients (arXiv:1608.05343).
arXiv. http://arxiv.org/abs/1608.05343
Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., &
Stanley, K. O. (2022). Evolution through Large Models
(arXiv:2206.08896). arXiv. http://arxiv.org/abs/2206.08896
Liu, Z., Kitouni, O., Nolte, N., Michaud, E. J., Tegmark, M.,
& Williams, M. (2022). Towards Understanding Grokking: An
Effective Theory of Representation Learning (arXiv:2205.10343).
arXiv. http://arxiv.org/abs/2205.10343
Power, A., Burda, Y., Edwards, H., Babuschkin, I., &
Misra, V. (2022). Grokking: Generalization Beyond Overfitting on
Small Algorithmic Datasets (arXiv:2201.02177). arXiv.
http://arxiv.org/abs/2201.02177
Richards, B. A., Lillicrap, T. P., Beaudoin, P., Bengio, Y.,
Bogacz, R., Christensen, A., Clopath, C., Costa, R. P., de Berker,
A., Ganguli, S., Gillon, C. J., Hafner, D., Kepecs, A.,
Kriegeskorte, N., Latham, P., Lindsay, G. W., Miller, K. D., Naud,
R., Pack, C. C., … Kording, K. P. (2019). A deep learning
framework for neuroscience. Nature Neuroscience, 22(11),
1761–1770. https://doi.org/10.1038/s41593-019-0520-2
Sejnowski, T. (2022). Large Language Models and the Reverse
Turing Test (arXiv:2207.14382). arXiv.
http://arxiv.org/abs/2207.14382
Tay, Y., Dehghani, M., Abnar, S., Chung, H. W., Fedus, W.,
Rao, J., Narang, S., Tran, V. Q., Yogatama, D., & Metzler, D.
(2022). Scaling Laws vs Model Architectures: How does Inductive
Bias Influence Scaling? (arXiv:2207.10551). arXiv.
http://arxiv.org/abs/2207.10551
The Bitter Lesson. (n.d.). Retrieved September 30, 2021, from
http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Vogelstein, J. T., Verstynen, T., Kording, K. P., Isik, L.,
Krakauer, J. W., Etienne-Cummings, R., Ogburn, E. L., Priebe, C.
E., Burns, R., Kutten, K., Knierim, J. J., Potash, J. B., Hartung,
T., Smirnova, L., Worley, P., Savonenko, A., Phillips, I., Miller,
M. I., Vidal, R., … Yang, W. (2022). Prospective Learning: Back to
the Future. ArXiv:2201.07372 [Cs].
http://arxiv.org/abs/2201.07372
Zador, A. M. (2019). A critique of pure learning and what
artificial neural networks can learn from animal brains. Nature
Communications, 10(1), 3770.
https://doi.org/10.1038/s41467-019-11786-6
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D.
(2018). mixup: Beyond Empirical Risk Minimization
(arXiv:1710.09412). arXiv. http://arxiv.org/abs/1710.09412