9 Comments

I have been waiting for someone to take on this approach. Shrinking a model, slowing down its processing, etc, in order to "replicate" human processing. I like your question about the "blank slate." It cuts to the quick. All of this is analogy. Very interesting analogy. But still we are dealing with models of models, no. Opens up a larger field of question about using AI as a vehicle for research in lieu of work with actual humans. I am thinking of folks who are safety testing systems on crowds of AI agents--using those scenarios to predict the integrity of systems in the face of different kinds of stressors. Again, models of models. In essence, science has always worked this way. But there is something interesting and new happening here. Definitely something to write and think about...

Expand full comment

Thanks Nick! Yes, sometimes these AI models can feel like magic. It's tricky isn't it? Ideally we want to build models that tell us something about the real world. I agree -- AI of this type is an analogy of sorts, but like all good analogies (and bad ones) they can only go so far. The trick is trying to figure out what the models are actually explaining.

Expand full comment

Hi Suzi, first author of the Science paper here! I just wanted to say thanks for this fantastic write-up of our work and really diving into some of the details of our training and evaluation set-up. I’m mostly in agreement with the concerns you’ve listed here, but I did want to add a few additional comments and provide some additional context for some of our choices to your readers:

- You mention a language acquisition device, or a universal grammar, as an example of the nature view. Since our work focuses mostly on the problem of lexical acquisition rather than grammar, there are other beliefs more in line with the nature view that are worth mentioning as well, such as knowing the set of visual referents that a concept could refer to at birth (e.g. an innate view of concepts), or having strong constraints about how to extend or generalize a word (e.g. based on its shape, or only considering whole objects instead of parts), or assuming that each word has a different meaning (the principle of contrast). We do not explicitly include any of these mechanisms into our model, its primary method of learning these mappings is purely by the co-occurrences between visual and linguistic information.

- Concern 1: I agree that we do remove some layers of complexity compared to the raw multi-sensory stream of what a baby receives, by only considering the paired moments between spoken utterances and visual input, as well as passing in the transcribed utterances, but I would argue that many of the other well-known models that purport to study cross-situational word learning make even more assumptions, such as removing the visual aspect of the problem entirely! See https://www.sciencedirect.com/science/article/pii/S0010028512000795?casa_token=UOzXoHdHMEMAAAAA:1jPkoacJooE6UIB7uDU_ah8GPBnCZYLbDqKMLHhA5p37gJ-ZQ-x2JVZAnZEThagP9Sj2UlU or https://journals.sagepub.com/doi/full/10.1111/j.1467-9280.2009.02335.x?casa_token=wT08hOSACawAAAAA%3APrxJEByZR7clWbKTHhZPXlu7aG-aBXl_4ZdraIOZ039pzYL8bGlE-Ss7Silk22-tLzalF4nUJGBN. Additionally, this kind of pre-processing is also common in multimodal machine learning. It is quite common to have some amount of pre-processed or annotated data to also feed into the model (see examples from the Ego4D challenge that also uses an egocentric video dataset: https://ego4d-data.org/docs/challenge/). Nevertheless, even with some of these pre-processing steps, our work still provides the largest and most representative subset of one child’s experience to date, and I think demonstrating that learning is possible from this is still a huge advance for the field.

- Concern 2: We only study the problem of learning word-referent mappings because this problem is so big it has its own subfield within cognitive science! Of course, we recognize that grammar and the compositionality of language are also very important and acquired by children later on, but we believe that it made sense to start at the very beginning. Nevertheless, my co-author Wentao Wang led a separate project looking at learning these other aspects of language from the same dataset which you can find here: https://onlinelibrary.wiley.com/doi/abs/10.1111/cogs.13305, demonstrating that some aspects of syntax and semantics can also be learned from one child’s linguistic input. Therefore, I think it would be relatively straight-forward to train a model that learns both word-referent mappings and grammar from this single dataset. Although we have not explicitly pursued this just yet, perhaps we might need to do so given how many people have raised this issue as a potential gotcha! Separately, your example of differentiating between “the butterfly on the block” vs. “the block on the butterfly” is actually very challenging! The WinoGround paper (https://openaccess.thecvf.com/content/CVPR2022/html/Thrush_Winoground_Probing_Vision_and_Language_Models_for_Visio-Linguistic_Compositionality_CVPR_2022_paper.html), looks at this problem in great detail, and finds that even state-of-the-art multimodal models trained on millions of datapoints still fail to solve this problem.

- Concern 3: This concern might be a bit of a conceptual difference. To me, I think of the blank state as Step 0: when the model’s weights are randomly initialized, but I treat both Steps 1 and 2 (as you’ve laid them out) as separate pre-training stages (rather than pre-training and training), since children do receive and learn from visual experience in the first few months of life prior to acquiring any language (Step 1), but then that gets refined with the addition of language input (Step 2). So to me, I don’t think that the visual representations of the model after Step 1 necessarily constitute innate concepts, since those representations were still learned from data. It would only be innate if those concepts were there at Step 0 (in the random weights of the neural network prior to receiving or learning from any data, either vision or language).

- Concern 4: These are all really good points regarding overlap! Note that in addition to a different sampling process to extract the evaluate frames (that were done independently of the spoken language, only via the visual content), the pre-training task in Step 2 (matching video frames to entire utterances) is also slightly different to the evaluation task (matching single words to video frames), nor was this evaluation dataset used during any of the pre-training stages, so it is not the standard kind of overfitting you see in machine learning. However, as our overlap analysis shows, the fact the frames are sampled from the same set of videos may have raised the performance of the model slightly, but it doesn’t take away the fact that the model does generalize to images outside of its training distribution. I think it's this finding which gives me strong confidence that it is learning a generalizable mapping between words and images, and not merely overfitting in the traditional sense (which would fail to display this kind of generalization). I agree that its performance is a bit lower in the out-of-distribution evaluation, but that’s because we wanted to include as many possible words to evaluate on (even those the model may have only seen a handful of times!), which did end up lowering the overall average performance. Even then, performance for half of the concepts are above-chance (32 out of 64), which I think is quite promising! One last point: we’re working on training a version of CVCL using an independent set of videos from baby S distinct from the evaluation frames, thus removing the potential overlap issue, so stay tuned!

I hope some of my responses clarify some of the concerns you’ve raised about our work, and happy to answer any other questions you might have!

Expand full comment

Thanks Wai Keen for providing valuable context to your work. This must be a busy time for you, so I appreciate you taking the time to respond to my review.

You've raised so many really good points, and given valuable insights into the difficult decisions that important research requires scientists to make.

I'm fascinated by the work you and your team are doing. I'm eagerly awaiting the results of the CVCL model on an independent set of videos from Baby S. Good luck!

Expand full comment

Thanks for covering this topic, Suzi. Appreciate it. Excellent work and congratulations on this one! I read dozens of articles about this paper since it truly fascinates me. I've heard babies are the source of inspiration for many researchers. These little creatures learn by trying things and seeing what happens. As we grow up, we keep learning and getting smarter. Experts who study how children grow say that babies can guess what will happen next. Yann LeCun has argued that teaching AI systems to observe like children might be the way forward to more intelligent systems. That's why I find it quite fascinating.

Expand full comment

I really enjoyed reading this article. I write heavily on this subject on my own substacks so it was good to find another writer who shares this interest.

Expand full comment

Hi! Thank you. Yes, it does look like we share similar interests :)

Expand full comment

As an NLP researcher and father of two little girls, the question of human intelligence and learning has always captivated me. Is the human approach to learning the optimal strategy that we should attempt to emulate, or is it merely one of many strategies evolution has discovered? This debate extends to machine learning as well – should we strive to replicate human learning methods or focus on devising the most effective learning strategies for computers?

When it comes to children's language acquisition, I'm definitely far from an expert in human language development, but I am reasonably familiar with how computers learn language. Observing my girls, aged six months and nearly two years, I've been amazed by their ability to quickly associate words with meanings and grasp simple grammatical structures. My older daughter, at just two years old, can already conjugate verbs and use words in context that she has never heard before. She even makes understandable mistakes, such as applying regular conjugation patterns to irregular verbs.

From my personal and biased experience, it's clear that human language learning is quite different from and more robust than our best language models. While I cannot definitively say whether this supports nature or nurture, it does suggest that humans possess a sophisticated learning ability that is more efficient and robust than current statistical machine learning methods.

This line of research intrigues me because it opens up questions about how we can make AI systems more robust by designing learning strategies that are both highly efficient and easily generalizable.

Thanks Suzi for such an amazing write-up and thanks to the paper authors for championing a brave new approach to a centuries old question.

Expand full comment

Thanks Alejandro!

I'm no expert on language acquisition, either. But I do think it's fascinating. I agree, human language learning and machine language learning (i.e LLMs) seem to be quite different. But this makes a lot of sense to me. Currently, there are too many ways that human brains differ from machines. And I think those differences are the type of differences that make a difference. For now, of course. Who knows what sorts of advances the future holds.

Expand full comment