Discussion about this post

User's avatar
Suzi Travis's avatar

Thanks Nick! Yes, sometimes these AI models can feel like magic. It's tricky isn't it? Ideally we want to build models that tell us something about the real world. I agree -- AI of this type is an analogy of sorts, but like all good analogies (and bad ones) they can only go so far. The trick is trying to figure out what the models are actually explaining.

Expand full comment
Wai Keen Vong's avatar

Hi Suzi, first author of the Science paper here! I just wanted to say thanks for this fantastic write-up of our work and really diving into some of the details of our training and evaluation set-up. I’m mostly in agreement with the concerns you’ve listed here, but I did want to add a few additional comments and provide some additional context for some of our choices to your readers:

- You mention a language acquisition device, or a universal grammar, as an example of the nature view. Since our work focuses mostly on the problem of lexical acquisition rather than grammar, there are other beliefs more in line with the nature view that are worth mentioning as well, such as knowing the set of visual referents that a concept could refer to at birth (e.g. an innate view of concepts), or having strong constraints about how to extend or generalize a word (e.g. based on its shape, or only considering whole objects instead of parts), or assuming that each word has a different meaning (the principle of contrast). We do not explicitly include any of these mechanisms into our model, its primary method of learning these mappings is purely by the co-occurrences between visual and linguistic information.

- Concern 1: I agree that we do remove some layers of complexity compared to the raw multi-sensory stream of what a baby receives, by only considering the paired moments between spoken utterances and visual input, as well as passing in the transcribed utterances, but I would argue that many of the other well-known models that purport to study cross-situational word learning make even more assumptions, such as removing the visual aspect of the problem entirely! See https://www.sciencedirect.com/science/article/pii/S0010028512000795?casa_token=UOzXoHdHMEMAAAAA:1jPkoacJooE6UIB7uDU_ah8GPBnCZYLbDqKMLHhA5p37gJ-ZQ-x2JVZAnZEThagP9Sj2UlU or https://journals.sagepub.com/doi/full/10.1111/j.1467-9280.2009.02335.x?casa_token=wT08hOSACawAAAAA%3APrxJEByZR7clWbKTHhZPXlu7aG-aBXl_4ZdraIOZ039pzYL8bGlE-Ss7Silk22-tLzalF4nUJGBN. Additionally, this kind of pre-processing is also common in multimodal machine learning. It is quite common to have some amount of pre-processed or annotated data to also feed into the model (see examples from the Ego4D challenge that also uses an egocentric video dataset: https://ego4d-data.org/docs/challenge/). Nevertheless, even with some of these pre-processing steps, our work still provides the largest and most representative subset of one child’s experience to date, and I think demonstrating that learning is possible from this is still a huge advance for the field.

- Concern 2: We only study the problem of learning word-referent mappings because this problem is so big it has its own subfield within cognitive science! Of course, we recognize that grammar and the compositionality of language are also very important and acquired by children later on, but we believe that it made sense to start at the very beginning. Nevertheless, my co-author Wentao Wang led a separate project looking at learning these other aspects of language from the same dataset which you can find here: https://onlinelibrary.wiley.com/doi/abs/10.1111/cogs.13305, demonstrating that some aspects of syntax and semantics can also be learned from one child’s linguistic input. Therefore, I think it would be relatively straight-forward to train a model that learns both word-referent mappings and grammar from this single dataset. Although we have not explicitly pursued this just yet, perhaps we might need to do so given how many people have raised this issue as a potential gotcha! Separately, your example of differentiating between “the butterfly on the block” vs. “the block on the butterfly” is actually very challenging! The WinoGround paper (https://openaccess.thecvf.com/content/CVPR2022/html/Thrush_Winoground_Probing_Vision_and_Language_Models_for_Visio-Linguistic_Compositionality_CVPR_2022_paper.html), looks at this problem in great detail, and finds that even state-of-the-art multimodal models trained on millions of datapoints still fail to solve this problem.

- Concern 3: This concern might be a bit of a conceptual difference. To me, I think of the blank state as Step 0: when the model’s weights are randomly initialized, but I treat both Steps 1 and 2 (as you’ve laid them out) as separate pre-training stages (rather than pre-training and training), since children do receive and learn from visual experience in the first few months of life prior to acquiring any language (Step 1), but then that gets refined with the addition of language input (Step 2). So to me, I don’t think that the visual representations of the model after Step 1 necessarily constitute innate concepts, since those representations were still learned from data. It would only be innate if those concepts were there at Step 0 (in the random weights of the neural network prior to receiving or learning from any data, either vision or language).

- Concern 4: These are all really good points regarding overlap! Note that in addition to a different sampling process to extract the evaluate frames (that were done independently of the spoken language, only via the visual content), the pre-training task in Step 2 (matching video frames to entire utterances) is also slightly different to the evaluation task (matching single words to video frames), nor was this evaluation dataset used during any of the pre-training stages, so it is not the standard kind of overfitting you see in machine learning. However, as our overlap analysis shows, the fact the frames are sampled from the same set of videos may have raised the performance of the model slightly, but it doesn’t take away the fact that the model does generalize to images outside of its training distribution. I think it's this finding which gives me strong confidence that it is learning a generalizable mapping between words and images, and not merely overfitting in the traditional sense (which would fail to display this kind of generalization). I agree that its performance is a bit lower in the out-of-distribution evaluation, but that’s because we wanted to include as many possible words to evaluate on (even those the model may have only seen a handful of times!), which did end up lowering the overall average performance. Even then, performance for half of the concepts are above-chance (32 out of 64), which I think is quite promising! One last point: we’re working on training a version of CVCL using an independent set of videos from baby S distinct from the evaluation frames, thus removing the potential overlap issue, so stay tuned!

I hope some of my responses clarify some of the concerns you’ve raised about our work, and happy to answer any other questions you might have!

Expand full comment
7 more comments...

No posts