Edit (04/08/18): In hindsight (a couple of weeks worth) I think I was a little too negative. I think that I many not have understood the questions the ACL community were asking and how supervised deep learning can lend insight to approaches to NLP.
Money well spent? Yes and no. Having spent the money it motivated me to; read An Introduction to Language, skim many of the papers accepted into ACL2018, … but I found the actual conference disappointing. There were a lot applications of supervised deep learning, which in my mind is not science, rather its engineering. Engineering can certainly be interesting, but it can also be very uninteresting… Collect a dataset, take a guess at a neural net architecture, watch some loss curves, voila (voila: “to suggest an appearance as if by magic”, which seems fitting for DL).
My impression of many of the ACL papers is summed up by quote from presenter at ICML: when asked whether they did an ablation study to understand the problem/architecture and they answered “what’s an ablation?”. Sigh. This got me thinking about what makes good science. I don’t want to pick out individual papers, so the general patterns that left me unimpressed were: ignorance, the lack of care for truly understanding, uncertain results and the lack of effort attempting to answer the big questions.
Ignorance of existing work or the relationship(s) to existing fields outside of linguistics. For example, being unaware of; matrix factorisations, sparse bases and signal processing, learning latent features, … (and other things I already know of). So while linguists might get frustrated (rightly so) with the ML community for their ignorance (and arrogance), the ML community has just as much right to get frustrated with linguists for their ignorance. (recently I have started to notice that this is a problem across all of science, and is not limited to ML and computational linguistics).
Relatedly, I expected to see a deeper desire to understand their problem(s) rather than to build solution(s). My impression was that most of the (authors of the) papers didn’t understand why they did it and why it helps. There were a few of papers that explored word/sentence/!^$*# embeddings and attempted to see what they represented (or what they could be used for). And there were a few papers that used ablations to explore what is necessary/sufficient for a good solution. But the majority of papers didn’t bother. Understanding is hard, +1% accuracy is easier.
In my mind science is about removing uncertainty about which theories are incorrect. However, in many of the papers at ACL it didnt seem clear that the proposed method performs as claimed or better than its competitors. Because of the non-convexity of optimising NNs you need to be careful when making comparisons. Hyper parameters can have a large effect, so you need to be thorough. Also, the significance (in the sense of statistical significance and whether I should care) of +1% accuracy is dubious. I also think there is a strong case for non-trivial baselines. A lot more effort should be spent here, designing something we understand and seeing how it performs. Finally, if no uncertainty is removed, then what is the contribution!?
I would have love to have seen some/more work on the big questions;
- what is necessary for complex language to emerge from the need to collaborate and the subsequent interaction of learners?
- can we construct a biologically inspired model of language acquisition?
- how is working memory computationally useful?
- how can we imbue our NLP systems with the same inductive biases we see in children (e.g. shape bias) to aid generalisation?
Papers
Interesting looking papers (pre conference);
- On the Limitations of Unsupervised Bilingual Dictionary Induction
- Hierarchical Losses and New Resources for Fine-grained Entity Typing and Linking
- A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss
- Learning Structured Natural Language Representations for Semantic Parsing
- Straight to the Tree: Constituency Parsing with Neural Syntactic Distance
- Did the Model Understand the Question?
- Training Classifiers with Natural Language Explanations
- StructVAE: Tree-structured Latent Variable Models for Semi-supervised Semantic Parsing
- The Role of Syntax during Pronoun Resolution: Evidence from fMRI
- Compositional Morpheme Embeddings with Affixes as Functions and Stems as Arguments
- Language Generation via DAG Transduction
- Composing Finite State Transducers on GPUs
- Improving Knowledge Graph Embedding Using Simple Constraints
- Object-oriented Neural Programming (OONP) for Document Understanding
- Backpropagating through Structured Argmax using a SPIGOT
- On the Limitations of Unsupervised Bilingual Dictionary Induction
- … and many more
[caption id=”attachment_6616” align=”aligncenter” width=”434”] too many papers, not enough focus.[/caption]
Interesting looking papers (post conference);
- The hitchhikers guide to significant testing in NLP
- Policy Gradients as a proxy for dynamic oracles in constituency parsing
- Linear time constituency parsing with RNNs and dynamic programming
- Finding Syntax in Human Encephalography with Beam Search
(sigh, i’ll add them to the end of my list: positions 1075, 1076, …)
Interesting notes
Co-reference resolution (and entity linking/resolution)
(inspired by/from Hierarchical Losses and New Resources for Fine-grained Entity Typing and Linking)
You can frame entity resolution as a matrix factorisation problem. There are a set of n entities and we need to be able to uniquely index them, but we are lazy. Can we find a set of other labels, lets call them types, that allow us to uniquely index the entities in natural examples? That is, can we find a matrix (Entities x Types) that allows us, within a sentence, to specify a set of types – a vector of shape (Types) – and us that to look up elements of our matrix?
Q: If the goal is to efficiently distinguish entities, can this be formulated into a loss function!?
Approximate entailment
(inspired by/from Improving Knowledge Graph Embedding Using Simple Constraints)
This paper formulates an approximate version of entailment (similar to Evans et al.).
I found this intriguing as I recently asked myself, “could there be an advantage from approximate reasoning?” and this question led me to the idea of regulatisation for consistency (blog post is w.i.p). I wonder how much further we can take this approach, using hand crated regularisers to impose logical priors on a knowledge graph.
Ps. Regularising for approximate entailment made a large (~5-10%) difference on some of the test results!
Relational formalisations
(inspired by/from Towards Understanding the Geometry of Knowledge Graph Embeddings)
We have a relationship between two concepts. How can we formalise this relationship and what are the pros and cons of different formalisims? What types of relationship are there and how should they be formalised?
These questions take us (quite quickly) to some deep and pure mathematics. Should the composition of two relationships be transitive, associative, commutative, … Well it depends on the relationship we want to capture.
Simple candidates for modelling relationships between dense vectors seem to be additive and multiplicative.
But these are just a small set of the possible ways we can formalise these relationships. Which representations are best!?
Favorite presentations
On the Complexity and Typology of Inflectional Morphological Systems by Ryan Cotterell. Not only was this talk interesting but the presenter actually made an effort to engage his audience; an unusual occurrence. I knew that there were languages (such as Turkish, Arabic, Latin, …) that had greater morphological complexity than English (many more cases/genders that could augment the meaning of a verb). I knew that English has many irregular verbs (past tense: consume-consumed versus eat-ate, plurals: car-cars versus foot-feet). But I didn’t know that languages with high morphological complexity tend to have low numbers of irregular verbs and vice versa. This hints at some fundamental trade-off between the resources required to learn and/or compute a complex case system and many irregular verbs. Cool.
‘Lighter’ Can Still Be Dark: Modeling Comparative Color Descriptions by Olivia Winn. How can comparative adjectives (such as ‘lighter’ in the male ducks plumage is lighter than a females) be reasoned with? It turns out that these comparative adjectives, aka relationships between two colors, can be easily grounded as vectors. I found it interesting that the grounding was dependent on the position of the color in RGB, in hindsight this makes sense and “more neon” effects different colors differently (??). I would love to see a compositional version of this, where “more neon” could be composed with “lighter” (built into the architecture, not done after). Really good work, getting to the core of grounding and relational reasoning.
On the Practical Computational Power of Finite Precision RNNs for Language Recognition by Gail Weiss.
RNNs have been proven to be universal approximators. But is that really true? The proof requires infinite precision and infinite time. So with these limitations taken into account, is it really true that RNNs are universal approximators? No. But even more interestingly, Gail et al. focus on the equivalence of different RNN architectures and show that LSTMs are strictly more powerful than GRUs because LSTMs can count. Just awesome!
Research directions for the future
I am especially interested in learning to manipulate, query and transform structured representations (trees, graphs, ?) in the hope of integrating symbolic and neural approaches (aka integrating reasoning and learning). I am hopeful that some of the approaches to generating/manipulating syntax trees might be insightful.
NLP for law and legal analytics. A lot of effort (or not enough effort) is spent checking contracts, verifying compliance with regulations, monitoring fraud, … Imagine if we could automate these!!!
Integrates systems for the comprehension and generation of language (as writing and/or speech and/or sign). I am curious about the sharing of knowledge between these two tasks and how we can cleverly design an efficient integrated system.
NLP for everyone. The ability to record all sentences spoken, all paragraphs typed, … The ability to analyse your own language and find patterns; bias, lies, quirks, habits, common sayings, moods, … A scope into our own thoughts.
Language grounded in the real world. This seems a long way away, mainly as cannot see a clear path forward. In my mind this quest is why the field of AI was first formed and is still pursued. A computer with common sense and ability to “understand”.
Other happenings
Was cool to see FB, Google, Amazon, IBM, Microsoft, Baidu, Tencent, … It makes me feel wanted/useful/prestigious… just by association.
[caption id=”attachment_6611” align=”alignnone” width=”329”] heh, only at an academic conference. there was a singles group chat started and the meaning of ‘single’ was lost in translation.[/caption]
Ryan Cotterell polled the audience: “raise your hand if English is not your first language”. About 90% of people in the room raise their hand. I thought that was really cool.
[caption id=”attachment_6613” align=”alignnone” width=”4032”] if you look carefully at the small screen on the right you can see that the stage manager was watching football (world cup replays) for the entire presentation[/caption]
Final thoughts
Not sure which conference I am going to pick next, but I think I am keen for something with more science, more math, less hype and less people. I would also like to engage more with the attendees, maybe i’ll go to the workshops next time.
Otherwise, I met some nice people and learnt a few things. Worth.