Just to keep track of the recommended reading Johanna gave me:
1) Semantic Role Labelling: With the release of Propbank, semantic role labelling is all the rage right now. The real question is should our system do this, and to what extent can it already? I'll have to take a look at what ccg2sem does, but I would guess that unless Johan added WordNet features, it doesn't. The papers "The Necessity of Syntactic Parsing for Semantic Role Labeling" shows that the semantic role-labelling should be divided into two distinct tasks,
pruning, which identifies possible arguments, and then
matching argument candidates to roles. And as Punyakanok et al. discover, using a full parse helps mainly by identifiying the correct constituents as argument candidates. The other paper about "Semantic Argument Classification Exploiting Argument Interdependence" basically goes even further by saying that any previous semantic roles already idetnfied should be used, but this produces only a one percent increase in recall.
2) The rest of the papers are about the "story comprehension" systems, which basically (using a sample corpus from Remedia which I imagine we could get a hold of, sixty children stories and question and answer sets) just tries to identify the relevant sentence that has the "answer". Basically the systems evolved from "Deep Read" that used prune "bag-of-word" approaches, to a rule-based approach that identified different scores for different levels of "clues" (Riloff and Thelen) to an interesting system (Grois and Wilkins) that uses a word-level transformation (directed by Q-learning) to transform the question ("Who does Nils jump on the back of?") to an answer ("Nils jumps on the back of ____"). This evolution goes from a 30-40-50 percent F-measure basically. It seems like the last method is smart but not hampered by dealing at the word level - after all, could we not do the same matching on using a dependency tree or other semantic representaiton level? One could almost think of a question as an empty semantic representation and one could do a search over available semantic representations to complete the model.
3) Lastly, the "pyramid model" (Nenkova and Passonneau) is interesting as it shows that basically they are using humans to identify "semantic content units", bits of frequently occuring text that are given a weight by how often human annotators use them. They appear to be fairly stable as the size grows, which is good news for any standard. It seems like this is something else that one might just want to do at the semantic level, as Haltern and Teufel have apparently been up to. However, they do not weight theirs (like we would), nor is it clear why one would want a human to be involved with anyways if one could just straightforwardly count overlap automatically.