OpenRCE Blog Entry

Ero Carrera (ero) <ero

carrera

gmail

com>

Friday, June 22 2007 20:15.00 CDT

Ive recently bumped again into Powerset. I had previously heard about them when they got some people from PARC (if I remember correctly) and went into attempting to build something that I had always dreamed about. The guys at Powerset are tackling one of the hardest and most interesting (in my opinion) problems currently known, that is, helping computers process and "understand" natural language and use those results to make information more accessible. From my humble amateur-linguistic-aficionado point of view, they are doing it a great work there. Soon I will have a chance to see it live, first hand, and I cant wait.

In one of their latest posts they discuss some ambiguities that arise from using words with several meanings in contexts where the least used of the meanings is taken into use, leading to misunderstandings.

To put it in other terms, the problems arise when using the less known meanings of words in a way that the brain is misled when starting to read a sentence and leads to misunderstand the subsequent words (which can also have several meanings which depend on how one understood the start of the sentence) .
Normally, once the sentence has been read several times, the brain finally "switches" into the right interpretation of the different meanings of those words in a way that the whole construct becomes coherent.
I personally see it as resembling the visual phenomena where the brain interprets specially crafted images in different ways, switching back and forth between their different interpretations, like in the Young Girl-Old Woman Illusion or the Rabbit-Duck one.

In the case of these garden path sentences, as they are commonly called, the brain gets confused because of the dependencies between the words and their meanings.

As the brain starts reading a sentence, it will attempt to predict what follows, and its amazingly good at that. The trick is to throw it off track by using words with multiple meanings.

In the example that they have as their post title "Search Engines Leaking Oil for Holes" the brain is tricked by taking the most common meaning of the first two words (a composite noun or collocation) and attempting to interpret it in a way that later becomes rather confusing when reaching "leaking oil".

Re-reading the sentence can lead to a second interpretation

In their post they ask how hard would be to find an automated way of generating such garden path sentences and they describe a pseudo-algorithm like the following:

You can make your own garden path sentences by following a few simple heuristics (...). The trick is to choose words that can act as both nouns and verbs, or as both adjectives and nouns, words like store, search, and post. Then follow the ambiguous word by another word that can take on more than one form. The hard part is to then add on another noun phrase that makes sense with the less common interpretation of the second word.

Trying to follow their heuristics, the first thing to do would be to find sets of words that can be both a noun and a verb or and adjective and a noun. Thanks to WordNet, PyWordNet and the mash-up of those and more provided by the guys from NodeBox thats not such a hard task as it would have otherwise been without such toolset.

Sets of words fulfilling those requirements can be build in a few lines of Python.

# Collect nouns, verbs and adjectives
verbs = set( wordnet.V.keys() )
nouns = set( wordnet.N.keys() )
adjectives = set( wordnet.ADJ.keys() )

# Pick the ones that can work both as nouns and verbs or as nouns and adjectives
noun_verbs = verbs.intersection(nouns)
noun_adjectives = adjectives.intersection(nouns)

print Found % d words that are both verbs and nouns % len(noun_verbs)
print Found % d words that are both adjectives and nouns % len(noun_adjectives)

Found 4096 words that are both verbs and nouns
Found 3138 words that are both adjectives and nouns

I will also need to have some means of knowing which words are more likely to follow a given one. For that I will reach into some datasets I collected years ago for some computational linguistics experiments I did. Using a small corpora of 2.071.007 sentences built out of books from the Project Gutenberg and parsing it through some Python code I obtained 16.057.624 word pairs, 2.365.383 of them unique. That will provide me with some numbers on what words are likely to follow others.

I can now look for frequently used words that can be both nouns and verbs. In the following line "occurrences" is a list containing all the words and the number of times they appear. They are filtered to only show the ones that are both nouns and verbs.

print [word for word in occurrences[:300] if word[0] in noun_verbs]

{{"be", 10070}, {"have", 7827}, {"like", 6577}, {"will", 6201}, {"out", 5422}, {"still", 4136}, {"even", 4049}, {"man", 3957}, {"can", 3866}, {"down", 3376}, {"see", 3104}, {"do", 3097}, {"time", 2729}, {"people", 2663}, {"well", 2602}, {"last", 2581}, {"back", 2337}, {"white", 2250}, {"make", 2088}, {"till", 2083}, {"come", 2048}, {"black", 2021}, {"general", 2004}, {"found", 1935}, {"light", 1918}, {"round", 1910}, {"go", 1880}, {"better", 1815}, {"face", 1755}, {"saw", 1742}, {"lay", 1740}, {"work", 1682}, {"form", 1678}, {"let", 1673}, {"right", 1654}, {"set", 1647}, {"lord", 1621}, {"look", 1579}, {"take", 1577}, {"hand", 1574}, {"head", 1546}, {"full", 1544}, {"best", 1538}, {"put", 1534}, {"state", 1531}, {"party", 1522}, {"love", 1517}, {"place", 1493}, {"house", 1491}, {"say", 1440}, {"get", 1401}, {"part", 1386}, {"water", 1385}, {"name", 1384}, {"second", 1370}, {"give", 1344}, {"felt", 1342}, {"present", 1327}, {"fell", 1320}, {"land", 1319}, {"use", 1311}}

Now given a word its possible to find other words that would often follow it and can also have several functions. For instance, lets see what comes out for "look":

# Pick words following look that can be both nouns and verbs
succeeding_words = [p for p in word_sparse[look].items () if p[0] in noun_verbs]
# Sort them by the most frequently used to the least
succeeding_words.sort ( lambda a, b : -1 if a[1] > b[1] else 0 if a[1] == b[0] else 1)
print succeeding_words[: 100]

"[(like, 255), (out, 185), (down, 124), (back, 115), (forward, 82), (round, 49), (well, 42), (pale, 26), (better, 16), (black, 9), (right, 7), (full, 6), (white, 5), (blue, 5), (grave, 5), (even, 4), (still, 4), (double, 4), (cross, 4), (close, 3)]

And the results for "form"

[(name, 185), (part, 18), (can, 8), (saint, 8), (like, 7), (see, 5), (will, 5), (state, 4), (ice, 3), (till, 3), (lay, 3), (french, 3), (people, 3), (found, 2), (out, 2), (put, 2), (well, 2), (note, 2), (black, 2), (starch, 2)]

Although not being a native English speaker makes this a tiny bit more challenging, I can see how one could play with combinations like "look, like", "look, still", "look, well", "form, name", "form, like", etc. to build slightly confusing sentences.

Collocations also are great to mislead the brain whenever one of the words has more than a meaning ("visitor center", "search engines", "meeting point") .
A quick hack to try to spot some automatically could be to look for pairs of words often appearing together and having the desired properties of fulfilling more than one function.
But given the low quality results in the shown next; one could, for instance, also take into account the relative frequency of a noun-noun compound as compared to other pairings of the nouns, to try to see how much more often those two words appear together than with others. Theres extensive literature on how to improve this and this was meant as a short-ish blog post after all.

print [ p for p in word_pairs_occurrences[:10000] if en.is_noun(p[0][0]) and en.is_verb(p[0][0]) and en.is_noun(p[0][1]) and en.is_verb(p[0][1]) ]

{{will, be}, {can, be}, {labor, force}, {will, have}, {be, found}, {can, do}, {come, back}, {exchange, rate}, {will, do}, {will, make}, {can, read}, {prime, minister}, {will, go}, {will, give}, {have, come}, {come, out}, {go, back}, {right, hand}, {set, out}, {go, out}, {find, out}, {can, see}, {will, come}, {come, down}, {will, take}, {have, found}, {short, form}, {will, tell}, {birth, total}, {get, out}, {go, down}, {land, use}, {be, put}, {can, tell}, {father, brown}, {will, find}, {white, man}, {put, out}, {take, care}, {can, get}, {dare, say}, {will, see}, {can, make}, {be, well}, {short, time}, {can, have}, {found, out}, {lay, down}, {second, time}, {be, better}, {be, read}, {can, think}, {go, home}, {lord, will}, {birth, rate}, {hoist, side}, {meter, gauge}, {ftp, program}, {be, true}, {be, like}, {last, time}, {look, like}, {will, say}, {man, can}, {set, down}, {license, fee}, {come, home}, {can, find}, {make, out}, {put, down}, {give, notice}, {can, say}, {be, cut}, {take, place}, {low, voice}, {will, try}, {cast, out}, {get, index}, {have, put}, {lie, down}, {can, go}, {radio, relay}, {still, be}, {will, get}, {be, ready}, {well, be}, {wait, till}, {get, back}, {tax, return}, {free, copyright}, {fell, down}, {can, copy}, {set, bin}, {have, felt}, {look, out}, {be, out}, {form, name}, {satellite, earth}, {burst, out}, {will, keep}, {be, free}, {can, give}, {double, track}, {people, have}, {cut, down}, {will, show}, {fish, catch}, {turn, out}, {carry, out}, {well, have}, {work, force}, {be, set}, {have, set}, {miss, garland}, {will, put}, {can, take}, {do, well}, {let, go}, {mine, hand}, {earth, station}, {fell, back}, {take, heed}, {short, distance}, {air, force}, {can, help}, {will, help}, {cry, out}, {will, let}, {free, state}, {feel, like}, {will, cause}, {present, time}, {will, think}, {be, present}, {will, return}, {cast, down}, {black, man}, {narrow, gauge}, {bulletin, board}, {man, be}, {be, right}, {dry, tree}, {will, set}, {be, back}, {point, out}, {right, side}, {can, come}, {look, down}, {will, call}, {run, down}, {file, size}, {major, transport}, {labor, party}, {be, content}, {will, leave}, {man, will}, {will, look}, {can, use}, {need, be}}

Definitely the problem is very challenging with current tools, but its always fun to give it a spin. With a few hours and limited tools I could only get to think of some ways to find good candidate words for garden path sentences. Definitely nowhere close to actually completing full sentences.

It would be great to expand on this toy research and make it actually useful and interesting. Using larger data sets (like this Google data set) from which to extract word relationships would be a good way to start. Having statistics for trigrams, fourgrams, etc. of words would make things better, having more contextual information would be possible to get more meaningful constructs by ensuring that the chosen words occur close within a small context.

I can think of more ways of improving it, most of them involving large datasets and lots of computational power... gosh, Im getting carried away thinking about this...

Looking forward to Powerset letting people play with their tools, I m sure that implementing ideas like the one discussed in this rant will become much easier.

	Posted: Wednesday, December 31 1969 18:00.00 CST