python - Absolute position of leaves in NLTK tree -
i trying find span (start index, end index) of noun phrase in given sentence. following code extracting noun phrases
sent=nltk.word_tokenize(a) sent_pos=nltk.pos_tag(sent) grammar = r""" nbar: {<nn.*|jj>*<nn.*>} # nouns , adjectives, terminated nouns np: {<nbar>} {<nbar><in><nbar>} # above, connected in/of/etc... vp: {<vbd><pp>?} {<vbz><pp>?} {<vb><pp>?} {<vbn><pp>?} {<vbg><pp>?} {<vbp><pp>?} """ cp = nltk.regexpparser(grammar) result = cp.parse(sent_pos) nounphrases = [] subtree in result.subtrees(filter=lambda t: t.label() == 'np'): np = '' x in subtree.leaves(): np = np + ' ' + x[0] nounphrases.append(np.strip())
for a = "the american civil war, known war between states or civil war, civil war fought 1861 1865 in united states after several southern slave states declared secession , formed confederate states of america.", noun phrases extracted are
['american civil war', 'war', 'states', 'civil war', 'civil war fought', 'united states', 'several southern', 'states', 'secession', 'confederate states', 'america'].
now need find span (start position , end position of phrase) of noun phrases. example, span of above noun phrases
[(1,3), (9,9), (12, 12), (16, 17), (21, 23), ....].
i'm new nltk , i've looked http://www.nltk.org/_modules/nltk/tree.html. tried use tree.treepositions() couldn't manage extract absolute positions using these indices. appreciated. thank you!
there isn't implicit function returns offsets of strings/tokens highlighted https://github.com/nltk/nltk/issues/1214
but can use ngram searcher used ribes score https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#l123
>>> nltk import word_tokenize >>> nltk.translate.ribes_score import position_of_ngram >>> s = word_tokenize("the american civil war, known war between states or civil war, civil war fought 1861 1865 in united states after several southern slave states declared secession , formed confederate states of america.") >>> position_of_ngram(tuple('american civil war'.split()), s) 1 >>> position_of_ngram(tuple('confederate states of america'.split()), s) 43
(it returns starting position of query ngram)
Comments
Post a Comment