python - Absolute position of leaves in NLTK tree -


i trying find span (start index, end index) of noun phrase in given sentence. following code extracting noun phrases

sent=nltk.word_tokenize(a) sent_pos=nltk.pos_tag(sent) grammar = r"""     nbar:         {<nn.*|jj>*<nn.*>}  # nouns , adjectives, terminated nouns      np:         {<nbar>}         {<nbar><in><nbar>}  # above, connected in/of/etc...     vp:         {<vbd><pp>?}         {<vbz><pp>?}         {<vb><pp>?}         {<vbn><pp>?}         {<vbg><pp>?}         {<vbp><pp>?} """  cp = nltk.regexpparser(grammar) result = cp.parse(sent_pos) nounphrases = [] subtree in result.subtrees(filter=lambda t: t.label() == 'np'):   np = ''   x in subtree.leaves():     np = np + ' ' + x[0]   nounphrases.append(np.strip()) 

for a = "the american civil war, known war between states or civil war, civil war fought 1861 1865 in united states after several southern slave states declared secession , formed confederate states of america.", noun phrases extracted are

['american civil war', 'war', 'states', 'civil war', 'civil war fought', 'united states', 'several southern', 'states', 'secession', 'confederate states', 'america'].

now need find span (start position , end position of phrase) of noun phrases. example, span of above noun phrases

[(1,3), (9,9), (12, 12), (16, 17), (21, 23), ....].

i'm new nltk , i've looked http://www.nltk.org/_modules/nltk/tree.html. tried use tree.treepositions() couldn't manage extract absolute positions using these indices. appreciated. thank you!

there isn't implicit function returns offsets of strings/tokens highlighted https://github.com/nltk/nltk/issues/1214

but can use ngram searcher used ribes score https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#l123

>>> nltk import word_tokenize >>> nltk.translate.ribes_score import position_of_ngram >>> s = word_tokenize("the american civil war, known war between states or civil war, civil war fought 1861 1865 in united states after several southern slave states declared secession , formed confederate states of america.") >>> position_of_ngram(tuple('american civil war'.split()), s) 1 >>> position_of_ngram(tuple('confederate states of america'.split()), s) 43 

(it returns starting position of query ngram)


Comments

Popular posts from this blog

Load Balancing in Bluemix using custom domain and DNS SRV records -

oracle - pls-00402 alias required in select list of cursor to avoid duplicate column names -

python - Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>] error -