python - NLTK - Remove Tags From Parsed Chunks -


this question has answer here:

#!/usr/bin/env python # -*- coding: utf-8 -*- import os import nltk import re  nltk.tree import * nltk.chunk.util import tagstr2tree nltk import word_tokenize, pos_tag  text = "yarın, mehmet ile birlikte ankara'da ki nüfus müdürlüğü'ne, aziz  yıldırım ile birlikte, Şükrü saraçoğlu stadı'na gideceğiz.".decode("utf-8")  tagged_text = pos_tag(word_tokenize(text)) tagged_text2 = word_tokenize(text)  grammar = "np:{<nnp>+}"  cp = nltk.regexpparser(grammar) result = cp.parse(tagged_text)  tree in result:     print(tree)  wrapped = "(root "+ str(result) + " )"  # add "root" node @ top trees = nltk.tree.fromstring(wrapped, read_leaf=lambda x: x.split("/")[0])  tree in trees:     print(tree.leaves())  tree2 in result:     print(nltk.tree.fromstring(str(tree2), read_leaf=lambda x: x.split("/")[0])) 

the output:

(np yar\u0131n/nnp) (u',', ',') (np mehmet/nnp) (u'ile', 'nn') (u'birlikte', 'nn') (np ankara'da/nnp ki/nnp nufus/nnp mudurlugu'ne/nnp) (u',', ',') (np aziz/nnp y\u0131ld\u0131r\u0131m/nnp) (u'ile', 'nn') (u'birlikte', 'nn') (u',', ',') (np sukru/nnp saracoglu/nnp stad\u0131'na/nnp) (u'gidece\u011fiz', 'nn') (u'.', '.')   ['yar\\u0131n', ',', 'mehmet', 'ile', 'birlikte', "ankara'da", 'ki', 'nufus', "mudurlugu'ne", ',', 'aziz', 'y\\u0131ld\\u0131r\\u0131m', 'ile', 'birlikte', ',', 'sukru', 'saracoglu', "stad\\u0131'na", 'gidecegiz', '.']   (np yar\u0131n) (u',', ',') (np mehmet) (u'ile', 'nn') (u'birlikte', 'nn') (np ankara'da ki nufus mudurlugu'ne) (u',', ',') (np aziz y\u0131ld\u0131r\u0131m) (u'ile', 'nn') (u'birlikte', 'nn') (u',', ',') (np sukru saracoglu stad\u0131'na) (u'gidece\u011fiz', 'nn') (u'.', '.') 

i referenced :how can remove pos tags before slashes in nltk?

i want grouping proper names , remove tags when used solution effects whole text , after chunk parse gone. tried understand tree structure how can apply the removing function in statement. want output like:

my desired output:

[yar\u0131n] [,] [mehmet] [ile] [birlikte] [ankara'da ki nufus mudurlugu'ne] ... ... 

also can't deal utf-8 see output full of non-ascii characters. how can deal ?

edit:

for in range(len(tree)):     arr.append(nltk.tree.fromstring(str(tree[i]), read_leaf=lambda x: x.split("/")[0]).leaves())     print(arr[i]) 

i found shoul write in code have following error. think can't append punctuations on array.

['yar\\u0131n'] traceback (most recent call last):   file "./chunk2.py", line 61, in <module>     arr.append(nltk.tree.fromstring(str(tree[i]), read_leaf=lambda x: x.split("/")[0]).leaves())   file "/usr/local/lib/python2.7/dist-packages/nltk/tree.py", line 630, in fromstring     cls._parse_error(s, match, open_b)   file "/usr/local/lib/python2.7/dist-packages/nltk/tree.py", line 675, in _parse_error     raise valueerror(msg) valueerror: tree.read(): expected u'(' got ','             @ index 0.                 ","                  ^ 

it's more inefficient realize. you're producing parse tree, converting string, wrapping if it's multiple trees (it isn't), parsing wrapped string tree. have parse tree result, stop , remove pos tags.

an nltk tree kind of list, iterate on branches of tree , remove pos tag leaf tuples. desired format, need add level of wrapping around words not nps:

... >>> result = cp.parse(tagged_text) >>> terms = [] >>> e in result:     if isinstance(e, tuple):         terms.append([ e[0] ])     else:         terms.append([w w, t in e]) >>> pprint.pprint(terms) [['yarın'],  [','],  ['mehmet'],  ['ile'],  ['birlikte'],  ["ankara'da", 'ki', 'nüfus', "müdürlüğü'ne"],  [','],  ['aziz', 'yıldırım'],  ... 

Comments

Popular posts from this blog

javascript - How to get current YouTube IDs via iMacros? -

c# - Maintaining a program folder in program files out of date? -

emulation - Android map show my location didn't work -