python - NLTK - Remove Tags From Parsed Chunks -
this question has answer here:
- ne_chunk without pos_tag in nltk 2 answers
#!/usr/bin/env python # -*- coding: utf-8 -*- import os import nltk import re nltk.tree import * nltk.chunk.util import tagstr2tree nltk import word_tokenize, pos_tag text = "yarın, mehmet ile birlikte ankara'da ki nüfus müdürlüğü'ne, aziz yıldırım ile birlikte, Şükrü saraçoğlu stadı'na gideceğiz.".decode("utf-8") tagged_text = pos_tag(word_tokenize(text)) tagged_text2 = word_tokenize(text) grammar = "np:{<nnp>+}" cp = nltk.regexpparser(grammar) result = cp.parse(tagged_text) tree in result: print(tree) wrapped = "(root "+ str(result) + " )" # add "root" node @ top trees = nltk.tree.fromstring(wrapped, read_leaf=lambda x: x.split("/")[0]) tree in trees: print(tree.leaves()) tree2 in result: print(nltk.tree.fromstring(str(tree2), read_leaf=lambda x: x.split("/")[0]))
the output:
(np yar\u0131n/nnp) (u',', ',') (np mehmet/nnp) (u'ile', 'nn') (u'birlikte', 'nn') (np ankara'da/nnp ki/nnp nufus/nnp mudurlugu'ne/nnp) (u',', ',') (np aziz/nnp y\u0131ld\u0131r\u0131m/nnp) (u'ile', 'nn') (u'birlikte', 'nn') (u',', ',') (np sukru/nnp saracoglu/nnp stad\u0131'na/nnp) (u'gidece\u011fiz', 'nn') (u'.', '.') ['yar\\u0131n', ',', 'mehmet', 'ile', 'birlikte', "ankara'da", 'ki', 'nufus', "mudurlugu'ne", ',', 'aziz', 'y\\u0131ld\\u0131r\\u0131m', 'ile', 'birlikte', ',', 'sukru', 'saracoglu', "stad\\u0131'na", 'gidecegiz', '.'] (np yar\u0131n) (u',', ',') (np mehmet) (u'ile', 'nn') (u'birlikte', 'nn') (np ankara'da ki nufus mudurlugu'ne) (u',', ',') (np aziz y\u0131ld\u0131r\u0131m) (u'ile', 'nn') (u'birlikte', 'nn') (u',', ',') (np sukru saracoglu stad\u0131'na) (u'gidece\u011fiz', 'nn') (u'.', '.')
i referenced :how can remove pos tags before slashes in nltk?
i want grouping proper names , remove tags when used solution effects whole text , after chunk parse gone. tried understand tree structure how can apply the removing function in statement. want output like:
my desired output:
[yar\u0131n] [,] [mehmet] [ile] [birlikte] [ankara'da ki nufus mudurlugu'ne] ... ...
also can't deal utf-8 see output full of non-ascii characters. how can deal ?
edit:
for in range(len(tree)): arr.append(nltk.tree.fromstring(str(tree[i]), read_leaf=lambda x: x.split("/")[0]).leaves()) print(arr[i])
i found shoul write in code have following error. think can't append punctuations on array.
['yar\\u0131n'] traceback (most recent call last): file "./chunk2.py", line 61, in <module> arr.append(nltk.tree.fromstring(str(tree[i]), read_leaf=lambda x: x.split("/")[0]).leaves()) file "/usr/local/lib/python2.7/dist-packages/nltk/tree.py", line 630, in fromstring cls._parse_error(s, match, open_b) file "/usr/local/lib/python2.7/dist-packages/nltk/tree.py", line 675, in _parse_error raise valueerror(msg) valueerror: tree.read(): expected u'(' got ',' @ index 0. "," ^
it's more inefficient realize. you're producing parse tree, converting string, wrapping if it's multiple trees (it isn't), parsing wrapped string tree. have parse tree result
, stop , remove pos tags.
an nltk tree kind of list, iterate on branches of tree , remove pos tag leaf tuples. desired format, need add level of wrapping around words not nps:
... >>> result = cp.parse(tagged_text) >>> terms = [] >>> e in result: if isinstance(e, tuple): terms.append([ e[0] ]) else: terms.append([w w, t in e]) >>> pprint.pprint(terms) [['yarın'], [','], ['mehmet'], ['ile'], ['birlikte'], ["ankara'da", 'ki', 'nüfus', "müdürlüğü'ne"], [','], ['aziz', 'yıldırım'], ...
Comments
Post a Comment