Gunaatita Blogs | Outsource Software Product Development | Outsource Software Development

One of the problems that I have set out to solve here would be to identify various Parts of speech forms specifically Noun forms like locations, Organization names Addresses, version numbers etc.. These can be used in automating the process of cataloguing various applications as a database for one of a many possible applications.While this cant be an out and out solution but it can certainly help or Aid a manual cataloguer in simplifying the process of running over multiple websites and extensively searching this can probably reduce a lot of manual cataloguing task.

I will not extensively dig into the coding part as it is very self explanatory. An ideal solution for a problem like this would be to crawl over various web pages and extract all relevant pages and then run an NLTK algorithm on top of these to extract the tags.Since I have already explained about web-crawling in one of my previous blogs i will not elaborate on that, instead i will use some hard coded corpuses and then analyse them to extract tags.

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
 chunked = ne_chunk(pos_tag(word_tokenize(text)))
 prev = None
 continuous_chunk = []
 current_chunk = []

 for i in chunked:
 if type(i) == Tree:
 current_chunk.append(i.label()+ ": " + " ".join([token for token, pos in i.leaves()]))
 elif current_chunk:
 named_entity = " ".join(current_chunk)
 if named_entity not in continuous_chunk:
 continuous_chunk.append(named_entity)
 current_chunk = []
 else:
 continue

 return continuous_chunk

txt = "A subsidiary of SAP SE, SAP North America oversees all business operations in the U.S. and Canada, and is headquartered in Newtown Square, PA, in the Philadelphia area. Get to know our management teams, learn about our long-term commitment to community, and find out what the SAP University Alliances program is doing to empower students at home and abroad. "
print "\n".join(get_continuous_chunks(txt))

Heres the out put with all Noun forms Nicely tagged

NLTK For tagging software properties

Related

Categories