python - NLTK tokenize with collocations -


I am using NLTK and would like to summarize a text in relation to collocations: for example, to be "New York" Should have a single token, while the naïve token will split "New" and "York".

I know how and how to be found together, but how can not be aligned with both ...

Thank you.

approach, which seems right for you, called naming entity recognition, for nominated entity recognition There are many links dedicated to the NLTK. I just give you an example

 nltk import sent_tokenize, word_tokenize, pos_tag, ne_chunk def extract_entities (text) from : entities = [] sent_tokenize for sentence (text): quantity = ne_chunk Returns entities __name__ == '__main__' if: text = "" "search for a multi agency (pos_tag (word_tokenize)) entities.extend ((part hasattr for part in quantity, if 'node']] Under the campaign, after the police, many states and the way of Mexico say that a former Los Angeles police officer There is a doubt in the killer of the college basketball coach and he is going through his oath to kill the police officers at the end of last week, killing one. "In this case, we are making our own goal," Sergeant. Rudy Lopez from the Corona Police Department said at a press conference. The suspect Christopher Jordan Dorner, 33 has been identified as, and he is considered extremely dangerous and equipped with many weapons, The murderers are believed to be revenge for the end of 2009 to give false statements from the Los Angeles Police Department, officials say that Dorner had published an online declaration in which "I am LAPD uniform In order to bring unconventional and uncomfortable war to those people, whether on duty or shut down. "" Print Extract_centity (text)  

Output:

  [Vr] Tree ('GPE', [('Mexico', 'NNP']], Tree ('GPE', [('Los', 'NNP'), ('Angels', 'NNP']], Tree ( 'Person', [('Rudy', 'NNP']), tree ('organization', [('loopage', 'nnp']], tree ('organization', [('corona', 'nnp '', 'Tree' ('person', [('Christopher', 'NNP'), ('Jordan', 'NNP'), ('Dorner', 'NNP']), tree ('GPE' Trees ('dosher', 'nnp'), tree ('gpi', [(' LAPD ',' NNP ')]]]  

Another approach Testing - Using various measures of information overlap between the two random variables, such as mutual information, Pwaintvoari mutual information, t-test and others. & Lt; & Lt; >> Christopher D is a good start Manning and Heinrich Schuetz Chapter 5 collation is available for download. Here is an example of getting rid of the NLTK.


Comments

Popular posts from this blog

python - Writing Greek in matplotlib labels, titles -

c# - LINQ to Entities does not recognize the method 'Int32 IndexOf(System.String, System.StringComparison)' method -

Pygame memory leak with transform.flip -