Python NLTKを使った英語のトークン化

Python NLTKを使った英語のトークン化
1. 文章のトークン化
参考書

Python NLTKを使った英語のトークン化

wiki/Natural_language_processingの１文を持ってきて、NLPという変数にした。

句読点と、括弧と括弧内を削除する処理を先に行った。

import nltk
import re

NLP = "Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data."

#===== 句読点と、括弧と括弧内を削除する処理　=====
def clearn(text):
    text = re.sub(r',', '', text)
    text = re.sub(r'\.', '', text)
    text = re.sub(r'\(.*?\)', '', text)
    return text

NLP = clearn(NLP)

#===== トークン化 =====
from nltk.tokenize import word_tokenize

NLP = word_tokenize(NLP)

#NLPの出力結果
#['Natural', 'language', 'processing', 'is', 'a',....]

文章のトークン化

from nltk.tokenize import sent_tokenize
sent_tokenize(NLP + NLP2)

Python NLTKを使った英語のトークン化

文章のトークン化

参考書