正規表現でテキストをクリーニング

正規表現でテキストをクリーニング-テキトーです
置き換えのre.sub
抽出のre.findall()
特定の文字列

正規表現でテキストをクリーニング-テキトーです

テキトーです
置き換えを行うsub()と、特定の抽出を行うfindall()の具体的な例を紹介していきます。

置き換えのre.sub

text = re.sub(r',', '', text)
text = re.sub(r'\(.*?\)', '', text)

抽出のre.findall()

# アルファベット
p = re.compile(r'[a-z]+')
#全角も含むr'[a-za-ｚ]+'
p.findall(textdata)

# 数字
p = re.compile('[1-9]+')
#全角も含むr'[a-z０-９]+'
p.findall(textdata)

# [#]を抽出
annotations = re.compile(r'［＃.*?］')
annotations.findall(textdata)

特定の文字列

顔文字

URL

re.compile(r"https?://[\w/:%#\$&\?\(\)~\.=\+\-]+")

郵便番号

re.compile(r'[0-9]{3}-[0-9]{4}')

絵文字

顔文字_(Unicodeのブロック)
図書館員のコンピュータ基礎講座|その他の記号と絵文字

re.compile(u"([\U0001F600-\U0001F64F]|[\U0001F300-\U0001F5FF])"

日本語だけ

re.compile(r"[ぁ-んァ-ン 一-龥]")

ハッシュタグ

re.compiler(r"#[\w/:%#\$&\?\(\)~\.=\+\-]+")

atマーク

re.compile(r'@.*? ')