Programming, Python

‘＜/scr” + “ipt＞’のパースエラーを簡単に解決する方法

by opendata • 2014年7月19日

pythonのHTMLParserで、HTMLをパースしていると、’＜/scr” + “ipt＞’の解析でエラーになる場合があります。

HTMLParser.HTMLParseError: bad end tag: '</scr" + "ipt>', at line ...

このようなJava scriptのパースに対応したパッケージを使用するのも解決方法の一つですが、プログラムを書き換えるのも面倒です。

そこで安易な方法ですが、HTMLをパースする前に’＜/scr” + “ipt＞’を適当な文字列で置き換えてしまうのが簡単です。

<略>
from HTMLParser import HTMLParser

<略>

f = urllib2.urlopen(req)
parser.feed(f.read().replace('</scr"+"ipt>','xxx'))

<略>

← 大規模な データを開いて プログラムを起動するサリー市|City of Surrey launching massive open data program

新しい位置データへの投資を商務省ランプ|Commerce Department ramps up data investments with new position →

Comments are closed.