PythonTip >> 博文 >> chinaunix

python网络编程基础笔记-处理HTML中不对称和语法错误的标记

zihua 2014-03-11 18:03:35 点击: 1152 | 收藏


1.python处理HTML中不对称tag或有语法错误的tag

"""
处理HTML中不对称tag或有语法错误的tag
方法是:使用数据结构---栈和级别来实现
1.将没有经过</tag>的tag全部入栈,经过</tag>的全部出栈
2.HTML语法错误处理使用级别来实现,栈顶的tag级别最低,如果高级别的tag要出栈,它会先将低级别的tag先出栈,这样可以避免遗漏</tag>的错误
3.对于</tag>错误的处理,先查找<tag>是否在栈内,如果没有则直接舍弃.
4.每次新的tag入栈,将data清空
"
""
from HTMLParser import HTMLParser
from htmlentitydefs import *

import sys,re
class UnbalanceParser(HTMLParser):
    def __init__(self):
        # 记录没有遇到</tag>的tag
        self.taglevels = []
        # 需要处理的tags
        self.handledtags = ["title","ul","li"]
        # 当前正在处理的tag
        self.processing = None
        HTMLParser.__init__(self)
    def handle_starttag(self,tag,attrs):
        if len(self.taglevels) and self.taglevels[-1] == tag:
            """前一个tag与当前的tag相同,则认为是unbalance tag,进行处理"""
            self.handle_endtag(tag)
        # 将当前的tag入栈(1)
        self.taglevels.append(tag)

        # 如果当前的tag需要处理,先清空data,并将tag设置为当前处理(4)
        if tag in self.handledtags:
            self.data = ""
            self.processing = tag
            if tag == "ul":
                print "list started"

    def handle_data(self,data):
        """如果是需要处理的tag,直接将其追加"""
        if self.processing:
            self.data += data

    def handle_endtag(self,tag):
        if not tag in self.taglevels:
            """如果tag不在栈中,认为此tag为一非法,不予处理(3)"""
            return
        while len(self.taglevels):
            # 取得栈顶tag
            starttag = self.taglevels.pop()

            # 如果是要处理的tag,调用处理函数
            if starttag in self.handledtags:
                self.finishprocessing(starttag)

            # 如果此tag与starttag相同表示,这个level的tag完成处理,停止循环;否则继续,
            # 直到此级别一下的tag全部处理(2)
            if starttag == tag:
                break
    def cleanse(self):
        """删除多余的空格"""
        self.data = re.sub("\s+"," ",self.data)

    def finishprocessing(self,tag):
        self.cleanse()
        if tag == "title" and tag == self.processing:
            print "Title = ",self.data
        elif tag == "ul":
            print "list ended"
        elif tag == "li" and tag == self.processing:
            print "list item:",self.data
        self.processing = None

fd = open("JCParseHtmlUnbalance.html")
tp = UnbalanceParser()
tp.feed(fd.read())

2.测试使用的HTML文件

<!-- Unbalanced tags parsing example, Chapter 7 - utitle.html -->
<HTML>
<HEAD>
<TITLE>welcome to jcodeer.cublog.cn
</HEAD>
<BODY>
This is my text.
<UL>
<LI>First List Item
<LI>Second List Item</LI>
<LI>Third List Item
</BODY>
</HTML>

原文链接:http://blog.chinaunix.net/uid-28559065-id-4140926.html

作者:zihua | 分类: chinaunix | 标签: python | 阅读: 1152 | 发布于: 2014-03-11 18时 |