第五章 XML 处理

目录

5.1. 接触

本章假定你已经对XML熟悉了。

  • 你应该知道XML文档象什么,是什么使其具有良好的格式,是什么使其有效。
  • 你应该知道DTD象什么。
  • 你应该熟悉组成XML文档的片段的标准术语:元素(element),结点(node),属性(attribute),等等。

不要你是哲学专业的,然而如果你曾经有过受到伊曼纽尔·康德的著作影响的不幸的话,与主修计算机科学之类的专业比起来,你将会欣赏这个例子多一些。

处理XML有两种基本的方法。方法一叫做SAX(处理XML的简单API-Simple API for XML),它的工作方式是,一次读出一点XML,对发现的每个元素调用一个方法。(如果你读过HTML 处理,对此你应该感到熟悉,因为这就是 sgmllib 模块的工作方式。)另一个方法叫DOM(文档对象模型-Document Object Model),它的工作方式是,一次读出整个XML文档,通过将本地的Python类链接到一个树型结构中,生成文档的一个内部表示。Python拥有这两种分析的标准模块,但本章只用DOM进行处理。

下面是一个完整的Python程序,它根据以XML格式定义的上下文无关的语法,生成伪随机输出。如果你不明白是什么意思,不用担心;我们将在下一节看一下驱动这个程序的XML文档。

例 5.1. kgp.py

如果你还没有运行过,可以下载本例或本书用到的其它的例子

"""Kant Generator for Python

Generates mock philosophy based on a context-free grammar

Usage: python kgp.py [options] [string...]

Options:
  -g ..., --grammar=...   use specified grammar file or URL
  -s ..., --source=...    parse specified source file or URL instead of string
  -w #, --wrap=#          hard wrap output to # characters per line
  -h, --help              show this help
  -d                      show debugging information while parsing

Examples:
  kgp.py                  generates several paragraphs of Kantian philosophy
  kpg.py -w 72 paragraph  generate a paragraph of Kant, wrap to 72 characters
  kgp.py -g husserl.xml   generates several paragraphs of Husserl
  kgp.py -s template.xml  reads from template.xml to decide what to generate
"""
from xml.dom import minidom
import random
import toolbox
import sys
import getopt

_debug = 0

class KantGenerator:
    """generates mock philosophy based on a context-free grammar"""

    def __init__(self, grammar=None, source=None):
        self.refs = {}
        self.defaultSource = None
        self.pieces = []
        self.capitalizeNextWord = 0
        self.loadGrammar(grammar)
        if not source:
            source = self.defaultSource
        self.loadSource(source)
        self.refresh()

    def loadGrammar(self, grammar):
        """load context-free grammar
        
        grammar can be
        - a URL of a remote XML file ("http://diveintopython.org/kant.xml")
        - a filename of a local XML file ("/a/diveintopython/common/py/kant.xml")
        - the actual grammar, as a string
        """
        sock = toolbox.openAnything(grammar)
        self.grammar = minidom.parse(sock).documentElement
        sock.close()
        self.refs = {}
        for ref in self.grammar.getElementsByTagName("ref"):
            self.refs[ref.attributes["id"].value] = ref
        xrefs = {}
        for xref in self.grammar.getElementsByTagName("xref"):
            xrefs[xref.attributes["id"].value] = 1
        xrefs = xrefs.keys()
        standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
        if standaloneXrefs:
            self.defaultSource = '<xref id="%s"/>' % random.choice(standaloneXrefs)
        else:
            self.defaultSource = None
        
    def loadSource(self, source):
        """load source
        
        source can be
        - a URL of a remote XML file ("http://diveintopython.org/section.xml")
        - a filename of a local XML file ("/a/diveintopython/common/py/section.xml")
        - the actual XML to parse, as a string ("<xref id='section'/>")
        """
        sock = toolbox.openAnything(source)
        self.source = minidom.parse(sock).documentElement
        sock.close()

    def reset(self):
        """reset parser"""
        self.pieces = []
        self.capitalizeNextWord = 0

    def refresh(self):
        """reset output buffer and re-parse entire source file
        
        Since parsing involves a good deal of randomness, this is an
        easy way to get new output without having to reload a grammar file
        each time.
        """
        self.reset()
        self.parse(self.source)
        return self.output()

    def output(self):
        """output generated text"""
        return "".join(self.pieces)

    def randomChildElement(self, node):
        """choose a random child element of a node
        
        This is a utility method used by parse_xref and parse_choice.
        """
        def isElement(e):
            return isinstance(e, minidom.Element)
        choices = filter(isElement, node.childNodes)
        chosen = random.choice(choices)
        if _debug:
            print '%s available choices:' % len(choices), [e.toxml() for e in choices]
            print 'Chosen:', chosen.toxml()
        return chosen

    def parse(self, node):
        """parse a single XML node
        
        A parsed XML document (from minidom.parse) is a tree of nodes
        of various types.  Each node is represented by an instance of the
        corresponding Python class (Element for a tag, Text for
        text data, Document for the top-level document).  The following
        statement constructs the name of a class method based on the type
        of node we're parsing ("parse_Element" for an Element node,
        "parse_Text" for a Text node, etc.) and then calls the method.
        """
        parseMethod = getattr(self, "parse_%s" % node.__class__.__name__)
        parseMethod(node)

    def parse_Document(self, node):
        """parse the document node
        
        The document node by itself isn't interesting (to us), but
        its only child, node.documentElement, is: it's the root node
        of the grammar.
        """
        self.parse(node.documentElement)

    def parse_Text(self, node):
        """parse a text node
        
        The text of a text node is usually added to the output buffer
        verbatim.  The one exception is that <p class='sentence'> sets
        a flag to capitalize the first letter of the next word.  If
        that flag is set, we capitalize the text and reset the flag.
        """
        text = node.data
        if self.capitalizeNextWord:
            self.pieces.append(text[0].upper())
            self.pieces.append(text[1:])
            self.capitalizeNextWord = 0
        else:
            self.pieces.append(text)

    def parse_Element(self, node):
        """parse an element
        
        An XML element corresponds to an actual tag in the source:
        <xref id='...'>, <p chance='...'>, <choice>, etc.
        Each element type is handled in its own method.  Like we did in
        parse(), we construct a method name based on the name of the
        element ("do_xref" for an <xref> tag, etc.) and
        call the method.
        """
        handlerMethod = getattr(self, "do_%s" % node.tagName)
        handlerMethod(node)

    def parse_Comment(self, node):
        """parse a comment
        
        The grammar can contain XML comments, but we ignore them
        """
        pass
    
    def do_xref(self, node):
        """handle <xref id='...'> tag
        
        An <xref id='...'> tag is a cross-reference to a <ref id='...'>
        tag.  <xref id='sentence'/> evaluates to a randomly chosen child of
        <ref id='sentence'>.
        """
        id = node.attributes["id"].value
        self.parse(self.randomChildElement(self.refs[id]))

    def do_p(self, node):
        """handle <p> tag
        
        The <p> tag is the core of the grammar.  It can contain almost
        anything: freeform text, <choice> tags, <xref> tags, even other
        <p> tags.  If a "class='sentence'" attribute is found, a flag
        is set and the next word will be capitalized.  If a "chance='X'"
        attribute is found, there is an X% chance that the tag will be
        evaluated (and therefore a (100-X)% chance that it will be
        completely ignored)
        """
        keys = node.attributes.keys()
        if "class" in keys:
            if node.attributes["class"].value == "sentence":
                self.capitalizeNextWord = 1
        if "chance" in keys:
            chance = int(node.attributes["chance"].value)
            doit = (chance > random.randrange(100))
        else:
            doit = 1
        if doit:
            map(self.parse, node.childNodes)

    def do_choice(self, node):
        """handle <choice> tag
        
        A <choice> tag contains one or more <p> tags.  One <p> tag
        is chosen at random and evaluated; the rest are ignored.
        """
        self.parse(self.randomChildElement(node))

def usage():
    print __doc__

def main(argv):
    grammar = None
    source = None
    wrap = None
    try:
        opts, args = getopt.getopt(argv, "hg:s:w:d", ["help", "grammar=","source=","wrap="])
    except getopt.GetoptError:
        usage()
        sys.exit(2)
    for opt, arg in opts:
        if opt in ("-h", "--help"):
            usage()
            sys.exit()
        elif opt == '-d':
            global _debug
            _debug = 1
        elif opt in ("-g", "--grammar"):
            grammar = arg
        elif opt in ("-s", "--source"):
            source = arg
        elif opt in ("-w", "--wrap"):
            try:
                wrap = int(arg)
            except ValueError:
                print "Warning: ignoring invalid wrap option: %s" % arg
    
    if not grammar:
        grammar = "kant.xml"
    
    if not source:
        if args:
            source = "".join(["<xref id='%s'/>" % arg for arg in args])

    k = KantGenerator(grammar, source)
    if wrap:
        print toolbox.hardwrap(k.output(), wrap)
    else:
        print k.output()

if __name__ == "__main__":
    main(sys.argv[1:])

例 5.2. toolbox.py

"""Miscellaneous utility functions"""

def hardwrap(s, maxcol=72):
    """hard wrap string to maxcol columns

    Example:
    >>> print hardwrap("This is a test of the emergency broadcasting system", 25)
    This is a test of the
    emergency broadcasting
    system.
    """
    import re
    pattern = re.compile(r'.*\s')
    def wrapline(s, pattern=pattern, maxcol=maxcol):
        lines = []
        start = 0
        while 1:
            if len(s) - start <= maxcol: break
            m = pattern.match(s[start:start + maxcol])
            if not m: break
            newline = m.group()
            lines.append(newline)
            start += len(newline)
        lines.append(s[start:])
        return "\n".join([s.rstrip() for s in lines)])
    return "\n".join(map(wrapline, s.split("\n")))

def openAnything(source):
    """URI, filename, or string --> stream

    This function lets you define parsers that take any input source
    (URL, pathname to local or network file, or actual data as a string)
    and deal with it in a uniform manner.  Returned object is guaranteed
    to have all the basic stdio read methods (read, readline, readlines).
    Just .close() the object when you're done with it.
    
    Examples:
    >>> from xml.dom import minidom
    >>> sock = openAnything("http://localhost/kant.xml")
    >>> doc = minidom.parse(sock)
    >>> sock.close()
    >>> sock = openAnything("c:\\inetpub\\wwwroot\\kant.xml")
    >>> doc = minidom.parse(sock)
    >>> sock.close()
    >>> sock = openAnything("<ref id='conjunction'><text>and</text><text>or</text></ref>")
    >>> doc = minidom.parse(sock)
    >>> sock.close()
    """
    # try to open with urllib (if source is http, ftp, or file URL)
    import urllib
    try:
        return urllib.urlopen(source)
    except IOError:
        pass
    
    # try to open with native open function (if source is pathname)
    try:
        return open(source)
    except IOError:
        pass
    
    # assume source is string, create stream
    return StringIO.StringIO(source)

单独运行这个 kgp.py 程序,它将分析在 kant.xml 中的缺省的基于XML的语法,并以康德的风格打印出几段有哲学价值的段落来。

例 5.3. kgp.py 的输出样例

     As is shown in the writings of Hume, our a priori concepts, in
reference to ends, abstract from all content of knowledge; in the study
of space, the discipline of human reason, in accordance with the
principles of philosophy, is the clue to the discovery of the
Transcendental Deduction.  The transcendental aesthetic, in all
theoretical sciences, occupies part of the sphere of human reason
concerning the existence of our ideas in general; still, the
never-ending regress in the series of empirical conditions constitutes
the whole content for the transcendental unity of apperception.  What
we have alone been able to show is that, even as this relates to the
architectonic of human reason, the Ideal may not contradict itself, but
it is still possible that it may be in contradictions with the
employment of the pure employment of our hypothetical judgements, but
natural causes (and I assert that this is the case) prove the validity
of the discipline of pure reason.  As we have already seen, time (and
it is obvious that this is true) proves the validity of time, and the
architectonic of human reason, in the full sense of these terms,
abstracts from all content of knowledge.  I assert, in the case of the
discipline of practical reason, that the Antinomies are just as
necessary as natural causes, since knowledge of the phenomena is a
posteriori.
    The discipline of human reason, as I have elsewhere shown, is by
its very nature contradictory, but our ideas exclude the possibility of
the Antinomies.  We can deduce that, on the contrary, the pure
employment of philosophy, on the contrary, is by its very nature
contradictory, but our sense perceptions are a representation of, in
the case of space, metaphysics.  The thing in itself is a
representation of philosophy.  Applied logic is the clue to the
discovery of natural causes.  However, what we have alone been able to
show is that our ideas, in other words, should only be used as a canon
for the Ideal, because of our necessary ignorance of the conditions.

[...snip...]

当然这是胡言乱语。噢,不完全是胡言乱语。它在句法和语法上都是正确的(尽管非常罗嗦--康德可不是你们所说的踩得到点上的那种人)。其中一些实际是正确的(或者至少康德可能会认同的事情),其中一些则明显是错误的,大部分只是语无伦次。但所有内容都是符合康德的风格。

让我重复一遍,如果你现在或曾经主修哲学专业,这会非常,非常有趣。

关于这个程序的有趣之处在于没有一点内容是属于康德的。所有的内容都来自于上下文无关语法文件,kant.xml。(我们将在下一节解析一个简单得多的语法。)kgp.py 程序所做的是将语法从头到尾读出来,随机决定哪个单词插放在哪个地方。Think of it as a Mad-Libs™ on autopilot.