6.10. roman.py, stage 5

6.10. `roman.py`, stage 5

Now that fromRoman works properly with good input, it's time to fit in the last piece of the puzzle: making it work properly with bad input. That means finding a way to look at a string and determine if it's a valid Roman numeral. This is inherently more difficult than validating numeric input in toRoman, but we have a powerful tool at our disposal: regular expressions.

If you're not familiar with regular expressions and didn't read Regular expressions 101, now would be a good time.

As we saw at the beginning of this chapter, there are several simple rules for constructing a Roman numeral. The first is that the thousands place, if any, is represented by a series of M characters.

Example 6.18. Checking for thousands

>>> import re
>>> pattern = '^M?M?M?$'       
>>> re.search(pattern, 'M')    
<SRE_Match object at 0106FB58>
>>> re.search(pattern, 'MM')   
<SRE_Match object at 0106C290>
>>> re.search(pattern, 'MMM')  
<SRE_Match object at 0106AA38>
>>> re.search(pattern, 'MMMM') 
>>> re.search(pattern, '')     
<SRE_Match object at 0106F4A8>

	This pattern has three parts: `^` - match what follows only at the beginning of the string. If this were not specified, the pattern would match no matter where the `M` characters were, which is not what we want. We want to make sure that the `M` characters, if they're there, are at the beginning of the string. `M?` - optionally match a single `M` character. Since this is repeated three times, we're matching anywhere from 0 to 3 `M` characters in a row. `$` - match what precedes only at the end of the string. When combined with the `^` character at the beginning, this means that the pattern must match the entire string, with no other characters before or after the `M` characters.
	The essense of the `re` module is the `search` function, which takes a regular expression (`pattern`) and a string (`'M'`) to try to match against the regular expression. If a match is found, `search` returns an object which has various methods to describe the match; if no match is found, `search` returns `None`, the Python null value. We won't go into detail about the object that `search` returns (although it's very interesting), because all we care about at the moment is whether the pattern matches, which we can tell by just looking at the return value of `search`. `'M'` matches this regular expression, because the first optional `M` matches and the second and third optional `M` characters are ignored.
	`'MM'` matches because the first and second optional `M` characters match and the third `M` is ignored.
	`'MMM'` matches because all three `M` characters match.
	`'MMMM'` does not match. All three `M` characters match, but then the regular expression insists on the string ending (because of the `$` character), and the string doesn't end yet (because of the fourth `M`). So `search` returns `None`.
	Interestingly, an empty string also matches this regular expression, since all the `M` characters are optional. Keep this fact in the back of your mind; it will become more important in the next section.

The hundreds place is more difficult than the thousands, because there are several mutually exclusive ways it could be expressed, depending on its value.

100 = C
200 = CC
300 = CCC
400 = CD
500 = D
600 = DC
700 = DCC
800 = DCCC
900 = CM

So there are four possible patterns:

CM
CD
0 to 3 C characters (0 if the hundreds place is 0)
D, followed by 0 to 3 C characters

The last two patterns can be combined:

an optional D, followed by 0 to 3 C characters

Example 6.19. Checking for hundreds

>>> import re
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$' 
>>> re.search(pattern, 'MCM')            
<SRE_Match object at 01070390>
>>> re.search(pattern, 'MD')             
<SRE_Match object at 01073A50>
>>> re.search(pattern, 'MMMCCC')         
<SRE_Match object at 010748A8>
>>> re.search(pattern, 'MCMC')           
>>> re.search(pattern, '')               
<SRE_Match object at 01071D98>

	This pattern starts out the same as our previous one, checking for the beginning of the string (`^`), then the thousands place (`M?M?M?`). Then we have the new part, in parentheses, which defines a set of three mutually exclusive patterns, separated by vertical bars: `CM`, `CD`, and `D?C?C?C?` (which is an optional `D` followed by 0 to 3 optional `C` characters). The regular expression parser checks for each of these patterns in order (from left to right), takes the first one that matches, and ignores the rest.
	`'MCM'` matches because the first `M` matches, the second and third `M` characters are ignored, and the `CM` matches (so the `CD` and `D?C?C?C?` patterns are never even considered). `MCM` is the Roman numeral representation of `1900`.
	`'MD'` matches because the first `M` matches, the second and third `M` characters are ignored, and the `D?C?C?C?` pattern matches `D` (each of the 3 `C` characters are optional and are ignored). `MD` is the Roman numeral representation of `1500`.
	`'MMMCCC'` matches because all 3 `M` characters match, and the `D?C?C?C?` pattern matches `CCC` (the `D` is optional and is ignored). `MMMCCC` is the Roman numeral representation of `3300`.
	`'MCMC'` does not match. The first `M` matches, the second and third `M` characters are ignored, and the `CM` matches, but then the `$` does not match because we're not at the end of the string yet (we still have an unmatched `C` character). The `C` does not match as part of the `D?C?C?C?` pattern, because the mutually exclusive `CM` pattern has already matched.
	Interestingly, an empty string still matches this pattern, because all the `M` characters are optional and ignored, and the empty string matches the `D?C?C?C?` pattern where all the characters are optional and ignored.

Whew! See how quickly regular expressions can get nasty? And we've only covered the thousands and hundreds places. (Later in this chapter, we'll see a slightly different syntax for writing regular expressions that, while just as complicated, at least allows some in-line documentation of the different sections of the expression.) Luckily, if you followed all that, the tens and ones places are easy, because they're exactly the same pattern.

Example 6.20. roman5.py

If you have not already done so, you can download this and other examples used in this book.

"""Convert to and from Roman numerals"""
import re

#Define exceptions
class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass

#Define digit mapping
romanNumeralMap = (('M',  1000),
                   ('CM', 900),
                   ('D',  500),
                   ('CD', 400),
                   ('C',  100),
                   ('XC', 90),
                   ('L',  50),
                   ('XL', 40),
                   ('X',  10),
                   ('IX', 9),
                   ('V',  5),
                   ('IV', 4),
                   ('I',  1))

def toRoman(n):
    """convert integer to Roman numeral"""
    if not (0 < n < 4000):
        raise OutOfRangeError, "number out of range (must be 1..3999)"
    if int(n) <> n:
        raise NotIntegerError, "decimals can not be converted"

    result = ""
    for numeral, integer in romanNumeralMap:
        while n >= integer:
            result += numeral
            n -= integer
    return result

#Define pattern to detect valid Roman numerals
romanNumeralPattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$' 

def fromRoman(s):
    """convert Roman numeral to integer"""
    if not re.search(romanNumeralPattern, s):                                    
        raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s

    result = 0
    index = 0
    for numeral, integer in romanNumeralMap:
        while s[index:index+len(numeral)] == numeral:
            result += integer
            index += len(numeral)
    return result

	This is just a continuation of the pattern we saw that handled the thousands and hundreds place. The tens places is either `XC` (`90`), `XL` (`40`), or an optional `L` followed by 0 to 3 optional `X` characters. The ones place is either `IX` (`9`), `IV` (`4`), or an optional `V` followed by 0 to 3 optional `I` characters.
	Having encoded all that logic into our regular expression, the code to check for invalid Roman numerals becomes trivial. If `re.search` returns an object, then the regular expression matched and our input is valid; otherwise, our input is invalid.

At this point, you are allowed to be skeptical that that big ugly regular expression could possibly catch all the types of invalid Roman numerals. But don't take my word for it, look at the results:

Example 6.21. Output of romantest5.py against roman5.py


fromRoman should only accept uppercase input ... ok          
toRoman should always return uppercase ... ok
fromRoman should fail with malformed antecedents ... ok      
fromRoman should fail with repeated pairs of numerals ... ok 
fromRoman should fail with too many repeated numerals ... ok
fromRoman should give known result with known input ... ok
toRoman should give known result with known input ... ok
fromRoman(toRoman(n))==n for all n ... ok
toRoman should fail with non-integer input ... ok
toRoman should fail with negative input ... ok
toRoman should fail with large input ... ok
toRoman should fail with 0 input ... ok

----------------------------------------------------------------------
Ran 12 tests in 2.864s

OK

	One thing I didn't mention about regular expressions is that, by default, they are case-sensitive. Since our regular expression `romanNumeralPattern` was expressed in uppercase characters, our `re.search` check will reject any input that isn't completely uppercase. So our uppercase input test passes.
	More importantly, our bad input tests pass. For instance, the malformed antecedents test checks cases like `MCMC`. As we've seen, this does not match our regular expression, so `fromRoman` raises an `InvalidRomanNumeralError` exception, which is what the malformed antecedents test case is looking for, so the test passes.
	In fact, all the bad input tests pass. This regular expression catches everything we could think of when we made our test cases.
	And the anticlimax award of the year goes to the word “`OK`”, which is printed by the `unittest` module when all the tests pass.


	When all your tests pass, stop coding.