Porting to Python 3 - The Book Site

Extending 2to3 with your own fixers

The 2to3 command is a wrapper around a standard library package, called lib2to3. It contains a code parser, a framework for setting up fixers that modify the parse tree and a large set of fixers. The fixers that are included in lib2to3 are enough to do all of the conversion you need for any normal porting. There are cases, however, where things are slightly beyond normal and you may have to write your own fixers. I first want to reassure you that these cases are very rare and you are unlikely to ever need this chapter and that you can skip it without feeling bad.

When fixers are necessary

It is strongly recommended that you don’t change the API when you port your module or package to Python 3, but sometimes you have to. For example, the Zope Component Architecture, a collection of packages to help you componentize your system, had to change it’s API. With the ZCA[1] you define interfaces that define the behavior of components and then make components that implement these interfaces. A simple example looks like this:

>>> from zope.interface import Interface, implements
>>>
>>> class IMyInterface(Interface):
...     def amethod():
...         '''This is just an example'''
...
>>> class MyClass(object):
...
...     implements(IMyInterface)
...
...     def amethod(self):
...         return True

The important line here is the implements(IMyInterface) line. It uses the way meta-classes are done in Python 2 for it’s extensions, by using the __metaclass__ attribute. However, in Python 3, there is no __metaclass__ attribute and this technique doesn’t work any longer. Thankfully class decorators arrived in Python 2.6, a new and better technique to do similar things. This is supported in the latest versions of zope.interface, with another syntax:

>>> from zope.interface import Interface, implementer
>>>
>>> class IMyInterface(Interface):
...     def amethod():
...         '''This is just an example'''
...
>>> @implementer(IMyInterface)
... class MyClass(object):
...
...     def amethod(self):
...         return True

Since the first syntax no longer works in Python 3, zope.interfaces needs a custom fixer to change this syntax and the package zope.fixers contains just such a fixer. It is these types of advanced techniques that (ab)use Python internals that may force you to change the API of your package and if you change the API, you should write a fixer to make that change automatic, or you will cause a lot of pain for the users of your package.

So writing a fixer is a very unusual task. However, if you should need to write a fixer, you need any help you can get, because it is extremely confusing. So I have put down my experiences from writing zope.fixers, to try to remove some of the confusion and lead you to the right path from the start.

The Parse Tree

The 2to3 package contains support for parsing code into a parse tree. This may seem superfluous, as Python already has two modules for that, namely parser and ast, but the parser module uses Python’s internal code parser, which is optimized to generate byte code and too low level for porting, while the ast module is designed to generate an abstract syntax tree and ignores all comments and formatting.

The parsing module of 2to3 is both high level and contains all formatting, but that doesn’t mean it’s easy to use. It can be highly confusing and the objects generated by parsed code may not be what you would expect at first glance. In general, the best hint I can give you when making fixers is to debug and step through the fixing process, looking closely at the parse tree until you start getting a feeling for how it works and then start manipulating it to see exactly what effects that has on the output. Having many unit tests is crucial to you make sure all the edge cases work.

The parse tree is made up of two types of objects; Node and Leaf. Node objects are containers that contain a series of objects, both Node and Leaf, while Leaf objects have no sub objects and contain the actual code.

Leaf objects have a type, telling you what it contains. Examples are INDENT, which means the indentation increased, STRING which is used for all strings, including docstrings, NUMBER for any kind of number, integers, floats, hexadecimal, etc, RPAR and LPAR for parentheses, NAME for any keyword or variable name and so on. The resulting parse tree does not contain much information about how the code relates to the Python language. It will not tell you if a NAME is a keyword or a variable, nor if a NUMBER is a an integer or a floating point value. However, the parser itself cares very much about Python grammar and will in general raise an error if it is fed invalid Python code.

One of the bigger surprises are that Leaf objects have a prefix and a suffix. These contain anything that isn’t strictly code, including comments and white space. So even though there is a node type for comments, I haven’t seen it in actual use by the parser. Indentation and dedentation are separate Leaf objects, but this will just tell you that indentation changed, not how much. Not that you need to know, the structure of the code is held by the hierarchy of Node objects, but if you do want to find out the indentation you will have to look at the prefix of the nodes. The suffix of a node is the same as the prefix of the next node and can be ignored.

Creating a fixer

To simplify the task of making a fixer there is a BaseFix class you can use. If you subclass from BaseFix you only need to override two methods, match() and transform(). match() should return a result that evaluates to false if the fixer doesn’t care about the node and it should return a value that is not false when the node should be transformed by the fixer.

If match() returns a non-false result, 2to3 will then call the transform() method. It takes two values, the first one being the node and the second one being whatever match() returned. In the simplest case you can have match() return just True or False and in that case the second parameter sent to transform() will always be True. However, the parameter can be useful for more complex behavior. You can for example let the match() method return a list of sub-nodes to be transformed.

By default all nodes will be sent to match(). To speed up the fixer the refactoring methods will look at a fixer attribute called _accept_type, and only check the node for matching if it is of the same type. _accept_type defaults to None, meaning that it accepts all types. The types you can accept are listed in lib2to3.pgen2.token.

A fixer should have an order attribute that should be set to "pre" or "post". This attribute decides in which order you should get the nodes, if you should get the leaves before their containing node ("pre") or if the fixer should receive the leaves after it gets the containing node ("post"). The examples in this chapter are all based on BaseFix, which defaults to "post".

You should follow a certain name convention when making fixers. If you want to call your fixer “something”, the fixer module should be called fix_something and the fixer class should be called FixSomething. If you don’t follow that convention, 2to3 may not be able to find your fixer.

Modifying the Parse Tree

The purpose of the fixer is for you to modify the parse tree so it generates code compatible with Python 3. In simple cases, this is easier than it sounds, while in complex cases it can be more tricky than expected. One of the main problems with modifying the parse tree directly is that if you replace some part of the parse tree the new replacement has to not only generate the correct output on its own but it has to be organized correctly. Otherwise the replacement can fail and you will not get the correct output when rendering the complete tree. Although the parse tree looks fairly straightforward at first glance, it can be quite convoluted. To help you generate parse trees that will generate valid code there is several helper functions in lib2to3.fixer_util. They range from the trivial ones as Dot() that just returns a Leaf that generates a dot, to ListComp() that will help you generate a list comprehension. Another way is to look at what the parser generates when fed the correct code and replicate that.

A minimal example is a fixer that changes any mention of oldname to newname. This fixer does require the name to be reasonably unique, as it will change any reference to oldname even if it is not the one imported in the beginning of the fixed code.

from lib2to3.fixer_base import BaseFix
from lib2to3.pgen2 import token

class FixName1(BaseFix):
    
    _accept_type = token.NAME

    def match(self, node):
        if node.value == 'oldname':
            return True
        return False
    
    def transform(self, node, results):
        node.value = 'newname'
        node.changed()

Here we see that we only accept NAME nodes, which is the node for almost any bit of text that refers to an object, function, class etc. Only NAME nodes gets passed to the match() method and there we then check if the value is oldname in which case True is returned and the node is passed to the transform() method.

As a more complex example I have a fixer that changes the indentation to 4 spaces. This is a fairly simple use case, but as you can see it’s not entirely trivial to implement. Although it is basically just a matter of keeping track of the indentation level and replacing any new line with the current level of indentation there are still several special cases. The indentation change is also done on the prefix value of the node and this may contain several lines, but only the last line is the actual indentation.

from lib2to3.fixer_base import BaseFix
from lib2to3.fixer_util import Leaf
from lib2to3.pgen2 import token

class FixIndent(BaseFix):
    
    indents = []
    line = 0
    
    def match(self, node):
        if isinstance(node, Leaf):
            return True
        return False

    def transform(self, node, results):
        if node.type == token.INDENT:
            self.line = node.lineno
            # Tabs count like 8 spaces.
            indent = len(node.value.replace('\t', ' ' * 8))
            self.indents.append(indent)
            # Replace this indentation with 4 spaces per level:
            new_indent = ' ' * 4 * len(self.indents)
            if node.value != new_indent:
                node.value = new_indent
                # Return the modified node:
                return node
        elif node.type == token.DEDENT:
            self.line = node.lineno
            if node.column == 0:
                # Complete outdent, reset:
                self.indents = []
            else:
                # Partial outdent, we find the indentation
                # level and drop higher indents.
                level = self.indents.index(node.column)
                self.indents = self.indents[:level+1]
                if node.prefix:
                    # During INDENT's the indentation level is
                    # in the value. However, during OUTDENT's
                    # the value is an empty string and then
                    # indentation level is instead in the last
                    # line of the prefix. So we remove the last
                    # line of the prefix and add the correct
                    # indententation as a new last line.
                    prefix_lines = node.prefix.split('\n')[:-1]
                    prefix_lines.append(' ' * 4 * 
                                        len(self.indents))
                    new_prefix = '\n'.join(prefix_lines)
                    if node.prefix != new_prefix:
                        node.prefix = new_prefix
                        # Return the modified node:
                        return node
        elif self.line != node.lineno:
            self.line = node.lineno
            # New line!
            if not self.indents:
                # First line. Do nothing:
                return None
            else:
                # Continues the same indentation
                if node.prefix:
                    # This lines intentation is the last line
                    # of the prefix, as during DEDENTS. Remove
                    # the old indentation and add the correct
                    # indententation as a new last line.
                    prefix_lines = node.prefix.split('\n')[:-1]
                    prefix_lines.append(' ' * 4 * 
                                        len(self.indents))
                    new_prefix = '\n'.join(prefix_lines)
                    if node.prefix != new_prefix:
                        node.prefix = new_prefix
                        # Return the modified node:
                        return node                    

        # Nothing was modified: Return None
        return None

This fixer is not really useful in practice and is only an example. This is partly because some things are hard to automate. For example it will not indent multi-line string constants, because that would change the formatting of the string constant. However, docstrings probably should be re-indented, but the parser doesn’t separate docstrings from other strings. That’s a language feature and the 2to3 parser only parses the syntax, so I would have to add code to figure out if a string is a docstring or not.

Also it doesn’t change the indentation of comments, because they are a part of the prefix. It would be possible to go through the prefix, look for comments and re-indent them too, but we would then have to assume that the comments should have the same indentation as the following code, which is not always true.

Finding the nodes with Patterns

In the above cases finding the nodes in the match() method is relatively simple, but in most cases you are looking for something more specific. The renaming fixer above will for example rename all cases of oldname, even if it is a method on an object and not the imported function at all. Writing matching code that will find exactly what you want can be quite complex, so to help you lib2to3 has a module that will do a grammatical pattern matching on the parse tree. As a minimal example we can take a pattern based version of the fixer that renamed oldname to newname.

You’ll note that here I don’t replace the value of the node, but make a new node and replace the old one. This is only to show both techniques, there is no functional difference.

from lib2to3.fixer_base import BaseFix
from lib2to3.fixer_util import Name

class FixName2(BaseFix):
    
    PATTERN = "fixnode='oldname'"

    def transform(self, node, results):
        fixnode = results['fixnode']
        fixnode.replace(Name('newname', prefix=fixnode.prefix))

When we set the PATTERN attribute of our fixer class to the above pattern BaseFix will compile this pattern into matcher and use it for matching. You don’t need to override match() anymore and BaseFix will also set _accept_type correctly, so this simplifies making fixers in many cases.

The difficult part of using pattern fixers is to find what pattern to use. This is usually far from obvious, so the best way is to feed example code to the parser and then convert that tree to a pattern via the code parser. This is not trivial, but thankfully Collin Winter has written a script called find_pattern.py[2] that does this. This makes finding the correct pattern a lot easier and really helps to simplify the making of fixes.

To get an example that is closer to real world cases, let us change the API of a module, so that what previously was a constant now is a function call. We want to change this:

from foo import CONSTANT

def afunction(alist):
    return [x * CONSTANT for x in alist]

into this:

from foo import get_constant

def afunction(alist):
    return [x * get_constant() for x in alist]

In this case changing every instance of CONSTANT into get_constant will not work as that would also change the import of the name to a function call which would be a syntax error. We need to treat the import and the usage separately. We’ll use find_pattern.py to look for patterns to use.

The user interface of find_pattern.py is not the most verbose, but it is easy enough to use once you know it. If we run:

$ find_pattern.py -f example.py

it will parse that file and print out the various nodes it finds. You press enter for each code snipped you don’t want and you press y for the code snippet you do want. It will then print out a pattern that matches that code snippet. You can also type in a code snippet as an argument, but that becomes fiddly for multi-line snippets.

If we look at the first line of our example, it’s pattern is:

import_from< 'from' 'foo' 'import' 'CONSTANT' >

Although this will be enough to match the import line we would then have to find the CONSTANT node by looking through the tree that matches. What we want is for the transformer to get a special handle on the CONSTANT part so we can replace it with get_constant easily. We can do that by assigning a name to it. The finished pattern then becomes:

import_from< 'from' 'foo' 'import' importname='CONSTANT' >

The transform() method will now get a dictionary as the results parameter. That dictionary will have the key 'node' which contains the node that matches all of the pattern and it will also contain they key 'importname' which contains just the CONSTANT node.

We also need to match the usage and here we match 'CONSTANT' and assign it to a name, like in the renaming example above. To include both patterns in the same fixer we separate them with a | character:

import_from< 'from' 'foo' 'import' importname='CONSTANT'>
|
constant='CONSTANT'

We then need to replace the importname value with get_constant and replace the constant node with a call. We construct that call from the helper classes Call and Name. When you replace a node you need to make sure to preserve the prefix, or both white-space and comments may disappear:

node.replace(Call(Name(node.value), prefix=node.prefix))

This example is still too simple. The patterns above will only fix the import when it is imported with from foo import CONSTANT. You can also import foo and you can rename either foo or CONSTANT with an import as. You also don’t want to change every usage of CONSTANT in the file, it may be that you also have another module that also have something called CONSTANT and you don’t want to change that.

As you see, the principles are quite simple, while in practice it can become complex very quickly. A complete fixer that makes a function out of the constant would therefore look like this:

from lib2to3.fixer_base import BaseFix
from lib2to3.fixer_util import Call, Name, is_probably_builtin
from lib2to3.patcomp import PatternCompiler

class FixConstant(BaseFix):
        
    PATTERN = """
        import_name< 'import' modulename='foo' >
        |
        import_name< 'import' dotted_as_name< 'foo' 'as'
           modulename=any > >
        |
        import_from< 'from' 'foo' 'import'
           importname='CONSTANT' >
        |
        import_from< 'from' 'foo' 'import' import_as_name<
           importname='CONSTANT' 'as' constantname=any > >
        |
        any
        """

    def start_tree(self, tree, filename):
        super(FixConstant, self).start_tree(tree, filename)
        # Reset the patterns attribute for every file:
        self.usage_patterns = []
        
    def match(self, node):
        # Match the import patterns:
        results = {"node": node}
        match = self.pattern.match(node, results)
        
        if match and 'constantname' in results:
            # This is an "from import as"
            constantname = results['constantname'].value
            # Add a pattern to fix the usage of the constant
            # under this name:
            self.usage_patterns.append(
                PatternCompiler().compile_pattern(
                    "constant='%s'"%constantname))
            return results
        
        if match and 'importname' in results:
            # This is a "from import" without "as".
            # Add a pattern to fix the usage of the constant
            # under it's standard name:
            self.usage_patterns.append(
                PatternCompiler().compile_pattern(
                    "constant='CONSTANT'"))
            return results
        
        if match and 'modulename' in results:
            # This is a "import as"
            modulename = results['modulename'].value
            # Add a pattern to fix the usage as an attribute:
            self.usage_patterns.append(
                PatternCompiler().compile_pattern(
                "power< '%s' trailer< '.' " \
                "attribute='CONSTANT' > >" % modulename))
            return results
        
        # Now do the usage patterns
        for pattern in self.usage_patterns:
            if pattern.match(node, results):
                return results
    
    def transform(self, node, results):
        if 'importname' in results:
            # Change the import from CONSTANT to get_constant:
            node = results['importname']
            node.value = 'get_constant'
            node.changed()
            
        if 'constant' in results or 'attribute' in results:
            if 'attribute' in results:
                # Here it's used as an attribute.
                node = results['attribute']
            else:
                # Here it's used standalone.
                node = results['constant']
                # Assert that it really is standalone and not
                # an attribute of something else, or an
                # assignment etc:
                if not is_probably_builtin(node):
                    return None
                
            # Now we replace the earlier constant name with the
            # new function call. If it was renamed on import
            # from 'CONSTANT' we keep the renaming else we
            # replace it with the new 'get_constant' name:
            name = node.value
            if name == 'CONSTANT':
                name = 'get_constant'
            node.replace(Call(Name(name), prefix=node.prefix))

The trick here is in the match function. We have a PATTERN attribute that will match all imports, but it also contains the pattern any to make sure we get to handle all nodes. This makes it slower, but is necessary in this case. The alternative would be to have separate fixers for each of the four import cases, which may very well be a better solution in your case.

In general, any real world fixer you need to write will be very complex. If the fix you need to do is simple, you are certainly better off making sure your Python 3 module and Python 2 module are compatible. However, I hope the examples provided here will be helpful. The fixers in lib2to3 are also good examples, even though they unfortunately are not very well documented.

Footnotes

[1]http://www.muthukadan.net/docs/zca.html
[2]http://svn.python.org/projects/sandbox/trunk/2to3/scripts/find_pattern.py