SilverCity --- Multilanguage lexical analysis package

SilverCity is a library that can provide lexical analysis for over 20 different programming languages. SilverCity is packaged as both a C++ library and as a Python extension. This documentation applies to the Python extension.

At this point I'd like to acknoledge that this documentation is incomplete. Writting isn't a hobby of mine. So if you need any help, just let me know at <>.

Table of Contents


See Also:

Scintilla is the open-source source editing component upon which SilverCity is built
Python tokenize module
Python's built-in lexical scanner for Python source code.

1 Module Contents

find_lexer_module_by_id (id)
The find_lexer_module_by_id function returns a LexerModule object given an integer constant. These constants are defined in the ScintillaConstants module.

A ValueError is raised if the provided constant does not map to a LexerModule.

find_lexer_module_by_name (name)
The find_lexer_module_by_name function returns a LexerModule object given it's name.

A ValueError is raised if no LexerModule has the given name

class WordList ([keywords])
Create a new WordList instance. This class is used by the LexerModule class to determine which words should be lexed as keywords.

keywords should be a string containing keywords separated by spaces e.g. "and assert break class..."

WordList objects have no methods. They simply act as placeholders for language keywords.

class PropertySet ([properties])
Create a new PropertySet instance. This class is used by the LexerModule class to determine various lexer options. For example, the 'styling.within.preprocessor' property determines if the C lexer should use a single or multiple lexical states when parsing C preprocessor expressions.

properties should be a dictionary of lexer options.

2 LexerModule objects

The LexerModule class provides a single method:

get_number_of_wordlists ()
Returns the number of WordLists that the lexer requires. This is the number of WordLists that must be passed to the tokenize_by_style.

If the number of required WordLists cannot be determined, a ValueError, is raised

tokenize_by_style (source, keywords, propertyset[, func])
Lexes the provided source code into a list of tokens. Each token is a dictionary with the following keys:

Key Value
style The lexical style of the token e.g. 11
text The text of the token e.g. 'import'
start_index The index in source where the token begins e.g. 0
end_index The index in source where the token ends e.g. 5
start_column The column position (0-based) where the token begins e.g. 0
end_column The column position (0-based) where the token ends e.g. 5
start_line The line position (0-based) where the token begins e.g. 0
end_line The line position (0-based) where the token ends e.g. 0

source is a string containing the source code. keywords is a list of WordList instances. The number of WordLists that should be passed depends on the particular LexerModule. propertyset is a PropertySet instance. The relevant properties are dependant on the particular LexerModule.

If the optional func argument is used, it must be a callable object. It will be called, using keyword arguments, for each token found in the source. Since additional keys may be added in the future, it is recommended that additional keys be collected e.g.

            import SilverCity
            from SilverCity import ScintillaConstants
            def func(style, text, start_column, start_line, **other_args): 
                if style == ScintillaConstants.SCE_P_WORD and text == 'import':
                    print 'Found an import statement at (%d, %d)' % (start_line + 1, start_column + 1)
            keywords = SilverCity.WordList(SilverCity.Keywords.python_keywords)
            properties = SilverCity.PropertySet()
            lexer = SilverCity.find_lexer_module_by_id(ScintillaConstants.SCLEX_PYTHON)
            lexer.tokenize_by_style(source_code, keywords, properties, func)

3 WordList objects

WordList objects have no methods. They simply act as placeholders for language keywords.

4 PropertySet objects

PropertySet objects have no methods. They act as dictionaries were the names of the properties are the keys. All keys must be strings, values will be converted to strings upon assignment i.e. retrieved values will always be strings. There is no mechanism to delete assigned keys.

Different properties apply to different languages. The following table is a complete list of properties, the language that they apply to, and their meanings:

Property Language Values
asp.default.language HTML Sets the default language for ASP scripts:
0 => None
1 => JavaScript
2 => VBScript
3 => Python
4 => PHP
5 => XML-based
styling.within.preprocessor C++ Determines if all preprocessor instruments should be lexed identically or if subexpressions should be given different lexical states:
0 => Same
1 => Different
tab.timmy.whinge.level Python The property value is a bitfield that causes different types of incorrect whitespace characters to cause there lexical states to be incremeted by 64:
0 => no checking
1 => check for correct indenting
2 => check for literal tabs
4 => check for literal spaces used as tabs
8 => check for mixed tabs and spaces

Example PropertySet usage:

    import SilverCity
    propset = SilverCity.PropertySet({'styling.within.preprocessor' : 0})
    propset['styling.within.preprocessor'] = 1 # changed my mind

5 Language Modules

There are submodules available for many of the languages supported by SilverCity.

These submodules offer a slightly more convenient interface for tokenization and have HTML generation code. Here is an example using the Python submodule:

    >>> from SilverCity import Python
    >>> Python.PythonLexer().tokenize_by_style('import test')
    [{'style': 5, 'end_line': 0, 'end_column': 5, 'text': 'import', 'start_line': 0,
     'start_column': 0, 'start_index': 0, 'end_index': 5}, {'style': 0, 'end_line':
    0, 'end_column': 6, 'text': ' ', 'start_line': 0, 'start_column': 6, 'start_inde
    x': 6, 'end_index': 6}, {'style': 11, 'end_line': 0, 'end_column': 10, 'text': '
    test', 'start_line': 0, 'start_column': 7, 'start_index': 7, 'end_index': 10}]
    >>> import StringIO
    >>> test_file = StringIO.StringIO()
    >>> Python.PythonHTMLGenerator().generate_html(StringIO.StringIO(), 'import test')
    >>> test_file.getvalue()
    <span class="p_word">import</span><span class="p_default">&nbsp;</span><span cl
All of the language modules have the same interface. Here is another example for the C++ language:
    >>> from SilverCity import CPP
    >>> test_file = StringIO.StringIO()
    >>> CPP.CPPHTMLGenerator().generate_html(test_file, 'return 5+5')
    >>> test_file.getvalue()
    '<span class="c_word">return</span><span class="c_default">&nbsp;</span><span cl
    ass="c_number">5</span><span class="c_operator">+</span><span class="c_number">5

6 Stuff that should be documented better

The ScintillaConstants module contains a list of lexer identifiers (used by find_lexer_module_by_id) and lexical states for each LexerModule. You should take a look at this module to find the states that are useful for your programming language.

The Keywords module contains lists of keywords that can be used to create WordList objects.

There are also some modules that package tokenize_by_style into a class that offers a visitor pattern (think SAX). You don't have to worry about these modules if you don't want to. But, if you do, they are all written in Python so you can probably muddle through.

Note that some lexer that are supported by Scintilla, are not supported by SilverCity. This is because I am lazy. Any contributions are welcome (and should be pretty easy to make).