SilverCity is a library that can provide lexical analysis for over 20 different programming languages. SilverCity is packaged as both a C++ library and as a Python extension. This documentation applies to the Python extension.
At this point I'd like to acknoledge that this documentation is incomplete. Writting isn't a hobby of mine. So if you need any help, just let me know at <brian@sweetapp.com>.
See Also:
A ValueError is raised if the provided constant does not map to a LexerModule.
A ValueError is raised if no LexerModule has the given name
keywords should be a string containing keywords separated by spaces e.g. "and assert break class..."
WordList objects have no methods. They simply act as placeholders for language keywords.
properties should be a dictionary of lexer options.
The LexerModule class provides a single method:
If the number of required WordLists cannot be determined, a ValueError, is raised
Key | Value |
---|---|
style | The lexical style of the token e.g. 11 |
text | The text of the token e.g. 'import' |
start_index | The index in source where the token begins e.g. 0 |
end_index | The index in source where the token ends e.g. 5 |
start_column | The column position (0-based) where the token begins e.g. 0 |
end_column | The column position (0-based) where the token ends e.g. 5 |
start_line | The line position (0-based) where the token begins e.g. 0 |
end_line | The line position (0-based) where the token ends e.g. 0 |
source is a string containing the source code. keywords is a list of WordList instances. The number of WordLists that should be passed depends on the particular LexerModule. propertyset is a PropertySet instance. The relevant properties are dependant on the particular LexerModule.
If the optional func argument is used, it must be a callable object. It will be called, using keyword arguments, for each token found in the source. Since additional keys may be added in the future, it is recommended that additional keys be collected e.g.
import SilverCity from SilverCity import ScintillaConstants def func(style, text, start_column, start_line, **other_args): if style == ScintillaConstants.SCE_P_WORD and text == 'import': print 'Found an import statement at (%d, %d)' % (start_line + 1, start_column + 1) keywords = SilverCity.WordList(SilverCity.Keywords.python_keywords) properties = SilverCity.PropertySet() lexer = SilverCity.find_lexer_module_by_id(ScintillaConstants.SCLEX_PYTHON) lexer.tokenize_by_style(source_code, keywords, properties, func)
WordList objects have no methods. They simply act as placeholders for language keywords.
PropertySet objects have no methods. They act as dictionaries were the names of the properties are the keys. All keys must be strings, values will be converted to strings upon assignment i.e. retrieved values will always be strings. There is no mechanism to delete assigned keys.
Different properties apply to different languages. The following table is a complete list of properties, the language that they apply to, and their meanings:
Property | Language | Values |
---|---|---|
asp.default.language | HTML | Sets the default language for ASP
scripts: 0 => None 1 => JavaScript 2 => VBScript 3 => Python 4 => PHP 5 => XML-based |
styling.within.preprocessor | C++ | Determines if all preprocessor instruments
should be lexed identically or if subexpressions should be given different
lexical states: 0 => Same 1 => Different |
tab.timmy.whinge.level | Python | The property value is a bitfield that
causes different types of incorrect whitespace characters to cause there
lexical states to be incremeted by 64: 0 => no checking 1 => check for correct indenting 2 => check for literal tabs 4 => check for literal spaces used as tabs 8 => check for mixed tabs and spaces |
Example PropertySet usage:
import SilverCity propset = SilverCity.PropertySet({'styling.within.preprocessor' : 0}) propset['styling.within.preprocessor'] = 1 # changed my mind
There are submodules available for many of the languages supported by SilverCity.
These submodules offer a slightly more convenient interface for tokenization and have HTML generation code. Here is an example using the Python submodule:
>>> from SilverCity import Python >>> Python.PythonLexer().tokenize_by_style('import test') [{'style': 5, 'end_line': 0, 'end_column': 5, 'text': 'import', 'start_line': 0, 'start_column': 0, 'start_index': 0, 'end_index': 5}, {'style': 0, 'end_line': 0, 'end_column': 6, 'text': ' ', 'start_line': 0, 'start_column': 6, 'start_inde x': 6, 'end_index': 6}, {'style': 11, 'end_line': 0, 'end_column': 10, 'text': ' test', 'start_line': 0, 'start_column': 7, 'start_index': 7, 'end_index': 10}] >>> import StringIO >>> test_file = StringIO.StringIO() >>> Python.PythonHTMLGenerator().generate_html(StringIO.StringIO(), 'import test') >>> test_file.getvalue() <span class="p_word">import</span><span class="p_default"> </span><span cl ass="p_identifier">test</span>'
>>> from SilverCity import CPP >>> test_file = StringIO.StringIO() >>> CPP.CPPHTMLGenerator().generate_html(test_file, 'return 5+5') >>> test_file.getvalue() '<span class="c_word">return</span><span class="c_default"> </span><span cl ass="c_number">5</span><span class="c_operator">+</span><span class="c_number">5 </span>' >>>
The ScintillaConstants module contains a list of lexer identifiers (used by find_lexer_module_by_id) and lexical states for each LexerModule. You should take a look at this module to find the states that are useful for your programming language.
The Keywords module contains lists of keywords that can be used to create WordList objects.
There are also some modules that package tokenize_by_style into a class that offers a visitor pattern (think SAX). You don't have to worry about these modules if you don't want to. But, if you do, they are all written in Python so you can probably muddle through.
Note that some lexer that are supported by Scintilla, are not supported by SilverCity. This is because I am lazy. Any contributions are welcome (and should be pretty easy to make).