Simple Syntax Highlighting Using Nltk

Programming and coding is usually done with some kind of syntax highlighting, to make it easier to read and reason with a program. It helps determine where we might have a number or string in our SQL query , or determine where is the start and end of a function code block.

Then why doesn’t one of these exist for say essay writing? Is it actually difficult to build a syntax highlighter for the English language? It turns out it is extremely simple to build one, but to build a good one is something entirely different.

Here I turn a simple version of a syntax highlighter written in Python (and this post) to see whether it is actually useful.

import nltk
from nltk.tokenize import word_tokenize

tag_colour = {}

# verbs
for vb in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']:
    tag_colour[vb] = 'purple'
    
# pronouns
for pn in ['PRP', 'PRP$']:
    tag_colour[pn] = 'green'

# nouns
for nn in ['NN', 'NNP', 'NNPS', 'NNS']:
    tag_colour[nn] = 'blue'
    
# enter your text
print ' '.join([tag_single_word(x) for x in nltk.pos_tag(nltk.word_tokenize(text))])

There were various shortcommings with this code; for one it doesn’t understand punctuation and new lines so that everything would be incorrectly spaced apart! Nevertheless it was an interesting experiment to also see the power within Python and NLTK to achieve something in a couple lines of code.