Trying to decipher LDA topics is hard. In this post I propose an extremely naïve way of labelling topics which was inspired by the (unsurprisingly) named paper Automatic Labelling of Topic Models.
The gist of the approach is that we can use web search in an information retrieval sense to improve the topic labelling of our LDA model.
Algorithm (Extracting Relevant Information):
Input: top keywords for an LDA topic
for each search result in Wikipedia for each keyword:
if any other keyword appears in article result summary:
yield article header
end if
end for
human topic = keywords(list of article headers)
return human topic
Of course this can be extended in many ways. The paper suggests not only using article headers, but also article summaries and Google search results in an “ensemble” style of model.
In my naïve approach the libraries I used within Python are:
gensim
: for determining the keywords usinggensim.summarization.keywords
wikiapi
: for crawling through Wikipedia
Both which can be installed via pypi
from wikiapi import WikiApi
import gensim
def get_relevant_articles(keywords, search_depth=5, keyword_summary=5):
"""
Searches through a list of keywords and returns keywords based on article headers
in Wikipedia.
args:
* keywords: A list of keywords
* search_depth: how many wikipedia search results are checked, assumes to be between 1-10
* keyword_summary: gensim word argument to how many words should be used in summarization
"""
if len(keywords) == 0:
return []
wiki = WikiApi()
keywords = [x.lower() for x in keywords]
info = []
for keyword in keywords:
results = wiki.find(keyword)
other_words = [x for x in keywords if x != keyword]
if search_depth is not None:
results = results[:search_depth]
for result in results:
article = wiki.get_article(result)
summary_words = article.summary.lower().split(' ')
has_words = any(word in summary_words for word in other_words)
if has_words:
info.append(article.heading)
try:
info_keyword = gensim.summarization.keywords(' '.join(info),
words=keyword_summary).split('\n')
except:
print("keyword extraction failed, defaulting to article heading output")
info_keyword = info[:]
return info_keyword
def lemmatize_all(docs):
"""lemmatize a list of strings"""
from gensim import utils
import itertools
def lemmatize_single(doc):
result = utils.lemmatize(doc)
return [x[:-3] for x in result]
return list(set(itertools.chain.from_iterable([lemmatize_single(x) for x in docs])))
all_results = get_relevant_articles("stock market investor fund trading investment firm exchange companies share".split())
has_words = any(word in summary_words for word in other_words)
if has_words:
info.append(article.heading)
try:
info_keyword = gensim.summarization.keywords(' '.join(info),
words=keyword_summary).split('\n')
except:
print("keyword extraction failed, defaulting to article heading output")
info_keyword = info[:]
return info_keyword
def lemmatize_all(docs):
"""lemmatize a list of strings"""
from gensim import utils
import itertools
def lemmatize_single(doc):
result = utils.lemmatize(doc)
return [x[:-3] for x in result]
return list(set(itertools.chain.from_iterable([lemmatize_single(x) for x in docs])))
all_results = get_relevant_articles("stock market investor fund trading investment firm exchange companies share".split())
print(all_results) # [u'investor', u'investors', u'reform', u'investment', u'exchange', u'trade', u'trading']
print(lemmatize_all(all_results)) # ['exchange', 'reform', 'trade', 'trading', 'investor', 'investment']