Blog search

Problem statement

I’d like to provide a quick way to retrieve information from this blog.

In terms of constraints, this blog is statically generated and hosted on Github, which disallows arbitrary plugins, limiting many solutions to client-side and/or external vendors. I’d also prefer to keep things free.

Solutions

Tags

Jekyll supports tags, and the Forestry CMS I use enables me to manage tags alongside content, so I can start by including tags in my index of notes.

Client-side search

Ideally, I could provide inline keyword search.

Search providers understandably require UI control.

Lunr provides a convenient JS library to perform keyword extraction and lookup, and supports pre-building the search index to improve client performance. However, the index for my content was 500kb and the search syntax, although powerful, was unintuitive for my simple needs.

Google’s published the most common English words. I could strip these from my content and then include the remainder in my index, eg:

{% raw %}
  {% unless site.data.stop_words contains word %}
    {{word}}
  {% endunless %}
{% endraw %}

This still yields more words than wieldy for displaying in an index. I’m also limited to Liquid syntax for index generation, which complicates things like excluding code snippets.

So far, the best solution has been constructing a regex from an input string, applying it to the titles and tags of my index and then hiding entries that don’t match.

Server-side search

I can take advantage of Google’s search indexing by defining a Jekyll sitemap. I can get closer to inline filtering by using Chrome’s omnibox. Here’s the (old, Github-based) blog’s opensearch.xml.

Integrated search

I’m using the phrase “integrated” to refer to search feature within a larger product. For example, WordPress offers full-text search as a feature. I took this path as of 2020-ish, to free up time for focusing on non-search topics.