Extensions to search for text processing
1st April 2026
One of the most useful Common Lisp functions for text processing is search; it lets you search for a substring in a string, and returns its position.
I've recently written two extensions to search to help with processing the HTML in web pages, and I thought they might be useful.
search2
The first routine search2 searches in a target for a string enclosed between an opening string and a closing string. It returns three values: the position of the start of the opening string, the position of the end of the closing string, and the enclosed string.
For example:
(defparameter *str* "<body><h2>Getting started</h2></body>")
> (search2 *str* "<h2>" "</h2>") 6 30 "Getting started"
Here's a typical application: process all the h2 headings in a string *page* adding ids to them for use as anchors, and output the result to a stream:
(defun add-ids (page stream)
(let ((start 0))
(loop
(multiple-value-bind (in out head) (search2 page "<h2>" "</h2>" :start start)
(cond
(in
(format stream "~a" (subseq page start in))
(format stream "<h2 id=\"~(~a~)\">~a</h2>" (substitute #\- #\space head) head)
(setq start out))
(t
(format stream "~a" (subseq page start))
(return)))))))
For example:
(defparameter *page* "<body><h2>Getting started</h2><p>Welcome to my blog!</p></body>")
> (add-ids *page* t) <body><h2 id="getting-started">Getting started</h2><p>Welcome to my blog!</p></body> NIL
search3
The second function does the equivalent for three strings. It searches in a target for two strings enclosed between an opening string and a middle string, and the middle string and a closing string. It returns four values: the position of the start of the opening string, the position of the end of the closing string, and the two enclosed strings.
For example:
(defparameter *str2* "<body><h2 id=\"getting-started\">Getting started</h2></body>")
> (search3 *str2* "<h2 id=\"" "\">" "</h2>") 6 51 "getting-started" "Getting started"
Here's a typical application: process all the h2 headings in a string *page* removing ids from them, and output the result to a stream:
(defun remove-ids (page stream)
(let ((start 0))
(loop
(multiple-value-bind (in out id head) (search3 page "<h2 id=\"" "\">" "</h2>" :start start)
(declare (ignore id))
(cond
(in
(format stream "~a" (subseq page start in))
(format stream "<h2>~a</h2>" head)
(setq start out))
(t
(format stream "~a" (subseq page start))
(return)))))))
For example:
(defparameter *page2* "<body><h2 id=\"getting-started\">Getting started</h2><p>Welcome to my blog!</p></body>")
> (remove-ids *page2* t) <body><h2>Getting started</h2><p>Welcome to my blog!</p></body> NIL
The functions
Here are the functions search2 and search3:
(defun search2 (target open close &key (start 0))
(let ((in (search open target :start2 start :test #'equalp)))
(when in
(let* ((lin (length open))
(lout (length close))
(out (when in (search close target :start2 (+ in lin) :test #'equalp))))
(when out
(let ((result (subseq target (+ in lin) out)))
(values
in
(+ out lout)
result)))))))
(defun search3 (target open mid close &key (start 0))
(let ((in (search open target :start2 start :test #'equalp)))
(when in
(let* ((lin (length open))
(lmid (length mid))
(lout (length close))
(mid (when in (search mid target :start2 (+ in lin) :test #'equalp)))
(out (when mid (search close target :start2 (+ mid lmid) :test #'equalp))))
(when out
(let ((result1 (subseq target (+ in lin) mid))
(result2 (subseq target (+ mid lmid) out)))
(values
in
(+ out lout)
result1
result2)))))))
blog comments powered by Disqus
