About me

About me

Feeds

RSS feed

Extensions to search for text processing

1st April 2026

One of the most useful Common Lisp functions for text processing is search; it lets you search for a substring in a string, and returns its position.

I've recently written two extensions to search to help with processing the HTML in web pages, and I thought they might be useful.

search2

The first routine search2 searches in a target for a string enclosed between an opening string and a closing string. It returns three values: the position of the start of the opening string, the position of the end of the closing string, and the enclosed string.

For example:

(defparameter *str* "<body><h2>Getting started</h2></body>")
> (search2 *str* "<h2>" "</h2>")
6
30
"Getting started"

Here's a typical application: process all the h2 headings in a string *page* adding ids to them for use as anchors, and output the result to a stream:

(defun add-ids (page stream)
  (let ((start 0))
    (loop
     (multiple-value-bind (in out head) (search2 page "<h2>" "</h2>" :start start)
       (cond
        (in
         (format stream "~a" (subseq page start in))
         (format stream "<h2 id=\"~(~a~)\">~a</h2>" (substitute #\- #\space head) head)
         (setq start out))
        (t
         (format stream "~a" (subseq page start))
         (return)))))))

For example:

(defparameter *page* "<body><h2>Getting started</h2><p>Welcome to my blog!</p></body>")
> (add-ids *page* t)
<body><h2 id="getting-started">Getting started</h2><p>Welcome to my blog!</p></body>
NIL

search3

The second function does the equivalent for three strings. It searches in a target for two strings enclosed between an opening string and a middle string, and the middle string and a closing string. It returns four values: the position of the start of the opening string, the position of the end of the closing string, and the two enclosed strings.

For example:

(defparameter *str2* "<body><h2 id=\"getting-started\">Getting started</h2></body>")
> (search3 *str2* "<h2 id=\"" "\">" "</h2>")
6
51
"getting-started"
"Getting started"

Here's a typical application: process all the h2 headings in a string *page* removing ids from them, and output the result to a stream:

(defun remove-ids (page stream)
  (let ((start 0))
    (loop
     (multiple-value-bind (in out id head) (search3 page "<h2 id=\"" "\">" "</h2>" :start start)
       (declare (ignore id))
       (cond
        (in
         (format stream "~a" (subseq page start in))
         (format stream "<h2>~a</h2>" head)
         (setq start out))
        (t
         (format stream "~a" (subseq page start))
         (return)))))))

For example: 

(defparameter *page2* 
  "<body><h2 id=\"getting-started\">Getting started</h2><p>Welcome to my blog!</p></body>")
> (remove-ids *page2* t)
<body><h2>Getting started</h2><p>Welcome to my blog!</p></body>
NIL

The functions

Here are the functions search2 and search3:

(defun search2 (target open close &key (start 0))
  (let ((in (search open target :start2 start :test #'equalp)))
    (when in
      (let* ((lin (length open))
             (lout (length close))
             (out (when in (search close target :start2 (+ in lin) :test #'equalp))))
        (when out
          (let ((result (subseq target (+ in lin) out)))
            (values
             in
             (+ out lout)
             result)))))))

(defun search3 (target open mid close &key (start 0))
  (let ((in (search open target :start2 start :test #'equalp)))
    (when in
      (let* ((lin (length open))
             (lmid (length mid))
             (lout (length close))  
             (mid (when in (search mid target :start2 (+ in lin) :test #'equalp)))
             (out (when mid (search close target :start2 (+ mid lmid) :test #'equalp))))
        (when out
          (let ((result1 (subseq target (+ in lin) mid))
                (result2 (subseq target (+ mid lmid) out)))
            (values
             in
             (+ out lout)
             result1
             result2)))))))

blog comments powered by Disqus