A Web Clipper of Sorts for Org-Mode

Table of Contents

1 An Org-Mode Web Clipper

1.1 What Is It?

I wanted a way to quickly capture web pages, or selections from web pages, when using either the W3M or EWW Emacs internal browsers (think: Evernote web clipper and the like). Now, there are things out there that allow for capturing web pages, but they weren’t the fast and simple sort of thing that I was looking for.

For instance, one of them requires you to set up a file of links, with URLs and other information as properties for a given headline describing the website being archived. This is really powerful but it’s multi-step, and relies on an intermediate file.

There are also clever methods using org-protocol, but I wanted to work with an internal browser, not an external one. Again, I was looking for speed and simplicity.

So I rolled my own from existing functions and components. It’s the Emacs way.

1.2 Org-Clipper Coding

First, I set up a special org-capture template:

("w" "Website" plain
 (function org-website-clipper)
 "* %a\n%T\n" :immediate-finish t)

It turned out to be not so easy to get org-capture to call a custom function. There was only one place to do it, the capture file positioning logic, and so I made use of that and essentially ’overloaded’ it.

Then I put together the code that makes it work.

;; org-eww and org-w3m should be in your org distribution, but see
;; note below on patch level of org-eww.
(require 'org-eww)
(require 'org-w3m)
(defvar org-website-page-archive-file "~/organize/website/websites.org")
(defun org-website-clipper ()
  "When capturing a website page, go to the right place in capture file,
   but do sneaky things. Because it's a w3m or eww page, we go
   ahead and insert the fixed-up page content, as I don't see a
   good way to do that from an org-capture template alone. Requires
   Emacs 25 and the 2017-02-12 or later patched version of org-eww.el."

  ;; Check for acceptable major mode (w3m or eww) and set up a couple of
  ;; browser specific values. Error if unknown mode.

   ((eq major-mode 'w3m-mode)
   ((eq major-mode 'eww-mode)
     (error "Not valid -- must be in w3m or eww mode")))

  ;; Check if we have a full path to the archive file. 
  ;; Create any missing directories.

  (unless (file-exists-p org-website-page-archive-file)
    (let ((dir (file-name-directory org-website-page-archive-file)))
      (unless (file-exists-p dir)
        (make-directory dir))))

  ;; Open the archive file and yank in the content.
  ;; Headers are fixed up later by org-capture.
  (find-file org-website-page-archive-file)
  (goto-char (point-max))
  ;; Leave a blank line for org-capture to fill in
  ;; with a timestamp, URL, etc.
  (insert "\n\n")
  ;; Insert the web content but keep our place.
  (save-excursion (yank))
  ;; Don't keep the page info on the kill ring.
  ;; Also fix the yank pointer.
  (setq kill-ring (cdr kill-ring))
  (setq kill-ring-yank-pointer kill-ring)
  ;; Final repositioning.
  (forward-line -1)

This works for both EWW and W3M. You’ll want to change the variable ’org-website-page-archive-file’ to something suitable for you.

1.3 Doing It

It’s simplicity itself. In EWW or W3M, when you’re on a page you want to capture, you can mark out a capture region. If you don’t, the default is to save the whole page. Then invoke org-capture, probably with ’C-c c’. Select the ’w’ template and that’s it.

On pages with a lot of links, it’s not as speedy as I might wish, as all those links get converted to org-mode compatible links (but you really want that). I imagine if you have really too many links the thing could blow up, but I haven’t seen that yet.

Your pages get continuously concatenated in your single archive file. This might get pretty big after a while (okay, it will get pretty big). Every so often you might want to do a little killing and yanking, moving entries to other files, or getting rid of cruft that you don’t want. Emacs does fine with large files up to a point, but if you’re starting to look at zillions of megabytes, you might want to do something about it.

Author: Bob Newell

Email: bobnewell@bobnewell.net

Created: 2018-03-07 Wed 14:02