From 9355856dc4b99533ab0e0bfc4b9be61dd20d50ac Mon Sep 17 00:00:00 2001
From: Tim Van Baak <tim.vanbaak@gmail.com>
Date: Mon, 27 May 2024 21:07:01 -0700
Subject: [PATCH] Post about bookmarks

---
 src/blog/2024/bookmarks.md | 46 ++++++++++++++++++++++++++++++++++++++
 src/blog/2024/index.md     |  1 +
 src/blog/index.md          |  3 ++-
 3 files changed, 49 insertions(+), 1 deletion(-)
 create mode 100644 src/blog/2024/bookmarks.md
 create mode 100644 src/blog/2024/index.md
diff --git a/src/blog/2024/bookmarks.md b/src/blog/2024/bookmarks.md
new file mode 100644
index 0000000..664c4f5
--- /dev/null
+++ b/src/blog/2024/bookmarks.md
@@ -0,0 +1,46 @@
+---
+title: Archiving bookmarks
+pubdate: 2024-05-27T21:06:37-07:00
+feed: blog
+---
+
+A [comment on Hacker News](https://news.ycombinator.com/item?id=40397848) got me thinking.
+
+> I realized I was overusing bookmarks. I now save webpages (perhaps as PDF) if it contains information I want to refer to later, such as an insightful article, technical information, a humorous bit, or the like.
+
+> Bookmarks are good only for links to things for which only the most current version is worth accessing. That’s my banking websites, a shopping site, my employer’s remote desktop system, etc.
+
+Right now I have somewhere around 5,500 bookmarks. They are sorted into a few top-level categories. "Ref" contains pages for reference, such as articles I might want to cite later, neat sites I might want to read again, reaction image links, etc. "Util" contains a few utilities but mostly gets opened for a few bookmarklets. "Later" is something like a reading list, containing hundreds upon hundreds of "I should read this later when I have time" links, plus several subfolders by topic. "net" once contained videos or downloads to do at a time when I had slow Internet, but now mostly contains a list of things I need to download and sort. "proj" contains some project-specific folders with reference material, like pages on specific language features some program will need to use.
+
+For most of these, that comment is very accurate. Many links I saved to articles are now dead, broken by site reorganizations or the host going down. Some can be recovered on the Internet Archive, some can't or weren't archivable to begin with. For a few, the site has become less usable or parts of the page have been removed; the latest version of the webpage is _less_ desirable than the version that was bookmarked!
+
+I've been going through some of the "Ref" bookmarks like boxes of old things never unpacked after a move. Some of them aren't really very interesting any more and I can delete them. Some of them are worth saving, and I archive them. Pages without dependencies are the easiest to just download. For pages with resources, I am using the [SingleFile](https://github.com/gildas-lormeau/SingleFile) extension, which does what it says in the name. (It also operates on the page as it appears in your browser, which means I can delete useless things like comment sections before saving the page.) A few bookmarked sites have multiple pages, so I mirror them with `wget`.
+
+For most webpages, what you really want is a few kilobytes of text. SingleFile is very useful for preserving the style of a page, but it also produces files with sizes in the megabytes. If I just want a few paragraphs, it's much easier to use [`htmlq`](https://github.com/mgdm/htmlq) to cut out the section that has the text I want and just save that. This is the script I'm currently using:
+
+    #!/usr/bin/env bash
+
+    if [ "$#" -lt 2 ]; then
+    	echo "usage: $0 [url] [selector]"
+    	exit 1
+    fi
+
+    URL="$1"
+    SELECTOR="$2"
+
+    PAGE=$(curl -s $URL)
+
+    echo '<html>
+    <head>
+    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />'
+    echo "$PAGE" | htmlq title
+    echo '<link rel="alternate" href="'$URL'" />
+    </head>
+    <body>'
+    echo "$PAGE" | htmlq "$SELECTOR"
+    echo '</body>
+    </html>'
+
+This preserves some context by leaving the original bookmark URL in a `rel="alternate"` link.
+
+The paradox of the information age is that copying information is easy, even trivial, but once the information is gone you have almost no chance of finding it again. An old book may turn up in a dusty shop somewhere, but if it's gone and the Internet Archive doesn't have it, it's probably gone forever (so [donate to the Internet Archive](https://archive.org/donate)). Storage is cheap, especially for text; save your own copy before the original is lost to the perpetual rot.
diff --git a/src/blog/2024/index.md b/src/blog/2024/index.md
new file mode 100644
index 0000000..5d3c420
--- /dev/null
+++ b/src/blog/2024/index.md
@@ -0,0 +1 @@
+* [Archiving bookmarks](./2024/bookmarks.md)
diff --git a/src/blog/index.md b/src/blog/index.md
index 07790f4..b24562c 100644
--- a/src/blog/index.md
+++ b/src/blog/index.md
@@ -4,6 +4,7 @@ title: Blog
 
 [RSS](./feed.xml)
 
+* [Archiving bookmarks](./2024/bookmarks.md)
 * [SHLVL PS1](./2023/shlvl.md)
 * [Backing up my ZFS NAS to an external drive](./2023/zfs-nas-backup.md)
-* [The traditional first software engineer blog post](./2023/blog-start.md)
\ No newline at end of file
+* [The traditional first software engineer blog post](./2023/blog-start.md)