Webarchivn'

2025-02-10

Sometimes you just want to preserve something. Like a page you visited, or are visiting. Sometimes, the browser is Good Enough™ and you can save it for offline viewing. Sometimes, however, you want some level of archive. Thankfully, there is a standard for that!

But how do you actually make one of these so-called .warc files? Well, thanks to the folks at the GNU Project, wget already has it built in! You can just use some options to get an archive:

$ wget --adjust-extension \
       --execute robots=off \
       --convert-links \
       --no-parent
       --mirror \
       --warc-file=domainname.com \
       --no-warc-keep-log \
       --page-requisites \
       --no-verbose "https://urlhere.com"

Now, you’ll get a folder of files that mirror the page or site, plus you’ll get a .warc file that contains all of that in a single digestable format suitable for libraries and search engines, neat!

Respond via email.