Webarchivn'
Sometimes you just want to preserve something. Like a page you visited, or are visiting. Sometimes, the browser is Good Enough™ and you can save it for offline viewing. Sometimes, however, you want some level of archive. Thankfully, there is a standard for that!
But how do you actually make one of these so-called .warc
files?
Well, thanks to the folks at the GNU Project, wget
already has it
built in! You can just use some options to get an archive:
$ wget --adjust-extension \
--execute robots=off \
--convert-links \
--no-parent
--mirror \
--warc-file=domainname.com \
--no-warc-keep-log \
--page-requisites \
--no-verbose "https://urlhere.com"
Now, you’ll get a folder of files that mirror the page or site, plus
you’ll get a .warc
file that contains all of that in a single
digestable format suitable for libraries and search engines, neat!