laserjet,
atzanteol, (edited )

I believe they used heritrix at one point. The important bit is that there is a special archive format that they use which is a standard. There are several tools that support it (both capturing to it and viewing it) - it allows for capturing a website in a ‘working’ condition with history or something. I’m a bit fuzzy on it since it’s been some time since I looked into it.

avidamoeba,
@avidamoeba@lemmy.ca avatar

It seems like all of their software is in the parent account of heritrix - github.com/orgs/internetarchive/repositories?type….

ericjmorey,
@ericjmorey@programming.dev avatar

Does Linkwarden fit your intended use?

avidamoeba,
@avidamoeba@lemmy.ca avatar

Kind of. Linkwarden seems to save as PDF. That’s better than nothing, however preserving a functional copy of the pages would be better. Archivebox seems to do this.

possiblylinux127,

I don’t know for certain but I’m sure they run lots of different software. They have PBs of data.

kittykittycatboys,
@kittykittycatboys@lemmy.blahaj.zone avatar

afaik, archive.org isnt open source. id recommend something like archivebox.io

possiblylinux127,

Archive box is a piece of software and the Internet archive is a organization that is focused on predicting the content on the internet.

The Internet Archive has PBs worth of data. I doubt any home user could manage that.

z00s, (edited )

archive

predicting

?

recapitated,

They’re beating the algorithm

mosiacmango,

Protecting

kittykittycatboys,
@kittykittycatboys@lemmy.blahaj.zone avatar

i dont think op is looking to mirror archive.org, my take was that they wanted someyhing like archive.org but selfhosted and for personal / small-scale use

avidamoeba,
@avidamoeba@lemmy.ca avatar

Exactly. I’m already running a local wiki, but I don’t want stuff I link to in my wiki to result in 404 in a few years. Or worse, to some AI-ridden ad-infested dumpster fire.

laserjet,

You can use something as simple as a browser extension like SingleFile that can automatically download complete, contained copies of anything bookmarked or only certain URLs.

avidamoeba, (edited )
@avidamoeba@lemmy.ca avatar

Oh yes, this looks like a winner. Thanks!

It seems like it’s written in Python too, which means I can maintain it if need be.

Oh boy I wish I had set this up many years ago. I wouldn’t have to resort to scouring !antiquememesroadshow for the top quality memes of the past when I need them…

On a far side of the moon note, I wonder if ActivityPub could be used to federate multiple archiveboxes to create a more resilient Internet Archive alternative. 🤔 Then integrate that with Lemmy to autoarchive links from posts. Aaand lemmy.world ran out of disk space. 🤣

Dehydrated,

+1 for ArchiveBox

  • All
  • Subscribed
  • Moderated
  • Favorites
  • selfhosted@lemmy.world
  • localhost
  • All magazines
  • Loading…
    Loading the web debug toolbar…
    Attempt #