Why Should You Archive Websites?
Over the years, the World Wide Web enabled individuals across the globe to easily share and communicate information with each other. One issue with the Web, however, is that websites do not hold up over time. Most websites only stay active for around two to five years. After that, they either go offline completely or are replaced by a different website altogether. For example, there are little to no websites from the 1990s that are still online today. Alternatively, you can also use the WayBack machine to archive websites – no installation required.
Archivebox’s Requirement
Before you can install Archivebox, you need to make sure that you have the following resources:
A machine that you can access from outside your home network. This can either be a machine at home that you can port-forward or a rented remote VPS.Your machine needs to have an adequate amount of storage space. In most cases, a 1TB disk should be able to store between 100,000 to 1,000,000 individual webpages.Your machine’s filesystem needs to either be EXT4 or ZFS for Archivebox to work properly.
Note: this tutorial focuses on installing and configuring Archivebox on a local Ubuntu 22.04 LTS machine.
Installing Archivebox
First, install the program’s dependencies. Open a terminal and type the following command: Install Archivebox through Python PIP: Next, create a folder where Archivebox will save all of its data. In my case, I am creating my directory in my “/home/archivebox” directory: Lastly, you can finalize your Archivebox instance by running the following command to download and configure all the Python patches that the program needs to run in your machine. You will be asked for the details of the first user. Check whether you have installed Archivebox properly by running:
Preparing the Web GUI
While Archivebox is perfectly usable as a command line utility, it is also possible to access the program through a web interface. This is useful if you want to either share Archivebox with other users or access the program outside your server. To host a web GUI, you need to create an Nginx reverse proxy to redirect any incoming web traffic to the Archivebox daemon. Create a new Nginx configuration file: Copy and paste the following code, changing server_name to your own domain name: Enable the Archivebox configuration: Restart Nginx and start the Archivebox daemon:
Archiving Your First Website
Open your web browser and access the Archivebox instance through your domain name. In my case, I am going to “yetanotherarchivebox.xyz.” Click the “LOG IN” button in the webpage’s upper-right corner. Enter your user credentials to log in to the utility. Archive your first website by pressing the “Add” button on the page’s upper sidebar. This will load a large dialog box, where you can add a list of web links that you would like to archive. In my case, I am adding “https://maketecheasier.com.” Next, you can choose a variety of options to archive your website. For example, you can provide a set of tags for your links to sort them properly. Further, you can tell Archivebox to save the contents of any immediate link in the page that you want to archive. This is useful in cases where you want to preserve the context of a website. Click the “Add URLs and Archive” button to start the archiving process. In most cases, this should only take between one and two minutes.
Archiving a Website Using the Command Line
To archive a webpage from the command line, run the following commands: Further, you can also use the add subcommand to archive a list of web links. For example, running the following command will tell Archivebox to save every link in my “bookmarks.txt” file: Lastly, it is also possible to create a self-contained archive of a single webpage. To do this, run the following command:
Customizing Archivebox
You can also customize how Archivebox obtains the pages that it saves. For example, it is possible to save only a screenshot of every web page that you archive. This is helpful for users who want to save disk space while storing websites. To disable the other formats, you need to run the following commands:
Adding a New User in Archivebox
To add a new user, go back to the web GUI and click the “ADMIN” button on the page’s upper bar. Once inside the Admin Panel, go to the “Authentication and Authorization” category and select “Users.” This will list all the active users in the system. Select the “Add User +” button in the page’s upper-right corner. Similar to adding users to a Linux group, the user creation process in Archivebox can be complicated. Despite that, a new user only requires three things to function properly: username, password and a set of user permissions. To create a new user, first provide a password. After that, select the user permissions for that particular user. In most cases, you only need to toggle the following options for a regular user: Provide a username for the new user account. In my case, I am using the name “alice.” Lastly, select the “SAVE” button on the page’s lower right corner to apply your changes. Image credit: Unsplash. All alterations and screenshots by Ramces Red. One way to mitigate this issue is by making sure that your installation is always up to date. Do that by running pip3 install –upgrade archivebox.