Automating Downloads from Websites with Dynamic URLs

November 6, 2024

Automating Downloads from Websites with Dynamic URLs

When working with websites that dynamically generate URLs through PHP scripts, automating the download of specific file types can be tricky. These types of URLs often don’t provide a straightforward directory structure, making manual downloading a time-consuming task. In this article, I’ll show you how to use two powerful tools—wget and pcregrep—to automate the process of downloading Atari 8-bit .exe and .atr files from a website with dynamically generated URLs.

We’ll focus on macOS High Sierra (or greater) to demonstrate the process, but similar techniques can be applied on Linux or through the Windows Subsystem for Linux (WSL).

Goal of This Guide

The guide that follows allows you to:

Automatically find folder URLs that contain the files you want to download.
Extract the direct file links using wget and pcregrep.
Download the files in bulk.

This process can be applied to various websites that generate content dynamically, providing a simple and effective solution for bulk downloading files like .exe and .atr.

Prerequisites: Setting Up the Tools

Before diving into the main process, there are a couple of tools we need: wget and pcregrep. These tools aren’t pre-installed on macOS, so you’ll need to install them. On macOS, we can use Homebrew, a package manager, to install these tools.

Installing Homebrew on macOS High Sierra

Since macOS High Sierra no longer officially supports the latest versions of Homebrew, you may need to use an older version, but it’s still functional for our needs. Here’s how to install Homebrew:

zsh

Copy code

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Installing wget and pcregrep

Once Homebrew is installed, you can use it to install wget and pcregrep by running the following commands in your terminal:

zsh

Copy code

brew install wget

zsh

Copy code

brew install pcre

These tools will allow us to download files and search through web pages efficiently.

Why wget Alone Isn’t Enough

While wget is a great tool for downloading files, it struggles with dynamically generated URLs, like those created by PHP scripts. Websites often don’t present files as direct links, so wget won’t be able to find them on its own.

Additionally, wget does not have the ability to filter out files based on their extensions (like .exe or .atr) in the way we need. This is where pcregrep comes in—allowing us to filter through the URLs and focus on only the files we’re interested in. By combining wget and pcregrep, we can automate the process of finding and downloading specific files from these complex web pages.

Step 1: Using wget to Find Folder URLs

Before we can download files, we first need to identify the folder URLs where they are located. Let’s say you’re trying to download Atari 8-bit software files like .exe and .atr. These files are hosted on a dynamic website that lists them under folders. The folder URLs are dynamically generated and displayed on the page.

To retrieve these folder URLs, we can use the wget --spider command. This command will check all the links on the site without actually downloading the content:

zsh

Copy code

wget --spider -r -l1 -nd "https://www.atarionline.pl/v01/index.php?ct=demos&sub=cp" 2>&1 | pcregrep -o 'https://www\.atarionline\.pl/v01/index\.php\?ct=demos&sub=cp&tg=[^ ]+'

This command performs two key actions:

wget --spider: Checks the website recursively for links without downloading any content.
pcregrep: Filters the output to show only URLs that point to the specific folders containing the files we want.

The result will look like this (URLs pointing to folder pages on the website):

bash

Copy code

https://www.atarionline.pl/v01/index.php?ct=demos&sub=cp&tg=24h%202nd%202004 https://www.atarionline.pl/v01/index.php?ct=demos&sub=cp&tg=24h%205th%202005 https://www.atarionline.pl/v01/index.php?ct=demos&sub=cp&tg=6502%20Compo%202004

Each of these URLs points to a folder containing multiple files. These URLs, when opened in a browser, will reveal a list of files (such as .atr, .xex, .arc) inside the corresponding folder.

Note: Replace the URL above with the root URL of the website you’re targeting. The provided URL is just an example.

Step 2: Scraping File URLs from the Folder Pages

Once you have the folder URLs, the next step is to extract the actual file links (e.g., .exe, .atr, .xex, .arc). These files are listed under each folder, and we need to scrape the HTML page for the direct file URLs.

You can use wget to download the content of each folder page and then use pcregrep to extract the file URLs.

Here’s how you can do it:

Download the folder page using wget:

zsh

Copy code

wget -q -O folder_page.html "https://www.atarionline.pl/v01/index.php?ct=demos&sub=cp&tg=24h%202nd%202004"

Extract the file URLs from the folder page using pcregrep:

zsh

Copy code

pcregrep -o 'https://www\.atarionline\.pl/v01/.+\.(atr|exe|xex|arc)' folder_page.html

This will give you URLs that directly point to the .atr, .exe, and other file types you want to download. For example:

bash

Copy code

https://www.atarionline.pl/v01/files/demos/24h_7th_2005.atr https://www.atarionline.pl/v01/files/demos/24h_7th_2005.xex https://www.atarionline.pl/v01/files/demos/24h_7th_2005.arc

Note: If you need to download other file types, such as .pdf or .txt, simply modify the pcregrep regular expression to include those extensions.

Step 3: Downloading the Files

Now that we have the direct download URLs for the files, we can use wget to download them. If you’ve saved these file URLs in a text file (e.g., download_links.txt), you can download all the files at once:

zsh

Copy code

cat download_links.txt | wget -i -

This will automatically download all the .atr, .exe, .xex, and .arc files listed in the text file.

Final Notes

By using wget combined with pcregrep, you can automate the process of downloading files from websites with dynamic URLs. This process can be particularly useful for downloading Atari 8-bit software files or any other files hosted on similar websites. The key steps are:

Find the folder URLs with wget.
Scrape the folder pages to find the actual file links.
Use wget again to download those files.

This process can save a lot of time compared to manually navigating through the website and downloading each file individually.

Troubleshooting Tips

While automating the download process is efficient, you may encounter some common issues:

Permission Errors: Ensure that the folders where you’re downloading files have write permissions. You can check and modify folder permissions with the chmod command.
404 Errors: Some URLs may be outdated or inaccessible. Check if the link is still valid by opening it in a browser or using wget with the --spider flag to test.
Redirections and Captchas: Some websites might use redirects or captchas to prevent automated downloads. If you run into these issues, you can try using wget with options like --max-redirect or consider using a headless browser automation tool like puppeteer to bypass captchas.

Breakdown of the Commands and Bash Script

Breakdown of the `wget` Command

The following command extracts the URLs for the software folders containing .exe and .atr files:

zsh

Copy code

wget --spider -r -l1 -nd "https://www.atarionline.pl/v01/index.php?ct=demos&sub=cp" 2>&1 | pcregrep -o 'https://www\.atarionline\.pl/v01/index\.php\?ct=demos&sub=cp&tg=[^ ]+'

Here’s a breakdown of what each part does:

wget --spider -r -l1 -nd "https://www.atarionline.pl/v01/index.php?ct=demos&sub=cp"
- --spider: Tells wget to perform a dry run, meaning it won’t download files but will check if they exist.
- -r: Recursively checks all links.
- -l1: Limits recursion to 1 level deep.
- -nd: Avoids creating directories—only downloads the files.
pcregrep -o 'https://www\.atarionline\.pl/v01/index\.php\?ct=demos&sub=cp&tg=[^ ]+'
- -o: Outputs only the matched part of the line (the folder URL).
- [^ ]+: Matches one or more non-space characters (essentially capturing the full URL).

By following this guide, you should be able to automate the process of downloading specific files from dynamic websites effectively.