November 6, 2024
When working with websites that dynamically generate URLs through PHP scripts, automating the download of specific file types can be tricky. These types of URLs often don’t provide a straightforward directory structure, making manual downloading a time-consuming task. In this article, I’ll show you how to use two powerful tools—wget
and pcregrep
—to automate the process of downloading Atari 8-bit .exe
and .atr
files from a website with dynamically generated URLs.
We’ll focus on macOS High Sierra (or greater) to demonstrate the process, but similar techniques can be applied on Linux or through the Windows Subsystem for Linux (WSL).
The guide that follows allows you to:
wget
and pcregrep
.This process can be applied to various websites that generate content dynamically, providing a simple and effective solution for bulk downloading files like .exe
and .atr
.
Before diving into the main process, there are a couple of tools we need: wget
and pcregrep
. These tools aren’t pre-installed on macOS, so you’ll need to install them. On macOS, we can use Homebrew, a package manager, to install these tools.
Since macOS High Sierra no longer officially supports the latest versions of Homebrew, you may need to use an older version, but it’s still functional for our needs. Here’s how to install Homebrew:
zsh
Copy code
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Once Homebrew is installed, you can use it to install wget
and pcregrep
by running the following commands in your terminal:
zsh
Copy code
brew install wget
zsh
Copy code
brew install pcre
These tools will allow us to download files and search through web pages efficiently.
While wget
is a great tool for downloading files, it struggles with dynamically generated URLs, like those created by PHP scripts. Websites often don’t present files as direct links, so wget
won’t be able to find them on its own.
Additionally, wget
does not have the ability to filter out files based on their extensions (like .exe
or .atr
) in the way we need. This is where pcregrep
comes in—allowing us to filter through the URLs and focus on only the files we’re interested in. By combining wget
and pcregrep
, we can automate the process of finding and downloading specific files from these complex web pages.
Before we can download files, we first need to identify the folder URLs where they are located. Let’s say you’re trying to download Atari 8-bit software files like .exe
and .atr
. These files are hosted on a dynamic website that lists them under folders. The folder URLs are dynamically generated and displayed on the page.
To retrieve these folder URLs, we can use the wget --spider
command. This command will check all the links on the site without actually downloading the content:
zsh
Copy code
wget --spider -r -l1 -nd "https://www.atarionline.pl/v01/index.php?ct=demos&sub=cp" 2>&1 | pcregrep -o 'https://www\.atarionline\.pl/v01/index\.php\?ct=demos&sub=cp&tg=[^ ]+'
This command performs two key actions:
wget --spider
: Checks the website recursively for links without downloading any content.pcregrep
: Filters the output to show only URLs that point to the specific folders containing the files we want.The result will look like this (URLs pointing to folder pages on the website):
bash
Copy code
https://www.atarionline.pl/v01/index.php?ct=demos&sub=cp&tg=24h%202nd%202004 https://www.atarionline.pl/v01/index.php?ct=demos&sub=cp&tg=24h%205th%202005 https://www.atarionline.pl/v01/index.php?ct=demos&sub=cp&tg=6502%20Compo%202004
Each of these URLs points to a folder containing multiple files. These URLs, when opened in a browser, will reveal a list of files (such as .atr
, .xex
, .arc
) inside the corresponding folder.
Note: Replace the URL above with the root URL of the website you’re targeting. The provided URL is just an example.
Once you have the folder URLs, the next step is to extract the actual file links (e.g., .exe
, .atr
, .xex
, .arc
). These files are listed under each folder, and we need to scrape the HTML page for the direct file URLs.
You can use wget
to download the content of each folder page and then use pcregrep
to extract the file URLs.
Here’s how you can do it:
wget
:zsh
Copy code
wget -q -O folder_page.html "https://www.atarionline.pl/v01/index.php?ct=demos&sub=cp&tg=24h%202nd%202004"
pcregrep
:zsh
Copy code
pcregrep -o 'https://www\.atarionline\.pl/v01/.+\.(atr|exe|xex|arc)' folder_page.html
This will give you URLs that directly point to the .atr
, .exe
, and other file types you want to download. For example:
bash
Copy code
https://www.atarionline.pl/v01/files/demos/24h_7th_2005.atr https://www.atarionline.pl/v01/files/demos/24h_7th_2005.xex https://www.atarionline.pl/v01/files/demos/24h_7th_2005.arc
Note: If you need to download other file types, such as
.txt
, simply modify thepcregrep
regular expression to include those extensions.
Now that we have the direct download URLs for the files, we can use wget
to download them. If you’ve saved these file URLs in a text file (e.g., download_links.txt
), you can download all the files at once:
zsh
Copy code
cat download_links.txt | wget -i -
This will automatically download all the .atr
, .exe
, .xex
, and .arc
files listed in the text file.
By using wget
combined with pcregrep
, you can automate the process of downloading files from websites with dynamic URLs. This process can be particularly useful for downloading Atari 8-bit software files or any other files hosted on similar websites. The key steps are:
wget
.wget
again to download those files.This process can save a lot of time compared to manually navigating through the website and downloading each file individually.
While automating the download process is efficient, you may encounter some common issues:
chmod
command.wget
with the --spider
flag to test.wget
with options like --max-redirect
or consider using a headless browser automation tool like puppeteer
to bypass captchas.wget
CommandThe following command extracts the URLs for the software folders containing .exe
and .atr
files:
zsh
Copy code
wget --spider -r -l1 -nd "https://www.atarionline.pl/v01/index.php?ct=demos&sub=cp" 2>&1 | pcregrep -o 'https://www\.atarionline\.pl/v01/index\.php\?ct=demos&sub=cp&tg=[^ ]+'
Here’s a breakdown of what each part does:
wget --spider -r -l1 -nd "https://www.atarionline.pl/v01/index.php?ct=demos&sub=cp"
--spider
: Tells wget
to perform a dry run, meaning it won’t download files but will check if they exist.-r
: Recursively checks all links.-l1
: Limits recursion to 1 level deep.-nd
: Avoids creating directories—only downloads the files.pcregrep -o 'https://www\.atarionline\.pl/v01/index\.php\?ct=demos&sub=cp&tg=[^ ]+'
-o
: Outputs only the matched part of the line (the folder URL).[^ ]+
: Matches one or more non-space characters (essentially capturing the full URL).By following this guide, you should be able to automate the process of downloading specific files from dynamic websites effectively.