March 2, 2020
This post covers two methods that are used to filter URLs from webpages. These curated URLs are subsequently processed and saved as a text file. The text file is then referenced by the bash script to coordinate a scrape of the wayback machine for specific file types.
Start by navigating to a URL, changing the example.com
root domain to your target domain.
http://web.archive.org/cdx/search/cdx?url=example.com*&output=json
Output: output=json can be added to return results as JSON array. The JSON output currently also includes a first line which indicates the cdx format.
Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=3
http://web.archive.org/cdx/search/cdx?url=example.com*&output=txt
If you need to limit the time frame of the crawl add the following parameters, yyyyMMddhhmmss
to the end of the URL to narrow the range by using &from=2010&to=2018
.
http://web.archive.org/cdx/search/cdx?url=example.com*&output=txt&from=2010&to=2018
You can also decrease or increase the limit to match your needs by using &output=txt&limit=999999
http://web.archive.org/cdx/search/cdx?url=example.com*&output=txt&limit=999999
A browser’s console allows developers and designers to instantly try out their code. To extract links from any website, copy the code below, then paste it into your browser console and hit enter. Hyperlinks will be extracted from the webpage and displayed in the console. This method is proven to be the fastest for extracting URLs. Different variations of javascript code are given below.
The following is a cross-browser supported code for extracting URLs along with their anchor text.
var urls=$$('a');
for(url in urls){
console.log("#"+url+" > "+urls[url].innerHTML +" >> "+urls[url].href)
}
If you are using Chrome or Firefox use the following code for a styled version of the same.
var urls=$$('a');
for(url in urls){
console.log("%c#"+url+" > %c"+urls[url].innerHTML +" >> %c"+urls[url].href,"color:red;","color:green;","color:blue;");
}
Use the following code to extract just the links without the anchor text.
var urls=$$('a');
for(url in urls)
console.log(urls[url].href);
External Links are the ones that point outside the current domain. If you want to extract the external URLs only, then this is the code you need to use.
var links = $$('a');
for (var i = links.length - 1; i > 0; i--) {
if (links[i].host !== location.host) {
console.log(links[i].href);
}
}
If you would like to extract links having a particular extension then paste the following code into the console. Pass the extension wrapped in quotes to the getLinksWithExtension() function. Please note that the following code extracts links from HTML link tag only <a></a>
and not from other tags such as a script or image tag.
function getLinksWithExtension(extension) {
var links = document.querySelectorAll('a[href$="' + extension + '"]'),
i;
for (i=0; i<links.length; i++){
console.log(links[i]);
}
}
getLinksWithExtension('mp3') //change mp3 to any extension