Search country domains using Google

Here is small research that I did two months ago to search country domains using Google for my country TLD, “.ba”. I already wrote about Getting http response headers with python which is basically result of script that will be shown here. Approach that I used is without using Google API, so I just used URL parameters to go through search results:

http://www.google.com/search?q=site%3A.ba&num=100&start=0

As you see, I was searching for “site:.ba”, 100 results per page and used start parameter to go through all search results. So here is that simple python3 script:

#!/usr/bin/env python

__author__ = 'Alen Komljen'

import urllib.request, re, time, argparse, os, platform
from socket import timeout

parser = argparse.ArgumentParser()
parser.add_argument('-r', action='store', dest='results_number', required=True, type=int, \
                    help='number of google results to check')
args = parser.parse_args()

max_results = args.results_number
start = 0
end = max_results - 100
url_list = []
google_url = "http://www.google.com/search?q=site%3A.ba&num=100&start="

pwd = os.getcwd()
system = platform.system()
if system == "Windows":
	ba_domains = open(pwd + "\\domains_google.txt", "w")
elif system == "Linux":
	ba_domains = open(pwd + "/domains_google.txt", "w")
else:
	print("Unsupported system!")

if max_results % 100 != 0 or max_results = start:
		request = urllib.request.Request(google_url + str(start))
		request.add_header("User-Agent","Mozilla/5.0")
		try:
			response = urllib.request.urlopen(request, timeout=10)
			html = response.read()
		except:
			continue

		url_match = re.findall("url\?q=http:\/\/([a-z\.]*.ba)", str(html))
		if url_match != "":
			for url in url_match:
				url_list.append(url)

		print ("Results from: " + str(start) + " - " + str(100 + start) + " finished, wait 30 seconds...")
		start+=100
		time.sleep(30)

url_list_sorted = sorted(set(url_list))

for x in url_list_sorted:
	ba_domains.write(x + "\n")

ba_domains.close()
print("Completed! Results added to file: " + ba_domains.name)

As I already said this is python3 script, so to run it you will need to have python3 installed. Also it can be executed from Windows or Linux and search results will be written to file “domains_google.txt” at same working directory where the script is. To run the script:

python search_domains.py -r 500

This means that script will go to 500 results. After each iteration of 100 results it will wait for 30 seconds to not spam google. I am not an python expert, so all suggestions are welcome.