Here is small research that I did two months ago to search country domains using Google for my country TLD, “.ba”. I already wrote about Getting http response headers with python which is basically result of script that will be shown here. Approach that I used is without using Google API, so I just used URL parameters to go through search results:
http://www.google.com/search?q=site%3A.ba&num=100&start=0
As you see, I was searching for “site:.ba”, 100 results per page and used start parameter to go through all search results. So here is that simple python3 script:
#!/usr/bin/env python
__author__ = 'Alen Komljen'
import urllib.request, re, time, argparse, os, platform
from socket import timeout
parser = argparse.ArgumentParser()
parser.add_argument('-r', action='store', dest='results_number', required=True, type=int, \
help='number of google results to check')
args = parser.parse_args()
max_results = args.results_number
start = 0
end = max_results - 100
url_list = []
google_url = "http://www.google.com/search?q=site%3A.ba&num=100&start="
pwd = os.getcwd()
system = platform.system()
if system == "Windows":
ba_domains = open(pwd + "\\domains_google.txt", "w")
elif system == "Linux":
ba_domains = open(pwd + "/domains_google.txt", "w")
else:
print("Unsupported system!")
if max_results % 100 != 0 or max_results = start:
request = urllib.request.Request(google_url + str(start))
request.add_header("User-Agent","Mozilla/5.0")
try:
response = urllib.request.urlopen(request, timeout=10)
html = response.read()
except:
continue
url_match = re.findall("url\?q=http:\/\/([a-z\.]*.ba)", str(html))
if url_match != "":
for url in url_match:
url_list.append(url)
print ("Results from: " + str(start) + " - " + str(100 + start) + " finished, wait 30 seconds...")
start+=100
time.sleep(30)
url_list_sorted = sorted(set(url_list))
for x in url_list_sorted:
ba_domains.write(x + "\n")
ba_domains.close()
print("Completed! Results added to file: " + ba_domains.name)
As I already said this is python3 script, so to run it you will need to have python3 installed. Also it can be executed from Windows or Linux and search results will be written to file “domains_google.txt” at same working directory where the script is. To run the script:
python search_domains.py -r 500
This means that script will go to 500 results. After each iteration of 100 results it will wait for 30 seconds to not spam google. I am not an python expert, so all suggestions are welcome.



