Using the Python requests package to set User Agent for scraping a website
On a not infrequent basis, if I’m unable to obtain data from a website via API, I find myself writing a program to scrape the data from the website (a note to all you Little Leaguers: always abide by the robots.txt file and be respectful in your scraping). I traditionally use Python to do the scraping, with the help of the marvelous BeautifulSoup package to do the parsing of the HTML.
I’ve done this enough to the point where I have a template set up, so the major work is untangling the HTML via BeautifulSoup. Recently, however, my template failed me and I couldn’t figure out why.
My standard approach
Traditionally, I have just used the urllib to connect to the website, and it had never failed me to date. (Note I’m just using one of my sites, www.datomium.com in the following examples, though it wasn’t the site I had the trouble with)
import urllib
from BeautifulSoup import BeautifulSoup as Soup
url = "http://www.dataomium.com/"
soup=Soup(urllib.urlopen(url))
# Start parsing via BeautifulSoup now
For the website I was scraping this worked fine at the highest level, i.e for www.datomium.com, the soup object above was populated with the expected tree from BeautifulSoup and I could parse away. However, if I tried to do it at a lower level, say www.datomium.com/bb/ , the soup object was just an empty string. I deconstructed that call and found that the urlopen function itself was returning nothing, although the return code indicated it was fine.
A quick look at the robots.txt of the site didn’t turn up anything that would indicate that I couldn’t hit the /bb/ subfolder
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
The successful approach
As it turns out, using the requests package rather than urllib turned out to be a solution.
In requests, you can set the ‘User-Agent’ parameter of the HTTP request, which finally returned content as expected
from BeautifulSoup import BeautifulSoup as Soup
import requests
page = requests.get(urlbase, headers={'User-Agent':'test'}
soup=Soup(page.content)
# Start parsing via BeautifulSoup now