Collecting available employee counts for public companies

I worked on a project where I wanted to find employee counts for companies to use in a Bayesian prior I was using in a model. There are likely numerous places you can buy this sort of data, but I hoped to use what was publicly available.

Eventually, I was able to obtain a good amount of this data through the Yahoo Finance API. Looking at a page such as

http://finance.yahoo.com/q/pr?s=ORCL+Profile

you can see that they supply a full time employee count. So the problem now becomes to loop through all tickers to retrieve this data programmatically. The site eoddata.com proved to be a good site for obtaining a list of tickers symbols and names for companies traded on both NYSE and NASDAQ. I obtained these in a CSV form, and then had to do a little cleansing prior to interrogating via the Yahoo API. (The code below selects them from a table eoddata_cmp into which the cleansed tickers were loaded)

First off, there are a number of tickers that are essentially duplicates as far as companies go, representing classes of stock. For an example, consider Alcoa, represented by both AA and AA-B on the NYSE. While both are valid tickers, we really only want one when we go to Yahoo to find Alcoa’s employee count. So first, I eliminated such duplicates. Additionally, there are tickers that are funds, and not actual companies employing people. However, as it’s a bit tougher to ascertain which ones fall into this category, I kept them for now, knowing they would just return a 0 count from yahoo.

Finally, list of tickers in hand, I was able to write a Python script to loop through the Yahoo API and return a CSV of companies and employee counts. There are some companies which do not have employee counts available, but for the most part, it worked out well as one source for my purposes.

You can download a fairly recent version of the CSV here pub_comp_nonzero_emp

import urllib
import json
import MySQLdb

db = MySQLdb.connect("127.0.0.1","ID","pwd","bayesmatch" )
cmp_cur = db.cursor()

# URL portions
yurl_pt1="https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.stocks%20where%20symbol%3D%22"
yurl_pt2 = "%22&diagnostics=true&format=json&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys"

# Open station file and Distance File
tfile = open("c:/Bayesmatch/cmp_with_cnts.csv", 'w')


cmp_sql = """SELECT  ticker,name
	    from eoddata_cmp
	   	 	      """
cmp_cur.execute(cmp_sql)
comps = cmp_cur.fetchall()

for cmpi in comps:
		# Read JSON 
		yurl = yurl_pt1 + cmpi[0] + yurl_pt2
		jsonurl=urllib.urlopen(yurl)
		text1=json.loads(jsonurl.read())

		# Drill into results
		q1=text1["query"]
		res=q1["results"]
		stck =res["stock"]
		if stck:
			if stck.has_key('FullTimeEmployees'):
				fte = stck["FullTimeEmployees"]
			else:
				fte = 0
					
		tfile.write(stck["symbol"] +"," +cmpi[1]+"," +str(fte))
		tfile.write("\n")


tfile.close()