Facebook photo tags export
A few months ago, I scanned a few thousand old family photos and uploaded them to Facebook. Family members and friends came out of the woodwork from all over the world and helped tag them, identifying faces I barely remembered from childhood or had never known at all.
I wanted to export those photos back out of Facebook, with all those tags intact. Fortunately, https://developers.facebook.com/docs/graph-api[Facebook's API] is incredibly rich and easy to use - Facebook may be a walled garden, but it's one that's really, really easy to copy your data out of. (Note I didn't say remove.)
I used the facebook-sdk.
import facebook
token = 'CAACEdEoseetcetcmysecrettokenwasherezEwZDZD'
g = facebook.GraphAPI(access_token=token)
First I went to Facebook by hand and got the album ids I wanted to download from.
albumswanted = ['10101527783396627',
'10101527780352727',
'10101527672838187'] # in real life there were a lot more
Next, I set up a dict where I will store the data. For each of the albums, I want to get all the photos in it. Facebook is super-nice and actually returns all the data I need in this query (I can imagine an alternative world in which it just returned a bare list of photoids which I would have had to execute subqueries on).
If the current page returned by get_object
has data, I print out the photo's id
, its source
which is a URL for the image, and then if the photo has any face tags, I print them.
albums = {}
for album in albumswanted:
curpage = g.get_object('{}/photos'.format(album),
fields='id,source,tags{name}')
If the page has data, it will have a length of 2; a data
key and a paging
key. When finished, curpage
will just have an (empty) data
key, and so it will be of length 1.
while len(curpage) == 2:
for photo in curpage['data']:
if 'tags' in photo:
print('{}\t{}\t'.format(photo['id'], photo['source'])),
for tag in photo['tags']['data']:
print('"{}"\t'.format(tag['name'])),
print('\n'),
When I've run out of photos, grab the next page of data, if one exists, and do it again.
curpage = g.get_object('{}/photos'.format(album),
fields='id,source,tags{name}',
after=curpage['paging']['cursors']['after'])
This script outputs lines like:
10101527734988637 https://scontent.xx.fbcdn.net/hphotos-xpf1/t31.0-8/s720x720/10511474_10101527734988637_3611385766771831953_o.jpg "Dede Shanok Drucker" "Daniel M Drucker"
I saved the output of that script to tags.txt
. Next, I grabbed all of those image files:
awk '{print $2'} tags.txt | wget -i -
Note, however, that the files I retrieved are not the full resolution images! Yes, I could have retrieved the full resolution images, but that wouldn't actually help me - Facebook strips original Exif information, and in any case, I want to identify the subset of these images I have on my local disk that have been tagged on Facebook.
I now have a folder facebook
full of images I just downloaded. I also have a folder originals
which contains (a larger set of) images, with different filenames and higher resolution, some of which correspond to the images in the facebook
folder. I want to determine a mapping between the two.
Because the image transform is so simple (just resizing, no other transformation), really any image comparison tool would probably work - even something incredibly naive like histogram matching. I ended up using the first thing I came across, the pHash library, and it worked great:
import phash
import glob
import itertools
import sys
thumbhash = {}
orighash = {}
for fn in glob.glob('/path/to/facebook/*.jpg'):
print('hashing {}'.format(fn))
thumbhash[fn] = phash.image_digest(fn)
for fn in glob.glob('/path/to/originals/*.jpg'):
print('hashing {}'.format(fn))
orighash[fn] = phash.image_digest(fn)
I use itertools.product
to check every hash in the facebook images against every hash in the originals. It's a cartesian product, so it's a large number of comparisons, but it's actually quite fast - the slow part was computing the hashes in the first place.
for element in itertools.product(thumbhash,orighash):
cc = phash.cross_correlation(thumbhash[element[0]],orighash[element[1]])
if cc > .99: # it should really be pretty perfect
print(element)
This generates lines like
('10628778_10101527712643417_5250594863317404614_o.jpg', '/home/dmd/Dropbox/scan2014/feb93/Feb93alb_p_026.JPG')
Using those and the tags from tags.txt
, I can then just use Exiftool to add the tags I want to the right image files!