2015-04-14

Facebook photo tags export

A few months ago, I scanned a few thousand old family photos and uploaded them to Facebook. Family members and friends came out of the woodwork from all over the world and helped tag them, identifying faces I barely remembered from childhood or had never known at all.

I wanted to export those photos back out of Facebook, with all those tags intact. Fortunately, https://developers.facebook.com/docs/graph-api[Facebook's API] is incredibly rich and easy to use - Facebook may be a walled garden, but it’s one that’s really, really easy to copy your data out of. (Note I didn’t say remove.)

I used the facebook-sdk.

import facebook
token = 'CAACEdEoseetcetcmysecrettokenwasherezEwZDZD'
g = facebook.GraphAPI(access_token=token)

First I went to Facebook by hand and got the album ids I wanted to download from.

albumswanted = ['10101527783396627',
                '10101527780352727',
                '10101527672838187'] # in real life there were a lot more

Next, I set up a dict where I will store the data. For each of the albums, I want to get all the photos in it. Facebook is super-nice and actually returns all the data I need in this query (I can imagine an alternative world in which it just returned a bare list of photoids which I would have had to execute subqueries on).

If the current page returned by get_object has data, I print out the photo’s id, its source which is a URL for the image, and then if the photo has any face tags, I print them.

albums = {}
for album in albumswanted:
    curpage = g.get_object('{}/photos'.format(album),
                           fields='id,source,tags{name}')

If the page has data, it will have a length of 2; a data key and a paging key. When finished, curpage will just have an (empty) data key, and so it will be of length 1.

    while len(curpage) == 2:
        for photo in curpage['data']:
            if 'tags' in photo:
                print('{}\t{}\t'.format(photo['id'], photo['source'])),
                for tag in photo['tags']['data']:
                    print('"{}"\t'.format(tag['name'])),
                print('\n'),

When I’ve run out of photos, grab the next page of data, if one exists, and do it again.

        curpage = g.get_object('{}/photos'.format(album),
                               fields='id,source,tags{name}',
                               after=curpage['paging']['cursors']['after'])

This script outputs lines like:

    10101527734988637	https://scontent.xx.fbcdn.net/hphotos-xpf1/t31.0-8/s720x720/10511474_10101527734988637_3611385766771831953_o.jpg	"Dede Shanok Drucker"	"Daniel M Drucker"

I saved the output of that script to tags.txt. Next, I grabbed all of those image files:

awk '{print $2'} tags.txt | wget -i -

Note, however, that the files I retrieved are not the full resolution images! Yes, I could have retrieved the full resolution images, but that wouldn’t actually help me - Facebook strips original Exif information, and in any case, I want to identify the subset of these images I have on my local disk that have been tagged on Facebook.

I now have a folder facebook full of images I just downloaded. I also have a folder originals which contains (a larger set of) images, with different filenames and higher resolution, some of which correspond to the images in the facebook folder. I want to determine a mapping between the two.

Because the image transform is so simple (just resizing, no other transformation), really any image comparison tool would probably work - even something incredibly naive like histogram matching. I ended up using the first thing I came across, the pHash library, and it worked great:

import phash
import glob
import itertools
import sys

thumbhash = {}
orighash = {}

for fn in glob.glob('/path/to/facebook/*.jpg'):
    print('hashing {}'.format(fn))
    thumbhash[fn] = phash.image_digest(fn)

for fn in glob.glob('/path/to/originals/*.jpg'):
    print('hashing {}'.format(fn))
    orighash[fn] = phash.image_digest(fn)

I use itertools.product to check every hash in the facebook images against every hash in the originals. It’s a cartesian product, so it’s a large number of comparisons, but it’s actually quite fast - the slow part was computing the hashes in the first place.

for element in itertools.product(thumbhash,orighash):
    cc = phash.cross_correlation(thumbhash[element[0]],orighash[element[1]])
    if cc > .99:        # it should really be pretty perfect
        print(element)

This generates lines like

('10628778_10101527712643417_5250594863317404614_o.jpg', '/home/dmd/Dropbox/scan2014/feb93/Feb93alb_p_026.JPG')

Using those and the tags from tags.txt, I can then just use Exiftool to add the tags I want to the right image files!