Auto face cropper

Related files

Even if you’re not able to achieve Facebook-level face-detection, it’s still useful to be able to write tailored detection code that can meet your needs.

For example, if you’re creating a data site based off the U.S. Congress membership, you probably want to show the face of each congressmember, because faces add visual impact to a page, and especially since those images are free-to-use and conveniently collected by the Sunlight Foundation.

You can clone their Github repo, but be warned, it’s quite massive. You can practice with my sample of 30 photos here:

The photos themselves are nice and clear. But if you put them into an array, you’ll see a visual inconsistency: some images include just the heads and shoulders, and other images are from the waist up:


With 500+ sitting members of Congress, it will be painful to pick out the photos that need to be cropped and than manually crop them. However, we just need a pretty basic implementation of face-detection for a Python script, because in this case, every Congressmember’s photo has a relatively obvious face, and so we can set the detectMultiScale() parameters to be very loose in finding candidates, and then just pick the biggest detected face…on the assumption that the biggest detected face is the actual face.


Download the sample files:

mkdir -p /tmp/testpeople
cd /tmp/testpeople
# download the file
curl -O $url
# unzip the zip

Run the script:

mkdir -p /tmp/testfaces # a new directory to save the faces
cd /tmp/testpeople # just in case you aren't there already...
for f in unitedstates-images-originals/*.jpg; do 
  python $f -d /tmp/testfaces

Pre-Crop (e.g. /tmp/testpeople):


Post-Crop (e.g. /tmp/testfaces):


Random stuff

These are scripts to prepare the data for this lesson. No context is given.

curl -o
mkdir -p unitedstates-images-originals
# copy over the original images
cp -r images-gh-pages/congress/original/ unitedstates-images-originals
# copy over necessary text files
cp images-gh-pages/{LICENSE,*.md} unitedstates-images-originals
zip -r unitedstates-images-originals/
# copy using AWS S3
aws s3 cp s3://YOURBUCKETNAME --acl public-read

Make an excerpt file of the images

# gshuf shuffles things and is part of OS X homebrew coreutils
zip -r \
  unitedstates-images-originals/{LICENSE,*.md} \
  $(ls unitedstates-images-originals/*.jpg | 
    gshuf | head -n 30) # don't normally do this... 
aws s3 cp s3://YOURBUCKETNAME --acl public-read

To use:

curl -O $url