Sets

One of my absolute favorite things about Python is the rich collection of data containers you have at your disposal: lists, dictionaries, and tuples, of course, but also the collections module (OrderedDicts, namedtuples, Counters) and the topic of this post: sets.

You see, if you’re coming from PHP, there are arrays. And then there are arrays within arrays. You can have any data structure you like, so long as it starts with array(. You know, array(array(array(array(....

So, coming from the impoverished data structures of PHP, Python’s seem both luxurious and overwhelming. Sets, for instance, didn’t make much sense to me at first. Sets are like lists, only they’re unordered and can’t contain duplicates. Those sound like handy features to have — I think I’ll just use lists!

And when lists are needed, use lists. But there are actually many common situations where people often just use lists when sets can solve a lot of problems.

Set Operations

Let’s say you’re validating a CSV file. You want to make sure that it contains some required headers. Here they are as a set:

required = {'id', 'name', 'address', 'title'}

(The brace notation for sets isn’t available in Python 2.6 and below: you have to use set([...]).)

Ok, now let’s get our headers from our csv:

with open('myfile.csv', 'rU') as f:
    reader = csv.reader(f, delimiter=',')
    headers = reader.next()

Rather than doing something silly like looping through required and testing for membership in headers, we can take advantage of the fact that sets are designed for super fast membership lookups like this:

if required.issubset(set(headers)):
    print('Good!')

issubset is just one of the handy set operations available to you: union, difference, etc. Do dir(set()) for a list. These kinds of operations make up for a fairly large amount of the things you often do with lists, and once you’re aware of sets, you’ll keep finding reasons to use them.

Speed

This makes for clean code, but it’s also faster. For instance, lets make up a test case here. Create two sets consisting of 200 random integers between 1 and 1000:

>>> one = {random.choice(range(1, 1000)) for x in range(0, 200)}
>>> two = {random.choice(range(1, 1000)) for x in range(0, 200)}

Let’s say we want to find what item’s they have in common. In one corner, we will take advantage of set’s intersection method:

>>>  one.intersection(two)

In the other corner, we will loop over one set and pull out the items in both sets, shown here with a list comprehension:

>>> [x for x in one if x in two]

Which is faster? Let’s use timeit:

>>> from timeit import timeit
>>> setup = 'from __main__ import one, two'
>>> timeit('one.intersection(two)', setup)
39.309852838516235
>>> timeit('[i for i in one if i in two]', setup)
120.47418093681335

A few notes:

  1. timeit is pretty unfriendly to use isn’t it? If you haven’t used it before, The first argument is a statement that is executed one million times, and the second argument is a setup for the first argument: here, importing one and two from my current module.
  2. set.intersection is 3 times faster!
  3. I actually did this in Pythonista on my iPad, so these times are likely very slow!

Why are sets faster? The implementation of lists basically mirrors the way lists work: items gets added to a list, at a particular position, and the list doesn’t care whether it has 1 or 100 copies of what you’re adding. So if you ask whether a particular item is in a list, Python pretty much has to do exactly what we would have to do: loop through each item in the list and ask, “Is this what you’re looking for?”

Sets, however, are not stored like this. When you add a value to a set, Python hashes the value and stores it at a specific point in the hash based on that hash value. This means that looking up an item means hashing the value you’re asking for and then trying to retrieve that value from the set’s hash table.

This is why sets require unique items (duplicate items have the same hash value), and also why sets are unordered (because sets return their values in their hashed order).

For a more thorough explanation, check out Brandon Rhodes talk, The Mighty Dictionary. Dictionaries? Sets are basically dictionaries with only keys.

You can see the effect of this when you compare the result of asking for membership in a list versus a set. Here we’ll take a random value from our one set (which we’ll convert to a list first: random.choice wants a list) and then ask whether or not it’s in our two set. Then we’ll convert two to a list and try the same thing:

>>> one = list(one)
>>> setup = 'from __main__ import one, two; import random'
>>> timeit('random.choice(one) in two', setup)
8.184441089630127
>>> two = list(two)
>>> timeit('random.choice(one) in two', setup)
28.87601089477539

Again, about 3.5 times as fast to test for membership in a set vs a list…but this isn’t a very big list. Let’s make them bigger. Let’s create two lists/sets with 1,000 random integers between 1 and 10,000. Again, one is a list, and two is first a set, then a list:

>>> one = [random.choice(range(1, 10000)) for x in range(0, 1000)]
>>> two = {random.choice(range(1, 10000)) for x in range(0, 1000)}
>>> timeit('random.choice(one) in two', setup)
8.436138153076172
>>> two = list(two)
>>> timeit('random.choice(one) in two', setup)
138.07221007347107

Game over. It’s basically just as fast to do the lookup on a 1,000 item set as before, but 5 times slower for the 1,000 list than the 200 item list.