Sets in Python are often used for two purposes:
1. Removing the duplicate entries in a collection
2. For membership testing.
By membership, here we mean to find existence of element in a collection
The focus of this post is to evaluate performance of list, tuple and set data structures with respect to each other on the basis of membership testing.
How do I do it? Well, some code, some data collection and some analysis
[sourcecode language=”python”]
from datetime import datetime
listA = range(10000000)
setA = set(listA)
tupA = tuple(listA)
def calc(data, type):
start = datetime.now()
if data in type:
print ""
end = datetime.now()
print end-start
calc(9999, listA)
calc(9999, tupA)
calc(9999, setA)
[/sourcecode]
Below is the average of 10 iterations for the time taken by lists/tuples and sets for testing membership of 9999, 99999, 999999 in 10000000
Search 9999 in 10000000 | ||
list | tuple | set |
0.8 | 0.8 | 1.9 |
Search 99999 in 10000000 | ||
list | tuple | set |
2.6 | 2.8 | 1.7 |
Search 999999 in 10000000 | ||
list | tuple | set |
21.4 | 21.6 | 0.8 |
Conclusions
1. for testing membership of elements at the beginning of the collection, lists or tuples perform more than 100% better than sets
2. As soon as you start testing membership of elements that are in the middle or end of the collection, sets perform 40% – 1800% better than lists or tuples
You now have a fair idea why you should think of using sets for large collections…
Note
Hardware I used: windows 7, x64, 4GB RAM
Python version: 2.7.2
Interesting! could we have a mechanism or an intelligent API where given a list, the membership query function transparently chooses the best construct among set, tuples or list before querying on the same list?
Nice Work, Thanks !