Did a quick comparison of some data serialization options for Python. My requirements for the serialization format were the following:
- Input data is typically either a list or a dictionary.
- Interoperability is important and must be compatible with at least C.
- A human readable format is desirable but not necessary.
Based on the requirements, I took a look at the following Python packages:
- PyYaml
- python-cjson
- ujson
- u-msgpack-python
- msgpack-python NOTE: Disqualified since had trouble running on Windows.
The following Python script was used to test out the various packages:
import os
import time
import umsgpack
import yaml
import cjson
import ujson
DATA = [{'val1':12345, 'val2':[1,2,3,4,5], 'val3':"12345"} for _ in range(10000)]
def test_serialization(name, encode, decode):
print name
print " Encoding..."
t_start = time.clock()
packed = encode(DATA)
print " time = %f seconds" % (time.clock() - t_start)
print " size = %u kilobytes" % (len(packed) / 1024)
print " Decoding..."
t_start = time.clock()
unpacked = decode(packed)
print " time = %f seconds" % (time.clock() - t_start)
print " same = %r" % (DATA == unpacked)
test_serialization("umsgpack", umsgpack.packb, umsgpack.unpackb)
test_serialization("yaml", yaml.dump, yaml.load)
test_serialization("cjson", cjson.encode, cjson.decode)
test_serialization("ujson", ujson.encode, ujson.decode)
The result of running this script on my laptop (Intel Core i7 2670QM) is the following:
umsgpack
Encoding...
time = 0.390241 seconds
size = 341 kilobytes
Decoding...
time = 0.430256 seconds
same = True
yaml
Encoding...
time = 8.266586 seconds
size = 527 kilobytes
Decoding...
time = 15.943908 seconds
same = True
cjson
Encoding...
time = 0.030977 seconds
size = 576 kilobytes
Decoding...
time = 0.022119 seconds
same = True
ujson
Encoding...
time = 0.013703 seconds
size = 478 kilobytes
Decoding...
time = 0.018000 seconds
same = True
For my particular application, speed is more important than size of the serialized data. The clear winner for speed is ujson
. For size, msgpack
is slightly better than ujson
which makes sense since it is a binary format.
Overall, I am very impressed by the performance of ujson
. Given the ubiquity of JSON for web-based data, it makes sense that ultra optimized libraries would exist for it. While I love YAML as a data format, the performance of the PyYAML library is not suitable for applications requiring fast encoding/decoding times.