PythonTip >> 博文 >> PyPy python

Part 2 Dart vs Go vs Python (and PyPy) Performance

zihua 2014-01-20 23:01:48 点击: 2060 | 收藏


UPDATE: In the comments Will pointed out I had a bug in the Python code so updated 22:00 EST sorry.

UPDATE 2: Anonymous posted a more comparable version of the Go (http://pastie.org/3772555) so I updated the numbers 04-12-2012 09:00 EST (sorry). Please let me know if there are any other bug fixes!

The previous post I wrote Surprising performance from Dart inspired me to do some more test runs with Dart, Go, Python and PyPy. For round two of some simple performance comparisons between  Dart, Go, Python and PyPy I decided to do some basic file reading and processing. Since the vast majority of data that I process is JSON I decided to generate some random JSON data and store it into a file. Using each language read the entire file and for each line parse the JSON and aggregate it. Assuming I wrote the code right (big assumption) then I can get an idea which language is best for processing large sums of JSON.

The code basically is broken down into a module (package) that contains an Aggregator class (type) and the main file which reads from the data file. The hosted repository is on bitbucket.

Dart code:

I was able to code this using async style programming which in theory would maximize for IO performance if Dart supported async IO. I was a bit shaky at first with the terse syntax of the onLine callback but I am getting a bit ok with it.  The duplicate use of StringInputStream seems a bit wordy, probably could have just said "var stream" instead of "StringInputStream stream" to increase readability (Go got this right with the := syntax). Otherwise the code is very straight forward.

Since Dart is design for the browser and it is a preview the IO would be very slow but I expected it to be able to process JSON very fast since that is a large part of what a modern web application does. This also means that the StringInputStream should be pretty fast since it will be the main source of data for the application. Go would probably outperform Python and not really sure about Python vs PyPy. I figured PyPy would use more memory but be faster for doing aggregation.

Runtime comparison:

This is only 10,000 JSON documents in a single file. I have no idea how it does it but Python just kills it. Not that surprising that it is 18x faster than Dart but 2x (roughly) faster than Go is pretty amazing I knew the Python 2.x IO was fast but I didn't realize it was this fast. I think there could be a faster way to process the JSON in Go it just didn't jump out at me to do it without using the Unamarshal.

Memory comparison:

I expected Go to have used the least memory but didn't expect it to be so tiny (2,076, after update Python is 4,692). Dart was unusable but that is expected in a preview release. PyPy is not much better I guess I need to send them a note and ask them what they think. Python again rock star here.

I have chosen to kick the tires at a pretty high level but I think it is a realistic situation since web apps these days are processing a lot of JSON. Dart faired as expected for a preview release. Python rock star status as usual. PyPy we will see what happens when I contact them about it. Finally Go was very disappointing how much runtime it took to process the JSON, however it did recover some dignity by it exceptional use of memory.

  1. If I read it correctly, the Python implementation uses file.readlines() which will load the whole file into memory and might explain the high pypy memory usage. I would imagine if you read the file line by line, or using a fixed size buffer, you could reduce the memory footprint drastically at the cost of raw speed (as it will require several IO visits).

    ReplyDelete

  2. PyPy has a high JIT warmup period (if I remember correctly, on the order of 5-10 seconds). Try it with a significantly larger file and I suspect the python/pypy differences will flip-flop.

    ReplyDelete

  3. PyPy's JSON parser is pure python, whereas CPython's is actually a C module. It's the same case for JSON encoding, however we spent some time optimizing that so it's actually faster than the C code: http://morepypy.blogspot.com/2011/10/speeding-up-json-encoding-in-pypy.html . Eventually someone will do this for parsing as well, and I'm confident that PyPy can win there too.

    ReplyDelete

    1. I am using PyPy 1.8 so I assume this optimization was in it? Or was it that 10K was to small a data size for to see the performance?

      Delete

  4. Will: You are totally right I need to update code FAIL! This affected both Python and PyPy. I will try to get these numbers updated!

    ReplyDelete

    1. It's not quite the same. Yes, Python (and pretty much every other language implementation out there?) is written in C. But the specific module in question (json) is a C extension - meaning that all the heavy lifting is done in optimized C code that only gets back to Python in the API level.

      Delete

  5. You will appreciate you're actually measuring the performance of the JSON libraries used by each, and how you approach the IO, and not the intrinsic speed of the language....

    The thing is, CPython's JSON is *slow*. If you actually do really manipulate a lot of JSON, use the much faster ujson library.

    http://stackoverflow.com/questions/9884080/fastest-packing-of-data-in-python-and-java

    ReplyDelete

  6. Unfortunately, Go's `json` package is horribly inefficient. `json.Unmarshal` uses `reflect` internally, which is notoriously slow. You can speed up the implementation by unmarshalling to a `map[string]interface{}`, then converting interface types as needed.

    I made a crapton of changes to your implementation, but not passing a struct to `json.Unmarshal` was the only thing that made a difference (and about a 5x speedup, at that).

    Here's the version of the code I managed to squeeze out: http://pastie.org/3772555

    ReplyDelete

    1. You are correct I think your implementation is a better match to what the Python and Dart code is. Give me a little while and I will write an update.

      Delete

  7. The text says CPython is 9x faster than Go, but that graph seems to disagree. Also, you seem to be crediting IO speed, but IO isn't the limiting factor here, I don't think.

    ReplyDelete

    1. JSON parsing. That's what is being benchmarked here, not IO.

      With the current stream.go:

      Total Count: 10000

      Total List size: 5000000

      Total list sum: +2.499365e+011

      Average list number: +4.998730e+004

      real 0m3.397s

      user 0m3.360s

      sys 0m0.034s

      With the JSON parsing commented out and a dummy Item with all of the fields just created as a literal instead:

      Total Count: 10000

      Total List size: 5000000

      Total list sum: +1.247500e+009

      Average list number: +2.495000e+002

      real 0m0.219s

      user 0m0.184s

      sys 0m0.017s

      So, looks to me like a trivial fraction of the time is spent doing IO.

      Delete

  8. The Go version is missing runtime.GOMAXPROCS(runtime.NumCPU());

    Even without making more threads, which you could do, this will increase garbage collection performance.

    ReplyDelete

原文链接:http://www.wumii.com/item/1VjA647f

作者:zihua | 分类: PyPy python | 标签: pypy python | 阅读: 2060 | 发布于: 2014-01-20 23时 |