首页 > 代码库 > High Performance Python 笔记(Python是门不错的语言,全栈程序员就用它好了!)

High Performance Python 笔记(Python是门不错的语言,全栈程序员就用它好了!)

High Performance Python

目录

  • 1Understanding Performant Python
  • 2Profiling
  • 3Lists and Tuples
  • 4Dictionaries and Sets
  • 5Iterators and Generators
  • 6Matrix and Vector Computation
  • 7Compiling to C
  • 8Concurrency
  • 9multiprocessing
  • 10Clusters and Job Queues
  • 11Using Less RAM
  • 12Lessons from the Field

Understanding Performant Python

Profiling

Lists and Tuples

  1. 内部实现都是array?

Dictionaries and Sets

  1. 字典元素:__hash__ + __eq__/__cmp__
  2. entropy(熵)
  3. locals() globals() __builtin__
  4. 列表理解/生成器理解:(一个用[],一个用())
    [<value> for <item> in <sequence> if <condition>] vs (<value> for <item> in <sequence> if <condition>)
  5. itertools:
    1. imap, ireduce, ifilter, izip, islice, chain, takewhile, cycle
  6. p95 Knuth‘s online mean algorithm?

Iterators and Generators

Matrix and Vector Computation

  1. 老是在举‘循环不变式’的例子,这是编译器没优化好吧?
  2. $ perf stat -e cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,\
    cache-references,cache-misses,branches,branch-misses,task-clock,faults,\
    minor-faults,cs,migrations -r 3 python diffusion_python_memory.py
  3. numpy
    1. np.roll([[1,2,3],[4,5,6]], 1, axis=1)
    2. ?Cython能够优化数据结构吗?还是说只能处理代码?
    3. In-place operations, such as +=, *=
      1. => numexpr
        1. from numexpr import evaluate
        2. evaluate("next_grid*D*dt+grid", out=next_grid)
    4. ?Creating our own roll function
  4. scipy
    1. from scipy.ndimage.filters import laplace
    2. laplace(grid, out, mode=‘wrap‘)
    3. page-faults显示scipy分配了大量内存?instructions显示scipy函数太过通用?

Compiling to C

  1. 编译到C:
    1. Cython
      1. zmq也用到了?
      2. setup.py
        from distutils.core import setup
        from distutils.extension import Extension
        from Cython.Distutils import build_ext
        setup( cmdclass = {‘build_ext‘: build_ext},
        ext_modules = [Extension("calculate", ["cythonfn.pyx"])]
        )
      3. $ python setup.py build_ext --inplace
      4. Cython Annotations:代码行更黄代表“more calls into the Python virtual machine,”
      5. 添加Type Annotations
        1. cdef unsigned int i, n
      6. 禁止边界检查:#cython: boundscheck=False(修饰函数)
      7. Buffer标记协议?
        1. def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs): ...
      8. OpenMP
        1. prange
        2. -fopenmp(对GCC?)
        3. schedule="guided"
    2. Shed Skin:for non- numpy code
      1. shedskin --extmod test.py
      2. 额外的0.05s:用于从Python环境复制数据
    3. Pythran
  2. 基于LLVM的Numba:specialized for numpy
    1. 使用Continuum’s Anaconda版本
    2. from numba import jit
      1. @jit()
    3. Experimental GPU support is also available?
    4. #pythran export evolve(float64[][], float)
  3. VM & JIT:PyPy
    1. GC行为:Whereas CPython uses reference counting, PyPy uses a modified mark and sweep(从而可能回收不及时)
    2. Note that PyPy 2.3 runs as Python 2.7.3.
    3. STM:尝试移除GIL
  4. 其他工具:Theano Parakeet PyViennaCL Nuitka Pyston(Dropbox的)PyCUDA(低级代码无法移植?)
  5. ctypes、cffi(来自PyPy)、f2py、CPython模块
    1. $ f2py -c -m diffusion --fcompiler=gfortran --opt=‘-O3‘ diffusion.f90
  6. JIT Versus AOT

Concurrency

  1. 并发:避免I/O wait的浪费
  2. In Python, coroutines are implemented as generators.
  3. For Python 2.7 implementations of future-based concurrency, ... ?
    1. gevent(适合于mainly CPU-based problems that sometimes involve heavy I/O)
      1. gevent monkey-patches the standard I/O functions to be asynchronous
      2. Greenlet
        1. wait
        2. The futures are created with gevent.spawn
        3. 控制同时打开的资源数:from gevent.coros import Semaphore
          1. requests = [gevent.spawn(download, u, semaphore) for u in urls]
      3. import grequests?
      4. 69x的加速?这是否意味着对应的不必要的IO waits?
      5. event loop可能either underutilizing or overutilizing
    2. tornado(By Facebook,适合于mostly I/O-bound的异步应用)
      1. from tornado import ioloop, gen
      2. from functools import partial
      3. AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient", max_clients=100)
      4. @gen.coroutine
        1. ... responses = yield [http_client.fetch(url) for url in urls] #生成Future对象?
        2. response_sum = sum(len(r.body) for r in responses)
        3. raise gen.Return(value=http://www.mamicode.com/response_sum)
      5. _ioloop = ioloop.IOLoop.instance()
      6. run_func = partial(run_experiment, base_url, num_iter)
      7. result = _ioloop.run_sync(run_func)
      8. 缺点:tracebacks can no longer hold valuable information
  4. In Python 3.4, new machinery introduced to easily create coroutines and have them still return values
    1. asyncio
      1. yield from:不再需要raise异常,以便从coroutine中返回结果
      2. very low-level => import aiohttp
        @asyncio.coroutine
        def http_get(url): #
        <span style="white-space:pre">	</span>nonlocal semaphore
        <span style="white-space:pre">	</span>with (yield from semaphore):
        <span style="white-space:pre">		</span>response = yield from aiohttp.request('GET', url)
        <span style="white-space:pre">		</span>body = yield from response.content.read()
        <span style="white-space:pre">		</span>yield from response.wait_for_close()
        <span style="white-space:pre">	</span>return body
        return http_get
        
        tasks = [http_client(url) for url in urls]
        for future in asyncio.as_completed(tasks):
        <span style="white-space:pre">	</span>data = http://www.mamicode.com/yield from future>
      3. allows us to unify modules like tornado and gevent by having them run in the same event loop

multiprocessing

  1. Process Pool Queue Pipe Manager ctypes(用于IPC?)
  2. In Python 3.2, the concurrent.futures module was introduced (via PEP 3148)
  3. PyPy完全支持multiprocessing,运行更快
  4. from multiprocessing.dummy import Pool(多线程的版本?)
  5. hyperthreading can give up to a 30% perf gain,如果有足够的计算资源
  6. It is worth noting that the negative of threads on CPU-bound problems is reasonably solved in Python 3.2+
  7. 使用外部的队列实现:Gearman, 0MQ, Celery(使用RabbitMQ作为消息代理), PyRes, SQS or HotQueue
  8. manager = multiprocessing.Manager()
    value = http://www.mamicode.com/manager.Value(b‘c‘, FLAG_CLEAR)
  9. rds = redis.StrictRedis()
    rds[FLAG_NAME] = FLAG_SET
  10. value = http://www.mamicode.com/multiprocessing.RawValue(b‘c‘, FLAG_CLEAR) #无同步机制?
  11. sh_mem = mmap.mmap(-1, 1) # memory map 1 byte as a flag
    sh_mem.seek(0)
    flag = sh_mem.read_byte()
  12. Using mmap as a Flag Redux(?有点看不明白,略过)
  13. $ ps -A -o pid,size,vsize,cmd | grep np_shared
  14. lock = lockfile.FileLock(filename)
    lock.acquire/release()
  15. lock = multiprocessing.Lock()
    value = http://www.mamicode.com/multiprocessing.Value(‘i‘, 0)
    lock.acquire()
    value.value += 1
    lock.release()

Clusters and Job Queues

  1. $462 Million Wall Street Loss Through Poor Cluster Upgrade Strategy
    1. 版本升级造成不一致?但API应该版本化...
  2. Skype‘s 24-Hour Global Outage
    1. some versions of the Windows client didn’t properly handle the delayed responses and crashed.
  3. To reliably start the cluster‘s components when the machine boots, we tend to use either a cron job,Circus or supervisord, or sometimes Upstart (which is being replaced by systemd)
  4. you might want to introduce a random-killer tool like Netflix‘s ChaosMonkey
  5. Make sure it is cheap in time and money to deploy updates to the system
  6. Make sure you use a deployment system like Fabric, Salt, Chef, or Puppet
  7. 早期预警:Pingdom andServerDensity
  8. 状态监控:Ganglia
  9. 3 Clustering Solutions
    1. Parallel Python
      1. ppservers = ("*",) # set IP list to be autodiscovered
      2. job_server = pp.Server(ppservers=ppservers, ncpus=NBR_LOCAL_CPUS)
      3. ... job = job_server.submit(calculate_pi, (input_args,), (), ("random",))
    2. IPython Parallel
      1. via ipcluster
      2. Schedulers hide the synchronous nature of the engines and provide an asynchronous interface
    3. NSQ(分布式消息系统,Go编写)
      1. Pub/sub:Topicd -> Channels -> Consumers
      2. writer = nsq.Writer([‘127.0.0.1:4150‘, ])
      3. handler = partial(calculate_prime, writer=writer)
      4. reader = nsq.Reader(message_handler = handler, nsqd_tcp_addresses = [‘127.0.0.1:4150‘, ], topic = ‘numbers‘, channel = ‘worker_group_a‘,)
      5. nsq.run()
  10. 其他集群工具

Using Less RAM

  1. IPython #memit
  2. array模块
  3. DAWG/DAFSA
  4. Marisa trie(静态树)
  5. Datrie(需要一个字母表以包含所有的key?)
  6. HAT trie
  7. HTTP微服务(使用Flask):https://github.com/j4mie/postcodeserver/
  8. Probabilistic Data Structures
    1. HyperLogLog++结构?
    2. Very Approximate Counting with a 1-byte Morris Counter
      1. 2^exponent,使用概率规则更新:random(0,1)<=2^-exponent
    3. K-Minimum Values/KMV(记住k个最小的hash值,假设hash值分布均匀)
    4. Bloom Filters
      1. This method gives us no false negatives and a controllable rate of false positives(可能误判为有)
      2. ?用2个独立的hash仿真任意多个hash
      3. very sensitive to initial capacity
      4. scalable Bloom filters:By chaining together multiple bloom filters ...
    5. LogLog Counter
      bit_index = trailing_zeros(item_hash)
      if bit_index > self.counter:
      self.counter = bit_index
      1. 变体:SuperLogLog HyperLogLog

Lessons from the Field

  1. Sentry is used to log and diagnose Python stack traces
  2. Aho-Corasick trie?
  3. We use Graphite with collectd and statsd to allow us to draw pretty graphs of what‘s going on
  4. Gunicorn was used as a WSGI and its IO loop was executed by Tornado 

High Performance Python 笔记(Python是门不错的语言,全栈程序员就用它好了!)