Python multiprocessing 使用手记[3] – 关于Queue

Posted on 2012年5月14日 16:05

继续讨论Python multiprocessing，这次讨论的主要内容是mp库的核心组件之一的Queue。

Queue是mp库当中用来提供多进程对象交换的方式。对象交换和上一部分当中提到的对象共享都是使多个进程访问同一个对象的方式，两者的区别就是，对象共享是多个进程访问同一个对象，对象交换则是将对象从一个进程传输的另一个进程。

multiprocessing当中的Queue使用方式和Python内置的threading.Queue对象很像，它支持一个put操作，将对象放入Queue，也支持一个get操作，将对象从Queue当中读出。和threading.Queue不同的是，mp.Queue默认不支持 join()和task_done操作，这两个支持需要使用mp.JoinableQueue对象。

由于Queue对象负责进程之间的对象传输，因此第一个问题就是如何在两个进程之间共享这个Queue对象本身。在上一部分所言的三种共享方式当中，Queue对象只能使用继承（inheritance）的方式共享。这是因为Queue本身基于unix的Pipe对象实现，而Pipe对象的共享需要通过继承。因此，在一个典型的应用实现模型当中，应该是父进程创建Queue，然后创建子进程共享该Queue，由父进程和子进程分别读写。例如下面的这个例子：

import multiprocessing
 
q = multiprocessing.Queue()
 
def reader_proc():
    print q.get()
 
reader = multiprocessing.Process(target=reader_proc)
reader.start()
 
q.put(100)
reader.join()

另一种实现方式是父进程创建Queue，创建多个子进程，有的子进程读Queue，有的子进程写Queue，例如：

import multiprocessing
 
q = multiprocessing.Queue()
 
def writer_proc():
    q.put(100)
 
def reader_proc():
    print q.get()
 
reader = multiprocessing.Process(target=reader_proc)
reader.start()
writer = multiprocessing.Process(target=writer_proc)
writer.start()
 
reader.join()
writer.join()

由于使用继承的方式共享Queue，因此代码当中并没有明显的传输Queue对象本身的代码，看起来似乎只要将multiprocessing当中的对象换成threading当中的对象，程序仍然能够工作。反之，拿到一个现有的多线程程序，是不是将threading改成 multiprocessing就可以工作呢？也许可以，但是更可能的情况是你会遇到很多问题。

第一个问题就是mp的Queue需要考虑多进程之间的对象传输，因此所传输的对象必须是可以pickle的。否则，在Queue的put操作上会抛出PicklingError。

其他的一些差异表现在一些技术细节上，这些不是任何高层逻辑可以抽象掉的，不知道这些差异会导致一些潜在的错误，例如死锁。在总结这些潜在的犯错的可能的同时，我们会简单看一下mp当中Queue的实现方式，以便能够方便的理解为什么会有这样的行为。这些实现问题仅仅针对Linux，Windows 上面的实现和出现的问题在这里不涉及。

mp.Queue建构在系统的Pipe之上，但是实际上进程并不是直接将对象写入到Pipe里面，而是先写入一个本地的buffer，再由一个专门的feed线程将其放入Pipe当中。读取端则是直接从Pipe当中读出对象。之所以有这样一个feed线程，是为了能够提供Queue接口函数所需要的 put的超时控制。但是由于这个feed线程的存在，mp.Queue提供了几个额外的函数来控制它，一个函数close来停止该线程，以及 join_thread来join该线程。close同时负责把所有在buffer当中的对象刷新到Pipe当中。

但是这个feed线程也是个麻烦制造者，为了保证所有被放入Queue的东西最终都能够到达另外一端的进程，mp库注册了一个atexit的处理函数，用来在进程退出的时候自动close并且join该feed线程。这个join动作带来了很多问题，比如潜在的死锁。考虑下面一种状况：一个父进程创建了两个子进程，一个子进程读，另一个子进程写。当需要停止这些进程的时候，父进程如果先把读进程结束，但是同时写进程已经将太多的对象写入Queue，导致后继的对象等待在buffer当中，则这个进程将无法终止，因为atexit的处理函数等待把所有buffer当中的对象放入Pipe，但是Pipe 已经满了，然后陷入了死锁。

有人可能会问，那只要保证总是按照数据流的顺序来停止进程不就行。问题是在很多复杂的系统流程当中，可能存在一个环形的数据流，这种情况下，无论按照什么顺序停止进程，终究有一个进程可能陷入这种情景当中。

幸运的是，Queue对象还提供了一个成员函数cancel_join_thread，这个函数可以使得在进程停止的时候不进行join操作，这样可以避免死锁，代价就是这个时候尚未刷新到Pipe当中的对象都会丢失。鉴于即使调用了join_thread，残留在Pipe当中的对象仍然可能丢失，所以一旦选择使用mp的Queue对象，就不要假设不会在流程当中丢对象了。

另外一个可能的方案是使用mp库当中的SimpleQueue对象。这个对象在文档当中没有提及，但是在multiprocessing.queue模块当中有定义。这个对象就是去掉了buffer的Queue对象，因此可能能够避免上面说的问题的。但是SimpleQueue没有提供put和get的超时处理，两个动作都是阻塞的。

除了使用multiprocessing.Queue，还可以使用multiprocessing.Pipe进行通信。mp.Pipe是Queue 的底层结构，但是没有feed线程和put/get的超时控制。一定程度上和SimpleQueue很像。需要注意的是Pipe带有一个参数 duplex，当设置为True（默认）的时候，Pipe并不是使用系统的pipe来实现，而是通过socketpair，即Unix Domain Socket来实现。这个和pipe相比有些微的性能差异。

另外一个使用Queue的方式不是mp库内置的。这种方式使用上一篇文章当中提到的server process的方式来共享一个Queue对象。这个Queue对象实际上在server process当中，所有的子进程通过socket连接到server process获取该Queue的代理对象进行操作。说到这有人会想起来mp库有一个内置的SyncManager对象，可以通过multiprocess.Manager函数获取到，通过该对象的Queue方法可以获取一个Queue的代理对象。不幸的是，这个方法不是正确的获取Queue的方式，原因正如上一篇文章所说，SyncManager.Queue方法的每次调用获取到的是一个新建对象的代理对象，而不是一个共享对象。正确的使用server process当中的Queue的方式是：

共同部分：

import multiprocessing.managers as mpm
import Queue
 
class SharedQueueManager(mpm.BaseManager): pass
q = Queue.Queue()
SharedQueueManager.register('Queue', lambda: q)

服务进程：

mgr = SharedQueueManager(address=('', 12345))
server = mgr.get_server()
server.serve_forever()

客户进程：

mgr = SharedQueueManager(address=('localhost', 12345))
mgr.connect()
q = mgr.Queue() # 这里q就是共享的Queue对象的代理对象

这种方式比起mp库内置的Queue，有一些性能上的影响，因为毕竟牵涉到多次网络通讯，但是带来的好处是没有feed线程带来的一系列问题，而且理论上不会存在丢数据的问题，除非server process崩溃。但是正如上一篇所说，server process本身就不是很靠谱的，因此这里也只是“理论上”不会丢数据而已。

说到性能，这里就列两个性能数据，以前在twitter上面提到过的（这两个连接无法访问的请联系我）：

操作对象为 pickle后512字节的对象，通过proxy操作Queue的性能大约是7000次/秒（本机）或1100次/秒（多机），如果使用 multiprocessing.Queue，效率可达54000次/秒。

Posted in Python|Comments(0)

Python multiprocessing 使用手记[2] – 跨进程对象共享

Posted on 2012年5月14日 16:05

继续写关于Python multiprocessing的使用手记，继上次的进程模型之后，这次展开讨论一下multiprocessing当中的跨进程对象共享的问题。

在mp库当中，跨进程对象共享有三种方式，第一种仅适用于原生机器类型，即python.ctypes当中的类型，这种在mp库的文档当中称为shared memory方式，即通过共享内存共享对象；另外一种称之为server process，即有一个服务器进程负责维护所有的对象，而其他进程连接到该进程，通过代理对象操作服务器进程当中的对象；最后一种在mp文档当中没有单独提出，但是在其中多次提到，而且是mp库当中最重要的一种共享方式，称为inheritance，即继承，对象在父进程当中创建，然后在父进程是通过multiprocessing.Process创建子进程之后，子进程自动继承了父进程当中的对象，并且子进程对这些对象的操作都是反映到了同一个对象。

这三者共享方式各有特色，在这里进行一些简单的比较。

首先是共享方式所应对的对象类型，看这个表：

共享方式	支持的类型
Shared memory	ctypes当中的类型，通过RawValue，RawArray等包装类提供
Inheritance	系统内核对象，以及基于这些对象实现的对象。包括Pipe, Queue, JoinableQueue, 同步对象(Semaphore, Lock, RLock, Condition, Event等等)
Server process	所有对象，可能需要自己手工提供代理对象(Proxy)

这个表总结了三种不同的共享方式所支持的类型，下面一个个展开讨论。

其中最单纯简单的就是shared memory这种方式，只有ctypes当中的数据类型可以通过这种方式共享。由于mp库本身缺少命名的机制，即在一个进程当中创建的对象，无法在另外一个进程当中通过名字来引用，因此，这种共享方式依赖于继承，对象应该由父进程创建，然后由子进程引用。关于这种机制的例子，可以参见Python文档当中的例子 Synchronization types like locks, conditions and queues，参考其中的test_sharedvalues函数。

然后是继承方式。首先关于继承方式需要有说明，继承本质上并不是一种对象共享的机制，对象共享只是其副作用。子进程从父进程继承来的对象并不一定是共享的。继承本质上是父进程fork出的子进程自动继承父进程的内存状态和对象描述符。因此，实际上子进程复制了一份父进程的对象，只不过，当这个对象包装了一些系统内核对象的描述符的时候，拷贝这个对象（及其包装的描述符）实现了对象的共享。因此，在上面的表当中，只有系统内核对象，和基于这些对象实现的对象，才能够通过继承来共享。通过继承共享的对象在linux平台上没有任何限制，但是在Windows上面由于没有fork的实现，因此有一些额外的限制条件，因此，在Windows上面，继承方式是几乎无法用的。

最后就是Server Process这种方式。这种方式可以支持的类型比另外两种都多，因为其模型是这样的：

server process模型

在这个模型当中，有一个manager进程，负责管理实际的对象。真正的对象也是在manager进程的内存空间当中。所有需要访问该对象的进程都需要先连接到该管理进程，然后获取到对象的一个代理对象（Proxy object），通常情况下，这个代理对象提供了实际对象的公共函数的代理，将函数参数进行pickle，然后通过连接传送到管理进程当中，管理进程将参数unpickle之后，转发给相应的实际对象的函数，返回值（或者异常）同样经过管理进程pickle之后，通过连接传回到客户进程，再由proxy对象进行unpickle，返回给调用者或者抛出异常。

很明显，这个模型是一个典型的RPC（远程过程调用）的模型。因为每个客户进程实际上都是在访问manager进程当中的对象，因此完全可以通过这个实现对象共享。

manager和proxy之间的连接可以是基于socket的网络连接，也可以是unix pipe。如果是使用基于socket的连接方式，在使用proxy之前，需要调用manager对象的connect函数与远程的manager进程建立连接。由于manager进程会打开端口接收该连接，因此必要的身份验证是需要的，否则任何人都可以连上manager弄乱你的共享对象。mp库通过 authkey的方式来进行身份验证。

在实现当中，manager进程通过multiprocessing.Manager类或者BaseManager的子类实现。 BaseManager提供了函数register注册一个函数来获取共享对象的proxy。这个函数会被客户进程调用，然后在manager进程当中执行。这个函数可以返回一个共享的对象（对所有的调用返回同一个对象），或者可以为每一个调用创建一个新的对象，通过前者就可以实现多个进程共享一个对象。关于这个的用法可以参考Python文档当中的例子“Demonstration of how to create and use customized managers and proxies”。

典型的导出一个共享对象的代码是：

ObjectType object_
class ObjectManager(multiprocessing.managers.BaseManager): pass
ObjectManager.register("object", lambda: object_)

注意上面介绍proxy对象的时候，我提到的“公共函数”四个字。每个proxy对象只会导出实际对象的公共函数。这里面有两个含义，一个是“公共”，即所有非下划线开头的成员，另一个是“函数”，即所有callable的成员。这就带来一些限制，一是无法导出属性，二是无法导出一些公共的特殊函数，例如__get__, __next__等等。对于这个mp库有一套处理，即自定义proxy对象。首先是BaseManager的register可以提供一个 proxy_type作为第三个参数，这个参数指定了哪些成员需要被导出。详细的使用方法可以参见文档当中的第一个例子。

另外manager还有一些细节的问题需要注意。由于Proxy对象不是线程安全的，因此如果需要在一个多线程程序当中使用proxy，mp库会为每个线程创建一个proxy对象，而每个proxy对象都会对server process创建一个连接，而manager那边对于每个连接都创建一个单独的线程来为其服务。这样带来的问题就是，如果客户进程有很多线程，很容易会导致manager进程的fd数目达到ulimit的限制，即使没有达到限制，也会因为manager进程当中有太多线程而严重影响manager的性能。解决方案可以是一个进程内cache，只有一个单独的线程可以创建proxy对象访问共享对象，其余线程只能访问该进程当中的cache。

一旦manager因为达到ulimit限制或者其他异常，manager会直接退出，遗憾的是，这时候已经建立的proxy会试图重新连接 manager – 但是它已经不存在了。这个会导致客户进程hang在对proxy的函数调用上，这个时候，目前除了杀掉进程没有找到别的办法。

另外proxy使用socket的方式比较tricky，因此和内置的socket库有很多冲突，比如socket.setdefaulttimeout（Python Issue 6056 ）。在setdefaulttimeout调用了之后，进程当中所有通过socket模块建立的socket都是被设置为unblock模式的，但是mp 库并不知道这一点，而且它总是假设socket都是block模式的，于是，一旦调用了setdefaulttimeout，所有对于proxy的函数调用都会抛出OSError，错误代码为11，错误原因是非常有误导性的“Resource temporarily unavailable”，实际上就是EAGAIN。这个错误可以通过我提供的一个patch来补救（这个patch当中还包含其他的一些修复，所以请自行查看并修改该patch）。

由于以上的一些原因，server process模式作为一个对象的共享模式，能够提供最为灵活的共享方式，但是也有最多的问题。这个在使用过程当中就靠自己去衡量了。目前我们的系统对于数据可靠性方面要求不高，丢失数据是可以接受的，但是也只用这种模式来维护统计值，不敢用来维护更多的东西。

关于跨进程共享对象的问题就写到这里，后面内容待续……

Posted in Python|Comments(0)

Python multiprocessing库使用手记（引子）

Posted on 2012年5月14日 16:04

前段时间在做的一个Python项目，需要实现一个后台服务程序，程序流程比较复杂，而且可能经常变动，但是如果把整个流程切分成一些步骤，每个步骤有自己的输入输出和处理。只要将他们的输入输出接在一起，进行不同的组合就可以实现常见的流程变动。

使用多进程的原因是考虑到Python的全局解释器锁（Global Interceptor Lock, GIL）。由于GIL的存在，在CPU密集型的程序当中，使用多线程并不能有效地利用多核CPU的优势，因为一个解释器在同一时刻只会有一个线程在执行。要想尽可能利用多核CPU并发，多进程是必需的。

引入多进程就带来几个问题：首先是这种设计方案需要一个类似于UNIX Pipe的底层结构，说的更确切一点就是message queue，操作类似于Python的Queue.Queue。另一个问题就是需要一个进程间信息共享的基础设施。

先说Queue，这个Queue需要满足下面一些需求，优先级递减：

轻量 – 不需要依赖数据库等比较重型的程序。代价就是丢点东西没关系；
跨进程 – 读端和写端可以不在一个进程当中；
支持多进多出 – 读端和写端都要支持多个进程；
稳定和高效 – 这个就不用多说了吧；
有Python的API，再不济要有C的API可以让我写Python的binding。

另外是进程间共享的设施，系统shm是迫不得已使用的，因为要自己处理Pickling和Unpickling，以及一些杂七杂八的竞争条件问题。有建构在上面的现成的库最好。

最终出于轻量的考虑，选定了Python的multiprocessing库，而没有用IPC MQ 或 RabbitMQ。这个库在Python2.6被引入到标准库当中，有为2.5提供的backport。因为我们的系统使用的是2.5版本，因此我们使用的其实是2.5的backport。这个实现除了一些细枝末节的地方之外，核心实现和2.6标准库当中是完全一样的。

在使用过程当中，遇到了挺多的问题，在这里一一记录。

Posted in Python|Comments(0)

Python调试技术

Posted on 2012年5月10日 15:25

使用 pdb 进行调试

pdb 是 python 自带的一个包，为 python 程序提供了一种交互的源代码调试功能，主要特性包括设置断点、单步调试、进入函数调试、查看当前代码、查看栈片段、动态改变变量的值等。pdb 提供了一些常用的调试命令，详情见表 1。

表 1. pdb 常用命令

命令	解释
break 或 b 设置断点	设置断点
continue 或 c	继续执行程序
list 或 l	查看当前行的代码段
step 或 s	进入函数
return 或 r	执行代码直到从当前函数返回
exit 或 q	中止并退出
next 或 n	执行下一行
pp	打印变量的值
help	帮助

下面结合具体的实例讲述如何使用 pdb 进行调试。

清单 1. 测试代码示例

				
import pdb 
 a = "aaa"
 pdb.set_trace() 
 b = "bbb"
 c = "ccc"
 final = a + b + c 
 print final

开始调试：直接运行脚本，会停留在 pdb.set_trace() 处，选择 n+enter 可以执行当前的 statement。在第一次按下了 n+enter 之后可以直接按 enter 表示重复执行上一条 debug 命令。

清单 2. 利用 pdb 调试

				
[root@rcc-pok-idg-2255 ~]#  python epdb1.py 
 > /root/epdb1.py(4)?() 
 -> b = "bbb"
 (Pdb) n 
 > /root/epdb1.py(5)?() 
 -> c = "ccc"
 (Pdb) 
 > /root/epdb1.py(6)?() 
 -> final = a + b + c 
 (Pdb) list 
  1     import pdb 
  2     a = "aaa"
  3     pdb.set_trace() 
  4     b = "bbb"
  5     c = "ccc"
  6  -> final = a + b + c 
  7     print final 
 [EOF] 
 (Pdb) 
 [EOF] 
 (Pdb) n 
 > /root/epdb1.py(7)?() 
 -> print final 
 (Pdb)

退出 debug：使用 quit 或者 q 可以退出当前的 debug，但是 quit 会以一种非常粗鲁的方式退出程序，其结果是直接 crash。

清单 3. 退出 debug

				
[root@rcc-pok-idg-2255 ~]#  python epdb1.py 
 > /root/epdb1.py(4)?() 
 -> b = "bbb"
 (Pdb) n 
 > /root/epdb1.py(5)?() 
 -> c = "ccc"
 (Pdb) q 
 Traceback (most recent call last): 
  File "epdb1.py", line 5, in ? 
    c = "ccc"
  File "epdb1.py", line 5, in ? 
    c = "ccc"
  File "/usr/lib64/python2.4/bdb.py", line 48, in trace_dispatch 
    return self.dispatch_line(frame) 
  File "/usr/lib64/python2.4/bdb.py", line 67, in dispatch_line 
    if self.quitting: raise BdbQuit 
 bdb.BdbQuit

打印变量的值：如果需要在调试过程中打印变量的值，可以直接使用 p 加上变量名，但是需要注意的是打印仅仅在当前的 statement 已经被执行了之后才能看到具体的值，否则会报 NameError: < exceptions.NameError … ....> 错误。

清单 4. debug 过程中打印变量

				
[root@rcc-pok-idg-2255 ~]#  python epdb1.py 
 > /root/epdb1.py(4)?() 
 -> b = "bbb"
 (Pdb) n 
 > /root/epdb1.py(5)?() 
 -> c = "ccc"
 (Pdb) p b 
'bbb'
 (Pdb) 
'bbb'
 (Pdb) n 
 > /root/epdb1.py(6)?() 
 -> final = a + b + c 
 (Pdb) p c 
'ccc'
 (Pdb) p final 
 *** NameError: <exceptions.NameError instance at 0x1551b710 > 
 (Pdb) n 
 > /root/epdb1.py(7)?() 
 -> print final 
 (Pdb) p final 
'aaabbbccc'
 (Pdb)

使用 c 可以停止当前的 debug 使程序继续执行。如果在下面的程序中继续有 set_statement() 的申明，则又会重新进入到 debug 的状态，读者可以在代码 print final 之前再加上 set_trace() 验证。

清单 5. 停止 debug 继续执行程序

				
[root@rcc-pok-idg-2255 ~]#  python epdb1.py 
 > /root/epdb1.py(4)?() 
 -> b = "bbb"
 (Pdb) n 
 > /root/epdb1.py(5)?() 
 -> c = "ccc"
 (Pdb) c 
 aaabbbccc

显示代码：在 debug 的时候不一定能记住当前的代码块，如要要查看具体的代码块，则可以通过使用 list 或者 l 命令显示。list 会用箭头 -> 指向当前 debug 的语句。

清单 6. debug 过程中显示代码

				
[root@rcc-pok-idg-2255 ~]#  python epdb1.py 
 > /root/epdb1.py(4)?() 
 -> b = "bbb"
 (Pdb) list 
  1     import pdb 
  2     a = "aaa"
  3     pdb.set_trace() 
  4  -> b = "bbb"
  5     c = "ccc"
  6     final = a + b + c 
  7     pdb.set_trace() 
  8     print final 
 [EOF] 
 (Pdb) c 
 > /root/epdb1.py(8)?() 
 -> print final 
 (Pdb) list 
  3     pdb.set_trace() 
  4     b = "bbb"
  5     c = "ccc"
  6     final = a + b + c 
  7     pdb.set_trace() 
  8  -> print final 
 [EOF] 
 (Pdb)

在使用函数的情况下进行 debug

清单 7. 使用函数的例子

				
import pdb 
 def combine(s1,s2):      # define subroutine combine, which... 
    s3 = s1 + s2 + s1    # sandwiches s2 between copies of s1, ... 
    s3 = '"' + s3 +'"'   # encloses it in double quotes,... 
    return s3            # and returns it. 
 a = "aaa"
 pdb.set_trace() 
 b = "bbb"
 c = "ccc"
 final = combine(a,b) 
 print final

如果直接使用 n 进行 debug 则到 final=combine(a,b) 这句的时候会将其当做普通的赋值语句处理，进入到 print final。如果想要对函数进行 debug 如何处理呢 ? 可以直接使用 s 进入函数块。函数里面的单步调试与上面的介绍类似。如果不想在函数里单步调试可以在断点处直接按 r 退出到调用的地方。

清单 8. 对函数进行 debug

				
[root@rcc-pok-idg-2255 ~]# python epdb2.py 
 > /root/epdb2.py(10)?() 
 -> b = "bbb"
 (Pdb) n 
 > /root/epdb2.py(11)?() 
 -> c = "ccc"
 (Pdb) n 
 > /root/epdb2.py(12)?() 
 -> final = combine(a,b) 
 (Pdb) s 
 --Call-- 
 > /root/epdb2.py(3)combine() 
 -> def combine(s1,s2):      # define subroutine combine, which... 
 (Pdb) n 
 > /root/epdb2.py(4)combine() 
 -> s3 = s1 + s2 + s1    # sandwiches s2 between copies of s1, ... 
 (Pdb) list 
  1     import pdb 
  2 
  3     def combine(s1,s2):      # define subroutine combine, which... 
  4  ->     s3 = s1 + s2 + s1    # sandwiches s2 between copies of s1, ... 
  5         s3 = '"' + s3 +'"'   # encloses it in double quotes,... 
  6         return s3            # and returns it. 
  7 
  8     a = "aaa"
  9     pdb.set_trace() 
 10     b = "bbb"
 11     c = "ccc"
 (Pdb) n 
 > /root/epdb2.py(5)combine() 
 -> s3 = '"' + s3 +'"'   # encloses it in double quotes,... 
 (Pdb) n 
 > /root/epdb2.py(6)combine() 
 -> return s3            # and returns it. 
 (Pdb) n 
 --Return-- 
 > /root/epdb2.py(6)combine()->'"aaabbbaaa"'
 -> return s3            # and returns it. 
 (Pdb) n 
 > /root/epdb2.py(13)?() 
 -> print final 
 (Pdb)

在调试的时候动态改变值。在调试的时候可以动态改变变量的值，具体如下实例。需要注意的是下面有个错误，原因是 b 已经被赋值了，如果想重新改变 b 的赋值，则应该使用！ B。

清单 9. 在调试的时候动态改变值

				
[root@rcc-pok-idg-2255 ~]# python epdb2.py 
 > /root/epdb2.py(10)?() 
 -> b = "bbb"
 (Pdb) var = "1234"
 (Pdb) b = "avfe"
 *** The specified object '= "avfe"' is not a function 
 or was not found along sys.path. 
 (Pdb) !b="afdfd"
 (Pdb)

pdb 调试有个明显的缺陷就是对于多线程，远程调试等支持得不够好，同时没有较为直观的界面显示，不太适合大型的 python 项目。而在较大的 python 项目中，这些调试需求比较常见，因此需要使用更为高级的调试工具。接下来将介绍 PyCharm IDE 的调试方法 .

使用 PyCharm 进行调试

PyCharm 是由 JetBrains 打造的一款 Python IDE，具有语法高亮、Project 管理、代码跳转、智能提示、自动完成、单元测试、版本控制等功能，同时提供了对 Django 开发以及 Google App Engine 的支持。分为个人独立版和商业版，需要 license 支持，也可以获取免费的 30 天试用。试用版本的 Pycharm 可以在官网上下载，下载地址为：http://www.jetbrains.com/pycharm/download/index.html。 PyCharm 同时提供了较为完善的调试功能，支持多线程，远程调试等，可以支持断点设置，单步模式，表达式求值，变量查看等一系列功能。PyCharm IDE 的调试窗口布局如图 1 所示。

图 1. PyCharm IDE 窗口布局
图片示例

下面结合实例讲述如何利用 PyCharm 进行多线程调试。具体调试所用的代码实例见清单 10。

清单 10. PyCharm 调试代码实例

				
__author__ = 'zhangying'
 #!/usr/bin/python 
 import thread 
 import time 
 # Define a function for the thread 
 def print_time( threadName, delay): 
    count = 0 
    while count <  5: 
        count += 1 
        print "%s: %s" % ( threadName, time.ctime(time.time()) ) 
 def check_sum(threadName,valueA,valueB): 
    print "to calculate the sum of two number her"
    result=sum(valueA,valueB) 
    print "the result is" ,result; 
 def sum(valueA,valueB): 
    if valueA >0 and valueB>0: 
        return valueA+valueB 
 def readFile(threadName, filename): 
    file = open(filename) 
    for line in file.xreadlines(): 
        print line 
 try: 
    thread.start_new_thread( print_time, ("Thread-1", 2, ) ) 
    thread.start_new_thread( check_sum, ("Thread-2", 4,5, ) ) 
    thread.start_new_thread( readFile, ("Thread-3","test.txt",)) 
 except: 
    print "Error: unable to start thread"
 while 1: 
 # 	 print "end"
    pass

在调试之前通常需要设置断点，断点可以设置在循环或者条件判断的表达式处或者程序的关键点。设置断点的方法非常简单：在代码编辑框中将光标移动到需要设置断点的行，然后直接按 Ctrl+F8 或者选择菜单"Run"->"Toggle Line Break Point"，更为直接的方法是双击代码编辑处左侧边缘，可以看到出现红色的小圆点（如图 2）。当调试开始的时候，当前正在执行的代码会直接显示为蓝色。下图中设置了三个断点，蓝色高亮显示的为正在执行的代码。

图 2. 断点设置
图片示例 2

表达式求值：在调试过程中有的时候需要追踪一些表达式的值来发现程序中的问题，Pycharm 支持表达式求值，可以通过选中该表达式，然后选择“Run”->”Evaluate Expression”，在出现的窗口中直接选择 Evaluate 便可以查看。

Pychar 同时提供了 Variables 和 Watches 窗口，其中调试步骤中所涉及的具体变量的值可以直接在 variable 一栏中查看。

图 3. 变量查看
图片示例 3

如果要动态的监测某个变量可以直接选中该变量并选择菜单”Run”->”Add Watch”添加到 watches 栏中。当调试进行到该变量所在的语句时，在该窗口中可以直接看到该变量的具体值。

图 4. 监测变量
图片示例 4

对于多线程程序来说，通常会有多个线程，当需要 debug 的断点分别设置在不同线程对应的线程体中的时候，通常需要 IDE 有良好的多线程调试功能的支持。 Pycharm 中在主线程启动子线程的时候会自动产生一个 Dummy 开头的名字的虚拟线程，每一个 frame 对应各自的调试帧。如图 5，本实例中一共有四个线程，其中主线程生成了三个线程，分别为 Dummy-4,Dummy-5,Dummy-6. 其中 Dummy-4 对应线程 1，其余分别对应线程 2 和线程 3。

图 5. 多线程窗口
图片示例 5

当调试进入到各个线程的子程序时，Frame 会自动切换到其所对应的 frame，相应的变量栏中也会显示与该过程对应的相关变量，如图 6，直接控制调试按钮，如 setp in，step over 便可以方便的进行调试。

图 6. 子线程调试
图片示例 6

查看大图。

使用 PyDev 进行调试

PyDev 是一个开源的的 plugin，它可以方便的和 Eclipse 集成，提供方便强大的调试功能。同时作为一个优秀的 Python IDE 还提供语法错误提示、源代码编辑助手、Quick Outline、Globals Browser、Hierarchy View、运行等强大功能。下面讲述如何将 PyDev 和 Eclipse 集成。在安装 PyDev 之前，需要先安装 Java 1.4 或更高版本、Eclipse 以及 Python。第一步：启动 Eclipse，在 Eclipse 菜单栏中找到 Help 栏，选择 Help > Install New Software，并选择 Add button，添加 Ptdev 的下载站点 http://pydev.org/updates。选择 PyDev 之后完成余下的步骤便可以安装 PyDev。

图 7. 安装 PyDev
图片示例 7

安装完成之后需要配置 Python 解释器，在 Eclipse 菜单栏中，选择 Window > Preferences > Pydev > Interpreter – Python。Python 安装在 C:\Python27 路径下。单击 New，选择 Python 解释器 python.exe，打开后显示出一个包含很多复选框的窗口，选择需要加入系统 PYTHONPATH 的路径，单击 OK。

图 8. 配置 PyDev
图片示例 8

在配置完 Pydev 之后，可以通过在 Eclipse 菜单栏中，选择 File > New > Project > Pydev >Pydev Project，单击 Next 创建 Python 项目，下面的内容假设 python 项目已经创建，并且有个需要调试的脚本 remote.py（具体内容如下），它是一个登陆到远程机器上去执行一些命令的脚本，在运行的时候需要传入一些参数，下面将详细讲述如何在调试过程中传入参数 .

清单 11. Pydev 调试示例代码

				
 #!/usr/bin/env python 
 import os 	 
 def telnetdo(HOST=None, USER=None, PASS=None, COMMAND=None): #define a function 
	 import telnetlib, sys 
	 if not HOST: 
		 try: 
			 HOST = sys.argv[1] 
			 USER = sys.argv[2] 
			 PASS = sys.argv[3] 
			 COMMAND = sys.argv[4] 
		 except: 
			 print "Usage: remote.py host user pass command"
			 return 
	 tn = telnetlib.Telnet() # 
	 try: 
		 tn.open(HOST) 
	 except: 
		 print "Cannot open host"
		 return 
	 tn.read_until("login:") 
	 tn.write(USER + '\n') 
	 if PASS: 
		 tn.read_until("Password:") 
		 tn.write(PASS + '\n') 
		 tn.write(COMMAND + '\n') 
		 tn.write("exit\n") 
		 tmp = tn.read_all() 
		 tn.close() 
		 del tn 
		 return tmp 
		
 if __name__ == '__main__': 
	 print telnetdo()

在调试的时候有些情况需要传入一些参数，在调试之前需要进行相应的配置以便接收所需要的参数，选择需要调试的程序（本例 remote.py），该脚本在 debug 的过程中需要输入四个参数：host，user，password 以及命令。在 eclipse 的工程目录下选择需要 debug 的程序，单击右键，选择“Debug As”->“Debug Configurations”，在 Arguments Tab 页中选择“Variables”。如下图 9 所示 .

图 9. 配置变量
图片示例 9

在窗口”Select Variable”之后选择“Edit Varuables” ，出现如下窗口，在下图中选择”New” 并在弹出的窗口中输入对应的变量名和值。特别需要注意的是在值的后面一定要有空格，不然所有的参数都会被当做第一个参数读入。

图 10. 添加具体变量
图片示例 10

按照以上方式依次配置完所有参数，然后在”select variable“窗口中安装参数所需要的顺序依次选择对应的变量。配置完成之后状态如下图 11 所示。

图 11. 完成配置
图片示例 11

选择 Debug 便可以开始程序的调试，调试方法与 eclipse 内置的调试功能的使用相似，并且支持多线程的 debug，这方面的文章已经有很多，读者可以自行搜索阅读，或者参考”使用 Eclipse 平台进行调试“一文。

使用日志功能达到调试的目的

日志信息是软件开发过程中进行调试的一种非常有用的方式，特别是在大型软件开发过程需要很多相关人员进行协作的情况下。开发人员通过在代码中加入一些特定的能够记录软件运行过程中的各种事件信息能够有利于甄别代码中存在的问题。这些信息可能包括时间，描述信息以及错误或者异常发生时候的特定上下文信息。最原始的 debug 方法是通过在代码中嵌入 print 语句，通过输出一些相关的信息来定位程序的问题。但这种方法有一定的缺陷，正常的程序输出和 debug 信息混合在一起，给分析带来一定困难，当程序调试结束不再需要 debug 输出的时候，通常没有很简单的方法将 print 的信息屏蔽掉或者定位到文件。python 中自带的 logging 模块可以比较方便的解决这些问题，它提供日志功能，将 logger 的 level 分为五个级别，可以通过 Logger.setLevel(lvl) 来设置。默认的级别为 warning。

表 2. 日志的级别

Level	使用情形
DEBUG	详细的信息，在追踪问题的时候使用
INFO	正常的信息
WARNING	一些不可预见的问题发生，或者将要发生，如磁盘空间低等，但不影响程序的运行
ERROR	由于某些严重的问题，程序中的一些功能受到影响
CRITICAL	严重的错误，或者程序本身不能够继续运行

logging lib 包含 4 个主要对象

logger:logger 是程序信息输出的接口。它分散在不同的代码中使得程序可以在运行的时候记录相应的信息，并根据设置的日志级别或 filter 来决定哪些信息需要输出并将这些信息分发到其关联的 handler。常用的方法有 Logger.setLevel()，Logger.addHandler() ，Logger.removeHandler() ，Logger.addFilter() ，Logger.debug(), Logger.info(), Logger.warning(), Logger.error()，getLogger() 等。logger 支持层次继承关系，子 logger 的名称通常是父 logger.name 的方式。如果不创建 logger 的实例，则使用默认的 root logger，通过 logging.getLogger() 或者 logging.getLogger("") 得到 root logger 实例。
Handler:Handler 用来处理信息的输出，可以将信息输出到控制台，文件或者网络。可以通过 Logger.addHandler() 来给 logger 对象添加 handler，常用的 handler 有 StreamHandler 和 FileHandler 类。StreamHandler 发送错误信息到流，而 FileHandler 类用于向文件输出日志信息，这两个 handler 定义在 logging 的核心模块中。其他的 hander 定义在 logging.handles 模块中，如 HTTPHandler,SocketHandler。
Formatter:Formatter 则决定了 log 信息的格式 , 格式使用类似于 %(< dictionary key >)s 的形式来定义，如'%(asctime)s - %(levelname)s - %(message)s'，支持的 key 可以在 python 自带的文档 LogRecord attributes 中查看。
Filter:Filter 用来决定哪些信息需要输出。可以被 handler 和 logger 使用，支持层次关系，比如如果设置了 filter 为名称为 A.B 的 logger，则该 logger 和其子 logger 的信息会被输出，如 A.B,A.B.C.

清单 12. 日志使用示例

				
import logging 
 LOG1=logging.getLogger('b.c') 
 LOG2=logging.getLogger('d.e') 
 filehandler = logging.FileHandler('test.log','a') 
 formatter = logging.Formatter('%(name)s %(asctime)s %(levelname)s %(message)s') 
 filehandler.setFormatter(formatter) 
 filter=logging.Filter('b') 
 filehandler.addFilter(filter) 
 LOG1.addHandler(filehandler) 
 LOG2.addHandler(filehandler) 
 LOG1.setLevel(logging.INFO) 
 LOG2.setLevel(logging.DEBUG) 
 LOG1.debug('it is a debug info for log1') 
 LOG1.info('normal infor for log1') 
 LOG1.warning('warning info for log1:b.c') 
 LOG1.error('error info for log1:abcd') 
 LOG1.critical('critical info for log1:not worked') 
 LOG2.debug('debug info for log2') 
 LOG2.info('normal info for log2') 
 LOG2.warning('warning info for log2') 
 LOG2.error('error:b.c') 
 LOG2.critical('critical')

上例设置了 filter b，则 b.c 为 b 的子 logger，因此满足过滤条件该 logger 相关的日志信息会被输出，而其他不满足条件的 logger（这里是 d.e）会被过滤掉。

清单 13. 输出结果

				
b.c 2011-11-25 11:07:29,733 INFO normal infor for log1 
 b.c 2011-11-25 11:07:29,733 WARNING warning info for log1:b.c 
 b.c 2011-11-25 11:07:29,733 ERROR error info for log1:abcd 
 b.c 2011-11-25 11:07:29,733 CRITICAL critical info for log1:not worked

logging 的使用非常简单，同时它是线程安全的，下面结合多线程的例子讲述如何使用 logging 进行 debug。

清单 14. 多线程使用 logging

				
logging.conf 
 [loggers] 
 keys=root,simpleExample 

 [handlers] 
 keys=consoleHandler 

 [formatters] 
 keys=simpleFormatter 

 [logger_root] 
 level=DEBUG 
 handlers=consoleHandler 

 [logger_simpleExample] 
 level=DEBUG 
 handlers=consoleHandler 
 qualname=simpleExample 
 propagate=0 

 [handler_consoleHandler] 
 class=StreamHandler 
 level=DEBUG 
 formatter=simpleFormatter 
 args=(sys.stdout,) 

 [formatter_simpleFormatter] 
 format=%(asctime)s - %(name)s - %(levelname)s - %(message)s 
 datefmt= 

 code example: 
 #!/usr/bin/python 
 import thread 
 import time 
 import logging 
 import logging.config 
 logging.config.fileConfig('logging.conf') 
 # create logger 
 logger = logging.getLogger('simpleExample') 
 # Define a function for the thread 
 def print_time( threadName, delay): 
	 logger.debug('thread 1 call print_time function body') 
	 count = 0 
	 logger.debug('count:%s',count)

总结

全文介绍了 python 中 debug 的几种不同的方式，包括 pdb 模块、利用 PyDev 和 Eclipse 集成进行调试、PyCharm 以及 Debug 日志进行调试，希望能给相关 python 使用者一点参考。更多关于 python debugger 的资料可以参见参考资料。

Posted in Python|Comments(1)

Python3 smtp 发送邮件

Posted on 2012年3月30日 20:14

文件形式的邮件

[python] view plain copy

#!/usr/bin/env python3
#coding: utf-8
import smtplib
from email.mime.text import MIMEText
from email.header import Header
sender = '***'
receiver = '***'
subject = 'python email test'
smtpserver = 'smtp.163.com'
username = '***'
password = '***'
msg = MIMEText('你好','text','utf-8')#中文需参数‘utf-8’，单字节字符不需要
msg['Subject'] = Header(subject, 'utf-8')
smtp = smtplib.SMTP()
smtp.connect('smtp.163.com')
smtp.login(username, password)
smtp.sendmail(sender, receiver, msg.as_string())
smtp.quit()

HTML形式的邮件

[python] view plain copy

#!/usr/bin/env python3
#coding: utf-8
import smtplib
from email.mime.text import MIMEText
sender = '***'
receiver = '***'
subject = 'python email test'
smtpserver = 'smtp.163.com'
username = '***'
password = '***'
msg = MIMEText('<html><h1>你好</h1></html>','html','utf-8')
msg['Subject'] = subject
smtp = smtplib.SMTP()
smtp.connect('smtp.163.com')
smtp.login(username, password)
smtp.sendmail(sender, receiver, msg.as_string())
smtp.quit()

带图片的HTML邮件

[python] view plain copy

#!/usr/bin/env python3
#coding: utf-8
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.image import MIMEImage
sender = '***'
receiver = '***'
subject = 'python email test'
smtpserver = 'smtp.163.com'
username = '***'
password = '***'
msgRoot = MIMEMultipart('related')
msgRoot['Subject'] = 'test message'
msgText = MIMEText('Some HTML text and an image. <img src="cid:image1"> good!','html','utf-8')
msgRoot.attach(msgText)
fp = open('h:\\python\\1.jpg', 'rb')
msgImage = MIMEImage(fp.read())
fp.close()
msgImage.add_header('Content-ID', '<image1>')
msgRoot.attach(msgImage)
smtp = smtplib.SMTP()
smtp.connect('smtp.163.com')
smtp.login(username, password)
smtp.sendmail(sender, receiver, msgRoot.as_string())
smtp.quit()

带附件的邮件

[python] view plain copy

#!/usr/bin/env python3
#coding: utf-8
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.image import MIMEImage
sender = '***'
receiver = '***'
subject = 'python email test'
smtpserver = 'smtp.163.com'
username = '***'
password = '***'
msgRoot = MIMEMultipart('related')
msgRoot['Subject'] = 'test message'
#构造附件
att = MIMEText(open('h:\\python\\1.jpg', 'rb').read(), 'base64', 'utf-8')
att["Content-Type"] = 'application/octet-stream'
att["Content-Disposition"] = 'attachment; filename="1.jpg"'
msgRoot.attach(att)
smtp = smtplib.SMTP()
smtp.connect('smtp.163.com')
smtp.login(username, password)
smtp.sendmail(sender, receiver, msgRoot.as_string())
smtp.quit()

群邮件

[python] view plain copy

#!/usr/bin/env python3
#coding: utf-8
import smtplib
from email.mime.text import MIMEText
sender = '***'
receiver = ['***','****',……]
subject = 'python email test'
smtpserver = 'smtp.163.com'
username = '***'
password = '***'
msg = MIMEText('你好','text','utf-8')
msg['Subject'] = subject
smtp = smtplib.SMTP()
smtp.connect('smtp.163.com')
smtp.login(username, password)
smtp.sendmail(sender, receiver, msg.as_string())
smtp.quit()

各种元素都包含的邮件

[python] view plain copy

#!/usr/bin/env python3
#coding: utf-8
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.image import MIMEImage
sender = '***'
receiver = '***'
subject = 'python email test'
smtpserver = 'smtp.163.com'
username = '***'
password = '***'
# Create message container - the correct MIME type is multipart/alternative.
msg = MIMEMultipart('alternative')
msg['Subject'] = "Link"
# Create the body of the message (a plain-text and an HTML version).
text = "Hi!\nHow are you?\nHere is the link you wanted:\nhttp://www.python.org"
html = """\
<html>
<head></head>
<body>
Hi!
How are you?
Here is the <a href="http://www.python.org">link</a> you wanted.
</body>
</html>
"""
# Record the MIME types of both parts - text/plain and text/html.
part1 = MIMEText(text, 'plain')
part2 = MIMEText(html, 'html')
# Attach parts into message container.
# According to RFC 2046, the last part of a multipart message, in this case
# the HTML message, is best and preferred.
msg.attach(part1)
msg.attach(part2)
#构造附件
att = MIMEText(open('h:\\python\\1.jpg', 'rb').read(), 'base64', 'utf-8')
att["Content-Type"] = 'application/octet-stream'
att["Content-Disposition"] = 'attachment; filename="1.jpg"'
msg.attach(att)
smtp = smtplib.SMTP()
smtp.connect('smtp.163.com')
smtp.login(username, password)
smtp.sendmail(sender, receiver, msg.as_string())
smtp.quit()

基于SSL的邮件

[python] view plain copy

#!/usr/bin/env python3
#coding: utf-8
import smtplib
from email.mime.text import MIMEText
from email.header import Header
sender = '***'
receiver = '***'
subject = 'python email test'
smtpserver = 'smtp.163.com'
username = '***'
password = '***'
msg = MIMEText('你好','text','utf-8')#中文需参数‘utf-8’，单字节字符不需要
msg['Subject'] = Header(subject, 'utf-8')
smtp = smtplib.SMTP()
smtp.connect('smtp.163.com')
smtp.ehlo()
smtp.starttls()
smtp.ehlo()
smtp.set_debuglevel(1)
smtp.login(username, password)
smtp.sendmail(sender, receiver, msg.as_string())
smtp.quit()

Posted in Python|Comments(1)

ubuntu 打造vim python编辑器

Posted on 2012年3月21日 00:06

" All system-wide defaults are set in $VIMRUNTIME/debian.vim (usually just
" /usr/share/vim/vimcurrent/debian.vim) and sourced by the call to :runtime
" you can find below.  If you wish to change any of those settings, you should
" do it in this file (/etc/vim/vimrc), since debian.vim will be overwritten
" everytime an upgrade of the vim packages is performed.  It is recommended to
" make changes after sourcing debian.vim since it alters the value of the
" 'compatible' option.

" This line should not be removed as it ensures that various options are
" properly set to work with the Vim-related packages available in Debian.
runtime! debian.vim

" Uncomment the next line to make Vim more Vi-compatible
" NOTE: debian.vim sets 'nocompatible'.  Setting 'compatible' changes numerous
" options, so any other options should be set AFTER setting 'compatible'.
"set compatible

" Vim5 and later versions support syntax highlighting. Uncommenting the next
" line enables syntax highlighting by default.
" 高亮显示
syntax on

" If using a dark background within the editing area and syntax highlighting
" turn on this option as well
" set background=dark

" Uncomment the following to have Vim jump to the last position when
" reopening a file
"if has("autocmd")
"  au BufReadPost * if line("'\"") > 0 && line("'\"") <= line("$")
"    \| exe "normal g'\"" | endif
"endif

" Uncomment the following to have Vim load indentation rules according to the
" detected filetype. Per default Debian Vim only load filetype specific
" plugins.
"if has("autocmd")
"  filetype indent on
"endif

" The following are commented out as they cause vim to behave a lot
" differently from regular Vi. They are highly recommended though.
"set showcmd                " Show (partial) command in status line.
"set showmatch                " Show matching brackets.
"set ignorecase                " Do case insensitive matching
"set smartcase                " Do smart case matching
"set incsearch                " Incremental search
"set autowrite                " Automatically save before commands like :next and :make
"set hidden             " Hide buffers when they are abandoned
"set mouse=a                " Enable mouse usage (all modes) in terminals

" Source a global configuration file if available
" XXX Deprecated, please move your changes here in /etc/vim/vimrc
if filereadable("/etc/vim/vimrc.local")
  source /etc/vim/vimrc.local
endif


if has("autocmd")

  " 自动检测文件类型并加载相应的设置
  filetype plugin indent on

  " Python 文件的一般设置，比如不要 tab 等
  autocmd FileType python setlocal et | setlocal sta | setlocal sw=4

  " Python Unittest 的一些设置
  " 可以让我们在编写 Python 代码及 unittest 测试时不需要离开 vim
  " 键入 :make 或者点击 gvim 工具条上的 make 按钮就自动执行测试用例
  autocmd FileType python compiler pyunit
  autocmd FileType python setlocal makeprg=python\ ./alltests.py
  autocmd BufNewFile,BufRead test*.py setlocal makeprg=python\ %

  " 自动使用新文件模板
  autocmd BufNewFile test*.py 0r ~/.vim/skeleton/test.py
  autocmd BufNewFile alltests.py 0r ~/.vim/skeleton/alltests.py
  autocmd BufNewFile *.py 0r ~/.vim/skeleton/skeleton.py

endif

" python auto-complete code
" Typing the following (in insert mode):
"   os.lis<Ctrl-n>
" will expand to:
"   os.listdir(
" Python 自动补全功能，只需要反覆按 Ctrl-N 就行了
if has("autocmd")
      autocmd FileType python set complete+=k~/.vim/tools/pydiction
endif


"设置文件类型为python语言
set filetype=python

"显示行号
set nu

"自动缩进
set autoindent

"背景显示颜色（我最喜欢的颜色)
color torte

"高亮搜索
set hlsearch

"输入字符串就显示匹配点
set incsearch

"输入的命令显示出来，看的清楚些
set showcmd

"自动补全命令时候使用菜单式起配列表
set wildmenu

"快速缩进？
set smartindent

Posted in Python|Comments(0)

optparse

Posted on 2012年3月13日 22:11

Python 有两个内建的模块用于处理命令行参数：

一个是 getopt，《Deep in python》一书中也有提到，只能简单处理命令行参数；

另一个是 optparse，它功能强大，而且易于使用，可以方便地生成标准的、符合Unix/Posix 规范的命令行说明。

示例

下面是一个使用 optparse 的简单示例：

				Python代码 

				from optparse import OptionParser  

				[...]  

				parser = OptionParser()  

				parser.add_option("-f", "--file", dest="filename",  

				                  help="write report to FILE", metavar="FILE")  

				parser.add_option("-q", "--quiet",  

				                  action="store_false", dest="verbose", default=True,  

				                  help="don't print status messages to stdout")  

				(options, args) = parser.parse_args()

现在，妳就可以在命令行下输入：

				Python代码 

				<yourscript> --file=outfile -q  

				<yourscript> -f outfile --quiet  

				<yourscript> --quiet --file outfile  

				<yourscript> -q -foutfile  

				<yourscript> -qfoutfile

上面这些命令是相同效果的。除此之外， optparse 还为我们自动生成命令行的帮助信息：

				Python代码 

				<yourscript> -h  

				<yourscript> --help  

输出：

				Python代码 

				usage: <yourscript> [options]  

				options:  

				  -h, --help            show this help message and exit  

				  -f FILE, --file=FILE  write report to FILE  

				  -q, --quiet           don't print status messages to stdout

简单流程

首先，必须 import OptionParser 类，创建一个 OptionParser 对象：

				Python代码 

				from optparse import OptionParser  

				[...]  

				parser = OptionParser()

然后，使用 add_option 来定义命令行参数：

				Python代码 

				parser.add_option(opt_str, ...,  

				                  attr=value, ...)

每个命令行参数就是由参数名字符串和参数属性组成的。如 -f 或者 –file 分别是长短参数名：

				Python代码 

				parser.add_option("-f", "--file", ...)  

最后，一旦你已经定义好了所有的命令行参数，调用 parse_args() 来解析程序的命令行：

				Python代码 

				(options, args) = parser.parse_args()  

注：你也可以传递一个命令行参数列表到 parse_args()；否则，默认使用 sys.argv[:1]。

parse_args() 返回的两个值：

options，它是一个对象（optpars.Values），保存有命令行参数值。只要知道命令行参数名，如 file，就可以访问其对应的值： options.file 。
args，它是一个由 positional arguments 组成的列表。

Actions

action 是 parse_args() 方法的参数之一，它指示 optparse 当解析到一个命令行参数时该如何处理。actions 有一组固定的值可供选择，默认是’store ‘，表示将命令行参数值保存在 options 对象里。

示例

				Python代码 

				parser.add_option("-f", "--file",  

				                  action="store", type="string", dest="filename")  

				args = ["-f", "foo.txt"]  

				(options, args) = parser.parse_args(args)  

				print options.filename

最后将会打印出 “foo.txt”。

当 optparse 解析到’-f’，会继续解析后面的’foo.txt’，然后将’foo.txt’保存到 options.filename 里。当调用 parser.args() 后，options.filename 的值就为’foo.txt’。

你也可以指定 add_option() 方法中 type 参数为其它值，如 int 或者 float 等等：

				Python代码 

				parser.add_option("-n", type="int", dest="num")  

默认地，type 为’string’。也正如上面所示，长参数名也是可选的。其实，dest 参数也是可选的。如果没有指定 dest 参数，将用命令行的参数名来对 options 对象的值进行存取。

store 也有其它的两种形式： store_true 和 store_false ，用于处理带命令行参数后面不带值的情况。如 -v,-q 等命令行参数：

				Python代码 

				parser.add_option("-v", action="store_true", dest="verbose")  

				parser.add_option("-q", action="store_false", dest="verbose")  

这样的话，当解析到 ‘-v’，options.verbose 将被赋予 True 值，反之，解析到 ‘-q’，会被赋予 False 值。

其它的 actions 值还有：

store_const 、append 、count 、callback 。

默认值

parse_args() 方法提供了一个 default 参数用于设置默认值。如：

				Python代码 

				parser.add_option("-f","--file", action="store", dest="filename", default="foo.txt")  

				parser.add_option("-v", action="store_true", dest="verbose", default=True)  

又或者使用 set_defaults()：

				Python代码 

				parser.set_defaults(filename="foo.txt",verbose=True)  

				parser.add_option(...)  

				(options, args) = parser.parse_args()

生成程序帮助

optparse 另一个方便的功能是自动生成程序的帮助信息。你只需要为 add_option() 方法的 help 参数指定帮助信息文本：

				Python代码 

				usage = "usage: %prog [options] arg1 arg2"  

				parser = OptionParser(usage=usage)  

				parser.add_option("-v", "--verbose",  

				                  action="store_true", dest="verbose", default=True,  

				                  help="make lots of noise [default]")  

				parser.add_option("-q", "--quiet",  

				                  action="store_false", dest="verbose",  

				                  help="be vewwy quiet (I'm hunting wabbits)")  

				parser.add_option("-f", "--filename",  

				                  metavar="FILE", help="write output to FILE"),  

				parser.add_option("-m", "--mode",  

				                  default="intermediate",  

				              help="interaction mode: novice, intermediate, "  

				                   "or expert [default: %default]")

当 optparse 解析到 -h 或者 –help 命令行参数时，会调用 parser.print_help() 打印程序的帮助信息：

				Python代码 

				usage: <yourscript> [options] arg1 arg2  

				options:  

				  -h, --help            show this help message and exit  

				  -v, --verbose         make lots of noise [default]  

				  -q, --quiet           be vewwy quiet (I'm hunting wabbits)  

				  -f FILE, --filename=FILE  

				                        write output to FILE  

				  -m MODE, --mode=MODE  interaction mode: novice, intermediate, or  

				                        expert [default: intermediate]

注意： 打印出帮助信息后，optparse 将会退出，不再解析其它的命令行参数。

以上面的例子来一步步解释如何生成帮助信息：

自定义的程序使用方法信息（usage message）：
Python代码
1. usage = "usage: %prog [options] arg1 arg2"
这行信息会优先打印在程序的选项信息前。当中的 %prog，optparse 会以当前程序名的字符串来替代：如 os.path.basename.(sys.argv[0])。

如果用户没有提供自定义的使用方法信息，optparse 会默认使用： “usage: %prog [options]”。
用户在定义命令行参数的帮助信息时，不用担心换行带来的问题，optparse 会处理好这一切。
设置 add_option 方法中的 metavar 参数，有助于提醒用户，该命令行参数所期待的参数，如 metavar=“mode”：
Python代码
1. -m MODE, --mode=MODE
注意： metavar 参数中的字符串会自动变为大写。
在 help 参数的帮助信息里使用 %default 可以插入该命令行参数的默认值。

如果程序有很多的命令行参数，你可能想为他们进行分组，这时可以使用 OptonGroup：

				Python代码 

				group = OptionGroup(parser, ``Dangerous Options'',  

				                    ``Caution: use these options at your own risk.  ``  

				                    ``It is believed that some of them bite.'')  

				group.add_option(``-g'', action=''store_true'', help=''Group option.'')  

				parser.add_option_group(group)

下面是将会打印出来的帮助信息：

				Python代码 

				usage:  [options] arg1 arg2  

				options:  

				  -h, --help           show this help message and exit  

				  -v, --verbose        make lots of noise [default]  

				  -q, --quiet          be vewwy quiet (I'm hunting wabbits)  

				  -fFILE, --file=FILE  write output to FILE  

				  -mMODE, --mode=MODE  interaction mode: one of 'novice', 'intermediate'  

				                       [default], 'expert'  

				  Dangerous Options:  

				    Caution: use of these options is at your own risk.  It is believed that  

				    some of them bite.  

				    -g                 Group option.

显示程序版本

象 usage message 一样，你可以在创建 OptionParser 对象时，指定其 version 参数，用于显示当前程序的版本信息：

				Python代码 

				parser = OptionParser(usage="%prog [-f] [-q]", version="%prog 1.0")  

这样，optparse 就会自动解释 –version 命令行参数：

				Python代码 

				$ /usr/bin/foo --version  

				foo 1.0  

处理异常

包括程序异常和用户异常。这里主要讨论的是用户异常，是指因用户输入无效的、不完整的命令行参数而引发的异常。optparse 可以自动探测并处理一些用户异常：

				Python代码 

				$ /usr/bin/foo -n 4x  

				usage: foo [options]  

				foo: error: option -n: invalid integer value: '4x'  

				$ /usr/bin/foo -n  

				usage: foo [options]  

				foo: error: -n option requires an argument

用户也可以使用 parser.error() 方法来自定义部分异常的处理：

				Python代码 

				(options, args) = parser.parse_args()  

				[...]  

				if options.a and options.b:  

				    parser.error("options -a and -b are mutually exclusive")

上面的例子，当 -b 和 -b 命令行参数同时存在时，会打印出“options -a and -b are mutually exclusive“，以警告用户。

如果以上的异常处理方法还不能满足要求，你可能需要继承 OptionParser 类，并重载 exit() 和 erro() 方法。

完整的程序例子

				Python代码 

				from optparse import OptionParser  

				[...]  

				def main():  

				    usage = "usage: %prog [options] arg"  

				    parser = OptionParser(usage)  

				    parser.add_option("-f", "--file", dest="filename",  

				                      help="read data from FILENAME")  

				    parser.add_option("-v", "--verbose",  

				                      action="store_true", dest="verbose")  

				    parser.add_option("-q", "--quiet",  

				                      action="store_false", dest="verbose")  

				    [...]  

				    (options, args) = parser.parse_args()  

				    if len(args) != 1:  

				        parser.error("incorrect number of arguments")  

				    if options.verbose:  

				        print "reading %s..." % options.filename  

				    [...]  

				if __name__ == "__main__":  

				    main()

参考资料

Posted in Python|Comments(0)

sys.getrefcount() 获取对象的当前引用计数

Posted on 2012年3月07日 03:28

>>> a = 37
>>> import sys
>>> sys.getrefcount(a)
10
>>>

多数情况下，引用技术比你猜测得要大得多。对于不可变数据（如数字和字符串），解释器会主动在程序的不同部分共享对象，以便节约内存。

Posted in Python|Comments(0)

Python将工作分布到多个线程

Posted on 2012年3月04日 03:50

使用threading

#!/usr/bin/env python3

import optparse
import os
import queue
import threading

BLOCK_SIZE = 8000

class Worker(threading.Thread):
    
    def __init__(self, work_queue, word, number):
        super().__init__()
        self.work_queue = work_queue
        self.word = word
        self.number = number 

    def run(self):
        while True:
            try:
                filename = self.work_queue.get()
                self.process(filename)
            finally:
                self.work_queue.task_done()
    
    def process(self, filename):
        previous = ""
        try:
            with open(filename, "rb") as fh:
                while True:
                    current = fh.read(BLOCK_SIZE)
                    if not current:
                        break
                    current = current.decode("utf8", "ignore")
                    if (self.word in current or
                        self.word in previous[-len(self.word):] +
                                     current[:len(self.word)]):
                        print("{0}{1}".format(self.number, filename))
                        break
                    if len(current) != BLOCK_SIZE:
                        break
                    previous = current
        except EnvironmentError as err:
            print("{0}{1}".format(self.number, err))

    
def main():
    opts, word, args = parse_options()
    filelist = get_files(args, opts.recurse)
    work_queue = queue.Queue()
    for i in range(opts.count):
        number = "{0}: ".format(i + 1) if opts.debug else ""
        worker = Worker(work_queue, word, number)
        worker.daemon = True
        worker.start()
    for filename in filelist:
        work_queue.put(filename)
    work_queue.join()
        

def parse_options():
    parser = optparse.OptionParser(
            usage=("usage: %prog [options] word name1 "
                   "[name2 [... nameN]\n\n"
                   "names are filenames or paths: paths only "
                   "make sense with the -r option set"))
    parser.add_option("-t", "--threads", dest="count", default=7,
            type="int",
            help=("the number of threads to use(1..20) "
                  "[default %default]"))
    parser.add_option("-r", "--recurse", dest="recurse",
            default=False, action="store_true",
            help="recurse into subdirectories")
    parser.add_option("-d", "--debug", dest="debug", default=False,
                    action="store_true")
    opts, args = parser.parse_args()
    if len(args) == 0:
        parser.error("a word and at least one path must be specified")
    elif len(args) == 1:
        parser.error("at least one path must be specified")
    if (not opts.recurse and
        not any([os.path.isfile(arg) for arg in args])):
        parser.add("at least one file must be specified: or use -r")
    if not (1 <= opts.count <=20):
        parser.error("thread count must be 1..20")
    return opts, args[0], args[1:]

def get_files(args, recurse):
    filelist = []
    for path in args:
        if os.path.isfile(path):
            filelist.append(path)
        elif recurse:
            for root, dirs, files in os.walk(path):
                for filename in files:
                    filelist.append(os.path.join(root, filename))
    return filelist

main()

使用multiprocessing

#!/usr/bin/env python3

import optparse
import os
import multiprocessing

BLOCK_SIZE = 8000

class Worker(multiprocessing.Process):
    
    def __init__(self, work_queue, word, number):
        super().__init__()
        self.work_queue = work_queue
        self.word = word
        self.number = number 

    def run(self):
        while True:
            try:
                filename = self.work_queue.get()
                self.process(filename)
            finally:
                self.work_queue.task_done()
    
    def process(self, filename):
        previous = ""
        try:
            with open(filename, "rb") as fh:
                while True:
                    current = fh.read(BLOCK_SIZE)
                    if not current:
                        break
                    current = current.decode("utf8", "ignore")
                    if (self.word in current or
                        self.word in previous[-len(self.word):] +
                                     current[:len(self.word)]):
                        print("{0}{1}".format(self.number, filename))
                        break
                    if len(current) != BLOCK_SIZE:
                        break
                    previous = current
        except EnvironmentError as err:
            print("{0}{1}".format(self.number, err))

    
def main():
    opts, word, args = parse_options()
    filelist = get_files(args, opts.recurse)
    work_queue = multiprocessing.JoinableQueue()
    for i in range(opts.count):
        number = "{0}: ".format(i + 1) if opts.debug else ""
        worker = Worker(work_queue, word, number)
        worker.daemon = True
        worker.start()
    for filename in filelist:
        work_queue.put(filename)
    work_queue.join()
        

def parse_options():
    parser = optparse.OptionParser(
            usage=("usage: %prog [options] word name1 "
                   "[name2 [... nameN]\n\n"
                   "names are filenames or paths: paths only "
                   "make sense with the -r option set"))
    parser.add_option("-t", "--threads", dest="count", default=7,
            type="int",
            help=("the number of threads to use(1..20) "
                  "[default %default]"))
    parser.add_option("-r", "--recurse", dest="recurse",
            default=False, action="store_true",
            help="recurse into subdirectories")
    parser.add_option("-d", "--debug", dest="debug", default=False,
                    action="store_true")
    opts, args = parser.parse_args()
    if len(args) == 0:
        parser.error("a word and at least one path must be specified")
    elif len(args) == 1:
        parser.error("at least one path must be specified")
    if (not opts.recurse and
        not any([os.path.isfile(arg) for arg in args])):
        parser.add("at least one file must be specified: or use -r")
    if not (1 <= opts.count <=20):
        parser.error("thread count must be 1..20")
    return opts, args[0], args[1:]

def get_files(args, recurse):
    filelist = []
    for path in args:
        if os.path.isfile(path):
            filelist.append(path)
        elif recurse:
            for root, dirs, files in os.walk(path):
                for filename in files:
                    filelist.append(os.path.join(root, filename))
    return filelist

main()

通过使用forking(在支持该机制的系统上，比如UNIX)或子进程（在那写不支持forking的系统上，比如Windows），multiprocessing模块可以提供线程类似的功能，因此，锁机制并不总是必须的，并且进程可以运行在操作系统支持的任何处理器核上。该包提供了几种在进程之间传递数据的方式，包括使用队列——可用于为进程提供工作载荷，就像queue.Queue可用于为线程提供工作载荷一样。

multiprocessing版本的主要好处是，在多核机器上，具有比线程化版本运行更快的潜力，因为这一版本可以在任何可用的处理器核上运行其进程。与标准的Python解释器（使用C编写，有时候称为CPython)相比，解释器有一个GIL(全局解释器锁），这意味着，在任何时刻上，只有一个线程可以执行Python代码。这一约束是一种实现上的细节，并不必然应用于其他Python解释器，比如Jython。

Posted in Python|Comments(0)

Python将工作分布到多个进程

Posted on 2012年3月01日 03:41

下面示例实现是在目录或者递归目录下查找文件中是否存在提供的字符串，返回该文件名。此示例是通过subprocess模块实现的多进程执行。

grepword-p.py

#!/usr/bin/env python3

import optparse
import os
import subprocess
import sys

def main():
	child = os.path.join(os.path.dirname(__file__),
						 "grepword-p-child.py")
	opts, word, args = parse_options()
	filelist = get_files(args, opts.recurse)
	files_per_process = len(filelist) // opts.count
	start, end = 0, files_per_process + (len(filelist) % opts.count)
	number = 1
	
	pipes = []
	while start < len(filelist):
		command = [sys.executable, child]
		if opts.debug:
			command.append(str(number))
		pipe = subprocess.Popen(command, stdin=subprocess.PIPE)
		pipes.append(pipe)
		pipe.stdin.write(word.encode("utf8") + b"\n")
		for filename in filelist[start:end]:
			pipe.stdin.write(filename.encode("utf8") + b"\n")
		pipe.stdin.close()
		number += 1
		start, end = end, end + files_per_process
	while pipes:
		pipe = pipes.pop()
		pipe.wait()
	
def parse_options():
	parser = optparse.OptionParser(
			usage=("usage: %prog [options] word name1 "
				   "[name2 [... nameN]]\n\n"
				   "names are filenames or paths; paths only "
				   "make sense with the -r option set"))
	parser.add_option("-p", "--processes", dest="count", default=7,
					  type="int",
					  help=("the number of child processes to use (1..20) "
							"[default %default]"))
	parser.add_option("-r", "--recurse", dest="recurse",
					  default=False, action="store_true",
					  help="recurse into subdirectories")
	parser.add_option("-d", "--debug", dest="debug", default=False,
					  action="store_true")
	opts, args = parser.parse_args()
	if len(args) == 0:
		parser.error("a word and at least one path must be specified")
	elif len(args) == 1:
		parser.error("at least one path must be specified")
	if (not opts.recurse and 
		not any([os.path.isfile(arg) for arg in args])):
		parser.error("at least one file must be specified; or use -r")
	if not (1 <= opts.count <= 20):
		parser.error("process count must be 1..20")
	return opts, args[0], args[1:]
	
def get_files(args, recurse):
	filelist = []
	for path in args:
		if os.path.isfile(path):
			filelist.append(path)
		elif recurse:
			for root, dirs, files in os.walk(path):
				for filename in files:
					filelist.append(os.path.join(root, filename))
	return filelist
	
main()

grepword-p-child.py

#!/usr/bin/env python3
import sys

BLOCK_SIZE = 8000

number = "{0}: ".format(sys.argv[1]) if len(sys.argv) == 2 else ""
stdin = sys.stdin.buffer.read()
lines = stdin.decode("utf8", "ignore").splitlines()
word = lines[0].rstrip()

for filename in lines[1:]:
	filename = filename.rstrip()
	previous = ""
	try:
		with open(filename, "rb") as fh:
			while True:
				current = fh.read(BLOCK_SIZE)
				if not current:
					break
				current = current.decode("utf8", "ignore")
				if (word in current or 
					word in previous[-len(word):] +
							current[:len(word)]):
					print("{0}{1}".format(number, filename))
					break
				if len(current) != BLOCK_SIZE:
					break
				previous = current
	except EnviromentError as err:
		print("{0}{1}".format(number, err))

接下来将研究将工作分配给多线程执行！

Posted in Python|Comments(0)

« 上一页 1 2 3 4 5 下一页 »