使用字符串

Posted on 2013年3月12日 15:06

NSString和NSMutableString初始化值方式

NSString *simpleString = @"This is a simple string";
NSString *anotherString = [NSString stringWithString:@"This is another simple string"];
NSString *oneMorestring = [[NSString alloc] initWithString:@"One more!"];
NSMutableString *mutableOne = [NSMutableString stringWithString:@"Mutable String"];
NSMutableString *anotherMutableOne = [[NSMutableString alloc] initWithString:@"A retained one"];
NSMutableString *thirdMutableOne =[NSMutableString stringWithString:simpleString];
 
 

Codeigniter:Better folder setup

Posted on 2012年12月26日 15:07

This guide is for you to have a better folder setup for your codeigniter. Separate your application folder from the core libraries. Have the bulk of your code outside of your public html folder thus improving security. Then create a public folder to separate the images, css and javascripts.

Step 1:Download Codeigniter

I guess first thing to do is download the latest Codeigniter. Once you downloaded the zip. Unzip it and place it to your server. Either you do it in your localhost using WAMP or upload it to a PHP enabled server.

the basic CI installation on most web servers gives you this folder structure:

we want this:

Step 2: Changing structure

As you see the folder structure on top, I created a public folder. You can create your css, images, js folders here. Move the index.php that’s located at the root to here. Then you can actually just delete the user_guide folder and license.txt.

Edit index.php, the one you moved to the public folder from

?
1
$system_path = 'system';

to

?
1
$system_path = '../system';

Then the application path

?
1
$application_folder = "application";

to

?
1
$application_folder = "../application";

Then make a .htaccess file at the root folder to get rid of the annoying index.php out of the url.

?
1
2
3
RewriteEngine on
RewriteCond $1 !^(index\.php|images|robots\.txt)
RewriteRule ^(.*)$ ci/index.php/$1 [L]

 

DFA 文字过滤替换

Posted on 2012年10月10日 23:55

 

#encoding=utf-8
#DFA based text filter
#author=sunjoy
#version=0.3
class GFW(object):
    def __init__(self):
        self.d = {}
    
    #give a list of "ming gan ci"
    def set(self,keywords):
        p = self.d
        q = {}
        k = ''
        for word in keywords:
            word += chr(11)
            p = self.d
            for char in word:
                char = char.lower()
                if p=='':
                    q[k] = {}
                    p = q[k]
                if not (char in p):
                    p[char] = ''
                    q = p
                    k = char
                p = p[char]
        
        pass
    
    def replace(self,text,mask):
        """
        >>> gfw = GFW()
        >>> gfw.set(["sexy","girl","love","shit"])
        >>> s = gfw.replace("Shit!,Cherry is a sexy girl. She loves python.","*")
        >>> print s
        *!,Cherry is a * *. She *s python.
        """
        p = self.d
        i = 0 
        j = 0
        z = 0
        result = []
        ln = len(text)
        while i+j<ln:
            #print i,j
            t = text[i+j].lower()
            #print hex(ord(t))
            if not (t in p):
                j = 0
                i += 1
                p = self.d
                continue
            p = p[t]
            j+=1
            if chr(11) in p:
                p = self.d
                result.append(text[z:i])
                result.append(mask)
                i = i+j
                z = i
                j = 0
        result.append(text[z:i+j])
        return "".join(result)
        
    def check(self,text):
        """
        >>> gfw = GFW()
        >>> gfw.set(["abd","defz","bcz"])
        >>> print gfw.check("xabdabczabdxaadefz")
        [(1, 3, 'abd'), (5, 3, 'bcz'), (8, 3, 'abd'), (14, 4, 'defz')]
        """
        p = self.d
        i = 0 
        j = 0
        result = []
        ln = len(text)
        while i+j<ln:
            t = text[i+j].lower()
            #print i,j,hex(ord(t))
            if not (t in p):
                j = 0
                i += 1
                p = self.d
                continue
            p = p[t]
            j+=1
            #print p,i,j
            if chr(11) in p:
                p = self.d
                result.append((i,j,text[i:i+j]))
                i = i+j
                j = 0
        return result
        
if __name__=="__main__":
    import doctest,sys
    doctest.testmod(sys.modules[__name__])
    

 

 

smallgfw: 一个基于DFA的敏感词检测和替换模块,用法如doctest所示。

>>> gfw = GFW()
>>> gfw.set(["sexy","girl","love","shit"])#设置敏感词列表
>>> s = gfw.replace("shit!,Cherry is a sexy girl. She loves python.","*")
>>> print s
*!,Cherry is a * *. She *s python. #屏蔽后的效果

>>> gfw = GFW()
>>> gfw.set(["abd","defz","bcz"])
>>> print gfw.check("xabdabczabdxaadefz") #检测敏感词的出现位置
[(1, 3, 'abd'), (5, 3, 'bcz'), (8, 3, 'abd'), (14, 4, 'defz')] #例如,(5, 3, 'bcz')表示下标5之后长度为3的子串

---

check 1 times
re cost: 0.0160000324249
smallgfw cost: 0.0160000324249
===================================
check 2 times
re cost: 0.0309998989105
smallgfw cost: 0.0
===================================
check 3 times
re cost: 0.047000169754
smallgfw cost: 0.0149998664856
===================================
check 4 times
re cost: 0.0629999637604
smallgfw cost: 0.0150001049042
===================================
check 5 times
re cost: 0.0789999961853
smallgfw cost: 0.0309998989105
===================================
check 6 times
re cost: 0.0780000686646
smallgfw cost: 0.0469999313354
===================================
check 7 times
re cost: 0.0940001010895
smallgfw cost: 0.0460000038147
===================================
check 8 times
re cost: 0.109999895096
smallgfw cost: 0.047000169754
===================================
check 9 times
re cost: 0.125
smallgfw cost: 0.0620000362396
===================================
check 10 times
re cost: 0.125
smallgfw cost: 0.077999830246
===================================
check 11 times
re cost: 0.172000169754
smallgfw cost: 0.0629999637604
===================================
check 12 times
re cost: 0.171999931335
smallgfw cost: 0.0780000686646
===================================
check 13 times
re cost: 0.18700003624
smallgfw cost: 0.077999830246
===================================
check 14 times
re cost: 0.18799996376
smallgfw cost: 0.0940001010895
===================================
check 15 times
re cost: 0.203000068665
smallgfw cost: 0.0929999351501
===================================
check 16 times
re cost: 0.219000101089
smallgfw cost: 0.109999895096
===================================
check 17 times
re cost: 0.233999967575
smallgfw cost: 0.108999967575
===================================
check 18 times
re cost: 0.25
smallgfw cost: 0.110000133514
===================================
check 19 times
re cost: 0.264999866486
smallgfw cost: 0.110000133514
===================================
check 20 times
re cost: 0.280999898911
smallgfw cost: 0.141000032425
===================================
replace 1 times
re cost: 0.0
smallgfw cost: 0.0150001049042
===================================
replace 2 times
re cost: 0.0309998989105
smallgfw cost: 0.0
===================================
replace 3 times
re cost: 0.0469999313354
smallgfw cost: 0.0160000324249
===================================
replace 4 times
re cost: 0.0620000362396
smallgfw cost: 0.0160000324249
===================================
replace 5 times
re cost: 0.0780000686646
smallgfw cost: 0.0309998989105
===================================
replace 6 times
re cost: 0.0789999961853
smallgfw cost: 0.0460000038147
===================================
replace 7 times
re cost: 0.0940001010895
smallgfw cost: 0.0469999313354
===================================
replace 8 times
re cost: 0.108999967575
smallgfw cost: 0.0469999313354
===================================
replace 9 times
re cost: 0.125
smallgfw cost: 0.0780000686646
===================================
replace 10 times
re cost: 0.141000032425
smallgfw cost: 0.0629999637604
===================================
replace 11 times
re cost: 0.155999898911
smallgfw cost: 0.0780000686646
===================================
replace 12 times
re cost: 0.156000137329
smallgfw cost: 0.077999830246
===================================
replace 13 times
re cost: 0.18799996376
smallgfw cost: 0.0780000686646
===================================
replace 14 times
re cost: 0.203000068665
smallgfw cost: 0.0939998626709
===================================
replace 15 times
re cost: 0.203000068665
smallgfw cost: 0.0940001010895
===================================
replace 16 times
re cost: 0.233999967575
smallgfw cost: 0.0939998626709
===================================
replace 17 times
re cost: 0.234000205994
smallgfw cost: 0.109999895096
===================================
replace 18 times
re cost: 0.25
smallgfw cost: 0.125
===================================
replace 19 times
re cost: 0.25
smallgfw cost: 0.125
===================================
replace 20 times
re cost: 0.296000003815
smallgfw cost: 0.125
===================================

 

psyco优化后

check 1 times
re cost: 0.0149998664856
smallgfw cost: 0.0
===================================
check 2 times
re cost: 0.0320000648499
smallgfw cost: 0.0
===================================
check 3 times
re cost: 0.0460000038147
smallgfw cost: 0.0
===================================
check 4 times
re cost: 0.0629999637604
smallgfw cost: 0.0
===================================
check 5 times
re cost: 0.0780000686646
smallgfw cost: 0.0160000324249
===================================
check 6 times
re cost: 0.077999830246
smallgfw cost: 0.0150001049042
===================================
check 7 times
re cost: 0.0940001010895
smallgfw cost: 0.0159997940063
===================================
check 8 times
re cost: 0.109000205994
smallgfw cost: 0.0159997940063
===================================
check 9 times
re cost: 0.125
smallgfw cost: 0.0150001049042
===================================
check 10 times
re cost: 0.125
smallgfw cost: 0.0320000648499
===================================
check 11 times
re cost: 0.139999866486
smallgfw cost: 0.0320000648499
===================================
check 12 times
re cost: 0.155999898911
smallgfw cost: 0.0310001373291
===================================
check 13 times
re cost: 0.171999931335
smallgfw cost: 0.0160000324249
===================================
check 14 times
re cost: 0.203000068665
smallgfw cost: 0.0149998664856
===================================
check 15 times
re cost: 0.219000101089
smallgfw cost: 0.0160000324249
===================================
check 16 times
re cost: 0.233999967575
smallgfw cost: 0.0160000324249
===================================
check 17 times
re cost: 0.233999967575
smallgfw cost: 0.0309998989105
===================================
check 18 times
re cost: 0.25
smallgfw cost: 0.0320000648499
===================================
check 19 times
re cost: 0.265000104904
smallgfw cost: 0.0309998989105
===================================
check 20 times
re cost: 0.28200006485
smallgfw cost: 0.0309998989105
===================================
replace 1 times
re cost: 0.0160000324249
smallgfw cost: 0.0150001049042
===================================
replace 2 times
re cost: 0.0159997940063
smallgfw cost: 0.0150001049042
===================================
replace 3 times
re cost: 0.0320000648499
smallgfw cost: 0.0149998664856
===================================
replace 4 times
re cost: 0.047000169754
smallgfw cost: 0.0
===================================
replace 5 times
re cost: 0.077999830246
smallgfw cost: 0.0
===================================
replace 6 times
re cost: 0.0940001010895
smallgfw cost: 0.0160000324249
===================================
replace 7 times
re cost: 0.0929999351501
smallgfw cost: 0.0160000324249
===================================
replace 8 times
re cost: 0.108999967575
smallgfw cost: 0.0
===================================
replace 9 times
re cost: 0.125
smallgfw cost: 0.0160000324249
===================================
replace 10 times
re cost: 0.141000032425
smallgfw cost: 0.0149998664856
===================================
replace 11 times
re cost: 0.15700006485
smallgfw cost: 0.0150001049042
===================================
replace 12 times
re cost: 0.171999931335
smallgfw cost: 0.0160000324249
===================================
replace 13 times
re cost: 0.18700003624
smallgfw cost: 0.0309998989105
===================================
replace 14 times
re cost: 0.18799996376
smallgfw cost: 0.0310001373291
===================================
replace 15 times
re cost: 0.218999862671
smallgfw cost: 0.0160000324249
===================================
replace 16 times
re cost: 0.21799993515
smallgfw cost: 0.0320000648499
===================================
replace 17 times
re cost: 0.233999967575
smallgfw cost: 0.0310001373291
===================================
replace 18 times
re cost: 0.25
smallgfw cost: 0.0309998989105
===================================
replace 19 times
re cost: 0.296999931335
smallgfw cost: 0.0320000648499
===================================
replace 20 times
re cost: 0.280999898911
smallgfw cost: 0.0310001373291
===================================

 

http://code.google.com/p/smallgfw/

使用DFA实现文字过滤

Posted on 2012年10月10日 23:38

DFA和文字过滤 


文字过滤是一般大型网站必不可少的一个功能,而且很多文字类网站更是需要。那么如何设计一个高效的文字过滤系统就是非常重要的了。 

文字过滤需求简要描述:判断集合A中哪些子集属于集合B,拿javaeye来说,如果用户发表一篇文章(集合A),我们需要判断这篇文章里是否存在一些关键字是属于集合B,B一般来说就是违禁词列表。 

看到这里,没有接触过的同学可能会想到contains,正则之类的方法,但是很遗憾,这些方法都是行不通的。唯一比较好的算法是DFA。 

一,DFA简介: 
学过编译原理的同学们一定知道,在词法分析阶段将源代码中的文本变成语法的集合就是通过确定有限自动机实现的。但是DFA并不只是词法分析里用到,DFA的用途非常的广泛,并不局限在计算机领域。 

DFA的基本功能是可以通过event和当前的state得到下一个state,即event+state=nextstate, 
我们来看一张到处都能找到的状态图: 


--------------------------------------- 

 
------------------------------------- 





大写字母是状态,小写字母是动作:我们可以看到S+a=U,U+a=Q,S+b=V等等。一般情况下我们可以用矩阵来表示整个状态转移过程: 
--------------- 
状态\字符  a       b 
S        U       V 
U        Q       V 
V        U       Q 
Q        Q       Q 

但是表示状态图可以有很多数据结构,上面的矩阵只是一个便于理解的简单例子。而接下来在本文提到的文字过滤系统中会使用另外的数据结构来实现自动机模型 

二,文字过滤 
在文字过滤系统中,为了能够应付较高的并发,有一个目标比较重要,就是尽量的减少计算,而在DFA中,基本没有什么计算,有的只是状态的转移。而要把违禁文字列表构造成一个状态机,用矩阵来实现是比较麻烦的,下面介绍一种比较简单的实现方式,就是树结构。 

所有的违禁词其本质来说是有ascii码组成的,而待过滤文本其本质也是ascii码的集合,比如说: 
输入是A=[101,102,105,97,98,112,110] 
违禁词列表: 
[102,105] 
[98,112] 
那么我们的任务就是把上面两个违禁词构造成一个DFA,这样输入的A就可以通过在这个DFA上的转移来实现违禁词查找的功能。 

树结构实现这个DFA的基于的基本方法是数组的index和数组value之间的关系(在双数组trie中同样是基于这一基本方法) 
那么102其实可以看作一个数组索引,而105是102这个索引指向的下一个数组中的一个索引,105后面没有值了,那就代表这个违禁词结束了。 

通过这样一种方式,就可以构造出一颗DFA的树结构表示。 

接着遍历输入文本中的每一个byte,然后在DFA中作状态转移就可以判断出一个违禁词是否出现在输入文本中。 

 

#encoding:UTF-8
import sys
from time import time
'''
@author: ahuaxuan 
@date: 2009-02-20
'''

wordTree = [None for x in range(256)]
wordTree.append(0)
nodeTree = [wordTree, 0]
def readInputText():
    txt = ''
    for line in open('text.txt', 'rb'):
        txt = txt + line
    return txt

def createWordTree():
    awords = []
    for b in open('words.txt', 'rb'):
        awords.append(b.strip())
    
    for word in awords:
        temp = wordTree
        for a in range(0,len(word)):
            index = ord(word[a])
            if a < (len(word) - 1):
                if temp[index] == None:
                    node = [[None for x in range(256)],0]
                    temp[index] = node
                elif temp[index] == 1:
                    node = [[None for x in range(256)],1]
                    temp[index] = node
                
                temp = temp[index][0]
            else:
                temp[index] = 1
    

def searchWord(str):
    temp = nodeTree
    words = []
    word = []
    a = 0
    while a < len(str):
        index = ord(str[a])
        temp = temp[0][index]
        if temp == None:
            temp = nodeTree
            a = a - len(word)
            word = []
        elif temp == 1 or temp[1] == 1:
            word.append(index)
            words.append(word)
            a = a - len(word) + 1 
            word = []
            temp = nodeTree
        else:
            word.append(index)
        a = a + 1
    
    return words

if __name__ == '__main__':
    #reload(sys)  
    #sys.setdefaultencoding('GBK')  
    input2 = readInputText()
    createWordTree();
    beign=time()
    list2 = searchWord(input2)
    print "cost time : ",time()-beign
    strLst = []
    print 'I have find some words as ', len(list2)
    map = {}
    for w in list2:
        word = "".join([chr(x) for x in w])
        if not map.__contains__(word):
            map[word] = 1
        else:
            map[word] = map[word] + 1
    
    for key, value in map.items():
        print key, value

 

输入文本就是本文(不包含下面的示例结果文本) 
运行结果示例: 


  1. python 5  
  2. 违禁词 12  
  3. DFA 12  
  4. ahuaxuan 3  

       当然用python实现以上算法只是为了便于理解,事实上python的速度实在是太慢了,同样的违禁词列表,同样的输入文本,python写的比用java写的差了40倍左右。理论上来讲在这个功能上,用python调用c写的功能比较合适。 


而这种方式比较大的缺点是内存使用虑较大,因为有很多数组上的元素是None,引用的空间会消耗大量的内存,这个和违禁词的长度和个数成正比。比较好的方式还是用双数组实现DFA,这个方式使用内存空间较小,而基本原理还是一样,通过两个数组的index和value之间的数学关系来实现状态机的转移。 

附件中附带违禁词和输入文本的测试,大家可以运行一下看看效果。

 

文章出处:http://ahuaxuan.iteye.com/blog/336577?page=4#comments

立即输出缓冲区

Posted on 2012年10月06日 19:19

<?php
set_time_limit(10);
ob_end_clean();     //在循环输出前,要关闭输出缓冲区

echo str_pad('',1024);     //浏览器在接受输出一定长度内容之前不会显示缓冲输出,这个长度值 IE是256,火狐是1024
for($i=1;$i<=100;$i++){
  echo $i.'<br/>';
  flush();    //刷新输出缓冲
  sleep(1);
}
?>

 

license

Posted on 2012年10月03日 22:34

1 license

1.1 BSD

BSD开源协议是一个给于使用者很大自由的协议。基本上使用者可以”为所欲为”,可以自由的使用,修改源代码,也可以将修改后的代码作为开源或者专有软件再发布。但”为所欲为”的前提当你发布使用了BSD协议的代码,或则以BSD协议代码为基础做二次开发自己的产品时,需要满足三个条件:

  • 如果再发布的产品中包含源代码,则在源代码中必须带有原来代码中的BSD协议。
  • 如果再发布的只是二进制类库/软件,则需要在类库/软件的文档和版权声明中包含原来代码中的BSD协议。
  • 不可以用开源代码的作者/机构名字和原来产品的名字做市场推广。

BSD代码鼓励代码共享,但需要尊重代码作者的著作权。BSD由于允许使用者修改和重新发布代码,也允许使用或在BSD代码上开发商业软件发布和销售,因此是对商业集成很友好的协议。而很多的公司企业在选用开源产品的时候都首选BSD协议,因为可以完全控制这些第三方的代码,在必要的时候可以修改或者二次开发。

1.2 APL2.0

Apache Licence是著名的非盈利开源组织Apache采用的协议。该协议和BSD类似,同样鼓励代码共享和尊重原作者的著作权,同样允许代码修改,再发布(作为开源或商业软件)。需要满足的条件也和BSD类似:

  • 包含原有代码中的一份Apache Licence
  • 如果你修改了代码,需要在被修改的文件中说明。
  • 在延伸的代码中[修改和有源代码衍生的代码中]需要带有原来代码中的协议,商标,专利声明和其他原来作者规定需要包含的说明。
  • 如果再发布的产品中包含一个Notice文件,则在Notice文件中需要带有Apache Licence。你可以在Notice中增加自己的许可,但不可以表现为对Apache Licence构成更改。

Apache Licence也是对商业应用友好的许可。使用者也可以在需要的时候修改代码来满足需要并作为开源或商业产品发布/销售。

1.3 GPL

我们很熟悉的Linux就是采用了GPL。GPL协议和BSD, Apache Licence等鼓励代码重用的许可很不一样。GPL的出发点是代码的开源/免费使用和引用/修改/衍生代码的开源/免费使用,但不允许修改后和衍生的代码做为闭源的商业软件发布和销售。这也就是为什么我们能用免费的各种linux,包括商业公司的linux和linux上各种各样的由个人,组织,以及商业软件公司开发的免费软件了。GPL协议的主要内容是只要在一个软件中使用(“使用”指类库引用,修改后的代码或者衍生代码)GPL协议的产品,则该软件产品必须也采用GPL协议,既必须也是开源和免费。这就是所谓的”传染性”。GPL协议的产品作为一个单独的产品使用没有任何问题,还可以享受免费的优势。由于GPL严格要求使用了GPL类库的软件产品必须使用GPL协议,对于使用GPL协议的开源代码,商业软件或者对代码有保密要求的部门就不适合集成/采用作为类库和二次开发的基础。其它细节如再发布的时候需要伴随GPL协议等和BSD/Apache等类似。

1.4 LGPL

LGPL是GPL的一个为主要为类库使用设计的开源协议。和GPL要求任何使用/修改/衍生之GPL类库的的软件必须采用GPL协议不同。LGPL 允许商业软件通过类库引用(link)方式使用LGPL类库而不需要开源商业软件的代码。这使得采用LGPL协议的开源代码可以被商业软件作为类库引用并发布和销售。但是如果修改LGPL协议的代码或者衍生,则所有修改的代码,涉及修改部分的额外代码和衍生的代码都必须采用LGPL协议。因此LGPL协议的开源代码很适合作为第三方类库被商业软件引用,但不适合希望以LGPL协议代码为基础,通过修改和衍生的方式做二次开发的商业软件采用。GPL/LGPL都保障原作者的知识产权,避免有人利用开源代码复制并开发类似的产品。NOTE(dirlt):比如glibc

1.5 MIT

MIT是和BSD一样宽范的许可协议,作者只想保留版权,而无任何其他了限制。也就是说,你必须在你的发行版里包含原许可协议的声明,无论你是以二进制发布的还是以源代码发布的。

 

优先队列

Posted on 2012年9月26日 19:00
#!/usr/bin/env python3
import queue
import threading

class Job():
    def __init__(self, priority, description):
        self.priority = priority
        self.description = description
        print("New job:", description)

    def __lt__(self, other):
        return self.priority < other.priority

q = queue.PriorityQueue()
q.put(Job(3, "Mid-level job"))
q.put(Job(10, "Low-level job"))
q.put(Job(1, "Important job"))

def process_job(q):
    while True:
        next_job = q.get()
        print("Processing job:", next_job.description)
        q.task_done()

workers = [threading.Thread(target=process_job, args=(q,)),
           threading.Thread(target=process_job, args=(q,)),
          ]

for w in workers:
    w.setDaemon(True)
    w.start()

q.join()

函数指针

Posted on 2012年9月26日 00:05
#include <stdio.h>
#include <string.h>

int NUM_ADS = 7;
char *ADS[] = {
    "William: SBM GSOH likes sports, TV, dining",
    "Matt: SWM NS likes art, movies, theater",
    "Luis: SLM ND likes books, theater, art",
    "Mike: DWM DS likes trucks, sports and bieber",
    "Peter: SAM likes chess, working out and art",
    "Josh: SJM likes sports, movies and theater",
    "Jed: DBM likes theater, books and dining"
};

int sports_no_bieber(char *s)
{
    return strstr(s, "sports") && !strstr(s, "bieber");
}

int sports_or_workout(char *s)
{
    return strstr(s, "sports") || strstr(s, "working out");
}

int ns_theater(char *s)
{
    return strstr(s, "NS") && strstr(s, "theater");
}

int arts_theater_or_dining(char *s)
{
    return strstr(s, "arts") || strstr(s, "theater") || strstr(s, "dining");
}

void find(int (*match) (char *))
{
    int i;
    puts("Search results: ");
    puts("------------------------------------");

    for (i = 0; i < NUM_ADS; i++)
    {
        if (match(ADS[i]))
        {
            printf("%s\n", ADS[i]);
        }
    }

    puts("------------------------------------");
}

int main()
{
    find(sports_no_bieber);
    find(sports_or_workout);
    find(ns_theater);
    find(arts_theater_or_dining);

    return 0;
}

difflib

Posted on 2012年9月24日 16:12

difflib是python提供的比较序列(string list)差异的模块。

实现了三个类:
1>SequenceMatcher 任意类型序列的比较 (可以比较字符串)
2>Differ 对字符串进行比较
3>HtmlDiff 将比较结果输出为html格式.


Library Example - re

Posted on 2012年9月17日 19:54

一、查找文本中的模式

#!/usr/bin/env python3
import re

pattern = "this"
text = "Does this text match the pattern?"

match = re.search(pattern, text)

s = match.start()
e = match.end()

print('Found "{0:s}"\nin "{1:s}"\nfrom {2:d} to {3:d} ("{4:s}")'.format(
     match.re.pattern, match.string, s, e, text[s:e]))

 

二、编译表达式

#!/usr/bin/env python3
import re

# Precompile the patterns
regexes = [ re.compile(p) 
            for p in ["this", "that"]
          ]

text = "Does this text match the pattern?"

print("Text: %r\n".format(text))

for regex in regexes:
    print("Seeking \"%s\" -> ".format(regex.pattern), end='')

    if regex.search(text):
        print("Match")
    else:
        print("No Match")

三、多重匹配

findall()函数返回输入中与模式匹配而不重叠的所有子串

#!/usr/bin/env python3
import re

text = "abbaaabbbaaaaa"
pattern = "ab"

for match in re.findall(pattern, text):
    print("Found \"{0}\"".format(match))

Found "ab"
Found "ab"

finditer()返回一个迭代器,它将生成Match实例

#!/usr/bin/env python3
import re

text = "abbaaabbbbaaaa"
pattern = "ab"

for match in re.finditer(pattern, text):
    s = match.start()
    e = match.end()
    print("Found \"{0}\" at {1:d}:{2:d}".format(text[s:e], s, e))

Found "ab" at 0:2
Found "ab" at 5:7
 

四、模式语法

#!/usr/bin/env python3
import re

def test_patterns(text, patterns=[]):
    """Given source text and a list of patterns, look for
    matches for each pattern within the text and print
    then to stdout
    """
    # Look for each pattern in the text and print the results
    for pattern, desc in patterns:
        print("Pattern {0!r} ({1})\n".format(pattern, desc))
        print("  {0!r}".format(text))
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            substr = text[s:e]
            n_backslashes = text[:s].count('\\')
            prefix = '.' * (s + n_backslashes)
            print("  {0}{1!r}".format(prefix, substr))
        print()
    return

if __name__ == "__main__":
    test_patterns('abbaaabbbbaaaaa',
                  [('ab', "'a' followd by 'b'"),])

    test_patterns('abbaabba',
        [('ab*',      'a followed by zero or more b'),
         ('ab+',      'a followed by one or more b'),
         ('ab?',      'a followed by zero or one b'),
         ('ab(3)',    'a followed by three b'),
         ('ab(2, 3)', 'a folloed by two to three b')])

Pattern 'ab' ('a' followd by 'b')

  'abbaaabbbbaaaaa'
  'ab'
  .....'ab'

Pattern 'ab*' (a followed by zero or more b)

  'abbaabba'
  'abb'
  ...'a'
  ....'abb'
  .......'a'

Pattern 'ab+' (a followed by one or more b)

  'abbaabba'
  'abb'
  ....'abb'

Pattern 'ab?' (a followed by zero or one b)

  'abbaabba'
  'ab'
  ...'a'
  ....'ab'
  .......'a'

Pattern 'ab(3)' (a followed by three b)

  'abbaabba'

Pattern 'ab(2, 3)' (a folloed by two to three b)

  'abbaabba'
 

注:如果需要非贪婪模模式,请在模式元字符后面加问好(?)

 

字符集

字符集(character set)是一组字符,包含可以与模式中相应位置匹配的所有字符,例如[ab]可以匹配a或b

test_patterns(
        'abbaabbba',
        [
            ("[ab]",    "either a or b"),
            ("a[ab]+",  "a followed by 1 or more a or b"),
            ("a[ab]+?", "a followed by 1 or more a or b, not greedy"),
        ]
    )

Pattern '[ab]' (either a or b)

  'abbaabbba'
  'a'
  .'b'
  ..'b'
  ...'a'
  ....'a'
  .....'b'
  ......'b'
  .......'b'
  ........'a'

Pattern 'a[ab]+' (a followed by 1 or more a or b)

  'abbaabbba'
  'abbaabbba'

Pattern 'a[ab]+?' (a followed by 1 or more a or b, not greedy)

  'abbaabbba'
  'ab'
  ...'aa'