MARISA-TrieをコマンドラインとPythonから使う

概要

Matching Algorithm with Recursively Implemented StorAge (MARISA) という Trie に対する高い空間効率とそれなりの時間効率を実現するデータ構造があります。
動的な更新には対応していませんが

辞書引き（Lookup） : 入力文字列が登録されているかどうかを確認
逆引き（Reverse Lookup） : 入力された ID から登録文字列を復元
Common Prefix Search : 入力文字列の前半部分に一致する登録文字列を検索
Predictive Search : 入力文字列で始まる登録文字列を検索

といった操作が可能です。

今回は marisa-trie 0.2.4 をベースに

コマンドラインプログラム
Python バインディング

から MARISA を触ってみます。

Install MARISA

環境は Amazon Linux とします。

まずはC++のプログラムビルドに必要なツール群をインストールします。

$ sudo yum install -y gcc gcc-c++

次に C++ プログラムのよくある手順でインストールします。

$ wget https://marisa-trie.googlecode.com/files/marisa-0.2.4.tar.gz
$ tar zxf marisa-0.2.4.tar.gz
$ cd marisa-0.2.4
$ ./configure
...
marisa 0.2.4 configuration:
-------------------------------
  HOST:      x86_64-unknown-linux-gnu
  CXX:       g++
  CXXFLAGS:  -g -O2
  LDFLAGS:
  PREFIX:    /usr/local

  SSE2:      no
  SSE3:      no
  SSSE3:     no
  SSE4.1:    no
  SSE4.2:    no
  SSE4a:     no
  POPCNT:    no
$ make
$ make check
...
==================
All 5 tests passed

...
$ sudo make install
$ ls -1 /usr/local/bin/
marisa-benchmark
marisa-build
marisa-common-prefix-search
marisa-dump
marisa-lookup
marisa-predictive-search
marisa-reverse-lookup

最後に Python バインディングをインストールします。

marisa-trie には SWIG ベースのバインディングもありますが、今回は pip ベースでインストールできるサードパーティーの marisa-trie をインストールします。

$ pip install marisa-trie

コマンドラインから MARISA を触る

辞書の構築

まずは辞書を作ります。入力データはスペルチェック用の dict を使います。

$ marisa-build  words.dict
#keys: 479829
#nodes: 618295
size: 1429192

Lookup

入力文字列が登録されているかどうかを確認します。

$ marisa-lookup  words.dict
monkey
141678    monkey

Reverse Lookup

入力された ID から登録文字列を復元します。

$ marisa-reverse-lookup words.dict
141678
141678	monkey

Common Prefix Search
入力文字列の前半部分に一致する登録文字列を検索します。

$ marisa-common-prefix-search words.dict
monkey
5 found
6    m    monkey
191    mo    monkey
2637    mon    monkey
16164    monk    monkey
141678    monkey    monkey

Predictive Search

入力文字列で始まる登録文字列を検索します。

出力は 3語(-n 3)にしぼります。

$ marisa-predictive-search -n 3 words.dict
monk
59 found
16164	monk	monk
141678	monkey	monk
341913	monkey-face	monk

Python から MARISA を触る

次に Python から MARISA を触ります。
先ほどのコマンドラインの操作を Python に読み替えていきます。

辞書の構築

まずは辞書を作ります。入力データはスペルチェック用の dict を使います。

In [1]: import marisa_trie

In [2]: with open("/usr/share/dict/words") as f:
   ...:     trie = marisa_trie.Trie(f.read().splitlines())
   ...:

Lookup

入力文字列が登録されているかどうかを確認します。

以下のいずれかの方法で確認できます。

In [4]: u'awaken' in trie
Out[4]: True

In [5]: trie[u'awaken']
Out[5]: 130909

キーは unicode 型で渡します。string 型で渡すと TypeError が発生します。

In [8]: trie['awaken']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in ()
----> 1 trie['awaken']

TypeError: Argument 'key' has incorrect type (expected unicode, got str)

存在しないキーを渡すと KeyError が発生します。

In [9]: trie[u'awaken-not-exist']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
 in ()
----> 1 trie[u'awaken-not-exist']

marisa_trie.pyx in marisa_trie.Trie.__getitem__ (src/marisa_trie.cpp:5920)()

marisa_trie.pyx in marisa_trie.Trie.key_id (src/marisa_trie.cpp:5787)()

KeyError: u'awaken-not-exist'

Reverse Lookup

入力された ID から登録文字列を復元します。

In [7]: trie.restore_key(130909)
Out[7]: u'awaken'

Common Prefix Search

入力文字列の前半部分に一致する登録文字列を検索します。

In [10]: trie.prefixes(u'awaken')
Out[10]: [u'a', u'aw', u'awa', u'awake', u'awaken']

イテレーター操作もできます。3件だけ表示させてみましょう。

In [11]: prefixes_iterator = trie.iter_prefixes(u'awaken')

In [12]: for i in range(3):print prefixes_iterator.next()
a
aw
awa

Predictive Search

入力文字列で始まる登録文字列を検索します。

In [13]: trie.keys(u'awaken')
Out[13]:
[u'awaken',
 u'awakener',
 u'awakeners',
 u'awakened',
 u'awakening',
 u'awakeningly',
 u'awakenings',
 u'awakenable',
 u'awakenment',
 u'awakens']

イテレーター操作もできます。3件だけ表示させてみましょう。

In [14]: keys_iterator = trie.iterkeys(u'awaken')

In [15]: for i in range(3):print keys_iterator.next()
awaken
awakener
awakeners

MARISA-TrieをコマンドラインとPythonから使う

概要

Install MARISA

コマンドラインから MARISA を触る

辞書の構築

Lookup

Reverse Lookup

Common Prefix Search
入力文字列の前半部分に一致する登録文字列を検索します。

Predictive Search

Python から MARISA を触る

辞書の構築

Lookup

Reverse Lookup

Common Prefix Search

Predictive Search

References

Leave a comment Cancel reply

概要

Install MARISA

コマンドラインから MARISA を触る

辞書の構築

Lookup

Reverse Lookup

Common Prefix Search 入力文字列の前半部分に一致する登録文字列を検索します。

Predictive Search

Python から MARISA を触る

辞書の構築

Lookup

Reverse Lookup

Common Prefix Search

Predictive Search

References

Share this:

Related

Leave a comment Cancel reply

Common Prefix Search
入力文字列の前半部分に一致する登録文字列を検索します。