ngram - 静かなる名辞
pythonとプログラミングのこと
2020-05-07T20:42:34+09:00
hayataka2049
Hatena::Blog
hatenablog://blog/10328537792367869878
【common lisp】common lispでn-gram
hatenablog://entry/10328749687226162587
2017-03-12T06:53:29+09:00
2019-06-16T19:06:22+09:00 趣味でCommon Lispを始めました。とりあえず練習にn-gramを書いてみました。 書き方は色々あると思いますが、ループ構文がまだいまいちわからないので再帰で書きます。 (defun rec-ngram (list n &optional (ret-list '())) (if (eq (length list) (- n 1)) ret-list (rec-ngram (cdr list) n (nconc ret-list (list (subseq list 0 n)))))) いけてる書き方なのかどうかは良くわかりませんが、そう複雑なものではありません。末尾再帰で書いていますが、…
<p> 趣味でCommon Lispを始めました。とりあえず練習にn-gramを書いてみました。</p><p> 書き方は色々あると思いますが、ループ構文がまだいまいちわからないので再帰で書きます。</p>
<pre class="code lang-lisp" data-lang="lisp" data-unlink><span class="synSpecial">(</span><span class="synStatement">defun</span> rec-ngram <span class="synSpecial">(</span><span class="synStatement">list</span> n <span class="synType">&optional</span> <span class="synSpecial">(</span>ret-list <span class="synSpecial">'()))</span>
<span class="synSpecial">(</span><span class="synStatement">if</span> <span class="synSpecial">(</span><span class="synStatement">eq</span> <span class="synSpecial">(</span><span class="synStatement">length</span> <span class="synStatement">list</span><span class="synSpecial">)</span> <span class="synSpecial">(</span><span class="synStatement">-</span> n <span class="synConstant">1</span><span class="synSpecial">))</span>
ret-list
<span class="synSpecial">(</span>rec-ngram <span class="synSpecial">(</span><span class="synStatement">cdr</span> <span class="synStatement">list</span><span class="synSpecial">)</span> n
<span class="synSpecial">(</span><span class="synStatement">nconc</span> ret-list
<span class="synSpecial">(</span><span class="synStatement">list</span> <span class="synSpecial">(</span><span class="synStatement">subseq</span> <span class="synStatement">list</span> <span class="synConstant">0</span> n<span class="synSpecial">))))))</span>
</pre><p> いけてる書き方なのかどうかは良くわかりませんが、そう複雑なものではありません。末尾再帰で書いていますが、末尾再帰でなくて良いならもっと簡単に書けます。</p><p> 使い方はこんな感じ。</p>
<pre class="code" data-lang="" data-unlink>* (rec-ngram '("吾輩" "は" "猫" "で" "ある" "。") 2)
(("吾輩" "は") ("は" "猫") ("猫" "で") ("で" "ある") ("ある" "。"))</pre><p> けど、毎回listの長さを計算するのは馬鹿馬鹿しいと思ったので、最初に全体の長さを計算し、後は1ずつ引いていくことにしました。</p>
<pre class="code lang-lisp" data-lang="lisp" data-unlink><span class="synSpecial">(</span><span class="synStatement">defun</span> rec-ngram <span class="synSpecial">(</span><span class="synStatement">list</span> n <span class="synType">&optional</span> <span class="synSpecial">(</span><span class="synStatement">list-length</span> <span class="synStatement">nil</span><span class="synSpecial">)</span> <span class="synSpecial">(</span>ret-list <span class="synSpecial">'()))</span>
<span class="synSpecial">(</span><span class="synStatement">if</span> <span class="synSpecial">(</span><span class="synStatement">eq</span> <span class="synStatement">list-length</span> <span class="synStatement">nil</span><span class="synSpecial">)</span>
<span class="synSpecial">(</span><span class="synStatement">setq</span> <span class="synStatement">list-length</span> <span class="synSpecial">(</span><span class="synStatement">length</span> <span class="synStatement">list</span><span class="synSpecial">)))</span>
<span class="synSpecial">(</span><span class="synStatement">if</span> <span class="synSpecial">(</span><span class="synStatement">eq</span> <span class="synSpecial">(</span><span class="synStatement">length</span> <span class="synStatement">list</span><span class="synSpecial">)</span> <span class="synSpecial">(</span><span class="synStatement">-</span> n <span class="synConstant">1</span><span class="synSpecial">))</span>
ret-list
<span class="synSpecial">(</span>rec-ngram <span class="synSpecial">(</span><span class="synStatement">cdr</span> <span class="synStatement">list</span><span class="synSpecial">)</span> n <span class="synSpecial">(</span><span class="synStatement">-</span> <span class="synStatement">list-length</span> <span class="synConstant">1</span><span class="synSpecial">)</span>
<span class="synSpecial">(</span><span class="synStatement">nconc</span> ret-list
<span class="synSpecial">(</span><span class="synStatement">list</span> <span class="synSpecial">(</span><span class="synStatement">subseq</span> <span class="synStatement">list</span> <span class="synConstant">0</span> n<span class="synSpecial">))))))</span>
</pre><p> こんな感じでn-gramが作れますが、以前書いた<a href="http://hayataka2049.hatenablog.jp/entry/2017/02/06/215619">python版</a>と比べると機能的に見劣りします。</p>
<ul>
<li>文字列を引数に取れない</li>
<li>n-gramの要素がリストで返ってきても嬉しくない(区切り文字列で区切られた文字列の方が嬉しい)</li>
</ul><p> とりあえずpython版と同じ機能になるように、適当に関数を増やしてラップしてみます。</p>
<pre class="code lang-lisp" data-lang="lisp" data-unlink><span class="synSpecial">(</span><span class="synStatement">defun</span> ngram <span class="synSpecial">(</span>str <span class="synType">&key</span> <span class="synSpecial">(</span>n <span class="synConstant">2</span><span class="synSpecial">)</span> <span class="synSpecial">(</span>splitter <span class="synConstant">"-*-"</span><span class="synSpecial">))</span>
<span class="synComment">;文字列が来たら一文字ずつの文字列のリストに変換</span>
<span class="synSpecial">(</span><span class="synStatement">if</span> <span class="synSpecial">(</span><span class="synStatement">typep</span> str <span class="synSpecial">'</span><span class="synIdentifier">string</span><span class="synSpecial">)</span>
<span class="synSpecial">(</span><span class="synStatement">setq</span> str <span class="synSpecial">(</span><span class="synStatement">mapcar</span> <span class="synType">#'string</span> <span class="synSpecial">(</span><span class="synStatement">concatenate</span> <span class="synSpecial">'</span><span class="synIdentifier">list</span> str<span class="synSpecial">))))</span>
<span class="synComment">;任意の区切り文字でn-gramの結果を連結</span>
<span class="synSpecial">(</span><span class="synStatement">let</span> <span class="synSpecial">((</span>ngram-result <span class="synSpecial">(</span>rec-ngram str n<span class="synSpecial">)))</span>
<span class="synSpecial">(</span><span class="synStatement">mapcar</span> <span class="synSpecial">(</span><span class="synStatement">lambda</span> <span class="synSpecial">(</span><span class="synStatement">list</span><span class="synSpecial">)</span>
<span class="synSpecial">(</span><span class="synStatement">format</span> <span class="synStatement">nil</span> <span class="synSpecial">(</span><span class="synStatement">format</span> <span class="synStatement">nil</span> <span class="synConstant">"~~{~~A~~^~A~~}"</span> splitter<span class="synSpecial">)</span> <span class="synStatement">list</span><span class="synSpecial">))</span>
ngram-result<span class="synSpecial">)))</span>
</pre><p> nと区切り文字列はキーワード引数にしてみました。list-lengthをこっちの関数で計算する前提にすればrec-ngramのif文を外せますが、面倒くさいのでやっていません。formatを二段重ねにしてる辺りがとても気持ち悪いですが、formatの書式指定がよくわからないので勘弁してください。こんな感じで使えます。</p>
<pre class="code" data-lang="" data-unlink>* (ngram "吾輩は猫である。")
("吾-*-輩" "輩-*-は" "は-*-猫" "猫-*-で" "で-*-あ" "あ-*-る" "る-*-。")
* (ngram '("吾輩" "は" "猫" "で" "ある" "。"))
("吾輩-*-は" "は-*-猫" "猫-*-で" "で-*-ある" "ある-*-。")
* (ngram "吾輩は猫である。" :n 3)
("吾-*-輩-*-は" "輩-*-は-*-猫" "は-*-猫-*-で" "猫-*-で-*-あ" "で-*-あ-*-る" "あ-*-る-*-。")
* (ngram "吾輩は猫である。" :n 3 :splitter "!??!")
("吾!??!輩!??!は" "輩!??!は!??!猫" "は!??!猫!??!で" "猫!??!で!??!あ" "で!??!あ!??!る" "あ!??!る!??!。")</pre><p> lispは書く分には楽しいですが、pythonと比べるとコード量が多いというか、低水準な印象です。使いこなせてない便利な機能も色々あると思うので、慣れてくればもう少し気楽に書けるようになるとは思います。とりあえず、しばらくは趣味で触っていくことにします。</p>
<div class="section">
<h4>追記 2017/03/16</h4>
<p> ループでも書きました。</p>
<pre class="code lang-lisp" data-lang="lisp" data-unlink><span class="synSpecial">(</span><span class="synStatement">defun</span> loop-ngram <span class="synSpecial">(</span><span class="synStatement">list</span> n<span class="synSpecial">)</span>
<span class="synSpecial">(</span><span class="synStatement">let</span> <span class="synSpecial">((</span>result-list <span class="synSpecial">'()))</span>
<span class="synSpecial">(</span><span class="synStatement">dotimes</span> <span class="synSpecial">(</span><span class="synStatement">count</span> <span class="synSpecial">(</span><span class="synStatement">-</span> <span class="synSpecial">(</span><span class="synStatement">length</span> <span class="synStatement">list</span><span class="synSpecial">)</span> n <span class="synConstant">-1</span><span class="synSpecial">)</span> result-list<span class="synSpecial">)</span>
<span class="synSpecial">(</span><span class="synStatement">setq</span> result-list <span class="synSpecial">(</span><span class="synStatement">cons</span> <span class="synSpecial">(</span><span class="synStatement">subseq</span> <span class="synStatement">list</span> <span class="synStatement">count</span> <span class="synSpecial">(</span><span class="synStatement">+</span> <span class="synStatement">count</span> n<span class="synSpecial">))</span> result-list<span class="synSpecial">)))))</span>
</pre><p> シンプルですが、逆向きになって出てきます。</p>
<pre class="code" data-lang="" data-unlink>* (loop-ngram '("吾輩" "は" "猫" "で" "ある" "。") 2))
(("ある" "。") ("で" "ある") ("猫" "で") ("は" "猫") ("吾輩" "は"))</pre><p> 逆向きがいやならreverseするか、最初からnconcでリストを作るだけなので難しいことはありません。</p>
</div>
hayataka2049
【python】pythonでn-gramの特徴量を作る
hatenablog://entry/10328749687214163875
2017-02-06T21:56:19+09:00
2019-06-16T20:52:17+09:00 ○○ってパッケージでできるよ! という意見もあると思いますが、ちょっと挙動を変えたくなる度にパッケージのhelp読んだり、微妙に柔軟性のないパッケージに苦しむ(たとえば文末の句点と次の文の最初の文字は繋げないで欲しいのにできない、とか)くらいなら、最初から自分で書いた方が速いです。好きなだけ編集できます。 とりあえず、文字列ないし形態素のリストなどをn-gramに切り分ける関数を作ってみます。 def ngram_split(string, n, splitter="-*-"): """ string:iterableなら何でも n:n-gramのn splitter:処理対象と混ざらなければ…
<p> ○○ってパッケージでできるよ! という意見もあると思いますが、ちょっと挙動を変えたくなる度にパッケージのhelp読んだり、微妙に柔軟性のないパッケージに苦しむ(たとえば文末の句点と次の文の最初の文字は繋げないで欲しいのにできない、とか)くらいなら、最初から自分で書いた方が速いです。好きなだけ編集できます。</p><p> とりあえず、文字列ないし形態素のリストなどをn-gramに切り分ける関数を作ってみます。</p>
<pre class="code lang-python" data-lang="python" data-unlink><span class="synStatement">def</span> <span class="synIdentifier">ngram_split</span>(string, n, splitter=<span class="synConstant">"-*-"</span>):
<span class="synConstant">"""</span>
<span class="synConstant"> string:iterableなら何でも n:n-gramのn splitter:処理対象と混ざらなければ何でも良い</span>
<span class="synConstant"> """</span>
lst = []
<span class="synStatement">for</span> i <span class="synStatement">in</span> <span class="synIdentifier">range</span>(<span class="synIdentifier">len</span>(string[:-n+<span class="synConstant">1</span>])):
lst.append(splitter.join(string[i:i+n]))
<span class="synStatement">return</span> lst
</pre><p> 結果はこんな感じ。</p>
<pre class="code lang-python" data-lang="python" data-unlink>>>> <span class="synStatement">for</span> x <span class="synStatement">in</span> ngram_split(<span class="synConstant">"吾輩は猫である。"</span>, <span class="synConstant">3</span>):
... <span class="synIdentifier">print</span>(x)
...
吾-*-輩-*-は
輩-*-は-*-猫
は-*-猫-*-で
猫-*-で-*-あ
で-*-あ-*-る
あ-*-る-*-。
>>> <span class="synStatement">for</span> x <span class="synStatement">in</span> ngram_split([<span class="synConstant">"吾輩"</span>,<span class="synConstant">"は"</span>,<span class="synConstant">"猫"</span>,<span class="synConstant">"で"</span>,<span class="synConstant">"ある"</span>,<span class="synConstant">"。"</span>], <span class="synConstant">3</span>):
... <span class="synIdentifier">print</span>(x)
...
吾輩-*-は-*-猫
は-*-猫-*-で
猫-*-で-*-ある
で-*-ある-*-。
</pre><p> 悪くないですが、返り値はdictの方が便利そうです。</p>
<pre class="code lang-python" data-lang="python" data-unlink><span class="synPreProc">from</span> collections <span class="synPreProc">import</span> defaultdict
<span class="synStatement">def</span> <span class="synIdentifier">itr_dict</span>(itr):
d = defaultdict(<span class="synIdentifier">int</span>)
<span class="synStatement">for</span> x <span class="synStatement">in</span> itr:
d[x] += <span class="synConstant">1</span>
<span class="synStatement">return</span> d
</pre><p> ↑こういうのを作っておいて、</p>
<pre class="code lang-python" data-lang="python" data-unlink>>>> itr_dict(ngram_split([<span class="synConstant">"吾輩"</span>,<span class="synConstant">"は"</span>,<span class="synConstant">"猫"</span>,<span class="synConstant">"で"</span>,<span class="synConstant">"ある"</span>,<span class="synConstant">"。"</span>], <span class="synConstant">3</span>))
defaultdict(<<span class="synStatement">class</span> <span class="synConstant">'int'</span>>, {<span class="synConstant">'は-*-猫-*-で'</span>: <span class="synConstant">1</span>, <span class="synConstant">'吾輩-*-は-*-猫'</span>: <span class="synConstant">1</span>, <span class="synConstant">'猫-*-で-*-ある'</span>: <span class="synConstant">1</span>, <span class="synConstant">'で-*-ある-*-。'</span>: <span class="synConstant">1</span>})
</pre><p> こんな感じで使えば良いのではないでしょうか。</p><br />
<p> さて、以下のようなデータを考えます。</p>
<pre class="code lang-python" data-lang="python" data-unlink>data_lst = [<span class="synConstant">"吾輩は猫である"</span>,
<span class="synConstant">"国境の長いトンネルを抜けると雪国であった"</span>,
<span class="synConstant">"恥の多い生涯を送って来ました"</span>,
<span class="synConstant">"一人の下人が、羅生門の下で雨やみを待っていた"</span>,
<span class="synConstant">"幼時から父は、私によく、金閣のことを語った"</span>]
</pre><p> これを文字2-gramの特徴量にしてみます。すでにn-gramは作れるようになっているので簡単です。</p>
<pre class="code lang-python" data-lang="python" data-unlink><span class="synPreProc">from</span> sklearn.feature_extraction <span class="synPreProc">import</span> DictVectorizer
bgram_dict_lst = [itr_dict(ngram_split(x, <span class="synConstant">3</span>)) <span class="synStatement">for</span> x <span class="synStatement">in</span> data_lst]
dict_vectorizer = DictVectorizer()
a = dict_vectorizer.fit_transform(bgram_dict_lst).toarray()
</pre><p> DictVectorizerって何? という声が聞こえてきそうなので、sklearnのドキュメントを貼ります。<br />
<a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html">sklearn.feature_extraction.DictVectorizer — scikit-learn 0.20.1 documentation</a></p><p> 結果を先に見せると、こういうものが生成されています。要するに、dictから特徴量まで一気に作ってくれます。</p>
<pre class="code lang-python" data-lang="python" data-unlink>>>> a
array([[ <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>],
[ <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">1.</span>, <span class="synConstant">1.</span>, <span class="synConstant">1.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">1.</span>, <span class="synConstant">1.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>],
[ <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>],
[ <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">2.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">1.</span>, <span class="synConstant">1.</span>, <span class="synConstant">1.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>],
[ <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>,
<span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>, <span class="synConstant">1.</span>, <span class="synConstant">0.</span>, <span class="synConstant">0.</span>]])
</pre><p> DictVectorizerに似たようなものを自分で作ることも、当然できます。が、基本的にこの手のデータ処理で細かく弄り回したいと思うのは前処理の部分ですから、こういう「dictを行列に変換する」みたいな単純な処理はパッケージに投げてしまった方が世の中の幸せの総量は増える気がします。逆に言えば、前処理が必要ならDictVectorizerに投げる前(あるいは投げた後)に済ませておく必要があります。</p><br />
<p> ところで、この特徴量はこれでアリですが、たぶんそれぞれのベクトルをテキスト長か何かで割って相対頻度に変換してやると、機械学習的に良い感じになると思います。</p>
<pre class="code lang-python" data-lang="python" data-unlink><span class="synPreProc">import</span> numpy <span class="synStatement">as</span> np
len_array = np.array([[<span class="synIdentifier">len</span>(x)]*a.shape[<span class="synConstant">1</span>] <span class="synStatement">for</span> x <span class="synStatement">in</span> data_lst])
<span class="synComment">#↑もうちょっと綺麗な方法があるなら教えて欲しい</span>
std_a = a/len_array
</pre><p> ここまで来れば、そのまま機械学習アルゴリズムに突っ込んでもそこそこなんとかなるかと思います。性能を求めるなら、更に特徴選択等を入れるべきでしょう。</p>
hayataka2049