TSV(CSV)から要素(ヘッダー)を取り出して文字コードを変更してなんか処理したい時の話

ジェネレータについて

普通のリストや関数を返すときは return で返してやればいいけれど、繰り返し処理ができるオブジェクト= itrable なオブジェクトのときは yield で一時的に処理を止めて一つづつ値を返すことができる。

# 関数の定義
>>> def hoge():
...     yield 1
...     yield 2
...     yield 3
...
# hoge は generator object である
>>> hoge()
<generator object hoge at 0x104451dc0>
# 使用するときは generator関数をインスタンス？を生成してやる
>>> h = hoge()
>>> h
<generator object hoge at 0x1044740a0>
# next()で一つづつ値を取り出すことができる
>>> h.next()
1
>>> h.next()
2
>>> h.next()
3
# 全部要素がなくなると StopIteration で落ちる
>>> h.next()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
StopIteration

またこれをリスト内包記法みたいにジェネレータ内包記法としてかける。要は[]が()にかわっただけ

gen = (x for x in range(10))
gen.next() # 0
gen.next() # 1
...

詳しいことは下記のエントリを読むと理解が深まると思います
参考::Python のジェネレータ (1) - 動作を試す | すぐに忘れる脳みそのためのメモ
参考::Python3 Advent Calendar 一日目 - Python とジェネレータ関数 - プログラマのネタ帳

本題

cp932 から utf-8 に強制変換するという関数を定義する。その際ファイルオブジェクトはでかいかもしれないのでメモリの安全のためにジェネレータで返してやる
force_unicode について公式ドキュメントは Django での Unicode の扱い — Django v1.0 documentation だけどちょっと気をつけないとハマる。後述の補足にかいた。

def convert_unicode(_file, encoding='cp932'):
   u"""
   ファイルとしてうとったものを強制的にユニコード文字列に変換する
   そしてジェネレータとして返してやる
   """
   from django.utils.encoding import force_unicode
   for row in _file:
       # 第2引数に元のエンコーディングを渡してあげないと落ちる
       # リスト内包記法みたいに、タプルで囲むとジェネレーターオブジェクトになる
       yield (force_unicode(value, encoding) for value in row)

で、実際アップロードされた TSV ファイルの中身を参照して特定の文字列が来たらある処理を走らせるということをする。気をつけないとハマるのは、 CSV や TSV ファイルってのは2重配列みたいなものなので、for がネストする。

import csv
uploaded_file = request.FILES["tsv_file"]
# TSVファイルなのでTAB区切り。 dialect はなにかとCSVのファイルタイプをよしなに判別してくれる便利引数
# 詳しくは CSV モジュールのドキュメントを読むといいと思う。おれはまだ読んでない
tsv_file = csv.reader(uploaded_file, dialect=csv.excel_tab)
# ユニコードに変換してジェネレータを返してやる
unicode_tsv_file = convert_unicode(tsv_file)
# ファイルオブジェクトはitartable なので for でまわせる
for row in unicode_tsv_file:
    for elem in row:
        # 特定の utf-8 文字列の時に特定の処理をはしらせる
        if elem == u'ほげもげ':
            print hogemoge

これで処理に入ることができますね！

破壊的操作について

generator って破壊的操作すると値が変わってしまうらしい？具体的には list 関数でこのとりこんだ TSV ファイルの列をキャストじゃないけど格納したのだけど、そのときに

for row in unicode_tsv_file:
    print list(row)[1]
    pirnt list(row)[3]

としたら2番目の [3] の部分で IndexError: Out Of Range になってしまった。pdb で見てみたところ、 list 関数を呼んだら次のステップでは [] になっていた。なので最初に格納することで回避した

for row in unicode_tsv_file:
    lst = list(row)
    print lst[1]
    print lst[3]

ってここまで書いたけど

@xxxxxxxx generatorはそもそも__iter__やnextで読むごとに内部状態変わるから破壊的
@yyyyyyyyy generatorを消費しただけだよね。(ジェネレータを末尾まで反復しきった)

実際 StopIteration になって落ちてたっぽい

追記

1行目をヘッダーとして使うだけなら csv.DictReader のほうが便利かもしれない
csv.DictReaderが便利 - atas

補足 force_unicode について

Djnago 1.4での force_unicode の実装。erros='strict' はエラーの時は落とすということで、'replace'だと強制置換らしい。ここらへんを掘るとかなり面倒な話になるらしい*1。とりあえず、使い方としては第1引数に変換したい文字列、第2引数に変換元の文字コードをもってくればいい。

def force_unicode(s, encoding='utf-8', strings_only=False, errors='strict'):
    """
    Similar to smart_unicode, except that lazy instances are resolved to
    strings, rather than kept as lazy objects.

    If strings_only is True, don't convert (some) non-string-like objects.
    """
    # Handle the common case first, saves 30-40% in performance when s
    # is an instance of unicode. This function gets called often in that
    # setting.
    if isinstance(s, unicode):
        return s
    if strings_only and is_protected_type(s):
        return s
    try:
        if not isinstance(s, basestring,):
            if hasattr(s, '__unicode__'):
                s = unicode(s)
            else:
                try:
                    s = unicode(str(s), encoding, errors)
                except UnicodeEncodeError:
                    if not isinstance(s, Exception):
                        raise
                    # If we get to here, the caller has passed in an Exception
                    # subclass populated with non-ASCII data without special
                    # handling to display as a string. We need to handle this
                    # without raising a further exception. We do an
                    # approximation to what the Exception's standard str()
                    # output should be.
                    s = ' '.join([force_unicode(arg, encoding, strings_only,
                            errors) for arg in s])
        elif not isinstance(s, unicode):
            # Note: We use .decode() here, instead of unicode(s, encoding,
            # errors), so that if s is a SafeString, it ends up being a
            # SafeUnicode at the end.
            s = s.decode(encoding, errors)
    except UnicodeDecodeError, e:
        if not isinstance(s, Exception):
            raise DjangoUnicodeDecodeError(s, *e.args)
        else:
            # If we get to here, the caller has passed in an Exception
            # subclass populated with non-ASCII bytestring data without a
            # working unicode method. Try to handle this without raising a
            # further exception by individually forcing the exception args
            # to unicode.
            s = ' '.join([force_unicode(arg, encoding, strings_only,
                    errors) for arg in s])
    return s

*1:Python における Unicode の扱い、コンパイル、Unicode文字列とutf-32などの実装とかなんか