Hateburo: kazeburo hatenablog

SRE / 運用系小姑 / Goを書くPerl Monger

Router::BoomとRouter::Simpleの文字列エンコードまわりの動作

昨日気付いた。

Router::Simpleはいわゆるutf8 flaggedな内部文字列を渡すと、キャプチャしたテキストも内部文字列として得られるけど、Router::Boomはバイナリ列となる。

use Router::Boom;
use Router::Simple;
use Encode;
use Test::More;
use utf8;

subtest 'boom' => sub {
    my $path = '/foobarです';
    my $router = Router::Boom->new();
    $router->add('/:user', 'dispatch_user');
    my @args = $router->match($path);
    is $args[1]->{user}, 'foobarです', 'match';
    ok Encode::is_utf8($args[1]->{user}),'utf8'; 
};

subtest 'simple' => sub {
    my $path = '/foobarです';
    my $router = Router::Simple->new();
    $router->connect('/:user', {action=>'dispatch_user'});
    my $args = $router->match($path);
    is $args->{user}, 'foobarです', 'match';
    ok Encode::is_utf8($args->{user}),'utf8'; 
};

done_testing;

結果

% perl -MTest::Pretty ./boom.pl
  boom
    ✖  match
    #   Failed test 'match'
    #   at /path/to/boom.pl line 12.
    #          got: 'foobarã§ã'
    #     expected: 'foobarです'
    ✖  utf8
    #   Failed test 'utf8'
    #   at /path/to/boom.pl line 13.
  simple
    ✓  match
    ✓  utf8

KossyはRouter::Boom使っているので、得られたテキストをdecodeするようにした

通信先が明確な内部APIなどのURIを構築するときはURI.pmを使わなくても良いんじゃないかな

通信先が明確な内部APIなどのURIを構築するときはURI.pmを使わなくても良いというかURI.pmはあまり速くないので、文字列連結だけで十分だと思います

#!/usr/bin/perl

use strict;
use warnings;
use Benchmark qw/:all/;
use URI;
use URI::Escape;
use URL::Encode::XS qw/url_encode/;

my $base = q!http://api.example.com!;
my $path = q!/path/to/endpoint!;
my %param = (
    token => 'bar',
    message => 'foo bar baz hoge hogehoge hogehoghoge',
);

cmpthese(timethese(-2, {
    'uri' => sub {
        my $uri = URI->new($base . $path);
        $uri->query_form(
            s_id => 1,
            type => 'foo',
            %param
        );
        $uri->as_string;
    },
    'concat' => sub {
        my @qs = (
            s_id => 1,
            type => 'foo',
            %param
        );
        my $uri = $base . $path . '?';
        while ( @qs ) {
            $uri .= shift(@qs) . '='. uri_escape(shift(@qs)) . '&'
        }
        substr($uri,-1,1,"");
        $uri;

    },
    'concat_xs' => sub {
        my @qs = (
            s_id => 1,
            type => 'foo',
            %param
        );
        my $uri = $base . $path . '?';
        while ( @qs ) {
            $uri .= shift(@qs) . '='. url_encode(shift(@qs)) . '&'
        }
        substr($uri,-1,1,"");
        $uri;
    },
}));

__END__
Benchmark: running concat, concat_xs, uri for at least 2 CPU seconds...
    concat:  2 wallclock secs ( 2.04 usr +  0.00 sys =  2.04 CPU) @ 81818.63/s (n=166910)
 concat_xs:  2 wallclock secs ( 2.17 usr +  0.00 sys =  2.17 CPU) @ 277470.51/s (n=602111)
       uri:  2 wallclock secs ( 2.20 usr +  0.00 sys =  2.20 CPU) @ 25653.18/s (n=56437)
              Rate       uri    concat concat_xs
uri        25653/s        --      -69%      -91%
concat     81819/s      219%        --      -71%
concat_xs 277471/s      982%      239%        --

application/x-www-form-urlencoded パーサーの動作を決める

深淵な理由があって2014年に application/x-www-form-urlencoded のパーサーを作ることになるとして仕様を考える

基本はW3CSPECを参考にしつつ、これまでのアプリケーションとの互換性を保つことを目標とする

  1. application/x-www-form-urlencoded ペイロードを "&" (U+0026) または ";" (U+003B) を使って分割する
  2. name-valueを格納する配列を用意
  3. 分割された文字列を次のように処理する
    1. 文字列の最初の文字が " " (U+0020) であればそれを削除
    2. 文字列に"="が含まれていれば、最初の"="までの文字をnameとし、残りの文字をvalueとする。最初の"="以降に文字がなければvalueは空文字。"="が文字列の最初の文字であればkeyを空文字とする。文字列に"="が含まれていない場合、文字列のすべてをnameとし、valueは空文字列とする。
    3. 全ての "+" (U+002B) を " " (U+0020) に入れ替える
    4. nameとvalueをunescapeし、配列に格納(push)する
  4. 配列を返す

テストデータはこんな感じになるかな

'a=b&c=d'     => ["a","b","c","d"]
'a=b;c=d'     => ["a","b","c","d"]
'a=1&b=2;c=3' => ["a","1","b","2","c","3"]
'a==b&c==d'   => ["a","=b","c","=d"]
'a=b& c=d'    => ["a","b","c","d"]
'a=b; c=d'    => ["a","b","c","d"]
'a=b; c =d'   => ["a","b","c ","d"]
'a=b;c= d '   => ["a","b","c"," d "]
'a=b&+c=d'    => ["a","b"," c","d"]
'a=b&+c+=d'   => ["a","b"," c ","d"]
'a=b&c=+d+'   => ["a","b","c"," d "]
'a=b&%20c=d'  => ["a","b"," c","d"]
'a=b&%20c%20=d' => ["a","b"," c ","d"]
'a=b&c=%20d%20' => ["a","b","c"," d "]
'a&c=d'       => ["a","","c","d"]
'a=b&=d'      => ["a","b","","d"]
'a=b&='       => ["a","b","",""]
'&'           => ["","","",""]
'='           => ["",""]
''            => []

Released POSIX::strftime::Compiler. GNU C library compatible strftime for loggers and servers

I released POSIX::strftime::Compiler v0.10 to CPAN.

https://metacpan.org/release/POSIX-strftime-Compiler
https://github.com/kazeburo/POSIX-strftime-Compiler

POSIX::strftime::Compiler provides GNU C library compatible strftime(3). But this module will not affected by the system locale. This feature is useful when you want to write loggers, servers and portable applications.

For generate same result strings on any locale, POSIX::strftime::Compiler wraps POSIX::strftime and converts some format characters to perl code

$ LC_ALL=ja_JP.UTF-8 perl -Ilib -MPOSIX::strftime::Compiler -E '
say POSIX::strftime::Compiler::strftime(q!%d/%b/%Y:%T %z!,localtime);
say POSIX::strftime(q!%d/%b/%Y:%T %z!,localtime)
'
21/Jan/2014:10:48:12 +0900
21/ 1月/2014:10:48:12 +0900

And POSIX::strftime::Compiler reduce cost of setlocale.

use Benchmark qw/:all/;
use POSIX qw//;
use POSIX::strftime::Compiler;

my $fmt = '%d/%b/%Y:%T';
cmpthese(timethese(-1, {
    'compiler_function' => sub {
        POSIX::strftime::Compiler::strftime($fmt, localtime($t));
    },
    'posix_and_locale' => sub {
        my $old_locale = POSIX::setlocale(&POSIX::LC_ALL);
        POSIX::setlocale(&POSIX::LC_ALL, 'C');
        POSIX::strftime($fmt,localtime($t));
        POSIX::setlocale(&POSIX::LC_ALL, $old_locale);
    },
    'posix' => sub {
        POSIX::strftime($fmt,localtime($t));
    },
}));

benchmark result is

Benchmark: running compiler_function, posix, posix_and_locale for at least 1 CPU seconds...
compiler_function:  1 wallclock secs ( 1.10 usr +  0.00 sys =  1.10 CPU) @ 446836.36/s (n=491520)
     posix:  1 wallclock secs ( 1.01 usr +  0.00 sys =  1.01 CPU) @ 243326.73/s (n=245760)
posix_and_locale:  1 wallclock secs ( 1.10 usr +  0.00 sys =  1.10 CPU) @ 71087.27/s (n=78196)
                      Rate  posix_and_locale             posix compiler_function
posix_and_locale   71087/s                --              -71%              -84%
posix             243327/s              242%                --              -46%
compiler_function 446836/s              529%               84%                --

POSIX::strftime::Compiler's strftime is faster than POSIX::strftime, because P::s::Compiler use sprintf() if all format characters are solved by easy calculation.

I have a plan to use this module in Apache::LogFormat::Compiler.

Apache::LogFormat::Compiler v0.22 has been released. Fixed bug around DST

Apache::LogFormat::Compiler v0.22 has been released

cpan: https://metacpan.org/release/Apache-LogFormat-Compiler
github: https://github.com/kazeburo/Apache-LogFormat-Compiler

Fixed bug around Daylight Saving Time(DST).
Older version shows incorrect timezone offset, not incorporating DST.

use Time::Local;
use POSIX;
use HTTP::Message::PSGI;
use Apache::LogFormat::Compiler;
use HTTP::Request::Common;

local $ENV{TZ} = 'America/New_York';
POSIX::tzset;
my $time = timelocal(0, 0, 1, 3, 11 - 1, 2013); 

my $log_handler = Apache::LogFormat::Compiler->new();
my $req = req_to_psgi(GET "/");
my $res = [200,[],[q!OK!]];

print $log_handler->log_line($req,$res,0,2,$time);
print $log_handler->log_line($req,$res,0,2,$time+3599);
print $log_handler->log_line($req,$res,0,2,$time+3600);

__DATA__
127.0.0.1 - - [03/Nov/2013:01:00:00 -0400] "GET / HTTP/1.1" 200 0 "-" "-"
127.0.0.1 - - [03/Nov/2013:01:59:59 -0400] "GET / HTTP/1.1" 200 0 "-" "-"
127.0.0.1 - - [03/Nov/2013:01:00:00 -0500] "GET / HTTP/1.1" 200 0 "-" "-"

Many Thanks dex4er.

Apache::LogFormat::Compilerのいわゆるサマータイムのバグを直しました。サマータイムのないJSTには影響ありません

POSIX::tzset and Windows

When changing timezone in perl script. POSIX::tzset is required.

local $ENV{TZ} = 'Asia/Tokyo';
POSIX::tzset();
localtime();


But Windows does not support this.

  • old Windows dies with "not implemented" error.
  • newer Windows does not die. But timezone is not changed. tzset is supported only for subprocess


If you want to use tzset in tests. it's recommended to skip on windows like these.

eval {
    POSIX::tzset;
    die q!tzset is implemented on this Cygwin. But Windows can't change tz inside script! if $^O eq 'cygwin';
    die q!tzset is implemented on this Windows. But Windows can't change tz inside script! if $^O eq 'MSWin32';
};
if ( $@ ) {
    plan skip_all => $@;
}

see http://api.metacpan.org/source/KAZEBURO/Apache-LogFormat-Compiler-0.22/t/04_tz.t

Starlet / How to listen to Unix Domain Socket without Server::Starter

Define $ENV{SERVER_STARTER_PORT} in your script.

if (-S $socket) {
    warn "removing existing socket file:$socket";
    unlink $socket
        or die "failed to remove existing socket file:$socket:$!";
}
unlink $socket;
my $sock = IO::Socket::UNIX->new(
    Listen => Socket::SOMAXCONN(),
    Local  => $socket,
) or die "failed to listen to file $socket:$!";
$ENV{SERVER_STARTER_PORT} = $socket."=".$sock->fileno;

my $loader = Plack::Loader->load(
    'Starlet',
    max_workers => 10,
);
$loader->run($app);