GSoC Midterms, CPAN and a British Tar

First of all, all our GSoC student have passed their midterms! That is really awesome!

That means that we already have something that works when it comes to HTTP::UserAgent, but I guess that the other projects do also have something deliverable.
When you install HTTP::UserAgent today, you will not only get modules that provide HTTP::Request, HTTP::Response or HTTP::Message classes, you also get tools like http-download, http-request and http-dump that are very nice command line tools as known from Perl 5’s LWP::UserAgent distro.

In the past weeks HTTP::UserAgent was already capable of fetching simple sites even when that was slow. I took about 20 to 30 seconds to fetch a few hundred bytes. Also we had problems when it came to chunked transfer encoding, where don’t get the entire site at once. And finally, when you fetch text like html you want to get a string in the right encoding, and we always tried to encode it as utf8.

The first thing we fixed was the chunked transfer encoding actually, since my pet project P6CPAN needs to fetch the MIRRORED.BY file, which is just too big to fit into a single chunk. Fortunally, I already had a prototype that does only that in panda, which made it easy to port it to the sane setup in HTTP::UserAgent.

After that we took a look at the encoding of strings. In case we get a meaningful Content-Type header, we pull out the charset and use that to encode our string, and fall back to utf8 when the charset or that header is not present.

Fixing the poor performance was not really tricky in fact. I always suspected the IO::Socket.lines method, because this one does a lot of magic to find the next delimiter in the buffer we recieved so far. And since we actually do not care about lines when we want to recieve the document, we do not need that kind of overhead.
It answer was simple and shows the sanity and power of Perl 6. We turned the call to .lines and the line based regex matches into a grammar, that can match on lines like a gazillion times faster. The grammar is only in charge of parsing the headers, wich meant that we needed a way to split our buffer by CRLFCRLF. I hope that this will land in the spec and rakudo’s guts as Blob.split in near future.

So what about CPAN? As of today we can fetch the list of mirrors using the latest work of HTTP::UserAgent. Can fetch the gzipped p6dists.json, p6provides.json and p6binaries.json files to know what is up there on CPAN:

$ panda --github --cpan search NativeCall
Resources on github:
NativeCall * Call native libraries
GTK::Simple * Simple GTK 3 binding using NativeCall
Crypt::Bcrypt 0.5.0 An implementation of bcrypt using
NativeCall
Resources on CPAN:
NativeCall v1

Now that is basically it. When we fetched a tarball of, say NativeCall, we get a tar.gz file. We can unzip this tarball but then? There is no module in the Perl 6 world that can peek tar files.
That is my current project: Port Archive::Tar from Perl 5 to Perl 6 at: https://github.com/FROGGS/p6-Archive-Tar
This is in several ways interesting. First, this is my first attempt to port a widely used Perl 5 module. Second, it will help v5 to pass more tests and it will help to satisfy some dependencies of Perl 5 dists running on v5. And last but not minor important: It shows where mangling Bufs is not ideal in rakudo. A sub called subbuf-rw got implemented and specced last week, which does the same as substr-rw for strings, or the substr sub in Perl 5 when being used in lvalue context. Buf.split will hopefully land, and perhaps more. The mix of strings and meant-to-be binary data in the Perl 5 version of Archive::Tar is the major problem when porting it, sine in Perl 6 these are two different types that do not mix well, and they probably should not do so which will lead to a bit more sanity there I hope.

Advertisements

Leave a comment

Filed under Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s