Friday, October 20, 2023

Java: streaming WARC file with jwat-warc

I had to write some Java code that read WARC that was being piped in through stdin.
Here's some minimally functional working code, using the jwat-warc library:
https://mvnrepository.com/artifact/org.jwat/jwat-warc

import org.jwat.warc.WarcReader;
import org.jwat.warc.WarcReaderFactory;
import org.jwat.warc.WarcRecord;

And the minimal code piece:

InputStream stdin = System.in;
WarcReader warcReader = WarcReaderFactory.getReader(stdin);
WarcRecord record;

while ((record = warcReader.getNextRecord()) != null) {
    InputStream contentStream = record.getPayloadContent();
    try (BufferedReader reader = new BufferedReader(new InputStreamReader(contentStream))) {
        StringBuilder builder = new StringBuilder();
        String line;
        while ((line = reader.readLine()) != null) {
            builder.append(line);
            builder.append("\n");
        }

        String uri = record.getHeader("WARC-Target-URI").value;
        String html = builder.toString();
        System.out.println(uri + "\t" + html.length());
    }
}

You need to read the full content of the WARC record, like I did (so you can't just skip records without reading them), or else it will throw the following exception:

java.io.IOException: Illegal seek
        at java.base/java.io.FileInputStream.skip0(Native Method)
        at java.base/java.io.FileInputStream.skip(Unknown Source)
        at java.base/java.io.BufferedInputStream.implSkip(Unknown Source)
        at java.base/java.io.BufferedInputStream.skip(Unknown Source)
        at java.base/java.io.FilterInputStream.skip(Unknown Source)
        at java.base/java.io.PushbackInputStream.skip(Unknown Source)
        at org.jwat.common.ByteCountingPushBackInputStream.skip(ByteCountingPushBackInputStream.java:134)
        at org.jwat.common.FixedLengthInputStream.skip(FixedLengthInputStream.java:115)
        at org.jwat.common.FixedLengthInputStream.close(FixedLengthInputStream.java:58)
        at java.base/java.io.BufferedInputStream.close(Unknown Source)
        at org.jwat.common.Payload.close(Payload.java:267)
        at org.jwat.warc.WarcRecord.close(WarcRecord.java:445)
        at org.jwat.warc.WarcReaderUncompressed.getNextRecord(WarcReaderUncompressed.java:123)

No comments: