I had to write some Java code that read WARC that was being piped in through stdin.
Here's some minimally functional working code, using the jwat-warc library:
https://mvnrepository.com/artifact/org.jwat/jwat-warc
import org.jwat.warc.WarcReader;
import org.jwat.warc.WarcReaderFactory;
import org.jwat.warc.WarcRecord;
And the minimal code piece:
InputStream stdin = System.in;
WarcReader warcReader = WarcReaderFactory.getReader(stdin);
WarcRecord record;
while ((record = warcReader.getNextRecord()) != null) {
InputStream contentStream = record.getPayloadContent();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(contentStream))) {
StringBuilder builder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
builder.append(line);
builder.append("\n");
}
String uri = record.getHeader("WARC-Target-URI").value;
String html = builder.toString();
System.out.println(uri + "\t" + html.length());
}
}
You need to read the full content of the WARC record, like I did (so you can't just skip records without reading them), or else it will throw the following exception:
java.io.IOException: Illegal seek
at java.base/java.io.FileInputStream.skip0(Native Method)
at java.base/java.io.FileInputStream.skip(Unknown Source)
at java.base/java.io.BufferedInputStream.implSkip(Unknown Source)
at java.base/java.io.BufferedInputStream.skip(Unknown Source)
at java.base/java.io.FilterInputStream.skip(Unknown Source)
at java.base/java.io.PushbackInputStream.skip(Unknown Source)
at org.jwat.common.ByteCountingPushBackInputStream.skip(ByteCountingPushBackInputStream.java:134)
at org.jwat.common.FixedLengthInputStream.skip(FixedLengthInputStream.java:115)
at org.jwat.common.FixedLengthInputStream.close(FixedLengthInputStream.java:58)
at java.base/java.io.BufferedInputStream.close(Unknown Source)
at org.jwat.common.Payload.close(Payload.java:267)
at org.jwat.warc.WarcRecord.close(WarcRecord.java:445)
at org.jwat.warc.WarcReaderUncompressed.getNextRecord(WarcReaderUncompressed.java:123)