I had to write some Java code that read WARC that was being piped in through stdin.
Here's some minimally functional working code, using the jwat-warc library:
https://mvnrepository.com/artifact/org.jwat/jwat-warc
import org.jwat.warc.WarcReader;
import org.jwat.warc.WarcReaderFactory;
import org.jwat.warc.WarcRecord;
And the minimal code piece:
InputStream stdin = System.in;
WarcReader warcReader = WarcReaderFactory.getReader(stdin);
WarcRecord record;
while ((record = warcReader.getNextRecord()) != null) {
InputStream contentStream = record.getPayloadContent();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(contentStream))) {
StringBuilder builder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
builder.append(line);
builder.append("\n");
}
String uri = record.getHeader("WARC-Target-URI").value;
String html = builder.toString();
System.out.println(uri + "\t" + html.length());
}
}
No comments:
Post a Comment