Protehnica: December 2023

Thursday, December 14, 2023

Java: streaming WARC files

Previously, I published in a separate post a minimal example of some Java code that reads WARC data, using the "jwat-warc" library. After that, I wanted to compare the performance of the two major WARC parsing libraries ("jwat-warc" and "jwarc"), in the context of a more complex process.

The difficulty in doing this is that the major libraries have different approaches to how they expect you to consume the data.

e.g.

"jwat-warc" exposes a stream of the underlying HTML, while "jwarc" exposes a stream containing both the underlying HTTP headers and the HTML.
"jwat-warc" exposes the HTTP response code, while when using "jwarc" they have to be extracted from the headers.
etc.

Here is the interface I came up with that represents a fully loaded WARC record, independent of the underlying implementation, alongside two static utility functions for constructing an XWarcRecord instance from the library-specific WarcRecord instance:

public abstract class XWarcRecord {
protected String _uri;
protected String _payload;

abstract public String responseCode();
abstract public String payload();

abstract public String uri();

public static XWarcRecord from(org.jwat.warc.WarcRecord r) {
return new XWarcRecord_JWat(r);
}

public static XWarcRecord from(org.netpreserve.jwarc.WarcRecord r) {
return new XWarcRecord_JWarc(r);
}
}

And the particular implementations, with some optimizations to defer as much processing as possible to the time when a particular data point is needed:

I. "jwat-warc"

For some reason, if you don't read the stream manually, the reader will throw an error when advancing to the next record. So we have to do that in the constructor.

public class XWarcRecord_JWat extends XWarcRecord {

private final WarcRecord r;

XWarcRecord_JWat(WarcRecord r) {

this.r = r;

try {

InputStream contentStream = r.getPayloadContent();

this._payload = new String(contentStream.readAllBytes(), Charsets.UTF_8);

} catch (IOException | NullPointerException e) {

this._payload = null;

}

@Override

public String responseCode() {

if (r.getHttpHeader() == null) {

return "0";

}

return r.getHttpHeader().statusCodeStr;

}

@Override

public String payload() {

return _payload;

}

@Override

public String uri() {

if (_uri == null) {

_uri = r.getHeader("WARC-Target-URI").value;

}

return _uri;

}

II. "jwarc"

The stream they expose includes both the HTTP headers and the HTML (or other content), so we have to extract them manually.

public class XWarcRecord_JWarc extends XWarcRecord {

protected String _headers = null;

protected String _responseCode = null;

private final WarcRecord r;

XWarcRecord_JWarc(WarcRecord r) {

this.r = r;

}

@Override

public String responseCode() {

if (_responseCode == null) {

_parseContent();

_responseCode = _responseCode(this._headers);

}

return _responseCode;

}

@Override

public String payload() {

if (this._payload == null) {

this._parseContent();

}

return this._payload;

}

@Override

public String uri() {

if (_uri == null) {

_uri = r.headers().first("WARC-Target-URI").orElse(null);

}

return _uri;

}

private void _parseContent() {

List<String> h = new ArrayList<>();

List<String> p = new ArrayList<>();

try (MessageBody body = r.body();

BufferedReader reader = new BufferedReader(new InputStreamReader(body.stream(), StandardCharsets.UTF_8))) {

String line;

while ((line = reader.readLine()) != null) {

if (line.isEmpty()) {

break;

}

h.add(line);

}

while ((line = reader.readLine()) != null) {

p.add(line);

}

this._headers = String.join("\n", h);

this._payload = String.join("\n", p);

} catch (IOException | NullPointerException e) {

throw new RuntimeException(e);

}

private static final Pattern P = Pattern.compile("HTTP/\\d\\.\\d\\s+(\\d{3})");

private static String _responseCode(String input) {

Matcher matcher = P.matcher(input);

if (matcher.find()) {

return matcher.group(1);

} else {

return "0";

}

There was no observed performance difference between the two, but coming up with a solution to abstract away the underlying implementation was an interesting exercise.

Monday, December 11, 2023

Throttle vs Debounce

The following page will offer a very nice JavaScript illustration of the difference between throttle and debounce: https://web.archive.org/web/20220128120157/http://demo.nimius.net/debounce_throttle/

The terms make sense in the context when you want to control the time when an "effect" is triggered, based on the timing of the "cause" (I am using these terms very generally: "cause and effect").

You "throttle" when you want the effect to be spaced apart by a minimum interval of X time.
You "debounce" when you want the effect to be triggered after the cause has "cooled off" for enough (X) time.

Friday, December 8, 2023

Install Node.js / npm on Linux

The easiest way to manage Node.js / npm on Linux is by using the Node Version Manager:
https://github.com/nvm-sh/nvm

I. Install NVM

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash

II. Use NVM to install the latest LTS release of Node.js / npm

nvm install --lts

III. List versions

nvm ls

This will output a mix of both installed and available versions (including all LTS releases)

N/A

default -> lts/* (-> N/A)

iojs -> N/A (default)

node -> stable (-> N/A) (default)

unstable -> N/A (default)

lts/* -> lts/iron (-> N/A)

lts/argon -> v4.9.1 (-> N/A)

lts/boron -> v6.17.1 (-> N/A)

lts/carbon -> v8.17.0 (-> N/A)

lts/dubnium -> v10.24.1 (-> N/A)

lts/erbium -> v12.22.12 (-> N/A)

lts/fermium -> v14.21.3 (-> N/A)

lts/gallium -> v16.20.2 (-> N/A)

lts/hydrogen -> v18.19.0 (-> N/A)

lts/iron -> v20.10.0 (-> N/A)

When installing and uninstalling specific versions, you can use both the numeric version, or the release designation (e.g. v16.20.2 and lts/gallium are interchangeable)

IV. Uninstall a specific version

nvm uninstall lts/iron

V. Install a specific version

nvm install lts/hydrogen

VI. Use a specific version

nvm use lts/gallium

If you're running an older version of Linux, you may only have access to older Node.js versions, because a dependency on the GNU C Library (glibc).

Trying to run anything newer than lts/gallium on Amazon Linux 2 will throw the following:

node: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by node)

node: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by node)

You can update the NPM version independently from Node.js, but this too has a limit. Gallium comes with version 8. At the time of this writing, it prompts that version 10 is available, but you can only upgrade up to version 9.

npm install -g npm@9

(It will complain about incompatible versions if you try to install version 10).