Protehnica: October 2023

Monday, October 30, 2023

AWS CLI sets wrong file type, file gets downloaded with the wrong extension

I recently experienced the following issue using the AWS CLI:

I uploaded a .csv.gz file to S3.
I generated a presigned link.
The presigned link served the file with the wrong extension(.csv.csv instead of .csv.gz).

I think this is a bug in S3, where they misidentify the file type when a file has multiple extensions.
Thankfully, this can be easily solved, both for existing files, and for future uploads:

For existing files: I clicked on the object in the web interface, and scrolled down to Metadata.

Sure enough, the "Content-Type" key had the wrong value (it was "text/csv").
I clicked the "Edit" button, and manually changed it to the correct type, namely "application/x-gzip". They have The existing presigned link also reflected the change.

For future uploads: setting the content type explicitly ensures I will always get the desired content type, e.g.: --content-type application/x-gzip

Saturday, October 21, 2023

Bash: "get or default"

A useful command for assigning a default value to a variable if an optional input (e.g. $1) is missing.

declare PARAMETER_VALUE="${1:-DEFAULT_VALUE}"

Friday, October 20, 2023

Java: streaming WARC file with jwat-warc

I had to write some Java code that read WARC that was being piped in through stdin.
Here's some minimally functional working code, using the jwat-warc library:
https://mvnrepository.com/artifact/org.jwat/jwat-warc

import org.jwat.warc.WarcReader;
import org.jwat.warc.WarcReaderFactory;
import org.jwat.warc.WarcRecord;

And the minimal code piece:

InputStream stdin = System.in;
WarcReader warcReader = WarcReaderFactory.getReader(stdin);
WarcRecord record;

while ((record = warcReader.getNextRecord()) != null) {
InputStream contentStream = record.getPayloadContent();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(contentStream))) {
StringBuilder builder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
builder.append(line);
builder.append("\n");
}

String uri = record.getHeader("WARC-Target-URI").value;
String html = builder.toString();
System.out.println(uri + "\t" + html.length());
}
}

You need to read the full content of the WARC record, like I did (so you can't just skip records without reading them), or else it will throw the following exception:

java.io.IOException: Illegal seek

at java.base/java.io.FileInputStream.skip0(Native Method)

at java.base/java.io.FileInputStream.skip(Unknown Source)

at java.base/java.io.BufferedInputStream.implSkip(Unknown Source)

at java.base/java.io.BufferedInputStream.skip(Unknown Source)

at java.base/java.io.FilterInputStream.skip(Unknown Source)

at java.base/java.io.PushbackInputStream.skip(Unknown Source)

at org.jwat.common.ByteCountingPushBackInputStream.skip(ByteCountingPushBackInputStream.java:134)

at org.jwat.common.FixedLengthInputStream.skip(FixedLengthInputStream.java:115)

at org.jwat.common.FixedLengthInputStream.close(FixedLengthInputStream.java:58)

at java.base/java.io.BufferedInputStream.close(Unknown Source)

at org.jwat.common.Payload.close(Payload.java:267)

at org.jwat.warc.WarcRecord.close(WarcRecord.java:445)

at org.jwat.warc.WarcReaderUncompressed.getNextRecord(WarcReaderUncompressed.java:123)

Java 21 on Ubuntu 22.04 LTS and Amazon Linux

I wanted to update the JRE to version 21 on Ubuntu 22.04 LTS, and on Amazon Linux 2.
I decided to go with the Adoptium® Eclipse Temurin™ OpenJDK release just because they make it really convenient to add apt and yum repositories.

The documentation page lists the steps needed to set up the repositories:
https://adoptium.net/installation/linux/

Here you can also find all the RPM-based Linux distributions they support:
https://packages.adoptium.net/ui/repos/tree/General/rpm

UPX Linux .so: "CantPackException: bad e_shoff"

I was trying to compress a .so file I had built from Go, but UPX threw an error:

upx --ultra-brute --lzma libname.so
upx: libname.so: CantPackException: bad e_shoff

After compression, the file could no longer be read by the following nm command, which lists exposed functions in the given library:

nm -D libname.so | grep my_function_name
nm: libname.so: file format not recognized

There's an issue on the Github issue tracker for UPX from 2021 that appears to not be solved, which clarifies the problem:
https://github.com/upx/upx/issues/506#issuecomment-1168570219

This style of layout of the address space in the shared library, having 4 [PT_]LOAD segments [...] requires that the upx runtime de-compression stub be significantly enhanced from the upx stub that handles shared libraries with only 2 PT_LOAD segments (one R E and one RW). Upgrading the upx stub has been in progress for a while, and the code is getting close, but is not yet complete.

Running the command suggested in that thread confirms that my .so file also contains 4 LOAD segments, alongside the other program headers:

readelf --segments libname.so

Elf file type is DYN (Shared object file)
Entry point 0x0
There are 10 program headers, starting at offset 64

Program Headers:
Type           Offset             VirtAddr           PhysAddr
                FileSiz            MemSiz              Flags Align
LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                0x0000000000042f98 0x0000000000042f98 R      0x1000
LOAD           0x0000000000043000 0x0000000000043000 0x0000000000043000
                0x000000000012faa9 0x000000000012faa9 R E    0x1000
LOAD           0x0000000000173000 0x0000000000173000 0x0000000000173000
                0x00000000002ec30c 0x00000000002ec30c R      0x1000
LOAD           0x0000000000460208 0x0000000000461208 0x0000000000461208
                0x000000000014ee74 0x0000000000182988 RW     0x1000
DYNAMIC        0x0000000000560dc8 0x0000000000561dc8 0x0000000000561dc8
                0x00000000000001f0 0x00000000000001f0 RW     0x8
NOTE           0x0000000000000270 0x0000000000000270 0x0000000000000270
                0x0000000000000088 0x0000000000000088 R      0x4
TLS            0x0000000000460208 0x0000000000461208 0x0000000000461208
                0x0000000000000000 0x0000000000000008 R      0x8
GNU_EH_FRAME   0x000000000045e8d0 0x000000000045e8d0 0x000000000045e8d0
                0x00000000000001ac 0x00000000000001ac R      0x4
GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                0x0000000000000000 0x0000000000000000 RW     0x10
GNU_RELRO      0x0000000000460208 0x0000000000461208 0x0000000000461208
                0x0000000000100df8 0x0000000000100df8 R      0x1

The only solution is to wait for UPX maintainers to address this.

Thursday, October 19, 2023

Gradle 8.4: "Convention type has been deprecated"

I upgraded Gradle to version 8.4 and started getting the following types of messages:

The org.gradle.api.plugins.ApplicationPluginConvention type has been deprecated. This is scheduled to be removed in Gradle 9.0. Consult the upgrading guide for further information: https://docs.gradle.org/8.4/userguide/upgrading_version_8.html#application_convention_deprecation
The org.gradle.api.plugins.Convention type has been deprecated. This is scheduled to be removed in Gradle 9.0. Consult the upgrading guide for further information: https://docs.gradle.org/8.4/userguide/upgrading_version_8.html#deprecated_access_to_conventions

What's going on is that my existing build.gradle files contained some configuration options used in a way that got deprecated. The changes I had to make were largely cosmetic. I just had to group some of the existing configuration options into their own blocks:

I. Pre 8.4:

sourceCompatibility = 1.17
targetCompatibility = 1.17
mainClassName = "run.Main"
applicationDefaultJvmArgs = ["-Xmx1g"]

II. Post 8.4:

java {
sourceCompatibility = 1.17
targetCompatibility = 1.17
}

application {
mainClass.set("run.Main")
applicationDefaultJvmArgs = ["-Xmx1g"]
}

Tuesday, October 17, 2023

Remove duplicate pictures

Recently I wanted to remove duplicate photos I had based on image content, not just file hash.
This program, AllDup by MTSD, did a great job.

https://www.alldup.de/en_download_alldup.php

Sunday, October 15, 2023

"The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results."

I was playing with Huggingface transformers and kept getting the warning "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.". I finally found a solution in a StackOverflow reply that will be credited at the end:

To fix this, first add this code after loading pre-trained tokenizer:
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
Then pass this in generate method like this:
gen_ids = model.generate(**encodings, pad_token_id=tokenizer.pad_token_id, max_new_tokens=200)

In short, there are two additions/changes you need to make:

When initializing your tokenizer, set:
tokenizer.pad_token = tokenizer.eos_token
When using the model to generate an output, pass the following as a parameter to model.generate:
pad_token_id=tokenizer.pad_token_id

Thank you user Shital Shah on StackOverflow:

https://stackoverflow.com/questions/74682597/fine-tuning-gpt2-attention-mask-and-pad-token-id-errors/76549607#76549607

"A decoder-only architecture is being used, but right-padding was detected!"

I was playing with Huggingface transformers and kept getting the warning "A decoder-only architecture is being used, but right-padding was detected!". I finally found a solution in a StackOverflow reply that will be credited at the end:

Padding in this context is referring to the "tokenizer.eos_token", and you are currently padding to the right of the user input and the error is saying that for correct results add padding to the left. You need to do this:
new_user_input_ids = tokenizer.encode(tokenizer.eos_token + input(">> User:"), return_tensors='pt')

While I originally thought it was about setting the parameter padding_side='left', it turned out to be about the order in which you concatenate the input and the eos_token.

Thank you user Travis Thayer on StackOverflow:
https://stackoverflow.com/questions/74748116/huggingface-automodelforcasuallm-decoder-only-architecture-warning-even-after/74972288#74972288

Wednesday, October 11, 2023

Update apt packages

#!/usr/bin/env bash
sudo apt-get update
sudo apt-get upgrade
sudo apt-get dist-upgrade
sudo apt-get autoremove
sudo apt-get clean

Wednesday, October 4, 2023

Install specific Go version on Linux

A bash script for installing a specific Go version on Linux (AMD64).

Expects a numeric version as a parameter (e.g. 1.21.1).
Downloads & extracts the Go archive under /usr/local/lib/go$VERSION
Creates a symlink for ./bin/go under /usr/local/bin/go

#!/usr/bin/env bash

declare VERSION="${1}"
declare PROGRAM="go${VERSION}"
declare ARCHIVE="${PROGRAM}.linux-amd64.tar.gz"
declare LIB_DIR="/usr/local/lib"
declare BIN_DIR="/usr/local/bin"
declare INSTALL="${LIB_DIR}/${PROGRAM}"
declare SYMLINK="${BIN_DIR}/go"

if [[ ! -d "${INSTALL}" ]]; then
wget --timestamping "https://go.dev/dl/${ARCHIVE}"

if [[ ! -f "${ARCHIVE}" ]]; then
echo "File not found: ${ARCHIVE}"
exit 1
fi

sudo tar -xvf "${ARCHIVE}"
sudo mv -f go "${PROGRAM}"
sudo mv -f "${PROGRAM}" "${LIB_DIR}"
else
echo "${INSTALL} already exists"
fi

sudo rm -f "${SYMLINK}"
sudo ln -s "${LIB_DIR}/${PROGRAM}/bin/go" "${SYMLINK}"

Monday, October 2, 2023

Sub7

Interview with Sub7 creator, Mobman:
- https://twitter.com/DarkCoderSc/status/1681208015255379968
- https://darkcodersc.medium.com/a-malware-retrospective-subseven-d86fed0c88bf

Born and raised in Craiova, Romania, Mobman was drawn to the world of software and malware at an early age. His fascination led him to the creation of the infamous SubSeven Remote Access Trojan, a feat achieved under a pseudonym inspired by his enduring favorite band, B.U.G. Mafia. As he reflected, “The nickname was inspired from my favorite band (still to this day!), the Romanian rap group called B.U.G. Mafia. I wanted to pick something mob-related and mobman just had a nice ring to it.”.

Sub7 fun fact: mobman used to write feature ideas in notebooks
https://twitter.com/xillwillx/status/1708766696985575772