Bytes-Util
What’s this
A Java library that enables easing parsing of binary encoding schemes.
Motivation
Nowadays there are still a lot of systems which communicates by non-standard home-grown binary protocols. Implementing such protocols is always a pain as I feel during my daily work, due to data structures or types not natively supported or hard to implement in Java, like unsigned integral types, little-endian numbers or binary coded decimals, just to name a few. So I make this library to ease the process of parsing and make programmers focus more on their real work.
Quick Start
Consider the following definition of a data packet:
Field Name | Length | Type |
---|---|---|
Header Mark | 1 | 1-byte unsigned integer |
Packet Type | 1 | 1-byte unsigned integer |
Sequence Number | 2 | 2-byte unsigned integer |
End Mark | 1 | 1-byte unsigned integer |
And declaration of the following Java class:
import io.github.zhtmf.DataPacket;
import io.github.zhtmf.annotations.modifiers.Order;
import io.github.zhtmf.annotations.modifiers.Unsigned;
import io.github.zhtmf.annotations.types.BYTE;
import io.github.zhtmf.annotations.types.SHORT;
@Unsigned
public class MyPacket extends DataPacket{
@BYTE
@Order(0)
public int headerMark;
@BYTE
@Order(1)
public int type;
@SHORT
@Order(2)
public int sequenceNumber;
@BYTE
@Order(3)
public byte endMark;
}
And coding serialization/deserialization become as simple:
InputStream in = ...; //socket input stream, or other input streams.
MyPacket packet = new MyPacket();
packet.deserialize(in); //fields will be filled with values in the stream.
OutputStream out = ...; //write to socket or other forms of output streams.
MyPacket packet = new MyPacket();
packet.headerMark = 0xEB;
packet.type = 0x01;
packet.sequenceNumber = 255;
packet.endMark = 0x07;
packet.serialize(out);
As you can see from the basic example above, to use this library users need to:
- Define a class which subclasses
DataPacket
- Use
Order
as well as other annotations to define properties of instance fields which corresponds directly with fields in original binary schemes. - Use inheritied methods like
DataPacket#serialize
andDataPacket#deserialize
to handle serialization/deserialization.
Basic Concepts
Abstract Data Types
This library defines the following pseudo data types which serves as an abstraction over various definitions in binary encoding schemes, and their annotations can be found under the package io.github.zhtmf.annotations.types
.
Name | Meaning |
---|---|
BYTE | 1-byte integer |
SHORT | 2-byte integer |
INT | 4-byte integer |
INT3 | 3-byte integer |
INT5 | 5-byte integer |
INT6 | 6-byte integer |
INT7 | 7-byte integer |
BCD | Binary-Coded Decimal |
RAW | sequences of bytes that do not fall in the categories above and used as-is |
CHAR | sequences of bytes which are interpreted as human readable text |
LONG | 8-bye integer |
BIT | Numbers that stored as groupings of less than 8 bits |
FIXED | Real data type that has a fixed number of digits after the radix point, aka. the Q format. |
And following conversion between data types above and Java types are defined:
byte/Byte | short/Short | int/Integer | long/Long | Enums | String | char/Character | boolean/Boolean | byte[] | int[] | java.util.Date | BigInteger | Double/double | BigDecimal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BYTE | ⚪ | ⚪ | ⚪ | ⚪ | ⚪ | ⚪ | ||||||||
SHORT | ⚪ | ⚪ | ⚪ | ⚪ | ||||||||||
INT | ⚪ | ⚪ | ⚪ | ⚪ | ||||||||||
INT3 | ⚪ | ⚪ | ||||||||||||
INT5 | ⚪ | |||||||||||||
INT6 | ⚪ | |||||||||||||
INT7 | ⚪ | |||||||||||||
BCD | ⚪ | ⚪ | ⚪ | ⚪ | ⚪ | ⚪ | ||||||||
RAW | ⚪ | ⚪ | ||||||||||||
CHAR | ⚪ | ⚪ | ⚪ | ⚪ | ⚪ | ⚪ | ⚪ | ⚪ | ⚪ | |||||
LONG | ⚪ | ⚪ | ⚪ | ⚪ | ||||||||||
BIT | ⚪ | ⚪ | ⚪ | |||||||||||
FIXED | ⚪ | ⚪ |
In addition, an instance field can also be another Data Packet
. It will be automatically handled by calling its serialize
or deserialize
method and does not need to be annotated with data type annotations above.
Most of the conversions are intuitive but conversion from numeric types to enum
s deserves a separate chapter in later part of this document.
Modifiers
Modifiers are another set of annotations which describes properties of data types above, they can be found under the package io.github.zhtmf.annotations.modifiers
.
Name | Meaning |
---|---|
Order | Order of this field in the original encoding scheme. This is a must because reflection API in Java does not guarantee the order of Field objects returned by Class.getDeclaredFields . |
Signed/Unsigned | Specifies that a single field or all fields in a class should be interpreted as signed or unsigned (default). |
BigEndian/LittleEndian | Specifies that a single field or all fields in a class should be interpreted as big-endian (default) or little-endian. |
CHARSET | Charset for all CHAR type fields in a class or for a single field. |
DatePattern | Date pattern string for a java.util.Date field. |
ListLength | length of java.util.List when handling with a list of basic data types. |
Length | length of CHAR arrays or RAW arrays. |
EndsWith | Indicates that a CHAR field is of indeterministic length and its end is marked by a specific sequence of bytes. |
ListEndsWith | Indicates that length of a list is neither static nor calculated but depends on external conditions at runtime. It enables using a ModifierHandler to encapsulate additional logic. Users can refer to external resources or even modify the list itself within this handler. See how this library handles length of the ConstantPool field in a Java class file in the related example as an example for how to use this modifier. |
These annotations can be found under the package io.github.zhtmf.annotations.modifiers
. They can be specified both at type level and at field level. And with not surprise, field level annotation overrides same annotation at type level.
Handler
Handler is a mechanism to express dynamic logic in a binary encoding, namely fields whose content depends on value another field or some other conditions and can only be determined at runtime.
There are three types of handlers as listed below:
|Type |Assosiated Annotations |Usage
|—-|——|—–
|Length handler |Length, ListLength|dynamic length
|Conditional handler|Conditional |conditionally omitting some fields
|Entity handler |Variant |custom instantiation logic of “sub” DataPacket
s.
Their usage are best explained through a small example, consider the following encoding scheme:
Field Name | Length | Type | Value |
---|---|---|---|
Header | 1 | 1-byte unsigned integer | 0xF0 |
Packet Length | 4 | 4-byte unsigned integer | |
Identifier | 4 | 4-byte array | 0x01000000 |
Sequence Number | 2 | 2-byte unsigned integer | |
CRC Flag | 1 | 1-byte integer | 0x00: with crc 0x01: without crc |
Body | multiple types, depends on body header | ||
CRC Value | 4 | 4-byte array | If CrcFlag is 0, this part is ommited, otherwise crc32 encoding of the Body part. |
Ending | 1 | 1-byte integer | 0xFF |
body of type 1:
Field Name | Length | Type | Value |
---|---|---|---|
Body Header | 2 | 2-byte unsigned integer | 0x5109 |
Device ID | 4 | 4-byte unsigned integer | |
IP | 4 | 4-byte array |
body of type 2:
Field Name | Length | Type | Value |
---|---|---|---|
Body Header | 2 | 2-byte unsigned integer | 0x55DE |
Files Count | 1 | 1-byte unsigned integer | |
File Name List | 100*N | 100-character string per file name |
There are three parts in this binary encoding scheme that cannot be statically defined:
- The CRC part can be omitted, if the crcFlag is 0.
- The body part can be either of the two types and the body part itself has a “header” which states what kinds of data follows.
- In body type 2, its length is indeterministic, which relies on how many file names are in the list.
These problems makes it dynamic and we use handlers to express the logics.
import java.io.IOException;
import java.io.InputStream;
import io.github.zhtmf.DataPacket;
import io.github.zhtmf.annotations.modifiers.Length;
import io.github.zhtmf.annotations.modifiers.LittleEndian;
import io.github.zhtmf.annotations.modifiers.Order;
import io.github.zhtmf.annotations.modifiers.Unsigned;
import io.github.zhtmf.annotations.modifiers.Variant;
import io.github.zhtmf.annotations.types.BYTE;
import io.github.zhtmf.annotations.types.INT;
import io.github.zhtmf.annotations.types.RAW;
import io.github.zhtmf.converters.auxiliary.EntityHandler;
import io.github.zhtmf.converters.auxiliary.ModifierHandler;
@Unsigned
@LittleEndian
public class DynamicPacket extends DataPacket{
@Order(0)
@BYTE
public int header;
@Order(1)
@INT
public long length;
@Order(2)
@RAW
@Length(4)
public byte[] identifier;
@Order(3)
@SHORT
public int sequenceNumber;
@Order(4)
@BYTE
public byte crcFlag;
@Order(5)
@Variant(BodyHandler.class)
public Body body;
@Order(6)
@RAW
@Length(handler=CrcHandler.class)
public int[] crc;
@Order(7)
@BYTE
public int ending;
private static final class CrcHandler extends ModifierHandler<Integer>{
@Override
public Integer handleDeserialize0(String fieldName, Object entity, InputStream is) {
return ((DynamicPacket)entity).crcFlag==0 ? 0 : 4;
}
@Override
public Integer handleSerialize0(String fieldName, Object entity) {
return ((DynamicPacket)entity).crcFlag==0 ? 0 : 4;
}
}
private static final class BodyHandler extends EntityHandler{
@Override
public DataPacket handle0(String fieldName, Object entity, InputStream is) throws IOException {
byte b1 = (byte) is.read();
byte b2 = (byte) is.read();
int header = b2 <<8 | b1;
if(header==0x5109) {
return new BodyType1();
}else if(header==0x55DE) {
return new BodyType2();
}else {
throw new Error("unknown body");
}
}
}
}
Firstly , the identifier
field is annotated with Length(4)
, this is how this library defines static length.
Secondly, the crc
field is also annotated with Length
, however because its length depends on value of another field crcFlag
so we use a handler to wrap the branching logic, namely is the CrcHandler
class.
The CrcHandler
class is a subclass of ModifierHandler<Integer>
, which defines two abstract methods. These two methods are provided with name of the field in question, the (incomplete) entity object as well as the input stream (only for deserialization) as a context for custom processing. Within the handler body, we use the crcFlag
field which is already known to us at this point to determine length of the crc
array.
This part can also be implemented using Conditional
which is only introduced in later version of this library.
Thirdly comes the body
part. This one is more complicated than the Crc
field in that different types of body are not only different in length but also has different layout. It is possible to use a lot of handlers to handle all possibilities of data layout for this part but your code will soon be unmaintainable when new body types are added.
This library adopts a different approach. Instance field for the body can be declared as a common base class which also inherits DataPacket
and it will be instantiated as one of the concrete subclasses:
@Unsigned
@LittleEndian
abstract class Body extends DataPacket{
@Order(0)
@SHORT
public int bodyHeader;
}
@Unsigned
@LittleEndian
class BodyType1 extends Body{
@Order(0)
@INT
public long deviceId;
@Order(1)
@RAW(4)
public int[] IP;
}
@Unsigned
@LittleEndian
class BodyType2 extends Body{
@Order(0)
@CHAR(100)
@Length(type=DataType.BYTE)
public List<String> fileList;
}
The BodyHandler
is a subclass of ModifierHandler
which simplifies the interface. Within the handler method body, we peek the body header and return instances of proper subclass of Body
. Normally such peeking is problematic because data for later processing will be consumed prematurely. But in this library, the InputStream
passed as parameter is a special one which is automatically rewound after the handler method returns.
Also @Length(type=DataType.BYTE)
in BodyType2
means “write length of the list into the stream as a BYTE prior to serializing list content and read in list length prior to deserialize list itself”, this is another usage of Length
annotation which provides support for the “Length-Value” pattern. Such usage is especially useful in Java because length of a List
is implicit in the list itself and using another field to record the length is fairly redundant.
Annotations and handler class definitions can be found under package io.github.zhtmf.annotations.modifiers
and io.github.zhtmf.converters.auxiliary
.
Enums
This library supports conversion directly to java enums. Although it requires some effort. User-defined enum types should meet one of the following requirements:
- Implements
NumericEnum
orStringEnum
interface. - Returns distinct strings (numeric strings when mapping to numeric data types) from
toString
method of each enum member. - Enum members are mapped implicitly as integers of their corresponding ordinals.
As in the following example:
private enum NEnum1 implements NumericEnum{
FLAG1 {
@Override
public long getValue() {
return 0;
}
},
FLAG2 {
@Override
public long getValue() {
return 1;
}
};
@Override
public abstract long getValue();
}
@BYTE
@Order(0)
public NEnum1 enumFlags1;
The library deserializes 0
as NEnum1.FLAG1 and 1
as NEnum1.FLAG2. Similarly, NEnum1.FLAG1 is serialized as a byte value 0
and NEnum1.FLAG2 as 1
.
Alternatively the definition can be like the following:
private enum NEnum1 {
FLAG1 {
@Override
public String toString() {
return "0";
}
},
FLAG2 {
@Override
public String toString() {
return "1";
}
};
}
This class does not implement the interface but returns distinct values as strings from toString
method.
Or the definition can be as simply as :
private enum NEnum1 {
FLAG1,
FLAG2;
}
If the encoding scheme only defines 0
and 1
for this field and they are deserialized as FLAG1
and FLAG2
respectively according to their ordinals.
Script
Admittedly, writing handlers is cumbersome and making the code unnecessarily lengthy. So this library introduces a script engine to simplify handlers.
For example the following handler definition ( an excerpt from MySQL connector example):
@Order(11)
@RAW
@Length(handler=RestPluginLength.class)
public byte[] restPluginProvidedData;
public static class RestPluginLength extends ModifierHandler<Integer>{
@Override
public Integer handleDeserialize0(String fieldName, Object entity, InputStream is) throws IOException {
HandshakeV10 v10 = (HandshakeV10)entity;
return Math.max(13, v10.authPluginDataLen-8);
}
@Override
public Integer handleSerialize0(String fieldName, Object entity) {
HandshakeV10 v10 = (HandshakeV10)entity;
return Math.max(13, v10.authPluginDataLen-8);
}
}
can be simplified as
@Order(11)
@RAW
@Length(scripts = @Script("a=entity.authPluginDataLen-8;a<13 ? 13 : a;"))
public byte[] restPluginProvidedData;
Using script sacrifices some performance and script cannot be used for Entity Handlers as this script engine does not aim to be a full-fledged programming language and users should write a real handler when expressing complex logic.
The script does not possess its own annotation but associates with existing modifier annotations through their scripts
property, check java doc of these annotations and the marker annotation io.github.zhtmf.annotations.modifiers.Script
for more information.
Refer to another document here for syntax of this script.
Length Calculation
There is a third method, length()
in DataPacket
, which calculates length in byte for this entity class as if it was serialized into a destination. It is useful for some encoding schemes which defines a “length” field.
Note that this method is NOT a constant-time operation but it does not calculate by doing a serialization.
From 1.1.3, a new method offset
has been added to ModifierHandler
class. This method returns an int
telling the current offset (position) in bytes relative to beginning of current DataPacket object. This is useful to implement some data structures (mainly strings) which does not have a deterministic length but occupies all remaining spaces in a data packet.
User-defined Conversion
Library supports user defined conversion by using the annotation of UserDefined
, as in the following example:
@Order(0)
@UserDefined(Converter1.class)
@Length(8)
public Timestamp ts;
As this library only “natively” supports java.util.Date, conversion to a java.sql.Timestamp requires some effort. By using this annotation, you should clearly defines how many bytes this conversion process requires (in this example it is 8) and specifies a handler class to do the actual conversion work:\
public static class Converter1 extends TypeConverter<Timestamp>{
@Override
public void serialize(Timestamp obj, Output context) throws IOException {
context.writeLong(((Timestamp)obj).getTime());
}
@Override
public Timestamp deserialize(Input input) throws IOException{
return new Timestamp(input.readLong());
}
}
io.github.zhtmf.TypeConverter
is a dedicated base class for custom conversion. Output
and Input
are utility classes which serves as an abstraction over the underlying stream and many other functions this library provides.