1770850620 english chat2026-devblog

Chat2026 part 10: continuing application design

More security hardening

Server side hardening

One security issue I've noticed with the current design that an outside observer can see the certificate of both the client and server. So it can collect important metadata and easily map the social network of people. One way to hide this is connecting to a regular HTTPS server that enables CONNECT method only to a given host. For example you can use this to allow proxy to your SSH server through your HTTPS webserver, in you are using Apache:

/etc/apache2/conf-available/proxy.conf:

ProxyRequests On
AllowConnect 22
<Proxy "*">
	Require all denied
</Proxy>

<ProxyMatch "127.0.0.1:22">
	Require all granted
</ProxyMatch>

Then after enabling the proxy and proxy_connect modules and enabling this configuration will allow connecting to the the SSH server at localhost.

Then using proxytunnel -E -p yourdomain.com:443 -d 127.0.0.1:22 as a proxy command for the SSH connection can connect to the SSH server via the webserver. This is especially useful when you are behind restrictive firewalls that doesn't let anything but the most common protocols such as port 80 and 443 out. But it can also be used to masquerade the chat program's traffic as regular Web traffic.

So the chat program will need to support this kind of proxying to cover the tracks. This requires a change to the access method structure described before:

struct TlsAccessMethod {
    host_name: String,
    port_number: u16,
    server_name: String, // to be used for TLS SNI,
}

enum AccessMethodKind {
    Tls(TlsAccessMethod),
    HttpsProxy {
        https_proxy: TlsAccessMethod,
        device: TlsAccessMethod,
    }
}

struct AccessMethod {
        method_kind: AccessMethodKind
        last_updated: Timestamp,
}

In the HTTP proxy case the https_proxy specifies the access method to the HTTPS server, while the device specifies method to access the device, in this case the host_name and port_number fields will be used in the HTTP CONNECT header while server name will be used for the nested TLS connection to establish the connection to the device.

Another way to harden is using the encrypted ClientHello feature of TLS, however an observer can see that it's happening and can block it. But if everyone does it, then it's hiding in the crowd (like when everyone uses TLS). The main obstacle is that the rustls does not support ECH on the server side as of writing.

Client side hardening

The user may also decide to use a SOCKS proxy to avoid directly connecting to someone. Using Tor can provide such SOCKS proxy, so we don't need to implement our own. But SOCKS protocol needs to be supported. It's a simple protocol, I might be able to implement that my own, but probably there are some good Rust crates for this.

Connection management

The node listens for connections and it also tries to make connections to other nodes.

We don't have much control about what connections come in. But if they mutually authenticate, then we receive it and handle it.

The node also makes outgoing connections. To do this it basically goes through the DeviceAccessStates and attempts to make connections to access modes that aren't connected yet. In my original idea if the connection fails, it will keep retrying in increasing amount of times. For example it retries connection first after 1 minute, then 2 minutes, then 4 minutes, then 8 minutes and so on. If a node is offline to exactly 255 minutes (4 hours and 15 minutes), then it comes online, then clients won't try to connect until an additional 256 minutes have passed (4 hours 16 minutes). This not good. Perhaps it would make sense to allow the user who creates the access, to specify what retry characteristics is needed. As a result I add 3 new fields into the access method structure:

    struct AccessMethod {
        method_kind: AccessMethodKind
        first_retry_period_secs: u32, // Period of first retry after the connection breaks.
        subsequent_retry_period_secs: u32, // Period of repeated retry after connection failure.
        max_attempts: u32, // Number of attempts before the access mode is declared dead.
        last_updated: Timestamp,
    }

The first_retry_period_secs applies only when the connection breaks, before an attempt is made to reestablish the connection. subsequent_retry_period_secs applies only when a connection attempt fails. So if the first connection attempt fails right away this timeout applies. And the max_attempts indicates how many connection failures can we tolerate before giving up.

For simple network hiccups the first_retry_period can be small, such as 1 minute. If the DHCP lease expired and needs a new IP or needed to restart the modem, this period should be enough for the connection to come back. A good value would be subsequent_retry_period_secs 10 minutes, if the node is offline we may try to reestablish connection each 10 minutes. The default value for max_attempts would be 1000. If we make an attempt each 10 minutes, then we run out of attempts in roughly a week. If no connection can be made, then we can give up there.

Of course there should be an option to allow the user to make an attempt explicitly. That would clear the failure counter. Hence we add an API:

    fn UserData::clear_failed_connection_counter(&mut self, state: Id<DeviceAccessState>); // To be used when the user explicitly wants to reconnect to this state.

Each device is supposed to have a different certificate. But there can be multiple active connections between devices. Then we need to find out if two connections that use identical certificates are in fact come from the same device, or the key is stolen and someone else trying to interfere. So my idea is that when a device starts, it generates itself a 256 bit random data, the "device random". Upon each connection, the first message they exchange is this device random. So if a device makes multiple connections to a device, the other side will see the device random and if they are the same, then it concludes that it's the same device. If another instance of the same device is started, then it would have a different device random, so the other side would determine that the connection is from a different instance. This can mean simply that the user copied the data and started the client twice, but it can also mean a compromised private key. When this happens, the current connection is put in a pending state, and all the existing connections for this device are pinged, there is a response, then the duplicate is confirmed. An alert is sent to the new connection, and the connection is then torn down. An alert is also sent to the existing connection, but that's maintained. A new device random can also simply be due to the fact that the application is restarted, in this state the old connection is dead and won't answer.

In order to keep the connection alive, the connection manager regularly sends packets to the other side.

Then the connection manager also takes care of catching up devices of contacts if they weren't updated (such as when a rarely used device comes online).

When a new, previously unseen (new certificate) device connects the first time. A new ContactDeviceInfo is created for the device on the spot with the last_updated member set to the current timestamp.

Perhaps sending a message is actually a catch up. You create a message and the connection manager will take care of delivering it to the devices of contacts. So no direct sending operation is involved.

So based on this the following run-time structures are devised:

    type DeviceRandom = [u8; 32];

    enum ConnectionStatus {
        NotConnected,
        Connected,
    }

    struct ConnectionState {
        status: ConnectionStatus,
        device_random: DeviceRandom,
    }

    struct ConnectionManager {
        connections: Map<CertificateId, ConnectionState>,
        pending_connections: Vec<ConnectionState>,
    }

    struct RuntmeData {
        my_dev_random: DeviceRandom,
        conn_mgr: ConnectionManager,
    }

Implementation plan

Depending on how many friends do you have a node don't need to maintain a whole lot of connections. So I think will use threaded I/O rather than committing to async. The connection manager would have a main thread that does all the management jobs I described above. On the top of that it would have a listener thread that listens for connections and spawns threads for each client that connect. Each client would have a pair of threads, one is constantly blocked on receiving and reads messages from the client and sends the messages to the connection manager. The other is waiting for messages from the connection manager and sends it to the client when they receive them. I've never did this 2 thread per socket in Rust so far, so I need to check the documentation if this is possible or not... It seems the TcpStream is cloneable so I can do this.

Synchronization between own devices

The corresponding user CA cert is always among the approved root certificates, therefore the connection manager can accept connections from other devices of the same user. However this is more special in the sense that the user can synchronize data between devices. The data synchronization itself is a complex problem.

A passive node can only make connections to active nodes, so if another passive device is added, the ability to connect doesn't go worse. The same is true when the user has an active node and the user adds another active node: active nodes can be connected and they can make connections as well. But when the user has an active node, but it adds a passive node, that new passive node cannot talk to the passive nodes that could previously reach the active node. Therefore synchronization of contacts from an active to passive node doesn't make any sense.

I would also consider to have a kind of client mode. Where an user's own passive node merely acts as a client to access and manage the active node remotely.

I'm also considering adding a possibility for passive nodes to ask active nodes to relay for them. In that case the access modes of the relayed node become the same, the only difference is the server name. But I think this won't be part of the prototype right now.

Online status based on connections

When at least one of the contact's devices is online (that is we have connection to it), then the contact appears online. Sending a new message will send it to all devices of that contact. Actually when you write a message to someone you just create a message, then the connection manager takes care of actually delivering them.

Although the initial prototype won't support voice and video calls yet. It's still worth thinking about how it's going to be implemented. I guess placing a call on the contact causes all their devices to ring, but the first one that answers is the one that receives the call. But this will be done later.

So this was after a few days of thinking. Things are starting to get complex. I think if I manage to implement and test the connection manager, then the majority of the essential work is done. But as usual the devil is always in the details. I will need to make a good user interface too. And it's also said that 80% of the work is done in 20% of the time it take to do something, the rest of the time is spent about ironing out corner cases and bugs.

In the next post I'm thinking about what the user interface should look like.