Hi guys, 

I'm having an issue which I have narrowed down to is_first_hop(), I can apply a workaround, but I don't know if I'm doing it correctly or if my problem is caused by misconfiguration of anything.

Let's say we have the following flow:



Client (NAT) TLS -> Kamailio1 (only Public IP) UDP -> Kamailio2  (only Public IP) UDP -> FreeSWITCH (only Public IP).


Regarding this setup, let's focus only on Kamailio1. And for the sake of a clear example, I have done a little draw.io diagram:








From the above flow, let's stick to the 200 OK that Kam1 is going to receive from Kam2. (marked in blue in the screenshot).



First 200 OK (replying to initial INVITE):

SIP/2.0 200 OK
Via: SIP/2.0/UDP KAM1_PUB_IP;rport=5060;branch=z9hG4bK1e2b.e1307b519c9b5f0015343e13f35aeace.0;i=3
Via: SIP/2.0/TLS [2607:fb90:489b:6e13:85f8:596b:b86b:c831]:54744;received=172.58.17.149;branch=z9hG4bK.jmXSnh-7x;rpo
=43842
Record-Route: <sip:KAM2_PUB_IP;lr=on;ftag=7vidaJ3Hw;did=36e.d441>
Record-Route: <sip:KAM1_PUB_IP;r2=on;lr=on;ftag=7vidaJ3Hw;did=36e.a792;nat=yes>
Record-Route: <sip:KAM1_PUB_FQDN:443;transport=tls;r2=on;lr=on;ftag=7vidaJ3Hw;did=36e.a792;nat=yes>
From: "Joel Test 1" <sip:8bd2a0aba14541789bb7269800646458@MY_DOMAIN>;tag=7vidaJ3Hw
To: "Joel Test 2" <sip:e78f2617b0d345d3bdb7b6780ece903c@MY_DOMAIN>;tag=vceg6N0m5ypHa
Call-ID: 3ezoQGF1kp
CSeq: 21 INVITE
Contact: <sip:e78f2617b0d345d3bdb7b6780ece903c@FS_PUB_IP:6061;transport=udp>
User-Agent: TP MEDIA 2.0
Allow: INVITE, ACK, BYE, CANCEL, OPTIONS, MESSAGE, INFO, UPDATE, REGISTER, REFER, NOTIFY
Supported: timer, path, replaces
Allow-Events: talk, hold, conference, refer
Content-Type: application/sdp
Content-Disposition: session
Content-Length: 358
Remote-Party-ID: "e78f2617b0d345d3bdb7b6780ece903c" <sip:e78f2617b0d345d3bdb7b6780ece903c@MY_DOMAIN>;party=calling;privacy=off;screen=no

v=0
o=TP 1529576021 1529576022 IN IP4 FS_PUB_IP
s=TP
c=IN IP4 FS_PUB_IP
t=0 0
m=audio 25484 RTP/AVP 96 101
a=rtpmap:96 opus/48000/2
a=fmtp:96 useinbandfec=1; maxplaybackrate=8000; sprop-maxcapturerate=8000
a=rtpmap:101 telephone-event/48000
a=fmtp:101 0-16
a=silenceSupp:off - - - -
a=ptime:20
a=rtcp:25485 IN IP4 FS_PUB_IP



Second 200 OK (replying to in-dialog INVITE with updated SDP):

SIP/2.0 200 OK
Via: SIP/2.0/UDP KAM1_PUB_IP;rport=5060;branch=z9hG4bKed2b.0006a6a159e800129a62b4415fdd64e6.0;i=7
Via: SIP/2.0/TLS 192.168.30.63:54752;received=A.B.C.D;branch=z9hG4bK.iKO2iYIgK;rport=27819
From: "Joel Test 1" <sip:8bd2a0aba14541789bb7269800646458@MY_DOMAIN>;tag=7vidaJ3Hw
To: "Joel Test 2" <sip:e78f2617b0d345d3bdb7b6780ece903c@MY_DOMAIN>;tag=vceg6N0m5ypHa
Call-ID: 3ezoQGF1kp
CSeq: 22 INVITE
Contact: <sip:e78f2617b0d345d3bdb7b6780ece903c@FS_PUB_IP:6061;transport=udp>
User-Agent: TP MEDIA 2.0
Accept: application/sdp
Allow: INVITE, ACK, BYE, CANCEL, OPTIONS, MESSAGE, INFO, UPDATE, REGISTER, REFER, NOTIFY
Supported: timer, path, replaces
Content-Type: application/sdp
Content-Disposition: session
Content-Length: 358

v=0
o=TP 1529576021 1529576022 IN IP4 FS_PUB_IP
s=TP
c=IN IP4 FS_PUB_IP
t=0 0
m=audio 25484 RTP/AVP 96 101
a=rtpmap:96 opus/48000/2
a=fmtp:96 useinbandfec=1; maxplaybackrate=8000; sprop-maxcapturerate=8000
a=rtpmap:101 telephone-event/48000
a=fmtp:101 0-16
a=silenceSupp:off - - - -
a=ptime:20
a=rtcp:25485 IN IP4 FS_PUB_IP




Now here comes the problem, I have the following in my kam1 config:



route[NATMANAGE] {

    ...

    if (is_reply()) {
        if (isbflagset(FLB_NATB)) {
            if (is_first_hop()) {
                 fix_nated_contact();

            }
        }
    }

    ...

}



So, on the first 200 OK, when it reaches that part of the config:

1- is_reply() -> OK
2- isbflagset(FLB_NATB) -> OK (because on the initial request NAT was detected blablabla....)
3- is_first_hop() -> FAIL 

fix_nated_contact() is NOT applied.

(This is the correct and the expected behavior).




Now, on the second 200 OK, again in that part of the config:

1- is_reply() -> OK
2- isbflagset(FLB_NATB) -> OK
3-is_first_hop() -> OK 

fix_nated_contact() is applied, thus the contact will be changed, and the client will send the ACK with incorrect information leading to another set of issues...





4.30.  is_first_hop()
The function returns true if the proxy is first hop after the original sender. For incoming SIP requests, it means there is only one Via header. For incoming SIP replies, it means that top Record-Route URI is 'myself' and source address is not matching it (to avoid detecting in case of local loops). Note that it does not detect spirals, which can have the condition for replies true also in the case of additional SIP reply receival.


So going back to the examples:

first 200 OK:

1- "top Record-Route URI is 'myself' -> FAIL

So we we are NOT the first hop, we do nothing and forward the reply to the client.


second 200 OK:

1- "top Record-Route URI is 'myself' -> No record-route headers are present, so we enter the is_first_hop() condition and modify the contact with fix_nated_contact(). 



Now to the real topic, I have a workaround as:

...
if (is_reply()) {
    if (isbflagset(FLB_NATB)) {
        if (is_first_hop()) {
            if (!ds_is_from_list()) { # <-- Check to see if the reply is coming from our internal servers
                fix_nated_contact();
            }
        }
    }
}
...



And that is working correctly, but I would like to understand the reasons.


I hope I have explained myself correctly, otherwise it's impossible to get to the point of my questions:

1- Is it correct for is_first_hop() to detect the second 200 OK as a first hop when it isn't? the behavior matches the documentation, so I don't know. If we stick to the check of the headers etc etc, it's working as described, if we stick to the concept of kam1 being actually being the first hop of that 200OK, then the check would need another condition to exclude the 200 OK (or better, the replies) of an in-dialog INVITE. 

2- Do you guys consider the workaround something reasonable? Under my opinion I would like to not have to add that, but I also don't know.

3- Am I missing something super standard that avoids all of this? I'm starting to go crazy trying and comparing different things to get to understand this and I want to make sure it's not just a is_first_hop() bug?


Sorry for such a long email, but I think that describing the scenario and flow was required.

Any input is more than welcome!


Thanks!
Joel.