libunistring is old, slow, messy code and uncompilable on new systems. remove the sole user of it, and replace it with inline utf8 decoder implementation from http://bjoern.hoehrmann.de/utf-8/decoder/dfa/.
improves performance and portability as libunistring is not needed. --- @Peter, Hugh: Would it be ok for me to push this?
modules/websocket/Makefile | 2 +- modules/websocket/README | 1 - modules/websocket/doc/websocket_admin.xml | 3 -- modules/websocket/utf8_decode.h | 52 +++++++++++++++++++++++++++++++ modules/websocket/ws_frame.c | 4 +-- 5 files changed, 55 insertions(+), 7 deletions(-) create mode 100644 modules/websocket/utf8_decode.h
diff --git a/modules/websocket/Makefile b/modules/websocket/Makefile index bb7c809..c686a82 100644 --- a/modules/websocket/Makefile +++ b/modules/websocket/Makefile @@ -27,7 +27,7 @@ else # E.g.: make TLS_HOOKS=1 TLS_EXTRA_LIBS="-lz -lkrb5" endif
-LIBS+= $(TLS_EXTRA_LIBS) -lunistring +LIBS+= $(TLS_EXTRA_LIBS)
# Static linking, if you'd like to use TLS and WEBSOCKET at the same time # diff --git a/modules/websocket/README b/modules/websocket/README index 49d8693..bdba3e4 100644 --- a/modules/websocket/README +++ b/modules/websocket/README @@ -316,7 +316,6 @@ onreply_route[WS_REPLY] { The following libraries must be installed before running Kamailio with this module loaded: * OpenSSL. - * GNU libunistring.
4. Parameters
diff --git a/modules/websocket/doc/websocket_admin.xml b/modules/websocket/doc/websocket_admin.xml index fa7d300..e40dc09 100644 --- a/modules/websocket/doc/websocket_admin.xml +++ b/modules/websocket/doc/websocket_admin.xml @@ -262,9 +262,6 @@ onreply_route[WS_REPLY] { <listitem> <para><emphasis>OpenSSL</emphasis>.</para> </listitem> - <listitem> - <para><emphasis>GNU libunistring</emphasis>.</para> - </listitem> </itemizedlist> </para> </section> diff --git a/modules/websocket/utf8_decode.h b/modules/websocket/utf8_decode.h new file mode 100644 index 0000000..b274fe7 --- /dev/null +++ b/modules/websocket/utf8_decode.h @@ -0,0 +1,52 @@ +#include <stdint.h> +#include <stddef.h> + +// Copyright (c) 2008-2010 Bjoern Hoehrmann bjoern@hoehrmann.de +// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details. + +#define UTF8_ACCEPT 0 +#define UTF8_REJECT 12 + +static const uint8_t utf8d[] = { + // The first part of the table maps bytes to character classes that + // to reduce the size of the transition table and create bitmasks. + 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, + 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, + 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, + 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, + 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, + 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, + 8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, + 10,3,3,3,3,3,3,3,3,3,3,3,3,4,3,3, 11,6,6,6,5,8,8,8,8,8,8,8,8,8,8,8, + + // The second part is a transition table that maps a combination + // of a state of the automaton and a character class to a state. + 0,12,24,36,60,96,84,12,12,12,48,72, 12,12,12,12,12,12,12,12,12,12,12,12, + 12, 0,12,12,12,12,12, 0,12, 0,12,12, 12,24,12,12,12,12,12,24,12,24,12,12, + 12,12,12,12,12,12,12,24,12,12,12,12, 12,24,12,12,12,12,12,12,12,24,12,12, + 12,12,12,12,12,12,12,36,12,36,12,12, 12,36,12,12,12,12,12,36,12,36,12,12, + 12,36,12,12,12,12,12,12,12,12,12,12, +}; + +static inline uint32_t decode(uint32_t* state, uint32_t* codep, uint32_t byte) +{ + uint32_t type = utf8d[byte]; + + *codep = (*state != UTF8_ACCEPT) ? + (byte & 0x3fu) | (*codep << 6) : + (0xff >> type) & (byte); + + *state = utf8d[256 + *state + type]; + return *state; +} + +static inline int IsUTF8(uint8_t* s, size_t len) +{ + uint32_t codepoint, state = 0; + + while (len--) + decode(&state, &codepoint, *s++); + + return state == UTF8_ACCEPT; +} + diff --git a/modules/websocket/ws_frame.c b/modules/websocket/ws_frame.c index a3a4cef..3562437 100644 --- a/modules/websocket/ws_frame.c +++ b/modules/websocket/ws_frame.c @@ -22,7 +22,7 @@ */
#include <limits.h> -#include <unistr.h> +#include "utf8_decode.h" #include "../../events.h" #include "../../receive.h" #include "../../stats.h" @@ -695,7 +695,7 @@ int ws_frame_transmit(void *data) frame.fin = 1; /* Can't be sure whether this message is UTF-8 or not so check to see if it "might" be UTF-8 and send as binary if it definitely isn't */ - frame.opcode = (u8_check((uint8_t *) wsev->buf, wsev->len) == NULL) ? + frame.opcode = IsUTF8((uint8_t *) wsev->buf, wsev->len) ? OPCODE_TEXT_FRAME : OPCODE_BINARY_FRAME; frame.payload_len = wsev->len; frame.payload_data = wsev->buf;
Hello,
getting rid of libunistring is good, indeed, thanks.
One thing that has to be cared with the websocket module is the need to link against libssl, so if the new included code is gpl we need to have an exception for it to comply with official debian packaging requirements.
The page lists a different license than gpl, it seems to be MIT, which I guess is all fine, being compatible with GPL and allowing linking with libssl. But I wanted to highlight in case anyone else has different opinion/more details.
Cheers, Daniel
On 20/12/13 11:47, Timo Teräs wrote:
libunistring is old, slow, messy code and uncompilable on new systems. remove the sole user of it, and replace it with inline utf8 decoder implementation from http://bjoern.hoehrmann.de/utf-8/decoder/dfa/.
improves performance and portability as libunistring is not needed.
@Peter, Hugh: Would it be ok for me to push this?
modules/websocket/Makefile | 2 +- modules/websocket/README | 1 - modules/websocket/doc/websocket_admin.xml | 3 -- modules/websocket/utf8_decode.h | 52 +++++++++++++++++++++++++++++++ modules/websocket/ws_frame.c | 4 +-- 5 files changed, 55 insertions(+), 7 deletions(-) create mode 100644 modules/websocket/utf8_decode.h
diff --git a/modules/websocket/Makefile b/modules/websocket/Makefile index bb7c809..c686a82 100644 --- a/modules/websocket/Makefile +++ b/modules/websocket/Makefile @@ -27,7 +27,7 @@ else # E.g.: make TLS_HOOKS=1 TLS_EXTRA_LIBS="-lz -lkrb5" endif
-LIBS+= $(TLS_EXTRA_LIBS) -lunistring +LIBS+= $(TLS_EXTRA_LIBS)
# Static linking, if you'd like to use TLS and WEBSOCKET at the same time # diff --git a/modules/websocket/README b/modules/websocket/README index 49d8693..bdba3e4 100644 --- a/modules/websocket/README +++ b/modules/websocket/README @@ -316,7 +316,6 @@ onreply_route[WS_REPLY] { The following libraries must be installed before running Kamailio with this module loaded: * OpenSSL.
* GNU libunistring.
- Parameters
diff --git a/modules/websocket/doc/websocket_admin.xml b/modules/websocket/doc/websocket_admin.xml index fa7d300..e40dc09 100644 --- a/modules/websocket/doc/websocket_admin.xml +++ b/modules/websocket/doc/websocket_admin.xml @@ -262,9 +262,6 @@ onreply_route[WS_REPLY] { <listitem> <para><emphasis>OpenSSL</emphasis>.</para> </listitem>
<listitem>
<para><emphasis>GNU libunistring</emphasis>.</para>
</listitem>
</itemizedlist> </para> </section>
diff --git a/modules/websocket/utf8_decode.h b/modules/websocket/utf8_decode.h new file mode 100644 index 0000000..b274fe7 --- /dev/null +++ b/modules/websocket/utf8_decode.h @@ -0,0 +1,52 @@ +#include <stdint.h> +#include <stddef.h>
+// Copyright (c) 2008-2010 Bjoern Hoehrmann bjoern@hoehrmann.de +// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.
+#define UTF8_ACCEPT 0 +#define UTF8_REJECT 12
+static const uint8_t utf8d[] = {
- // The first part of the table maps bytes to character classes that
- // to reduce the size of the transition table and create bitmasks.
- 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
- 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
- 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
- 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
- 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
- 10,3,3,3,3,3,3,3,3,3,3,3,3,4,3,3, 11,6,6,6,5,8,8,8,8,8,8,8,8,8,8,8,
- // The second part is a transition table that maps a combination
- // of a state of the automaton and a character class to a state.
- 0,12,24,36,60,96,84,12,12,12,48,72, 12,12,12,12,12,12,12,12,12,12,12,12,
- 12, 0,12,12,12,12,12, 0,12, 0,12,12, 12,24,12,12,12,12,12,24,12,24,12,12,
- 12,12,12,12,12,12,12,24,12,12,12,12, 12,24,12,12,12,12,12,12,12,24,12,12,
- 12,12,12,12,12,12,12,36,12,36,12,12, 12,36,12,12,12,12,12,36,12,36,12,12,
- 12,36,12,12,12,12,12,12,12,12,12,12,
+};
+static inline uint32_t decode(uint32_t* state, uint32_t* codep, uint32_t byte) +{
- uint32_t type = utf8d[byte];
- *codep = (*state != UTF8_ACCEPT) ?
- (byte & 0x3fu) | (*codep << 6) :
- (0xff >> type) & (byte);
- *state = utf8d[256 + *state + type];
- return *state;
+}
+static inline int IsUTF8(uint8_t* s, size_t len) +{
- uint32_t codepoint, state = 0;
- while (len--)
- decode(&state, &codepoint, *s++);
- return state == UTF8_ACCEPT;
+}
diff --git a/modules/websocket/ws_frame.c b/modules/websocket/ws_frame.c index a3a4cef..3562437 100644 --- a/modules/websocket/ws_frame.c +++ b/modules/websocket/ws_frame.c @@ -22,7 +22,7 @@ */
#include <limits.h> -#include <unistr.h> +#include "utf8_decode.h" #include "../../events.h" #include "../../receive.h" #include "../../stats.h" @@ -695,7 +695,7 @@ int ws_frame_transmit(void *data) frame.fin = 1; /* Can't be sure whether this message is UTF-8 or not so check to see if it "might" be UTF-8 and send as binary if it definitely isn't */
- frame.opcode = (u8_check((uint8_t *) wsev->buf, wsev->len) == NULL) ?
- frame.opcode = IsUTF8((uint8_t *) wsev->buf, wsev->len) ? OPCODE_TEXT_FRAME : OPCODE_BINARY_FRAME; frame.payload_len = wsev->len; frame.payload_data = wsev->buf;
Hi,
Yes - it is MIT license in the webpage. Might be a good thing to mention that in the created header file too.
- Timo
On Fri, 20 Dec 2013 12:08:10 +0100 Daniel-Constantin Mierla miconda@gmail.com wrote:
getting rid of libunistring is good, indeed, thanks.
One thing that has to be cared with the websocket module is the need to link against libssl, so if the new included code is gpl we need to have an exception for it to comply with official debian packaging requirements.
The page lists a different license than gpl, it seems to be MIT, which I guess is all fine, being compatible with GPL and allowing linking with libssl. But I wanted to highlight in case anyone else has different opinion/more details.
Cheers, Daniel
On 20/12/13 11:47, Timo Teräs wrote:
libunistring is old, slow, messy code and uncompilable on new systems. remove the sole user of it, and replace it with inline utf8 decoder implementation from http://bjoern.hoehrmann.de/utf-8/decoder/dfa/.
improves performance and portability as libunistring is not needed.
@Peter, Hugh: Would it be ok for me to push this?
modules/websocket/Makefile | 2 +- modules/websocket/README | 1 - modules/websocket/doc/websocket_admin.xml | 3 -- modules/websocket/utf8_decode.h | 52 +++++++++++++++++++++++++++++++ modules/websocket/ws_frame.c | 4 +-- 5 files changed, 55 insertions(+), 7 deletions(-) create mode 100644 modules/websocket/utf8_decode.h
diff --git a/modules/websocket/Makefile b/modules/websocket/Makefile index bb7c809..c686a82 100644 --- a/modules/websocket/Makefile +++ b/modules/websocket/Makefile @@ -27,7 +27,7 @@ else # E.g.: make TLS_HOOKS=1 TLS_EXTRA_LIBS="-lz -lkrb5" endif
-LIBS+= $(TLS_EXTRA_LIBS) -lunistring +LIBS+= $(TLS_EXTRA_LIBS)
# Static linking, if you'd like to use TLS and WEBSOCKET at the same time # diff --git a/modules/websocket/README b/modules/websocket/README index 49d8693..bdba3e4 100644 --- a/modules/websocket/README +++ b/modules/websocket/README @@ -316,7 +316,6 @@ onreply_route[WS_REPLY] { The following libraries must be installed before running Kamailio with this module loaded: * OpenSSL.
* GNU libunistring.
- Parameters
diff --git a/modules/websocket/doc/websocket_admin.xml b/modules/websocket/doc/websocket_admin.xml index fa7d300..e40dc09 100644 --- a/modules/websocket/doc/websocket_admin.xml +++ b/modules/websocket/doc/websocket_admin.xml @@ -262,9 +262,6 @@ onreply_route[WS_REPLY] { <listitem> <para><emphasis>OpenSSL</emphasis>.</para> </listitem>
<listitem>
<para><emphasis>GNU libunistring</emphasis>.</para>
</listitem>
</itemizedlist> </para> </section>
diff --git a/modules/websocket/utf8_decode.h b/modules/websocket/utf8_decode.h new file mode 100644 index 0000000..b274fe7 --- /dev/null +++ b/modules/websocket/utf8_decode.h @@ -0,0 +1,52 @@ +#include <stdint.h> +#include <stddef.h>
+// Copyright (c) 2008-2010 Bjoern Hoehrmann bjoern@hoehrmann.de +// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.
+#define UTF8_ACCEPT 0 +#define UTF8_REJECT 12
+static const uint8_t utf8d[] = {
- // The first part of the table maps bytes to character classes
that
- // to reduce the size of the transition table and create
bitmasks.
- 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
- 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
- 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
- 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
- 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
- 10,3,3,3,3,3,3,3,3,3,3,3,3,4,3,3,
11,6,6,6,5,8,8,8,8,8,8,8,8,8,8,8, +
- // The second part is a transition table that maps a combination
- // of a state of the automaton and a character class to a state.
- 0,12,24,36,60,96,84,12,12,12,48,72,
12,12,12,12,12,12,12,12,12,12,12,12,
- 12, 0,12,12,12,12,12, 0,12, 0,12,12,
12,24,12,12,12,12,12,24,12,24,12,12,
- 12,12,12,12,12,12,12,24,12,12,12,12,
12,24,12,12,12,12,12,12,12,24,12,12,
- 12,12,12,12,12,12,12,36,12,36,12,12,
12,36,12,12,12,12,12,36,12,36,12,12,
- 12,36,12,12,12,12,12,12,12,12,12,12,
+};
+static inline uint32_t decode(uint32_t* state, uint32_t* codep, uint32_t byte) +{
- uint32_t type = utf8d[byte];
- *codep = (*state != UTF8_ACCEPT) ?
- (byte & 0x3fu) | (*codep << 6) :
- (0xff >> type) & (byte);
- *state = utf8d[256 + *state + type];
- return *state;
+}
+static inline int IsUTF8(uint8_t* s, size_t len) +{
- uint32_t codepoint, state = 0;
- while (len--)
- decode(&state, &codepoint, *s++);
- return state == UTF8_ACCEPT;
+}
diff --git a/modules/websocket/ws_frame.c b/modules/websocket/ws_frame.c index a3a4cef..3562437 100644 --- a/modules/websocket/ws_frame.c +++ b/modules/websocket/ws_frame.c @@ -22,7 +22,7 @@ */
#include <limits.h> -#include <unistr.h> +#include "utf8_decode.h" #include "../../events.h" #include "../../receive.h" #include "../../stats.h" @@ -695,7 +695,7 @@ int ws_frame_transmit(void *data) frame.fin = 1; /* Can't be sure whether this message is UTF-8 or not so check to see if it "might" be UTF-8 and send as binary if it definitely isn't */
- frame.opcode = (u8_check((uint8_t *) wsev->buf, wsev->len)
== NULL) ?
- frame.opcode = IsUTF8((uint8_t *) wsev->buf, wsev->len) ? OPCODE_TEXT_FRAME :
OPCODE_BINARY_FRAME; frame.payload_len = wsev->len; frame.payload_data = wsev->buf;
Hello,
One of the reasons I used libunistring for this detection is that for pretty much all of the code fragments I found online for doing this in a "simple" way I saw people pointing out flaws in those algorithms. Can you confirm that this code doesn't have any of those flaws and is guaranteed to work in all cases (has this implementation been stubbed out and tested fully by someone here)?
What new systems does libunistring not work on? I have only ever had problems with it on older OS versions which didn't contain it at all.
Does this really improve performance? Only a tiny, tiny, subset of libunistring is used. As a result it doesn't really matter if libunistring in general is slow, just whether or not the one function used from libunistring is slow.
It also looks like the code style (particularly indentation) of that patch doesn't really match the style from that file. If that really is the case it is probably worth aligning the patch to the style in that file before applying and committing it.
Regards,
Peter
On 20 December 2013 11:29, Timo Teras timo.teras@iki.fi wrote:
Hi,
Yes - it is MIT license in the webpage. Might be a good thing to mention that in the created header file too.
- Timo
On Fri, 20 Dec 2013 12:08:10 +0100 Daniel-Constantin Mierla miconda@gmail.com wrote:
getting rid of libunistring is good, indeed, thanks.
One thing that has to be cared with the websocket module is the need to link against libssl, so if the new included code is gpl we need to have an exception for it to comply with official debian packaging requirements.
The page lists a different license than gpl, it seems to be MIT, which I guess is all fine, being compatible with GPL and allowing linking with libssl. But I wanted to highlight in case anyone else has different opinion/more details.
Cheers, Daniel
On 20/12/13 11:47, Timo Teräs wrote:
libunistring is old, slow, messy code and uncompilable on new systems. remove the sole user of it, and replace it with inline utf8 decoder implementation from http://bjoern.hoehrmann.de/utf-8/decoder/dfa/.
improves performance and portability as libunistring is not needed.
@Peter, Hugh: Would it be ok for me to push this?
modules/websocket/Makefile | 2 +- modules/websocket/README | 1 - modules/websocket/doc/websocket_admin.xml | 3 -- modules/websocket/utf8_decode.h | 52 +++++++++++++++++++++++++++++++ modules/websocket/ws_frame.c | 4 +-- 5 files changed, 55 insertions(+), 7 deletions(-) create mode 100644 modules/websocket/utf8_decode.h
diff --git a/modules/websocket/Makefile b/modules/websocket/Makefile index bb7c809..c686a82 100644 --- a/modules/websocket/Makefile +++ b/modules/websocket/Makefile @@ -27,7 +27,7 @@ else # E.g.: make TLS_HOOKS=1 TLS_EXTRA_LIBS="-lz -lkrb5" endif
-LIBS+= $(TLS_EXTRA_LIBS) -lunistring +LIBS+= $(TLS_EXTRA_LIBS)
# Static linking, if you'd like to use TLS and WEBSOCKET at the same time # diff --git a/modules/websocket/README b/modules/websocket/README index 49d8693..bdba3e4 100644 --- a/modules/websocket/README +++ b/modules/websocket/README @@ -316,7 +316,6 @@ onreply_route[WS_REPLY] { The following libraries must be installed before running Kamailio with this module loaded: * OpenSSL.
* GNU libunistring.
- Parameters
diff --git a/modules/websocket/doc/websocket_admin.xml b/modules/websocket/doc/websocket_admin.xml index fa7d300..e40dc09 100644 --- a/modules/websocket/doc/websocket_admin.xml +++ b/modules/websocket/doc/websocket_admin.xml @@ -262,9 +262,6 @@ onreply_route[WS_REPLY] { <listitem> <para><emphasis>OpenSSL</emphasis>.</para> </listitem>
<listitem>
<para><emphasis>GNU libunistring</emphasis>.</para>
</listitem> </itemizedlist> </para>
</section>
diff --git a/modules/websocket/utf8_decode.h b/modules/websocket/utf8_decode.h new file mode 100644 index 0000000..b274fe7 --- /dev/null +++ b/modules/websocket/utf8_decode.h @@ -0,0 +1,52 @@ +#include <stdint.h> +#include <stddef.h>
+// Copyright (c) 2008-2010 Bjoern Hoehrmann bjoern@hoehrmann.de +// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.
+#define UTF8_ACCEPT 0 +#define UTF8_REJECT 12
+static const uint8_t utf8d[] = {
- // The first part of the table maps bytes to character classes
that
- // to reduce the size of the transition table and create
bitmasks.
- 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
- 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
- 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
- 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
- 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
- 10,3,3,3,3,3,3,3,3,3,3,3,3,4,3,3,
11,6,6,6,5,8,8,8,8,8,8,8,8,8,8,8, +
- // The second part is a transition table that maps a combination
- // of a state of the automaton and a character class to a state.
- 0,12,24,36,60,96,84,12,12,12,48,72,
12,12,12,12,12,12,12,12,12,12,12,12,
- 12, 0,12,12,12,12,12, 0,12, 0,12,12,
12,24,12,12,12,12,12,24,12,24,12,12,
- 12,12,12,12,12,12,12,24,12,12,12,12,
12,24,12,12,12,12,12,12,12,24,12,12,
- 12,12,12,12,12,12,12,36,12,36,12,12,
12,36,12,12,12,12,12,36,12,36,12,12,
- 12,36,12,12,12,12,12,12,12,12,12,12,
+};
+static inline uint32_t decode(uint32_t* state, uint32_t* codep, uint32_t byte) +{
- uint32_t type = utf8d[byte];
- *codep = (*state != UTF8_ACCEPT) ?
- (byte & 0x3fu) | (*codep << 6) :
- (0xff >> type) & (byte);
- *state = utf8d[256 + *state + type];
- return *state;
+}
+static inline int IsUTF8(uint8_t* s, size_t len) +{
- uint32_t codepoint, state = 0;
- while (len--)
- decode(&state, &codepoint, *s++);
- return state == UTF8_ACCEPT;
+}
diff --git a/modules/websocket/ws_frame.c b/modules/websocket/ws_frame.c index a3a4cef..3562437 100644 --- a/modules/websocket/ws_frame.c +++ b/modules/websocket/ws_frame.c @@ -22,7 +22,7 @@ */
#include <limits.h> -#include <unistr.h> +#include "utf8_decode.h" #include "../../events.h" #include "../../receive.h" #include "../../stats.h" @@ -695,7 +695,7 @@ int ws_frame_transmit(void *data) frame.fin = 1; /* Can't be sure whether this message is UTF-8 or not so check to see if it "might" be UTF-8 and send as binary if it definitely isn't */
- frame.opcode = (u8_check((uint8_t *) wsev->buf, wsev->len)
== NULL) ?
- frame.opcode = IsUTF8((uint8_t *) wsev->buf, wsev->len) ? OPCODE_TEXT_FRAME :
OPCODE_BINARY_FRAME; frame.payload_len = wsev->len; frame.payload_data = wsev->buf;
On Fri, 20 Dec 2013 13:56:08 +0000 Peter Dunkley peter.dunkley@crocodile-rcs.com wrote:
One of the reasons I used libunistring for this detection is that for pretty much all of the code fragments I found online for doing this in a "simple" way I saw people pointing out flaws in those algorithms. Can you confirm that this code doesn't have any of those flaws and is guaranteed to work in all cases (has this implementation been stubbed out and tested fully by someone here)?
Did you read the document on the URL it refers to? It is quite thorough explanation of what it does, it's correctness and speed. It also explains that the motivation for implementing it was because all those snippets in the internet are seriously flawed.
What new systems does libunistring not work on? I have only ever had problems with it on older OS versions which didn't contain it at all.
It embeds an ancient gnulib version, and is not updateable easily. This makes it fail on systems using musl c-library. gnulib started to support it properly only few years ago. gnulib also at various places wants to touch libc internals so it can be considered a problem if porting to new systems.
Also the autoconfigury is not cannot be regenerated with new autotools.
Does this really improve performance? Only a tiny, tiny, subset of libunistring is used. As a result it doesn't really matter if libunistring in general is slow, just whether or not the one function used from libunistring is slow.
I'm referring specifically to the function you use. Please check the URL for performance comparison. While libunistring is not benchmarked separately, one can see with 0.1 second look at libunistring's implementation that it will be slower in performance and is likely something close to iconv()'s implementation.
It also looks like the code style (particularly indentation) of that patch doesn't really match the style from that file. If that really is the case it is probably worth aligning the patch to the style in that file before applying and committing it.
Sure. Indentation is easy to fix.
Thanks, Timo
On 20 December 2013 14:09, Timo Teras timo.teras@iki.fi wrote:
On Fri, 20 Dec 2013 13:56:08 +0000 Peter Dunkley peter.dunkley@crocodile-rcs.com wrote:
One of the reasons I used libunistring for this detection is that for pretty much all of the code fragments I found online for doing this in a "simple" way I saw people pointing out flaws in those algorithms. Can you confirm that this code doesn't have any of those flaws and is guaranteed to work in all cases (has this implementation been stubbed out and tested fully by someone here)?
Did you read the document on the URL it refers to? It is quite thorough explanation of what it does, it's correctness and speed. It also explains that the motivation for implementing it was because all those snippets in the internet are seriously flawed.
I did see the document. It looks good. Does you have first hand
experience of whether that code is valid and correct or not?
I personally tend to trust a release library from GNU somewhat more than code on a web-site - however good that web-site and documentation looks.
Does this really improve performance? Only a tiny, tiny, subset of
libunistring is used. As a result it doesn't really matter if libunistring in general is slow, just whether or not the one function used from libunistring is slow.
I'm referring specifically to the function you use. Please check the URL for performance comparison. While libunistring is not benchmarked separately, one can see with 0.1 second look at libunistring's implementation that it will be slower in performance and is likely something close to iconv()'s implementation.
I don't have any objection to this change as long as you are 100% sure that the algorithm from that web-page works correctly. The algorithm looks good, the web-page looks good, the documentation looks good, but I'd really prefer it if the implementation was explicitly and fully tested before the libunistring call is replaced.
Something as simple as a loop through all possible values calling your function, the libunistring function, and comparing the results would be perfect.
Regards,
Peter
Was there any resolution on this topic? I would like to get rid of the unnecessary dependency, the code looked fine at a quick check -- if it is just about utf8 encoding/decoding.
Eventually it can be made a compile time switch with defines for both options, keep the code for both cases and be able to easily switch from one to another.
Cheers, Daniel
On 20/12/13 15:45, Peter Dunkley wrote:
On 20 December 2013 14:09, Timo Teras <timo.teras@iki.fi mailto:timo.teras@iki.fi> wrote:
On Fri, 20 Dec 2013 13:56:08 +0000 Peter Dunkley <peter.dunkley@crocodile-rcs.com <mailto:peter.dunkley@crocodile-rcs.com>> wrote: > One of the reasons I used libunistring for this detection is that for > pretty much all of the code fragments I found online for doing this > in a "simple" way I saw people pointing out flaws in those > algorithms. Can you confirm that this code doesn't have any of those > flaws and is guaranteed to work in all cases (has this implementation > been stubbed out and tested fully by someone here)? Did you read the document on the URL it refers to? It is quite thorough explanation of what it does, it's correctness and speed. It also explains that the motivation for implementing it was because all those snippets in the internet are seriously flawed.
I did see the document. It looks good. Does you have first hand experience of whether that code is valid and correct or not?
I personally tend to trust a release library from GNU somewhat more than code on a web-site - however good that web-site and documentation looks.
> Does this really improve performance? Only a tiny, tiny, subset of > libunistring is used. As a result it doesn't really matter if > libunistring in general is slow, just whether or not the one function > used from libunistring is slow. I'm referring specifically to the function you use. Please check the URL for performance comparison. While libunistring is not benchmarked separately, one can see with 0.1 second look at libunistring's implementation that it will be slower in performance and is likely something close to iconv()'s implementation.
I don't have any objection to this change as long as you are 100% sure that the algorithm from that web-page works correctly. The algorithm looks good, the web-page looks good, the documentation looks good, but I'd really prefer it if the implementation was explicitly and fully tested before the libunistring call is replaced.
Something as simple as a loop through all possible values calling your function, the libunistring function, and comparing the results would be perfect.
Regards,
Peter
Peter Dunkley Technical Director Crocodile RCS Ltd
On Tue, 04 Feb 2014 21:34:35 +0100 Daniel-Constantin Mierla miconda@gmail.com wrote:
Was there any resolution on this topic? I would like to get rid of the unnecessary dependency, the code looked fine at a quick check -- if it is just about utf8 encoding/decoding.
Eventually it can be made a compile time switch with defines for both options, keep the code for both cases and be able to easily switch from one to another.
I think the patch was not 'blessed' yet.
Peter asked for testing results along the lines of:
Something as simple as a loop through all possible values calling your function, the libunistring function, and comparing the results would be perfect.
The calling the function with every possible utf-8 string is impossible test plan.
Dunno. Perhaps you want to push the patch with dictator hat on, or some sane test plan can be made. For now, I just applied the patch to my local builds and forgot this.
- Timo
Thanks for update. Can you wrap your patch in some define, so it can be applied while keeping the libunistring as alternative? Something like:
#ifdef USE_UTF8_EMBEDDED // your code here #else // libunistring function call #endif
Then it can be pushed without any problem to the master branch, allowing me (and others) to test it easily. After that we can re-evaluate to remove libunistring completely or maybe just make the embedded version default.
Cheers, Daniel
On 05/02/14 07:45, Timo Teras wrote:
On Tue, 04 Feb 2014 21:34:35 +0100 Daniel-Constantin Mierla miconda@gmail.com wrote:
Was there any resolution on this topic? I would like to get rid of the unnecessary dependency, the code looked fine at a quick check -- if it is just about utf8 encoding/decoding.
Eventually it can be made a compile time switch with defines for both options, keep the code for both cases and be able to easily switch from one to another.
I think the patch was not 'blessed' yet.
Peter asked for testing results along the lines of:
Something as simple as a loop through all possible values calling your function, the libunistring function, and comparing the results would be perfect.
The calling the function with every possible utf-8 string is impossible test plan.
Dunno. Perhaps you want to push the patch with dictator hat on, or some sane test plan can be made. For now, I just applied the patch to my local builds and forgot this.
- Timo
sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
Hello,
I have no problem in principle with the patch, I am just wary of changing something that works. I do Kamailio builds on four (very different) Linux variants (CentOS 6, Fedora 19, Ubuntu 13.10, Rasbian (Debian Wheezy on Raspberry Pi)) and have never had a problem with the libunistring dependency for WebSockets.
The code in the patch looks straight forward, and the information appears compelling, my question was just about verification and testing of this code copied from the webpage before using it to replace something that currently works.
Regards,
Peter
On 5 February 2014 08:05, Daniel-Constantin Mierla miconda@gmail.comwrote:
Thanks for update. Can you wrap your patch in some define, so it can be applied while keeping the libunistring as alternative? Something like:
#ifdef USE_UTF8_EMBEDDED // your code here #else // libunistring function call #endif
Then it can be pushed without any problem to the master branch, allowing me (and others) to test it easily. After that we can re-evaluate to remove libunistring completely or maybe just make the embedded version default.
Cheers, Daniel
On 05/02/14 07:45, Timo Teras wrote:
On Tue, 04 Feb 2014 21:34:35 +0100 Daniel-Constantin Mierla miconda@gmail.com wrote:
Was there any resolution on this topic? I would like to get rid of
the unnecessary dependency, the code looked fine at a quick check -- if it is just about utf8 encoding/decoding.
Eventually it can be made a compile time switch with defines for both options, keep the code for both cases and be able to easily switch from one to another.
I think the patch was not 'blessed' yet.
Peter asked for testing results along the lines of:
Something as simple as a loop through all possible values calling your function, the libunistring function, and comparing the results would be perfect.
The calling the function with every possible utf-8 string is impossible test plan.
Dunno. Perhaps you want to push the patch with dictator hat on, or some sane test plan can be made. For now, I just applied the patch to my local builds and forgot this.
- Timo
sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
-- Daniel-Constantin Mierla - http://www.asipto.com http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda
Hello,
I wrapped the patch within defines and pushed it to master branch. I plan to test it very soon as I need it in some old systems.
To enable it, websocket module has to be compuled with EMBEDDED_UTF8_DECODE=1, e.g.:
make modules EMBEDDED_UTF8_DECODE=1 modules=modules/websocket
Cheers, Daniel
On 05/02/14 10:23, Peter Dunkley wrote:
Hello,
I have no problem in principle with the patch, I am just wary of changing something that works. I do Kamailio builds on four (very different) Linux variants (CentOS 6, Fedora 19, Ubuntu 13.10, Rasbian (Debian Wheezy on Raspberry Pi)) and have never had a problem with the libunistring dependency for WebSockets.
The code in the patch looks straight forward, and the information appears compelling, my question was just about verification and testing of this code copied from the webpage before using it to replace something that currently works.
Regards,
Peter
On 5 February 2014 08:05, Daniel-Constantin Mierla <miconda@gmail.com mailto:miconda@gmail.com> wrote:
Thanks for update. Can you wrap your patch in some define, so it can be applied while keeping the libunistring as alternative? Something like: #ifdef USE_UTF8_EMBEDDED // your code here #else // libunistring function call #endif Then it can be pushed without any problem to the master branch, allowing me (and others) to test it easily. After that we can re-evaluate to remove libunistring completely or maybe just make the embedded version default. Cheers, Daniel On 05/02/14 07:45, Timo Teras wrote: On Tue, 04 Feb 2014 21:34:35 +0100 Daniel-Constantin Mierla <miconda@gmail.com <mailto:miconda@gmail.com>> wrote: Was there any resolution on this topic? I would like to get rid of the unnecessary dependency, the code looked fine at a quick check -- if it is just about utf8 encoding/decoding. Eventually it can be made a compile time switch with defines for both options, keep the code for both cases and be able to easily switch from one to another. I think the patch was not 'blessed' yet. Peter asked for testing results along the lines of: Something as simple as a loop through all possible values calling your function, the libunistring function, and comparing the results would be perfect. The calling the function with every possible utf-8 string is impossible test plan. Dunno. Perhaps you want to push the patch with dictator hat on, or some sane test plan can be made. For now, I just applied the patch to my local builds and forgot this. - Timo _______________________________________________ sr-dev mailing list sr-dev@lists.sip-router.org <mailto:sr-dev@lists.sip-router.org> http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev -- Daniel-Constantin Mierla - http://www.asipto.com http://twitter.com/#!/miconda <http://twitter.com/#%21/miconda> - http://www.linkedin.com/in/miconda
-- Peter Dunkley Technical Director Crocodile RCS Ltd