1.

Why is it said that the length() method of String class doesn't return accurate results?

Answer»
  • The length method returns the number of Unicode units of the STRING. Let's understand what Unicode units are and what is the confusion below.
  • We know that Java uses UTF-16 for String representation. With this Unicode, we NEED to understand the below two Unicode related terms:
    • Code Point: This represents an integer denoting a character in the code space.
    • Code Unit: This is a BIT sequence used for encoding the code points. In order to do this, one or more units might be required for representing a code point.
  • Under the UTF-16 scheme, the code points were divided logically into 17 planes and the first plane was called the Basic Multilingual Plane (BMP). The BMP has classic characters - U+0000 to U+FFFF. The rest of the characters- U+10000 to U+10FFFF were termed as the supplementary characters as they were contained in the remaining planes.
    • The code points from the first plane are encoded using one 16-bit code unit
    • The code points from the remaining planes are encoded using two code units.

Now if a string contained supplementary characters, the length function WOULD count that as 2 units and the result of the length() function would not be as per what is expected.

In other words, if there is 1 supplementary character of 2 units, the length of that SINGLE character is considered to be TWO - Notice the inaccuracy here? As per the java documentation, it is expected, but as per the REAL logic, it is inaccurate.



Discussion

No Comment Found