Why is it said that the length() method of String class doesn't return

1.	Why is it said that the length() method of String class doesn't return accurate results?
Answer» The length method returns the number of Unicode units of the STRING. Let's understand what Unicode units are and what is the confusion below. We know that Java uses UTF-16 for String representation. With this Unicode, we NEED to understand the below two Unicode related terms: Code Point: This represents an integer denoting a character in the code space. Code Unit: This is a BIT sequence used for encoding the code points. In order to do this, one or more units might be required for representing a code point. Under the UTF-16 scheme, the code points were divided logically into 17 planes and the first plane was called the Basic Multilingual Plane (BMP). The BMP has classic characters - U+0000 to U+FFFF. The rest of the characters- U+10000 to U+10FFFF were termed as the supplementary characters as they were contained in the remaining planes. The code points from the first plane are encoded using one 16-bit code unit The code points from the remaining planes are encoded using two code units. Now if a string contained supplementary characters, the length function WOULD count that as 2 units and the result of the length() function would not be as per what is expected. In other words, if there is 1 supplementary character of 2 units, the length of that SINGLE character is considered to be TWO - Notice the inaccuracy here? As per the java documentation, it is expected, but as per the REAL logic, it is inaccurate.

Why is it said that the length() method of String class doesn't return accurate results?

Answer»

The length method returns the number of Unicode units of the STRING. Let's understand what Unicode units are and what is the confusion below.
We know that Java uses UTF-16 for String representation. With this Unicode, we NEED to understand the below two Unicode related terms:
- Code Point: This represents an integer denoting a character in the code space.
- Code Unit: This is a BIT sequence used for encoding the code points. In order to do this, one or more units might be required for representing a code point.
Under the UTF-16 scheme, the code points were divided logically into 17 planes and the first plane was called the Basic Multilingual Plane (BMP). The BMP has classic characters - U+0000 to U+FFFF. The rest of the characters- U+10000 to U+10FFFF were termed as the supplementary characters as they were contained in the remaining planes.
- The code points from the first plane are encoded using one 16-bit code unit
- The code points from the remaining planes are encoded using two code units.

Now if a string contained supplementary characters, the length function WOULD count that as 2 units and the result of the length() function would not be as per what is expected.

In other words, if there is 1 supplementary character of 2 units, the length of that SINGLE character is considered to be TWO - Notice the inaccuracy here? As per the java documentation, it is expected, but as per the REAL logic, it is inaccurate.

Why is it said that the length() method of String class doesn't return accurate results?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment