NULL vs Empty Strings – Why Oracle Was Right and Apple Is Not

Try it yourself: Google “how to check for empty string.” I got about 16 million results. How can something that seems so straightforward lead to so many questions and probably even more answers?

The solution differs by language, but most answers will look similar to this:

if string != null and string.length() > 0 {
    ...
}

Why can we not simply say?

if string is filled {
    ...
}

(pseudocode, forget about the exact syntax)

If you think the first form is good enough, just pick one of the answers for your favorite language and stick with it. If you’re interested in the future of programming languages, as I am, read on.

Hopefully, your language of choice already supports some sort of shortcut to check for an empty string. But it probably still has two distinct representations for an empty string, which can confuse all of us from time to time:

  • the value null (maybe written as nil, none, etc.)
  • a string of length zero

Strings are one of the most common data structures we pass around in our programs, so you would expect programming languages to have solved this “” versus null issue by now. Unfortunately, even newcomer Swift, although quite articulated about optional versus mandatory values, doesn’t provide any solutions in this regard. Swift gives us optional values. But even if you define a Swift string to be optional you still have to deal with two representations for an empty string: either “” or nil. So you still have to check for both situations. Even worse, if you define a string to be mandatory it can still be empty!

If you think the distinction between a string of length zero and a null value has any significant meaning, please tweet or email me or write your real-world example below. Things are either present or they are not. A user will either enter a value in a given field or keep it empty. A function returns a result or it doesn’t. A value in a database is either there or not. There’s typically nothing in between.

Time Waster

We can waste a lot of time on this issue:

  • You’re never 100% sure what it is wise to return from a function in the case of no result.
  • You always have to check for both situations, just to be sure.
  • You can forget one of the two checks, resulting in unnecessary bugs.
  • The extra code obscures the pure intention of an empty or not-empty check.

Not convinced? Check out: https://www.rosettacode.org/wiki/Empty_string

History

Let’s see how we got into this mess, because there is some history behind it.

In older programming languages like COBOL, strings were typically fixed sized. Coming from a world of punch cards and tapes, that made a lot of sense. However, because variable sized strings are more flexible, they eventually became the standard. Strings now had to be allocated on the heap. That’s because variable sized values are inconvenient on the stack and impossible inside fixed record structures. From then on it became possible to represent an empty string by not allocating anything on the heap (null) in addition to the possibility of assigning zero characters (“”) to the string.

Object-oriented programming languages eventually turned these allocated strings into objects, giving them methods to handle and manipulate their content. Compare that to other value types like integers, floats, and booleans, which were typically kept as native as possible. That’s because making them into heap-allocated objects would mean a more-than-significant performance penalty.

Believe me, it is not intentional that we are stuck now with two representations for empty strings. It was just born out of technical necessity. That may be fine in a language like C. But hey, it is 2017 now – we can do better. Remember the trouble with ending strings using a ‘\0’ character. Exactly!

Why Strings are Values

In the theory of entities, attributes and relationships (EAR-modeling) entities and relationships are there to register facts. Values (assigned to attributes) play an entirely different role. While we can refer to entities using something like a record-ID, values reference from the model back to the outside world. To take some examples, a name can be a reference to a person, a date value refers to an abstract value on a calendar, and numbers refer to a virtual number scale. Given this definition, strings are also values (1). And therefore it makes sense to expect so-called value semantics for strings too.

Of course, a string is a sequence of characters. And, you could argue, that makes them totally different from numbers. But remember that numbers are also sequences of symbols (digits in this case). The only difference is that we can fit most numbers within 32 of 64 bits. For that reason we don’t need to allocate them in separate memory structures like we do with strings. But that is not an adequate reason to say that strings are not values.

It’s interesting to compare the theoretical definition of values with the typical implementation of strings and integers in many popular programming languages.

Values/attributes (theory) Integers Strings
Sharable No No Yes, although we try to prevent this
Support for optionality Yes No, in many languages Yes, but typically in two ways (“” and null)
Assignment behavior Value itself is copied Value itself is copied Depending on the language
Comparison behavior (equals) Values are compared Values are compared Depending on the language and operator (equals(), ==, ===)

 

The last column says it all. Strings being wrapped as objects may have benefits, but it gives them “entity” behavior. And languages have been struggling with this for decades now. Hence the introduction of band-aid features like const, final, immutability, etc.

The fact that we have two ways of representing empty strings is even weirder when you consider that in some languages intrinsic values such as integers, floats, etc., cannot represent an empty value. That’s why a lot of code uses value 0 or -1 to present an empty integer field, which, again, can lead to bugs because both could actually be valid values.

The Solution

We do not have to look far for solutions. Just two examples (sorry if I forgot your language of choice):

  • The Oracle VARCHAR2 type automatically converts every instance of “” into NULL. This guarantees that there is only one way an empty string can be represented.
  • String classes in C++ typically cannot be NULL because their instances are not pointers by themselves.

I can imagine a more conceptual approach toward strings and values in general. We can forbid const values like “” because they have no meaning. Isn’t that the same as the fact that we cannot write integer constants with zero digits?

If you’re developing a programming language, consider what fits your language style. But my main takeaways would be:

  • Understand that strings are values just like integers, dates, etc.
  • Go for value semantics, even if you like them to have object-like methods.
  • Please, please, prevent the situation of having two different values for empty strings.
  • Support optionality for every type of value (I would like to credit Swift for that)

The bottom line is to avoid too many technical considerations when designing a language. Let the compiler do the hard work.

References

1) Very long strings like the text of a document should maybe not be regarded as “values.” But that’s a different subject.

Credits

R. Sato (@raysato), thanks for allowing me to use your 0 vs null photo. Some pictures are really worth a thousand words.

9 thoughts on “NULL vs Empty Strings – Why Oracle Was Right and Apple Is Not

Add yours

  1. I understand your arguments and conclusion, but they are based on the assumption that “” and null always represent the same state. With this I disagree.

    Real world example is lazy loadin/caching of complex and slow to calculate Strings.

    1. Many languages never had this distinction. I don’t think that was ever regarded as a deficiency.

      It does, of course, depend on what your goal is. If I would develop a new general purpose language I would make a 100% distinction between values and reference-type objects. Because it is a more uniform model for all values (strings or otherwise). Which implies getting rid of nil versus “”.

      But of course, it’s like with pointers. I think it is good we got rid of low-level pointers like in C++. But C++ is still a good candidate for some software development.

  2. There is a semantic difference.
    Consider the variable myHobbies:

    nil → no hobbies were written into that variable (this does not mean I have no hobbies)

    ““ → I explicitly state that I have no hobbies.

    The difference boils down to me explicitly stating something is non existant or empty vs. a random variable sitting around in the code somewhere without an assigned value.

    1. In the thirty years that I have been making software, I never came across such a functional requirement. But if this distinction would indeed be required I would probably add a separate boolean to not confuse my colleague software developers.

  3. Adding a boolean in this situation is like a snake biting its own tail and is only a null check in disguise. In your blog your kick off argument is to prevent code like:

    if string != null and string.length() > 0 {

    }

    So you would now be replacing it by:

    if isStringValueAvailable and string.length() > 0 {

    }

    Where is the improvement in this? It actually adds redundant data by adding another variable

    A better way would be to allow the variable to be null and declare it with a @Nullable annotation.
    This way your colleague would also be able to see it. If he does not see it, then hopefully his development tools will. IntelliJ IDEA for example evaluates these annotation and will hit your wrist the moment you access without check. In many case it actually also deducts these situation from unannotated code.

    This would not be possible if you add a boolean. Also there is no guarantee that your colleague will notice the boolean, especially with auto completion in place thus causing a potential NPE (hopefully during development).

    I rather trust the tooling to identify these situations.

    1. Thanks for your response.

      We might have a few misunderstandings here.

      I agree with your vision on @Nullable and use of tools given the programming languages and tools we have now. My article, however, is about future programming languages.

      The boolean was only in response to your specific example concerning caching. What you write is not what I propose in general. But it makes me curious what you would do in case you have to cache a number (let’s say an integer) that can also be empty.

      Let try again. This is my premise:
      1. Strings are values just like integers, dates, etc.
      2. It’s very common that values need to be optional.
      3. It’s not very common that you have to distinguish between two types of a value being empty.
      4. I don’t know of any language in which you can have this distinction for numbers. So you can ask yourself why strings are so special.
      5. Some, especially older, languages (like COBOL, BASIC, C++ string classes) do not even have the distinction and don’t see that as a deficiency.

      If you still disagree, I’m curious what point you disagree with.

      1. 1. Strings are values just like integers, dates, etc.

        I think this depends on the level of abstraction that you compare them on.
        Academically they could be considered values.

        Technically strings are complex types usually represented by an array of chars in most languages.

        On the hardware level strings are most often a sequence of bytes either null terminated or length prefixed. This is simply a matter of specification. Typically CPU’s do not dictate structure, format or encoding of strings..

        Therefore Strings are not a single value, but a sequence of values.

        4. I don’t know of any language in which you can have this distinction for numbers. So you can ask yourself why strings are so special.

        In Java for every primitive type there also exists a Wrapperclass.
        boolean -> Boolean
        char -> Char
        long -> Long
        int -> Integer

        Object references to these elements can of course be null again.
        .

        BTW: Merry Xmas

  4. @Hendrick Kay

    1. True. And I prefer to look at from a conceptual viewpoint for multiple reasons.

    4. Still different because it is not possible to make an instance of Integer with zero digits while strings can have zero characters apart form being null.

    1. 4. Still different because it is not possible to make an instance of Integer with zero digits while strings can have zero characters apart form being null.

      I would say that String and Integers should not be compared as they are generally very different. But if you want to go this way then I would respond that for Integers the technical equivalent of an empty string is the zero. An empty string does not alter an existing string on concatenation just like a zero does not alter the value of an existing integer on addition.

Leave a Reply

Up ↑

%d bloggers like this: