This blog post summarizes, with bite-size bullet points, some knowledge on Haskell string types that I think is important to recognize when writing Haskell. The intended audience include inexperienced Haskell programmers, and experienced Haskell programmers who feel like refreshing their memory. The bullet list hopefully makes things reasonably easy to digest, and helps you eat Haskell string types for breakfast.
Haskell has five commonly used string types: String, strict and lazy Text, and strict and lazy ByteString1. Some recommand
against calling ByteString a string type, The following is
roughly in the order of String, Text, ByteString.
-
Stringis, in some sense, the āofficialā string type, since it is the only string type inbase(and of course, the only one inPrelude). -
Stringis unfortunately highly inefficient, because it is quite simply a linked list (specifically, a cons-list) of (boxed)Chars, each of which represents a unicode code point. This means eachCharin aStringis in a cons-cell with a(:)tag, a pointer to theCharon the heap, and a pointer to the next cons-cell.How many bytes a
Stringuses per character on a 64-bit machine is left as an exercise, but it is certainly a lot more than what is theoretically required to encode a unicode code point: between 1 and 4 bytes. It also suffers from poor locality for obvious reasons. -
Stringis convenient when dealing with certain parts ofbase, for instanceshowanddisplayException, as well as the filepath and directory libraries, but there are few other legitimate use cases ofString. For manybasefunctions that work onStrings, such asreverse,takeWhileandisInfixOf, there are corresponding versions inData.Text. -
Text, likeString, also represents a sequence of unicode characters, but is much more tightly packed. Thus it can often be a drop-in replacement forString. - Strict
Textis a (strict) array of bytes. LazyTextis a (lazy) cons-list of strictTexts, where each strictTextis called a chunk, and the lazy cons-list is also known as the spine. Internally, strictTextis UTF-16 encoded, so each character occupies either 2 or 4 bytes (plus some constant overhead for eachText).Textuses UTF-16 rather than the more popular UTF-8, because although UTF-8 is more space-efficient for ASCII characters, its arithmetic is relatively more complex, and converting bytes to/fromChars is more expensive. Furthermore, the space saving of UTF-8 is insignificant for smallTextvalues. See Text/UTF-8: Aftermath for more details.- The specific encoding
Textuses internally, whether UTF-8 or UTF-16, does not usually affect the way most users use the library. Unless you are using stuff in theInternalmodules, the unit you deal with isChar. - If you prefer a
Texttype that internally uses UTF-8 (for example, if you have a lot of interaction with UTF-8 encoded data), consider text-short or core-text.
-
If you are familiar with Java, it might be helpful to think of lazy
TextasLinkedList<ArrayList<Byte>>(except that theLinkedListis a lazy cons-list), where eachArrayList<Byte>is a chunk, and theLinkedListis the spine. This gets the benefits of bothLinkedListandArrayList: it supports efficient O(1) concatenation2 becauseLinkedListconcatenation is O(1); it also is more space efficient and has better locality than a flatLinkedList<Byte>, becauseArrayListis stored contiguously in memory and thereās no pointer from a node to the next.-
Some have argued that it is better to use a strict list/rope of chunks for lazy
Text, which would be more comparable to JavaāsLinkedList<ArrayList<Byte>>. The advantage is that this can potentially unify strict and lazyTextinto a single type, because it is both strict, and efficient in concatenation (and similar operations) like lazyTextis. The disadvantage, however, is that there doesnāt exist (at least not yet, as far as I know) an efficient implementation ofLinkedList(or any data structure) in Haskell whose performance is competitive in terms of constant factors to raw byte arrays (which is what backs strictText), and it may well be impossible.Besides, there are legitimate use cases, though perhaps rare, that require the
Textto be lazy, for instance when you want to perform lazy IO without bothering to use conduit or pipes (you most likely should, though).
-
-
Many operations on
Textare subject to fusion. The goal of fusion is to avoid materializing intermediate results when performing a chain of operations. The core idea of fusion is similar to improving list concatenation using difference lists, and improving a chain offmapusing Coyoneda, and can be summarized as: delay, delay, delay. Delaying the construction of a data structure is key for avoiding materializing the intermediate results. One way to delay an operation on a data structure is to turn the data structure into an arrow (e.g., function or Kleisli arrow), turn the operation into composition (e.g., function composition or Kleisli composition), and finally, apply the arrow to some argument to get the result only when needed.As an example, there are at least two ways to turn a Haskell list into functions, and they are related in an interesting way:
- Representing the list inductively (a.k.a. build/foldr fusion system):
newtype List a = List { build :: forall b. (a -> b -> b) -> b -> b }This represents a list in terms of fold. It is equivalent to
Mu (ListF a)whereMu fis the least fixed point off, andListF ais the base functor of[a], i.e.,data ListF a b = Cons a b | Nil - Representing the list coinductively (a.k.a. stream fusion):
data List a = forall s. List (s -> Step a s) s data Step a s = Yield a s | DoneThis represents a list in terms of unfold. It is equivalent to
Nu (ListF a)whereNu fis the greatest fixed point off.
So these two approaches are dual of each other. In both cases, a list is represented in a ādelayedā form, and operations such as
mapsimply āremembersā the function to be mapped, rather than actually performing the mapping and generating a new list. The pros and cons of the two approaches is out of the scope of this blog post, but this paper has more details if you are interested.In the case of
Text, it employs stream fusion. Each eligible operationopis turned intofromStream . opOnStream . toStream, whereopOnStreamis the streaming version ofop, and a rewrite rule is used to canceltoStream . fromStreaminto the identity function. - Representing the list inductively (a.k.a. build/foldr fusion system):
-
ByteString, unlikeStringandText, has little to do withCharor unicode. It is just a sequence of bytes. To convert aByteStringto/fromText, you need to specify which encoding you intend to use, and then use the conversion functions inData.Text.Encoding. - Like
Text, strictByteStringis backed by a byte array, and lazyByteStringis backed by a cons-list of chunks, each of which is a strictByteString. One difference is that strictByteStringuses aForeignPtr Word8as the underlying data structure. This makes it directly usable in FFI because it has the same memory layout as anunsigned char *in C (note that acharin C is a byte, not a unicode code point).- To pass a
ByteStringto a C function, you just need to do two things: (1) make a copy of it (because otherwise the C function may mutate it), and (2) append a null (0x00) byte, since C strings are null-terminated. This is exactly whatData.ByteString.useAsCStringdoes.
- To pass a
-
Functions in
Data.ByteString.Char8behave as if each byte is aChar, and thatās why you donāt need to (and donāt get to) specify an encoding to use any function in this module. It works well if yourByteStringcontains ASCII characters only, but probably not otherwise; for example,Char8.unpack (Char8.pack "ššš") /= "ššš". Use theChar8module if you are certain that you are dealing with ASCII characters, or if otherwise you know what you are doing. - When using
Data.Text.IO.hGetLineandData.Text.IO.hGetContents, make sure theHandlehas the correct encoding. Youād otherwise get an āinvalid argument (invalid byte sequence)ā error. Since these methods rely on using the correct encoding, it is often wise to use their counterparts inData.ByteString.ByteString, having no assumption on the encoding, is more suitable for exchanging information across machines. After reading theByteStrings in, you can then decode them using the appropriate decoders and handle errors properly.- This is probably why Aesonās
encodeanddecodefunctions work withByteStringrather thanText, even though JSON is supposed to be textual and human readable. JSON is often exchanged over the network, and since data are sent or received over the network as sequences of bytes, youād needByteString. Furthermore, although the overwhelming majority of JSON texts are UTF-8 encoded, this is not a guarantee, so usingTextprematurely can be risky. - That being said, the Aeson library does provide
Text-based encoder and decoder in theData.Aeson.Textmodule. - On a side node, Aesonās
encodefunction returns a lazyByteString, because the result is assembled piece by piece, and lazyByteString, like lazyText, has better asymptotics for operations like concatenation.
- This is probably why Aesonās
-
Data.ByteString.getLineandData.ByteString.hGetLineboth read until EOF or a newline byte, i.e., 0x0A. Beware if your data is not UTF-8 encoded. For example, in UTF-16BE and UTF-16LE, the newline character has two bytes, and is encoded as 0x00 0x0A and 0x0A 0x00, respectively. Arguably, these two functions should only be exposed from theChar8module, because a newline is a character. -
If you have a function that takes a lazy
Textas input, consider taking instead a stream that producesText(e.g.,ConduitT i Text m ()in conduit, orProducer' Text m ()in pipes). Because you can easily convert a lazyTextinto a stream, but the converse is not true because the stream may be effectful. The same goes forByteString. -
ByteStringused to employ stream fusion likeTextdoes, but currently fusion is not used inByteString, whether strict or lazy. Hereās the authorās explanation of why fusion is no longer used. - The bytestring package also has a
ShortByteStringtype. It can be a good option if you need to keep a large number of smallByteStrings around.
Further Reading
- Hereās a few other blog posts on Haskell string types: String Types by FP Complete, Untangling Haskellās Strings by Monday Morning Haskell, and Haskell String Types by Alexey Shmalko.
- To learn more about fixed points, recursion schemes,
Mu,Nuand what not, check out the recursion-schemes and the yaya library, and my previous blog post on this topic. - To learn more about
ShortByteString, check out Mark Karpovās blog post, Short ByteString and Text.
-
I should remark that
ByteStringis technically not a string type, because a string is a sequence of characters, andByteStringis a sequence of bytes - it has no notion of characters. So āByteStringā is perhaps not the most fitting name. It is, however, appropriate to discussByteStringtogether withStringandText, since the correct choice among these types is often a point of confusion.Ā ↩ -
Concatenation of two lazy
Texts is not actually O(1) because Haskellās cons-list doesnāt support O(1) concatenation.It is O(c1) where c1 is the number of chunks in the firstText. It is still a much better asymptotic than concatenating two strictTexts, which is O(n1 + n2) where n1 and n2 are the lengths of the twoTexts.Ā ↩