This blog post summarizes, with bite-size bullet points, some knowledge on Haskell string types that I think is important to recognize when writing Haskell. The intended audience include inexperienced Haskell programmers, and experienced Haskell programmers who feel like refreshing their memory. The bullet list hopefully makes things reasonably easy to digest, and helps you eat Haskell string types for breakfast.
Haskell has five commonly used string types: String
, strict and lazy Text
, and strict and lazy ByteString
1. Some recommand
against calling ByteString
a string type, The following is
roughly in the order of String
, Text
, ByteString
.
-
String
is, in some sense, the âofficialâ string type, since it is the only string type inbase
(and of course, the only one inPrelude
). -
String
is unfortunately highly inefficient, because it is quite simply a linked list (specifically, a cons-list) of (boxed)Char
s, each of which represents a unicode code point. This means eachChar
in aString
is in a cons-cell with a(:)
tag, a pointer to theChar
on the heap, and a pointer to the next cons-cell.How many bytes a
String
uses per character on a 64-bit machine is left as an exercise, but it is certainly a lot more than what is theoretically required to encode a unicode code point: between 1 and 4 bytes. It also suffers from poor locality for obvious reasons. -
String
is convenient when dealing with certain parts ofbase
, for instanceshow
anddisplayException
, as well as the filepath and directory libraries, but there are few other legitimate use cases ofString
. For manybase
functions that work onString
s, such asreverse
,takeWhile
andisInfixOf
, there are corresponding versions inData.Text
. -
Text
, likeString
, also represents a sequence of unicode characters, but is much more tightly packed. Thus it can often be a drop-in replacement forString
. - Strict
Text
is a (strict) array of bytes. LazyText
is a (lazy) cons-list of strictText
s, where each strictText
is called a chunk, and the lazy cons-list is also known as the spine. Internally, strictText
is UTF-16 encoded, so each character occupies either 2 or 4 bytes (plus some constant overhead for eachText
).Text
uses UTF-16 rather than the more popular UTF-8, because although UTF-8 is more space-efficient for ASCII characters, its arithmetic is relatively more complex, and converting bytes to/fromChar
s is more expensive. Furthermore, the space saving of UTF-8 is insignificant for smallText
values. See Text/UTF-8: Aftermath for more details.- The specific encoding
Text
uses internally, whether UTF-8 or UTF-16, does not usually affect the way most users use the library. Unless you are using stuff in theInternal
modules, the unit you deal with isChar
. - If you prefer a
Text
type that internally uses UTF-8 (for example, if you have a lot of interaction with UTF-8 encoded data), consider text-short or core-text.
-
If you are familiar with Java, it might be helpful to think of lazy
Text
asLinkedList<ArrayList<Byte>>
(except that theLinkedList
is a lazy cons-list), where eachArrayList<Byte>
is a chunk, and theLinkedList
is the spine. This gets the benefits of bothLinkedList
andArrayList
: it supports efficient O(1) concatenation2 becauseLinkedList
concatenation is O(1); it also is more space efficient and has better locality than a flatLinkedList<Byte>
, becauseArrayList
is stored contiguously in memory and thereâs no pointer from a node to the next.-
Some have argued that it is better to use a strict list/rope of chunks for lazy
Text
, which would be more comparable to JavaâsLinkedList<ArrayList<Byte>>
. The advantage is that this can potentially unify strict and lazyText
into a single type, because it is both strict, and efficient in concatenation (and similar operations) like lazyText
is. The disadvantage, however, is that there doesnât exist (at least not yet, as far as I know) an efficient implementation ofLinkedList
(or any data structure) in Haskell whose performance is competitive in terms of constant factors to raw byte arrays (which is what backs strictText
), and it may well be impossible.Besides, there are legitimate use cases, though perhaps rare, that require the
Text
to be lazy, for instance when you want to perform lazy IO without bothering to use conduit or pipes (you most likely should, though).
-
-
Many operations on
Text
are subject to fusion. The goal of fusion is to avoid materializing intermediate results when performing a chain of operations. The core idea of fusion is similar to improving list concatenation using difference lists, and improving a chain offmap
using Coyoneda, and can be summarized as: delay, delay, delay. Delaying the construction of a data structure is key for avoiding materializing the intermediate results. One way to delay an operation on a data structure is to turn the data structure into an arrow (e.g., function or Kleisli arrow), turn the operation into composition (e.g., function composition or Kleisli composition), and finally, apply the arrow to some argument to get the result only when needed.As an example, there are at least two ways to turn a Haskell list into functions, and they are related in an interesting way:
- Representing the list inductively (a.k.a. build/foldr fusion system):
newtype List a = List { build :: forall b. (a -> b -> b) -> b -> b }
This represents a list in terms of fold. It is equivalent to
Mu (ListF a)
whereMu f
is the least fixed point off
, andListF a
is the base functor of[a]
, i.e.,data ListF a b = Cons a b | Nil
- Representing the list coinductively (a.k.a. stream fusion):
data List a = forall s. List (s -> Step a s) s data Step a s = Yield a s | Done
This represents a list in terms of unfold. It is equivalent to
Nu (ListF a)
whereNu f
is the greatest fixed point off
.
So these two approaches are dual of each other. In both cases, a list is represented in a âdelayedâ form, and operations such as
map
simply âremembersâ the function to be mapped, rather than actually performing the mapping and generating a new list. The pros and cons of the two approaches is out of the scope of this blog post, but this paper has more details if you are interested.In the case of
Text
, it employs stream fusion. Each eligible operationop
is turned intofromStream . opOnStream . toStream
, whereopOnStream
is the streaming version ofop
, and a rewrite rule is used to canceltoStream . fromStream
into the identity function. - Representing the list inductively (a.k.a. build/foldr fusion system):
-
ByteString
, unlikeString
andText
, has little to do withChar
or unicode. It is just a sequence of bytes. To convert aByteString
to/fromText
, you need to specify which encoding you intend to use, and then use the conversion functions inData.Text.Encoding
. - Like
Text
, strictByteString
is backed by a byte array, and lazyByteString
is backed by a cons-list of chunks, each of which is a strictByteString
. One difference is that strictByteString
uses aForeignPtr Word8
as the underlying data structure. This makes it directly usable in FFI because it has the same memory layout as anunsigned char *
in C (note that achar
in C is a byte, not a unicode code point).- To pass a
ByteString
to a C function, you just need to do two things: (1) make a copy of it (because otherwise the C function may mutate it), and (2) append a null (0x00) byte, since C strings are null-terminated. This is exactly whatData.ByteString.useAsCString
does.
- To pass a
-
Functions in
Data.ByteString.Char8
behave as if each byte is aChar
, and thatâs why you donât need to (and donât get to) specify an encoding to use any function in this module. It works well if yourByteString
contains ASCII characters only, but probably not otherwise; for example,Char8.unpack (Char8.pack "đđđ") /= "đđđ"
. Use theChar8
module if you are certain that you are dealing with ASCII characters, or if otherwise you know what you are doing. - When using
Data.Text.IO.hGetLine
andData.Text.IO.hGetContents
, make sure theHandle
has the correct encoding. Youâd otherwise get an âinvalid argument (invalid byte sequence)â error. Since these methods rely on using the correct encoding, it is often wise to use their counterparts inData.ByteString
.ByteString
, having no assumption on the encoding, is more suitable for exchanging information across machines. After reading theByteString
s in, you can then decode them using the appropriate decoders and handle errors properly.- This is probably why Aesonâs
encode
anddecode
functions work withByteString
rather thanText
, even though JSON is supposed to be textual and human readable. JSON is often exchanged over the network, and since data are sent or received over the network as sequences of bytes, youâd needByteString
. Furthermore, although the overwhelming majority of JSON texts are UTF-8 encoded, this is not a guarantee, so usingText
prematurely can be risky. - That being said, the Aeson library does provide
Text
-based encoder and decoder in theData.Aeson.Text
module. - On a side node, Aesonâs
encode
function returns a lazyByteString
, because the result is assembled piece by piece, and lazyByteString
, like lazyText
, has better asymptotics for operations like concatenation.
- This is probably why Aesonâs
-
Data.ByteString.getLine
andData.ByteString.hGetLine
both read until EOF or a newline byte, i.e., 0x0A. Beware if your data is not UTF-8 encoded. For example, in UTF-16BE and UTF-16LE, the newline character has two bytes, and is encoded as 0x00 0x0A and 0x0A 0x00, respectively. Arguably, these two functions should only be exposed from theChar8
module, because a newline is a character. -
If you have a function that takes a lazy
Text
as input, consider taking instead a stream that producesText
(e.g.,ConduitT i Text m ()
in conduit, orProducer' Text m ()
in pipes). Because you can easily convert a lazyText
into a stream, but the converse is not true because the stream may be effectful. The same goes forByteString
. -
ByteString
used to employ stream fusion likeText
does, but currently fusion is not used inByteString
, whether strict or lazy. Hereâs the authorâs explanation of why fusion is no longer used. - The bytestring package also has a
ShortByteString
type. It can be a good option if you need to keep a large number of smallByteString
s around.
Further Reading
- Hereâs a few other blog posts on Haskell string types: String Types by FP Complete, Untangling Haskellâs Strings by Monday Morning Haskell, and Haskell String Types by Alexey Shmalko.
- To learn more about fixed points, recursion schemes,
Mu
,Nu
and what not, check out the recursion-schemes and the yaya library, and my previous blog post on this topic. - To learn more about
ShortByteString
, check out Mark Karpovâs blog post, Short ByteString and Text.
-
I should remark that
ByteString
is technically not a string type, because a string is a sequence of characters, andByteString
is a sequence of bytes - it has no notion of characters. So âByteStringâ is perhaps not the most fitting name. It is, however, appropriate to discussByteString
together withString
andText
, since the correct choice among these types is often a point of confusion. ↩ -
Concatenation of two lazy
Text
s is not actually O(1) because Haskellâs cons-list doesnât support O(1) concatenation.It is O(c1) where c1 is the number of chunks in the firstText
. It is still a much better asymptotic than concatenating two strictText
s, which is O(n1 + n2) where n1 and n2 are the lengths of the twoText
s. ↩