Chapter 4: Encoding & Evolution
With Avro, if you were to add a field that has no default value, new readers wouldn't be able to read data written by old writer, so you would break _________ compatibility.
Backward (compatibility)
__________ is when newer code can read data that was written by older code
Backward compatability
__________ is normally not hard to achieve: as author of the newer code, you know the format of data written by older code, as so you can explicitly handle it.
Backward compatibility
With Avro, _________ compatibility means that you have a new version of the schema as reader and an old version as writer
Backwards (compatibility)
It is generally a __________ idea to use your language's built-in encoding for anything other than very transient purposes
Bad
JSON is less verbose than XML, but both still use a lot of space compared to __________ formats.
Binary
Apache Thrift and Protocol Buffers (protobuf) are __________ libraries.
Binary Encoding
BJSON and MessagePack are examples of __________ for JSON.
Binary encodings
XML and JSON have good support for Unicode character strings (i..e, human readable text), but they don't support __________ (sequences of bytes without a character encoding); people bet around this limitation by encoding these types as text using Base64. A schema is then used to indicate that the value should be interpreted as Base64-encoded. This works, but increases the data size by 33%.
Binary strings
With Avro, changing the _________ of a field is possible, provided that Avro can convert the type.
Datatype
Protocol Buffers has __________ binary encoding format(s).
One
One advantage of Avro's approach, compared to Protocol Buffers and Thrift, is that the schema doesn't contain any _________.
Tag numbers
With Apache Avro, an encoded record does not contain any information that identifies fields or data types; the encoding simply consists of __________ concatenated together.
(Field) Values
With Avro, changing the name of a field is possible but a little tricky: the reader's schema can contain aliases for field names, so it can match an old writer's schema field name against the aliases. This means that changing a field name is strictly _________ compatible.
Backward
__________ is another binary encoding format that is a result of Thrift not being a good fit for Hadoo's use cases
Apache Avro
When a data format or schema changes, a corresponding change to __________ often needs to happen (for example, you add a new field to a record, and the application code starts reading and writing that field).
Application code
Apache Avro's __________ schema language is intended for human editing
Avro IDL
With Avro, adding a branch to a union type is strictly _________ compatible.
Backward
A downside of __________ for encoding data is that in order to restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes. This is frequently a source of security problems: if an attacker can get you application to decode an arbitrary byte sequence, they can instantiate arbitrary classes, which in turn allows them to do terrible things such as remotely executing arbitrary code.
Built-in libraries
A downside of __________ for encoding data is that the encoding is often tied to a particular programming language, and reading the data in another language is very difficult, precluding integrating your systems with those of other organizations (which may use different languages)
Built-in libraries
In large applications, application code can't change as fast as the data format. This means that old and new versions of the code, and old and new data formats may all _________ in the system at the same time.
Coexist
The Thrift __________ encoding is semantically equivalent to BinaryProtocol; the difference is that it packs the field type and tag number into a single byte by using variable-length integers
CompactProtocol
With Avro, a database of schema versions is a useful thing to have in any case, since it acts as documentation and gives you a chance to check schema _________.
Compatibility
The key idea with Avro is that the writer's schema and the reader's schema don't have to be the same — they only need to be __________.
Compatible
In most cases, a change to an application's features also requires a change to __________: perhaps a new field or record type needs to be captured, or perhaps existing data needs to be presented in a new way
data that it stores
When using Avro to send records over a network connection, two processes communicating over a bidirectional network connection can negotiate the schema version on _________ and then use that schema for the lifetime of the connection; therefore both processes know which version of the schema the other process is using.
Connection setup
Avro is friendlier to _________ schemas; For instance, if you encode a database contents using an Avro schema (i.e., generate a record schema for each database table, and each column becomes a field in that record) and the database schema changes, you can just generate anew Avro schema from the update database schema — the data export process does not need to pay attention to the schema change); since the fields are identified by name in Avro, the updated writer's schema can still be matched up with the older reader's schema.
Dynamically generated
JSON, XML, and CSV remain popular, especially as __________ formats (i.e., for sending data from one organization to another.)
Data interchange (format)
In Thrift and Protocol Buffer schemas, The __________ annotation allows the parser to determine how many bytes it needs to skip; this maintains forward compatibility; old code can read records that were written by new code
Datatype
When changing the __________ of a filed in a Thrift or Protocol Buffer schema, there is a risk that values will lose precision or get truncated (e..g., if you change a 32-bit integer into a 64-bit integer, the old code is still using 32-bit variable to hold the value. If the decoded 64-bit value won't fit in 32 bits, it will be truncated.)
Datatype
The translation from a byte sequenced to an in-memory representation is called __________ (also known as parsing, deserialization, unmarshalling).
Decoding
With Avro, to main compatibility, you may only add or remove a field that has a _________.
Default value
For Avro, if the code reading the data expects some field, but the writer's schema does not contain a field of that name, it is filed in with a __________ declared in the reader's schema.
Default value.
A downside of built-in libraries is that __________ (e.g., CPU time taken to encode or decode, and the size of encoded structure) is also often afterthought.
Efficiency
Once you get into the terabytes, the choice of data format can have a big impact on __________.
Efficiency
In Thrift and Protocol Buffers, an __________ is just the concatenation of its encoded fields. Each field is identified by its tag number (the numbers 1, 2, 3 in the sample schemas) and annotated with a datatype.
Encoded Record
The translations from the in-memory representation of data to a byte sequence is called __________ (also known as serialization or marshaling)
Encoding
__________ is the idea that we should aim to build systems that make it easy to adapt to change.
Evolvability
To parse an encoded record using Apache Avro, you go through the field in the order that they appear in the schema and use the schema to tell you the datatype of each field. This means that the binary data can only be decoded correctly if the code reading the data i suing the __________ as the code that wrote the data.
Exact same schema
(T/F) Apache Avro IDL schema contains tag number, similar to Thrift and Protocol Buffer
F
__________ are like aliases for fields — they are a compact way of saying what field we're taking about, without having to spell out the field name.
Field Tags
For Avro, It's no problem if the writer's schema and the reader's schema have their fields in a different order; the schema resolution matches the fields by __________.
Field name.
If you were using Thrift or Protocol Buffers for handling dynamically generated schemas, the _________ would likely have to be assigned by hand: every time the database schema changes, an administrator would have to manually update the mapping from database column names to _________.
Field tags
The __________ are critical to the meaning of the encoded data
Field tags
You can change the name of a field in the schema, since the encoded data never refers to field names, but you cannot change a __________, since that would make all existing encoded data invalid (i.e., break compatibility)
Field's tag
With Avro, _________ compatibility means that you can have a new version of the schema as writer and an old version of the schema as reader.
Forward (compatibility)
With Avro, if you were to remove a field that has no default value, old readers wouldn't be able to read data written by new writers, so you would break _________ compatibility
Forward (compatibility)
__________ can be tricky, because it requires older code to ignore additions made by a newer version of the code
Forward compatibility
__________ is when older code can read data that was written by newer code
Forward compatibility
For the binary encodings of JSON, it's not clear whether such a small reduction (and perhaps a speed up in parsing) is worth the loss of __________.
Human-readability
For forwards compatibility, If old code (which doesn't now about the new tag numbers you added) tries to read data written by new code, including a new field with a tag number it doesn't recognize, it can simply __________ that field.
Ignore
For Avro, if a field that appears in the writer's schema but not in the reader's schema, it is __________.
Ignored
Built-in libraries for encoding data is convenient because they allow __________ objects to be saved and restored with minimal additional code.
In-memory
In Thrift, you would describe the schema in the Thrift __________ (IDL).
Interface definition language
For data that is used only __________, there is less pressure to use a lowest-common denominator encoding format; you could choose a format this is more compact or faster to parse.
Internally within you organization
Apache Avro's __________-based schema language is intended to be easily machine-readable
JSON
__________ distinguishes strings and numbers, but it doesn't distinguish integers and floating-point numbers, and it doesn't specify a precision
JSON
__________'s popularity is mainly due to its built-in support in web browers and simplicity repeat I've to XML
JSON
In __________, programs work with data that is kept in objects, structures, lists, arrays, hash tables, trees, and so on. These data structures are optimized for efficient access and manipulation by CPU (typically using pointers)
Memory
In Thrift and Protocol Buffers, If a field value is not set, it is simply __________ from the encoded record
Omitted
In Thrift and Protocol Buffers, you can add new fields to the schema, provided that you give each field a __________
New tag number
With Avro, `union { null, long, string } field` indicates that the `field` can be a _________. You can only use null if it is one of the branches of the union. It helps prevent bugs by being explicit about what can and cannot be null.
Number, or a string, or null.
A common use case for Avro is for storing a large file containing millions of records, all encoded with the same schema. The writer of that file can just include the writer's schema once at the beginning of the file to let the reader's schema know which schema was used to encode the records in the file. Avro specifies a file format (_________) to do this.
Object container files
For Protocol Buffers, an __________ field can be changed (without breaking compatibility) into a reappeared (multi-valued) field; new code reading old data sees a list with zero or one elements; old code reading new data sees only the last element in the list.
Optional
To maintain backward and forward compatibility with Thrift and Protocol Buffer schemas, you can only remove a field that is __________ and can never use the same tag number again.
Optional
To maintain backward compatibility with Thrift and Protocol Buffer schemas, every field you add after the initially deployment of the schema must be __________ or have a default value.
Optional
Avro doesn't have _________ and _________ markers in the same way as Protocol Buffers and Thrift do (it has union types and default values instead).
Optional (and) required
For sending data from one organization to another, as long as people agree on what the format is, it often doesn't matter how pretty or efficient the format is. The difficulty of getting different organizations to agree on anything __________ most other concerns.
Outweighs
__________ does the bit packing slightly differently, but is very similar to Thrift's CompactProtocol
Protocol Buffer
__________ does not have a list or array datatype, instead it has a repeated marker for fields (which is a third option alongside required and optional).
Protocol Buffers
__________, a binary encoding library, was originally developed at Google
Protocol Buffers
With Apache Avro, when an application wants to decode some data, it is expecting the data to be in some schema, which is known as the __________.
Reader's schema
RPC stands for:
Remote Procedure Calls
For Protocol Buffers, The encoding of a __________ field says that same field tag simply appears multiple times in the encoded record.
Repeated
REST stands for:
Representational State Transfer
For backwards compatibility with Thrift and Protocol Buffer schemas, if you add a new field, you cannot make it __________, since the check would fail if new code read data written by old code, because the old code will not have written the new field that you added.
Required
A __________ allows new versions to be deployed without service downtime, and thus encourages more frequent releases and better evolvability
Rolling upgrade
With server-side applications, a __________ (also known as a staged rollout) is the process of deploying the new version to a few nodes at a time, checking whether the new version is running smoothly, and gradually working your way through all nodes.
Rolling upgrade
Both Thrift and Protocol Buffers require a __________ for any data that is encoded
Schema
Some of the binary encoding formats for JSON and XML extend the set of data types (e.g., distinguishing integers and floating-point number, or adding support binary strings), but otherwise they keep the JSON/XML data model unchanged. Since they don't prescribe a __________, they need to include all the object field names within the encoded data.
Schema
__________ is the idea that schemas inevitably need to change over time.
Schema evolution
When a program wants to write data to a file or send it over the network, it has to encode it as some kind of self-contained __________ (for example, a JSON document). Since a pointer won't make sense to any other process, this representation looks quite different from the data structures that are normally used in memory.
Sequence of bytes
Thrift's and Protocol Buffers support of code generation is useful in _________ languages, since it allows efficient in-memory structures to be used for decoded data and allows type checking and auto completion in IDEs when writing programs that access the data structures.
Statically typed (languages)
(T/F) Many programming languages come with built-in support for encoding in-memory objects into byte sequences
T
In __________, each field has a type annotation (to indicate whether it is a string, integer, list, etc.) and, where required, a length indication (length of a string, number of items in a list). The strings that appear in the data are also encoded as ASCII (or rather, UTF-8). Further, there are no field names in the encoding; instead, the encoded data contains field tags, which are numbers that appear in the schema definition.
Thrift
__________ has a dedicated list datatype, which is parameterized with the datatype of the list elements. This does not allow the same evolution from single-valued to multi-valued as Protocol Buffers does, but it has the advantage of supporting nested lists.
Thrift
__________ has two different binary encoding formats called BinaryProtocol and CompactProtocol
Thrift
__________, a binary encoding library, was originally developed at Facebook
Thrift
__________ and __________ each come with a code generation tool that takes a schema definition and procures classes that implement the schema in various programming languages. You application code can call the generated code to encode or decode records of the schema.
Thrift; Protocol Buffers
Programs usually work with data in at least __________ different representations
Two
With Avro, null is not an acceptable default for a field; if you want to allow a field to be null, you have to use a _________.
Union
For backwards compatibility in Thrift and Protocol Buffer schemas, as long as each field has a __________, new code can always read old data, because the tag numbers still have the same meaning.
Unique tag numbers
When using Avro in a Database, different records may be written using different writer's schemas. The simplest solution to know which writer schema was used to encode a record is to include a _________ number at the beginning of every record, and keep a list of schema versions in your database.
Version
A downside of built-in libraries for encoding is that __________ data is often an afterthought in these libraries: they often neglect the inconvenient problems of forward and backwards compatibility.
Versioning
With Apache Avro, When an application wants to encode some data, it encodes the data using whatever version of the schema it knows about — this is known as the __________
Writer's schema
When data is decoded (read), the Avro library resolves the differences by looking at the writer's schema and reader's schema side-by-side and translating the data from the __________ into the __________; The Avro specification defines this resolution
Writer's schema (into the) reader's schema
__________ is often criticized for being too verbose and unnecessarily complicated
XML
In __________, you cannot distinguish between a number and a string that happens to consist of digits (except by referring to an external schema).
XML (and CSV)
In Thrift and Protocol Buffers, marking a field as __________ enables a runtime chance that fails if the field is not set, which can be useful for catching bugs.
required
For Thirft and Protocol Buffers, each field in the schema must be marked with either __________ or __________
required or optional