Chapter 4: Encoding & Evolution

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

With Avro, if you were to add a field that has no default value, new readers wouldn't be able to read data written by old writer, so you would break _________ compatibility.

Backward (compatibility)

__________ is when newer code can read data that was written by older code

Backward compatability

__________ is normally not hard to achieve: as author of the newer code, you know the format of data written by older code, as so you can explicitly handle it.

Backward compatibility

With Avro, _________ compatibility means that you have a new version of the schema as reader and an old version as writer

Backwards (compatibility)

It is generally a __________ idea to use your language's built-in encoding for anything other than very transient purposes

Bad

JSON is less verbose than XML, but both still use a lot of space compared to __________ formats.

Binary

Apache Thrift and Protocol Buffers (protobuf) are __________ libraries.

Binary Encoding

BJSON and MessagePack are examples of __________ for JSON.

Binary encodings

XML and JSON have good support for Unicode character strings (i..e, human readable text), but they don't support __________ (sequences of bytes without a character encoding); people bet around this limitation by encoding these types as text using Base64. A schema is then used to indicate that the value should be interpreted as Base64-encoded. This works, but increases the data size by 33%.

Binary strings

With Avro, changing the _________ of a field is possible, provided that Avro can convert the type.

Datatype

Protocol Buffers has __________ binary encoding format(s).

One

One advantage of Avro's approach, compared to Protocol Buffers and Thrift, is that the schema doesn't contain any _________.

Tag numbers

With Apache Avro, an encoded record does not contain any information that identifies fields or data types; the encoding simply consists of __________ concatenated together.

(Field) Values

With Avro, changing the name of a field is possible but a little tricky: the reader's schema can contain aliases for field names, so it can match an old writer's schema field name against the aliases. This means that changing a field name is strictly _________ compatible.

Backward

__________ is another binary encoding format that is a result of Thrift not being a good fit for Hadoo's use cases

Apache Avro

When a data format or schema changes, a corresponding change to __________ often needs to happen (for example, you add a new field to a record, and the application code starts reading and writing that field).

Application code

Apache Avro's __________ schema language is intended for human editing

Avro IDL

With Avro, adding a branch to a union type is strictly _________ compatible.

Backward

A downside of __________ for encoding data is that in order to restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes. This is frequently a source of security problems: if an attacker can get you application to decode an arbitrary byte sequence, they can instantiate arbitrary classes, which in turn allows them to do terrible things such as remotely executing arbitrary code.

Built-in libraries

A downside of __________ for encoding data is that the encoding is often tied to a particular programming language, and reading the data in another language is very difficult, precluding integrating your systems with those of other organizations (which may use different languages)

Built-in libraries

In large applications, application code can't change as fast as the data format. This means that old and new versions of the code, and old and new data formats may all _________ in the system at the same time.

Coexist

The Thrift __________ encoding is semantically equivalent to BinaryProtocol; the difference is that it packs the field type and tag number into a single byte by using variable-length integers

CompactProtocol

With Avro, a database of schema versions is a useful thing to have in any case, since it acts as documentation and gives you a chance to check schema _________.

Compatibility

The key idea with Avro is that the writer's schema and the reader's schema don't have to be the same — they only need to be __________.

Compatible

In most cases, a change to an application's features also requires a change to __________: perhaps a new field or record type needs to be captured, or perhaps existing data needs to be presented in a new way

data that it stores

When using Avro to send records over a network connection, two processes communicating over a bidirectional network connection can negotiate the schema version on _________ and then use that schema for the lifetime of the connection; therefore both processes know which version of the schema the other process is using.

Connection setup

Avro is friendlier to _________ schemas; For instance, if you encode a database contents using an Avro schema (i.e., generate a record schema for each database table, and each column becomes a field in that record) and the database schema changes, you can just generate anew Avro schema from the update database schema — the data export process does not need to pay attention to the schema change); since the fields are identified by name in Avro, the updated writer's schema can still be matched up with the older reader's schema.

Dynamically generated

JSON, XML, and CSV remain popular, especially as __________ formats (i.e., for sending data from one organization to another.)

Data interchange (format)

In Thrift and Protocol Buffer schemas, The __________ annotation allows the parser to determine how many bytes it needs to skip; this maintains forward compatibility; old code can read records that were written by new code

Datatype

When changing the __________ of a filed in a Thrift or Protocol Buffer schema, there is a risk that values will lose precision or get truncated (e..g., if you change a 32-bit integer into a 64-bit integer, the old code is still using 32-bit variable to hold the value. If the decoded 64-bit value won't fit in 32 bits, it will be truncated.)

Datatype

The translation from a byte sequenced to an in-memory representation is called __________ (also known as parsing, deserialization, unmarshalling).

Decoding

With Avro, to main compatibility, you may only add or remove a field that has a _________.

Default value

For Avro, if the code reading the data expects some field, but the writer's schema does not contain a field of that name, it is filed in with a __________ declared in the reader's schema.

Default value.

A downside of built-in libraries is that __________ (e.g., CPU time taken to encode or decode, and the size of encoded structure) is also often afterthought.

Efficiency

Once you get into the terabytes, the choice of data format can have a big impact on __________.

Efficiency

In Thrift and Protocol Buffers, an __________ is just the concatenation of its encoded fields. Each field is identified by its tag number (the numbers 1, 2, 3 in the sample schemas) and annotated with a datatype.

Encoded Record

The translations from the in-memory representation of data to a byte sequence is called __________ (also known as serialization or marshaling)

Encoding

__________ is the idea that we should aim to build systems that make it easy to adapt to change.

Evolvability

To parse an encoded record using Apache Avro, you go through the field in the order that they appear in the schema and use the schema to tell you the datatype of each field. This means that the binary data can only be decoded correctly if the code reading the data i suing the __________ as the code that wrote the data.

Exact same schema

(T/F) Apache Avro IDL schema contains tag number, similar to Thrift and Protocol Buffer

F

__________ are like aliases for fields — they are a compact way of saying what field we're taking about, without having to spell out the field name.

Field Tags

For Avro, It's no problem if the writer's schema and the reader's schema have their fields in a different order; the schema resolution matches the fields by __________.

Field name.

If you were using Thrift or Protocol Buffers for handling dynamically generated schemas, the _________ would likely have to be assigned by hand: every time the database schema changes, an administrator would have to manually update the mapping from database column names to _________.

Field tags

The __________ are critical to the meaning of the encoded data

Field tags

You can change the name of a field in the schema, since the encoded data never refers to field names, but you cannot change a __________, since that would make all existing encoded data invalid (i.e., break compatibility)

Field's tag

With Avro, _________ compatibility means that you can have a new version of the schema as writer and an old version of the schema as reader.

Forward (compatibility)

With Avro, if you were to remove a field that has no default value, old readers wouldn't be able to read data written by new writers, so you would break _________ compatibility

Forward (compatibility)

__________ can be tricky, because it requires older code to ignore additions made by a newer version of the code

Forward compatibility

__________ is when older code can read data that was written by newer code

Forward compatibility

For the binary encodings of JSON, it's not clear whether such a small reduction (and perhaps a speed up in parsing) is worth the loss of __________.

Human-readability

For forwards compatibility, If old code (which doesn't now about the new tag numbers you added) tries to read data written by new code, including a new field with a tag number it doesn't recognize, it can simply __________ that field.

Ignore

For Avro, if a field that appears in the writer's schema but not in the reader's schema, it is __________.

Ignored

Built-in libraries for encoding data is convenient because they allow __________ objects to be saved and restored with minimal additional code.

In-memory

In Thrift, you would describe the schema in the Thrift __________ (IDL).

Interface definition language

For data that is used only __________, there is less pressure to use a lowest-common denominator encoding format; you could choose a format this is more compact or faster to parse.

Internally within you organization

Apache Avro's __________-based schema language is intended to be easily machine-readable

JSON

__________ distinguishes strings and numbers, but it doesn't distinguish integers and floating-point numbers, and it doesn't specify a precision

JSON

__________'s popularity is mainly due to its built-in support in web browers and simplicity repeat I've to XML

JSON

In __________, programs work with data that is kept in objects, structures, lists, arrays, hash tables, trees, and so on. These data structures are optimized for efficient access and manipulation by CPU (typically using pointers)

Memory

In Thrift and Protocol Buffers, If a field value is not set, it is simply __________ from the encoded record

Omitted

In Thrift and Protocol Buffers, you can add new fields to the schema, provided that you give each field a __________

New tag number

With Avro, `union { null, long, string } field` indicates that the `field` can be a _________. You can only use null if it is one of the branches of the union. It helps prevent bugs by being explicit about what can and cannot be null.

Number, or a string, or null.

A common use case for Avro is for storing a large file containing millions of records, all encoded with the same schema. The writer of that file can just include the writer's schema once at the beginning of the file to let the reader's schema know which schema was used to encode the records in the file. Avro specifies a file format (_________) to do this.

Object container files

For Protocol Buffers, an __________ field can be changed (without breaking compatibility) into a reappeared (multi-valued) field; new code reading old data sees a list with zero or one elements; old code reading new data sees only the last element in the list.

Optional

To maintain backward and forward compatibility with Thrift and Protocol Buffer schemas, you can only remove a field that is __________ and can never use the same tag number again.

Optional

To maintain backward compatibility with Thrift and Protocol Buffer schemas, every field you add after the initially deployment of the schema must be __________ or have a default value.

Optional

Avro doesn't have _________ and _________ markers in the same way as Protocol Buffers and Thrift do (it has union types and default values instead).

Optional (and) required

For sending data from one organization to another, as long as people agree on what the format is, it often doesn't matter how pretty or efficient the format is. The difficulty of getting different organizations to agree on anything __________ most other concerns.

Outweighs

__________ does the bit packing slightly differently, but is very similar to Thrift's CompactProtocol

Protocol Buffer

__________ does not have a list or array datatype, instead it has a repeated marker for fields (which is a third option alongside required and optional).

Protocol Buffers

__________, a binary encoding library, was originally developed at Google

Protocol Buffers

With Apache Avro, when an application wants to decode some data, it is expecting the data to be in some schema, which is known as the __________.

Reader's schema

RPC stands for:

Remote Procedure Calls

For Protocol Buffers, The encoding of a __________ field says that same field tag simply appears multiple times in the encoded record.

Repeated

REST stands for:

Representational State Transfer

For backwards compatibility with Thrift and Protocol Buffer schemas, if you add a new field, you cannot make it __________, since the check would fail if new code read data written by old code, because the old code will not have written the new field that you added.

Required

A __________ allows new versions to be deployed without service downtime, and thus encourages more frequent releases and better evolvability

Rolling upgrade

With server-side applications, a __________ (also known as a staged rollout) is the process of deploying the new version to a few nodes at a time, checking whether the new version is running smoothly, and gradually working your way through all nodes.

Rolling upgrade

Both Thrift and Protocol Buffers require a __________ for any data that is encoded

Schema

Some of the binary encoding formats for JSON and XML extend the set of data types (e.g., distinguishing integers and floating-point number, or adding support binary strings), but otherwise they keep the JSON/XML data model unchanged. Since they don't prescribe a __________, they need to include all the object field names within the encoded data.

Schema

__________ is the idea that schemas inevitably need to change over time.

Schema evolution

When a program wants to write data to a file or send it over the network, it has to encode it as some kind of self-contained __________ (for example, a JSON document). Since a pointer won't make sense to any other process, this representation looks quite different from the data structures that are normally used in memory.

Sequence of bytes

Thrift's and Protocol Buffers support of code generation is useful in _________ languages, since it allows efficient in-memory structures to be used for decoded data and allows type checking and auto completion in IDEs when writing programs that access the data structures.

Statically typed (languages)

(T/F) Many programming languages come with built-in support for encoding in-memory objects into byte sequences

T

In __________, each field has a type annotation (to indicate whether it is a string, integer, list, etc.) and, where required, a length indication (length of a string, number of items in a list). The strings that appear in the data are also encoded as ASCII (or rather, UTF-8). Further, there are no field names in the encoding; instead, the encoded data contains field tags, which are numbers that appear in the schema definition.

Thrift

__________ has a dedicated list datatype, which is parameterized with the datatype of the list elements. This does not allow the same evolution from single-valued to multi-valued as Protocol Buffers does, but it has the advantage of supporting nested lists.

Thrift

__________ has two different binary encoding formats called BinaryProtocol and CompactProtocol

Thrift

__________, a binary encoding library, was originally developed at Facebook

Thrift

__________ and __________ each come with a code generation tool that takes a schema definition and procures classes that implement the schema in various programming languages. You application code can call the generated code to encode or decode records of the schema.

Thrift; Protocol Buffers

Programs usually work with data in at least __________ different representations

Two

With Avro, null is not an acceptable default for a field; if you want to allow a field to be null, you have to use a _________.

Union

For backwards compatibility in Thrift and Protocol Buffer schemas, as long as each field has a __________, new code can always read old data, because the tag numbers still have the same meaning.

Unique tag numbers

When using Avro in a Database, different records may be written using different writer's schemas. The simplest solution to know which writer schema was used to encode a record is to include a _________ number at the beginning of every record, and keep a list of schema versions in your database.

Version

A downside of built-in libraries for encoding is that __________ data is often an afterthought in these libraries: they often neglect the inconvenient problems of forward and backwards compatibility.

Versioning

With Apache Avro, When an application wants to encode some data, it encodes the data using whatever version of the schema it knows about — this is known as the __________

Writer's schema

When data is decoded (read), the Avro library resolves the differences by looking at the writer's schema and reader's schema side-by-side and translating the data from the __________ into the __________; The Avro specification defines this resolution

Writer's schema (into the) reader's schema

__________ is often criticized for being too verbose and unnecessarily complicated

XML

In __________, you cannot distinguish between a number and a string that happens to consist of digits (except by referring to an external schema).

XML (and CSV)

In Thrift and Protocol Buffers, marking a field as __________ enables a runtime chance that fails if the field is not set, which can be useful for catching bugs.

required

For Thirft and Protocol Buffers, each field in the schema must be marked with either __________ or __________

required or optional


Kaugnay na mga set ng pag-aaral

Statistics Exam 3 Definitions (Ch 7, 8, 9, 10)

View Set

PSYC 345: Exam 2: Week 5: Multiple Choice

View Set

Basic Skill Assessing blood pressure

View Set

Chapter 32 (structure and function of the Reproductive Systems)

View Set

chapter 3: the twenty-first century entrepreneur

View Set

Economics exam 1 study guide: price elasticity.

View Set

Chapter 5- Policing Contemporary Issues and Challenges

View Set

CHAPTER 14: EXCHANGE RATES AND THE FOREIGN EXCHANGE MARKET: AN ASSET APPROACH

View Set

ENT - 201: Final Exam (Chapters 7 & 9-13)

View Set