Structured and Unstructured Data and NoSQL Databases

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Examples of semi-structured data include

JSON and XML are forms of semi-structured data.

there are two paths to data distribution:

replication and sharding

Relational databases saw off the challenge by stressing their role as an integration mechanism,

supported by a mostly standard language of data manipulation (SQL) and a growing professional divide between application developers and database administrators.

Replication

takes the same data and copies it over multiple nodes

materialized views

which are views that are computed in advance and cached on disk. Materialized views are effective for data that is read heavily but can stand being somewhat stale.

"NoSQL" is a

"open-source, distributed, nonrelational databases." The talks there [NoSQL Debrief] were from Voldemort, Cassandra, Dynomite, HBase, Hypertable, CouchDB, and MongoDB—but the term has never been confined to that original septet.

Storing data to disk is called what?

"persisting" the data because the data will still be there (i.e. "persist") after the computer is restarted.

with nosql its more causal. go into depth on how to store a key value, a document database and a column- family database and graph database

A key-value store allows you to store any data you like under a key. A document database effectively does the same thing, since it makes no restrictions on the structure of the documents you store. Column-family databases allow you to store any data under any column you like. Graph databases allow you to freely add new edges and freely add properties to nodes and edges as you wish.

it's true that aggregate-oriented databases don't have ...

ACID transactions that span multiple aggregates. Instead, they support atomic manipulation of a single aggregate at a time. This means that if we need to manipulate multiple aggregates in an atomic way, we have to manage that ourselves in the application code

difference b/t graph and relational databases

Although relational databases can implement relationships using foreign keys, the joins required to navigate around can get quite expensive—which means performance is often poor for highly connected data models. Graph databases make traversal along the relationships very cheap. A large part of this is because graph databases shift most of the work of navigating relationships from query time to insert time. This naturally pays off for situations where querying performance is more important than insert speed

broad options to deal with write inconsistencies

At one end, we can ensure that whenever we write data, the replicas coordinate to ensure we avoid a conflict. This can give us just as strong a guarantee as a master, albeit at the cost of network traffic to coordinate the writes. We don't need all the replicas to agree on the write, just a majority, so we can still survive losing a minority of the replica nodes

Because of its relative simplicity, structured data is well suited to the ______________________________________________.

Because of its relative simplicity, structured data is well suited to the relative limitations of relational database systems.

many databases—even key-value stores—provide ways to make these relationships visible to the database. Document stores... do this how? Riak... ''

Document stores make the content of the aggregate available to the database to form indexes and queries. Riak, a key-value store, allows you to put link information in metadata, supporting partial retrieval and link-walking capability.

Aggregate is a term that comes from

Domain-Driven Design

Column-oriented

Each column family defines a record type (e.g., customer profiles) with rows for each of the records. You then think of a row as the join of records in all column families.

row - orientated

Each row is an aggregate (for example, customer with the ID of 1234) with column families representing useful chunks of data (profile, order history) within that aggregate.

structure data examples

Examples of structured data include financial data such as accounting transactions, address details, demographic information, star ratings by customers, machines logs, location data from smart phones and smart devices, etc.

The Value of Relational Databases

Getting at Persistent Data Concurrency Integration A (Mostly) Standard Model

Unstructured data can also generate structures insights — and a lot more... elaborate

Going beyond raw statistics, unstructured data can (with the right NoSQL database), can provide more advanced insights, like customer sentiment. It can also provide enough structure so that non-text assets can be queried, allowing you to run facial recognition analysis from photographs for instance.

what databases are the most useful during single server distribution model

Graph databases are the obvious category here—these work best in a single-server configuration. If your data usage is mostly about processing aggregates, then a single-server document or key-value store may well be worthwhile because it's easier on application developers.

Replication and sharding are strategies that can be combined. how?

If we use both master-slave replication and sharding (see Figure 4.4), this means that we have multiple masters, but each data item only has a single master. Depending on your configuration, you may choose a node to be a master for some data and slaves for others, or you may dedicate nodes for master or slave duties.

limitations of relational databases

In particular, the values in a relational tuple have to be simple—they cannot contain any structure, such as a nested record or a list. This limitation isn't true for in-memory data structures, which can take on much richer structures than relations.As a result, if you want to use a richer in- memory data structure, you have to translate it to a relational representation to store it on disk. Hence the impedance mismatch—two different representations that require translation

In the relational model, a tuple ___________________________ and a relation is a

In the relational model, a tuple is a set of name-value pairs and a relation is a set of tuples

sharding

Often, a busy data store is busy because different people are accessing different parts of the dataset. In these circumstances we can support horizontal scalability by putting different parts of the data onto different servers—a technique

aggregate-ignorant

Relational databases have no concept of aggregate within their data model

There are two styles of distributing data: ..

Sharding distributes different data across multiple servers, so each server acts as the single source for a subset of data. • Replication copies data across multiple servers, so each bit of data can be found in multiple places. A system may use either or both techniques.

Despite the fact that sharding is made much easier with aggregates, it's still not a step to be taken lightly. why?

Some databases are intended from the beginning to use sharding, in which case it's wise to run them on a cluster from the very beginning of development, and certainly in production. Other databases use sharding as a deliberate step up from a single-server configuration, in which case it's best to start single-server and only use sharding once your load projections clearly indicate that you are running out of headroom.

the biggest complication with With a peer-to-peer replication cluster is .....

The biggest complication is, again, consistency. When you can write to two different places, you run the risk that two people will attempt to update the same record at the same time—a write-write conflict. Inconsistencies on read lead to problems but at least they are relatively transient. Inconsistent writes are forever.

single server

The first and the simplest distribution option is the one we would most often recommend—no distribution at all. Run the database on a single machine that handles all the reads and writes to the data store. We prefer this option because it eliminates all the complexities that the other options introduce; it's easy for operations people to manage and easy for application developers to reason about.

two rough strategies to building a materialized view

The first is the eager approach where you update the materialized view at the same time you update the base data for it. The application database (p. 7) approach is valuable here as it makes it easier to ensure that any updates to base data also update materialized views.

Unstructured data can be, and is, stored in a _________________________.

Unstructured data can be, and is, stored in a number of places

Despite this blurriness, the general distinction still holds. With key-value databases,... and with document databses...

With key-value databases, we expect to mostly look up aggregates using a key. With document databases, we mostly expect to submit some form of query based on the internal structure of the document

Master-Slave Replication

With master-slave distribution, you replicate data across multiple nodes. One node is designated as the master, or primary. This master is the authoritative source for the data and is usually responsible for processing any updates to that data. The other nodes are slaves, or secondaries.

in some cases you can think of the master-slave system as a single-server store with a hot back up..

You get the convenience of the single-server configuration but with greater resilience— which is particularly handy if you want to be able to handle server failures gracefully.

If you update multiple aggregates at once, you have to deal yourself with a failure partway through. Relational databases help you with this by

allowing you to modify multiple records in a single transaction, providing ACID guarantees while altering many rows.

Graph databases

are motivated by a different frustration with relational databases and thus have an opposite model—small records with complex interconnections, something like

These databases with a bigtable-style data model are often referred to as ___________________-.

as column stores. but that name has been around for a while to describe a different animal. nosql and abandoning relational databases

Sharding is particularly valuable for performance

because it can improve both read and write performance.

NoSQL- They emerged around the turn of the century to deal with ....

big data and the mismatch between how programs store data in RAM and on the hard drive.

master can be appointed how?

can be appointed manually or automatically

Using peer-to-peer replication and sharding is a common strategy for ...

column-family databases.

Aggregate orientation recognizes that often, if you want to operate on data in units that have a more ___________________ than a set of tuples

complex structure. As we'll see, key-value, document, and column-family databases all make use of this more complex record.

Large data estates can be housed in a ______________?

data warehouse - as long as the info continues to meet the rigid database schema

a schemaless store also makes it easier to deal with nonuniform data:

data where each record has a different set of fields

Polyglot Persistence new opportunities for enterprises, ex: problems

decisions organizational change immaturity dealing with eventual consistency paradigm

Unstructured data

doesn't have a predefined data model. Therefore, it's not as easily categorized into the predefined tables and rows of a relational database. Satellite imagery, audio files, video files, or even emails may have common features, but these rich data types aren't easily ingested, processed, or analyzed with conventional database related systems.

Structured data is easily _____________ and _________________________________________.

easily stored and retrieved in a traditional relational database where the management system applies logic to ensure information is in the correct format as it is written to disk.

difference b/t structured and unstructured. start with structured

every record adheres to a predefined data model; if incoming data fails to meet those definitions, it cannot be saved without correction or truncation. As a result, structured data may often be very text-heavy. This does have the advantage of being extremely easy to parse and search using conventional software.

Master-slave replication is most helpful for what?

for scaling when you have a read-intensive dataset. You can scale horizontally to handle more read requests by adding more slave nodes and ensuring that all read requests are routed to the slaves.

Depending on your distribution model, you can get a data store that will give you the ability to ...

handle larger quantities of data, the ability to process a greater read or write traffic, or more availability in the face of network slowdowns or breakages. These are often important benefits, but they come at a cost. Running over a cluster introduces complexity—so it's not something to do unless the benefits are compelling.

Aggregate orientation fits well with scaling out because

he aggregate is a natural unit to use for distribution.

implicit schema

is a set of assumptions about the data's structure in the code that manipulates the data.

The reason that this third category exists (between structured and unstructured data) is because ...

is because semi-structured data is considerably easier to analyse than unstructured data. Many Big Data solutions and tools have the ability to 'read' and process either JSON or XML. This reduces the complexity to analyse structured data, compared to unstructured data.

metadata

is data about data. It provides additional information about a specific set of data. From a technical point of view, this is not a separate data structure, but it is one of the most important elements for Big Data analysis and big data solutions.

Apache Graph

is optimized for storing relationships between nodes (structured?)

column-family databases example with peer to peer replication and sharding

n a scenario like this you might have tens or hundreds of nodes in a cluster with data shaded over them. A good starting point for peer-to-peer replication is to have a replication factor of 3, so each shard is present on three nodes. Should a node fail, then the shards on that node will be built on the other nodes

The fundamental data model of a graph database is very simple:

nodes connected by edges (also called arcs). Beyond this essential characteristic there is a lot of variation in data models—in particular, what mechanisms you have to store data in your nodes and edges.

Sharding

puts different data on different nodes.

second advantage of master-slave replication is what?

read resilience: Should the master fail, the slaves can still handle read requests. Again, this is useful if most of your data access is reads. The failure of the master does eliminate the ability to handle writes until either the master is restored or a new master is appointed. However, having slaves as replicates of the master does speed up recovery after a failure of the master since a slave can be appointed a new master very quickly.

the four distribution model

single server, sharding, master slave replication, and peer to peer replication

When references are needed, we could ...

switch to document stores and then query inside the documents, or even change the data for the key-value store to split the value object into Customer and Order objects and then maintain these objects' references to each other.

Manual appointing

typically means that when you configure your cluster, you configure one node as the master

An important aspect of relationships between aggregates is how they handle updates. Aggregate- oriented databases treat the aggregate as the ...

unit of data-retrieval. Consequently, atomicity is only supported within the contents of a single aggregate.

Master-slave replication reduces the chance of ...

update conflicts but peer-to-peer replication avoids loading all writes onto a single point of failure.

In a set of photographs, for example, metadata could describe what?

when and where the photos were taken. The metadata then provides fields for dates and locations which, by themselves, can be considered structured data. Because of this reason, metadata is frequently used by Big Data solutions for initial analysis.

Real Time BI or Real Time Analytics

where enterprises don't have to rely on end-of-the-day batch runs to populate data warehouse tables and generate analytics; now they can fill in this type of data, for multiple types of requirements, when the order is placed by the customer.

shared database integration

where multiple applications store their data in a single database

auto-sharding

where the database takes on the responsibility of allocating data to shards and ensuring that data access goes to the right shard. This can make it much easier to use sharding in an application.

automatic appointment

you create a cluster of nodes and they elect one of themselves to be the master. Apart from simpler configuration, automatic appointment means that the cluster can automatically appoint a new master when a master fails, reducing downtime.

is mater-slave good at scheme for datasets with heavy write traffic ?

NO. Consequently it isn't such a good scheme for datasets with heavy write traffic, although offloading the read traffic will help a bit with handling the write load.

In practice, the line between key-value and document gets a bit blurry.

People often put an ID field in a document database to do a key-value style lookup. Databases classified as key-value databases may allow you structures for data beyond just an opaque aggregate.

A tuple is a limited data structure: that captures ...

It captures a set of values, so you cannot nest one tuple within another to get nested records, nor can you put a list of values or tuples within another. This simplicity underpins the relational model—it allows us to think of all operations as operating on and returning tuples.

Peer-to-Peer Replication

Master-slave replication helps with read scalability but doesn't help with scalability of writes. It provides resilience against failure of a slave, but not of a master. Essentially, the master is still a bottleneck and a single point of failure. Peer-to-peer replication (see Figure 4.3) attacks these problems by not having a master. All the replicas have equal weight, they can all accept writes, and the loss of any of them doesn't prevent access to the data store.

For analyzing complex data types, or for advanced data analysis, ...

NoSQL databases offer a way to more efficiently manage and search across disparate data sets.

a schemaless database shifts the schema into the application code that accesses it. These problems can be reduced with a couple of approaches

One is to encapsulate all database interaction within a single application and integrate it with other applications using web services. This fits in well with many people's current preference for using web services for integration. Another approach is to clearly delineate different areas of an aggregate for access by different applications. These could be different sections in a document database or different column families an a column-family database.

f NoSQL databases that uses a distribution model similar to relational databases but offers a different data model that makes it better at...

handling data with complex relationships.

data is

helping be understand what is happening and what has happened. telling the narrative in the way that is easily understood and its more challenging to think about sometimes

For people using a database, the data model describes

how we interact with the data in the database this is distinct from a storage model,

The vital

if sometimes inconvenient, fact is that whenever we write a program that accesses data, that program almost always relies on some form of implicit schema.

document database

imposes limits on what we can place in it, defining allowable structures and types. In return, however, we get more flexibility in access.

graph databases are ideal for

is ideal for capturing any data consisting of complex relationships such as social networks, product preferences, or eligibility rules.

Relational databases use what to handle consistency ?

use ACID transactions (p. 19) to handle consistency across the whole database. This inherently clashes with a cluster environment, so NoSQL databases offer a range of options for consistency and distribution.

A data model

is the model through which we perceive and manipulate our data

Master-Server. Replication comes with some alluring benefits, but it also comes with an inevitable dark side— inconsistency.

You have the danger that different clients, reading different slaves, will see different values because the changes haven't all propagated to the slaves. In the worst case, that can mean that a client cannot read a write it just made. Even if you use master-slave replication just for hot backup this can be a concern, because if the master fails, any updates not passed on to the backup are lost.

The relational data model organizes data into

into a structure of tables and rows, or more properly, relations and tuples

a data structure define

is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data.

*Aggregate-oriented databases work best when most data interaction is done with the .....

is done with the same aggregate; aggregate-ignorant databases are better when interactions use data organized in many different formations.

A view

is like a relational table (it is a relation) but it's defined by computation over the base tables. When you access a view, the database computes the data in the view—a handy form of encapsulation.

tableau

is one of the best tools available for plotting data and clarifying narratives, both for ourselves and for others.

MongoDB,

is optimized to store documents. semi

"NoSQL" is applied to a database, it refers to an

it refers to an ill-defined set of mostly open-source databases, mostly developed in the early 21st century, and mostly not using SQL.

sharding benefits and non- benefits

it's useful to put aggregates together if you think they may be read in sequence Sharding is particularly valuable for performance because it can improve both read and write performance Sharding does little to improve resilience when used alone. Despite the fact that sharding is made much easier with aggregates, it's still not a step to be taken lightly In any case the step from a single node to sharding is going to be tricky

**Column-family models

models divide the aggregate into column families, allowing the database to treat them as units of data within the row aggregate. This imposes some structure on the aggregate but allows the database to take advantage of that structure to improve its accessibility.

polyglot persistence

using different data stores in different circumstances. Instead of just picking a relational database because everyone does, we need to understand the nature of the data we're storing and how we want to manipulate it. The result is that most organizations will have a mix of data storage technologies for different circumstances.

With a peer-to-peer replication cluster, you can ...

you can ride over node failures without losing access to data. Furthermore, you can easily add nodes to improve your performance. There's much to like here—but there are complications.

explain storing a relational database

you first have to define a schema—a defined structure for the database which says what tables exist, which columns exist, and what data types each column can hold. Before you store some data, you have to have the schema defined for it.

In order to get read resilience....

you need to ensure that the read and write paths into your application are different, so that you can handle a failure in the write path and still read. This includes such things as putting the reads and writes through separate database connections—a facility that is not often supported by database interaction libraries. As with any feature, you cannot be sure you have read resilience without good tests that disable the writes and check that reads still occur.

Replication comes in two forms:

• Master-slave replication makes one node the authoritative copy that handles writes while slaves synchronize with the master and may handle reads. • Peer-to-peer replication allows writes to any node; the nodes coordinate to synchronize their copies of the data.

NoSQL databases operate without a...

schema, allowing you to freely add fields to database records without having to define any changes in structure first. his is particularly useful when dealing with nonuniform data and custom fields which forced relational databases to use names like customField6 or custom field tables that are awkward to process and understand.

Cassandra uses the terms ...

"wide" and "skinny." Skinny rows have few columns with the same columns used across the many different rows. In this case, the column family defines a record type, each row is a record, and each column is a field. A wide row has many columns (perhaps thousands), with rows having very different columns. A wide column family models a list, with each column being one element in that list.

ACID transactions: stand for ...

Atomic, Consistent, Isolated, and Durable

characteristics of NoSQL

they don't use the relational data model, and thus don't use the SQL language They tend to be designed to run on a cluster they tend to be Open Source they don't have a fixed schema, allowing you to store any data in any record

**The key-value data model

treats the aggregate as an opaque whole, which means you can only do key lookup for the whole aggregate— you cannot run a query nor retrieve a part of the aggregate.

Polyglot Persistence

using multiple data storage technologies, chosen based upon the way data is being used by individual applications. Why store binary images in relational database, when there are better storage systems?

unstructured data is stored where?

usually in data lakes, NOSQL databases, application and date warehouse

The backing store can be organized in all short of ways. Name some

1. For many productivity applications (such as word processors), it's a file in the file system of the operating system. 2. For most enterprise applications, however, the backing store is a database. The database allows more flexibility than a file system in storing large amounts of data in a way that allows an application program to get at small bits of that information quickly and easily.

the only reason we see project teams considering NoSQL databases is?

An equally important reason is the old frustration with the impedance mismatch problem. The big data concerns have created an opportunity for people to think freshly about their data storage needs, and some development teams see that using a NoSQL database can help their productivity by simplifying their database access even if they have no need to scale beyond a single machine.

real point of acid?

ACID is a rather contrived acronym; the real point is the atomicity: Many rows spanning many tables are updated as a single operation. This operation either succeeds or fails in its entirety, and concurrent operations are isolated from each other so they cannot see a partial update.

Aggregates have an important consequence for transactions. Relational databases allow you to manipulate any combination of rows from any tables in a single transaction. Such transactions are called

ACID transactions

Example of a process-driven application

An inventory control system that maintains stock levels against product SKUs is an ideal example because it operates using concrete information. The logic built on top of the database may be complex, but the records themselves are very simple.

examples of unstructured data

Other examples of unstructured data include photos, video and audio files, text files, social media content, satellite imagery, presentations, PDFs, open-ended survey responses, websites and call center transcripts/recordings.

Structured defitnion 2

Data that is the easiest to search and organize, because it is usually contained in rows and columns and its elements can be mapped into fixed pre-defined fields, is known as structured data. can follow a data model a database designer creates, entities can be grouped together to form relations and this makes structure databases easy to store, analyze and search until recently was the only data easily usable for businesses

examples of semi structured

Email messages are a good example. While the actual content is unstructured, it does contain structured data such as name and email address of sender and recipient, time sent, etc. Another example is a digital photograph. The image itself is unstructured, but if the photo was taken on a smart phone, for example, it would be date and time stamped, geo tagged, and would have a device ID. Once stored, the photo could also be given tags that would provide a structure, such as 'dog' or 'pet.'

The common characteristics of NoSQL databases are

Not using the relational model Running well on clusters Open-source Built for the 21st century web estates Schemaless

remember there are two primary reasons for considering NoSQL

One is to handle data access with sizes and performance that demand a cluster; the other is to improve the productivity of application development by using a more convenient data interaction style.

The most important result of the rise of NoSQL is

Polyglot Persistence

the line between key-value and document gets a bit blurry. example

Riak allows you to add metadata to aggregates for indexing and interaggregate links, Redis allows you to break down the aggregate into lists or sets

This terminology is as established by Google Bigtable and HBase

Since the database knows about these common groupings of data, it can use this information for its storage and access behavior. Even though a document database declares some structure to the database, each document is still seen as a single unit. Column families give a two-dimensional quality to column-family databases.

SQL 5 thing it does

Store persistent data Application Integration Mostly Standard Concurrency Control Reporting

Structured data is best suited to ______________________________ that rely on _______________________________________.

Structured data is best suited to process-driven applications that rely on specific information presented in a known, consistent format

Unstructured data and applications powered by ...

by unstructured data tend to be more ambiguous; email clients that store messages of varying lengths that may include attachments. Or presentation software that blends text, graphics, and multimedia content. Potentially high value information is held in these assets, but it cannot be retrieved using regular text queries from a traditional relational database

Most computer architectures have the notion of two areas of memory:

a fast volatile "main memory" and a larger but slower "backing store."

Clustered relational databases, such as the Oracle RAC or Microsoft SQL Server, work on the concept of

a shared disk subsystem. They use a cluster-aware file system that writes to a highly available disk subsystem—but this means the cluster still has the disk subsystem as a single point of failure.

key-value, document, and column-family.. share an common characteristic of their data models which we call an

aggregate orientation

We said earlier on that key-value and document databases were strongly

aggregate-oriented. What we meant by this was that we think of these databases as primarily constructed through aggregates. Both of these types of databases consist of lots of aggregates with each aggregate having a key or ID that's used to get at the data.

"NoSQL"

first made its appearance in the late 90s as the name of an open-source relational database [Strozzi NoSQL]. Led by Carlo Strozzi, this database stores its tables as ASCII files, each tuple represented by a line with fields separated by tabs. The name comes from the fact that the database doesn't use SQL as a query language. Instead, the database is manipulated through shell scripts that can be combined into the usual UNIX pipelines. Other than the terminological coincidence, Strozzi's NoSQL had no influence on the databases we describe in this book.

Data warehouses and data lakes have become important for ...

for big data analytics, providing a way to increase overall capacity using low-cost commodity storage.

google and amazon have adapted what

google --> bigtable amazon --> dynamo NOSQL

A row in Cassandra only occurs in ...

in one column family, but that column family may contain supercolumns—columns that contain nested columns. The supercolumns in Cassandra are the best equivalent to the classic Bigtable column families.

In Domain-Driven Design, an aggregate is a

is a collection of related objects that we wish to treat as a unit. In particular, it is a unit for data manipulation and management of consistency. Typically, we like to update aggregates with atomic operations and communicate with our data storage in terms of aggregates

Semi-structured data define 2

is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contain tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data Known as a self-describing structure

Unstructured data define 2

is data that cannot be contained in a row-column database and doesn't have an associated data model. Think of the text of an email message. The lack of structure made unstructured data more difficult to search, manage and analyse, which is why companies have widely discarded unstructured data, until the recent proliferation of artificial intelligence and machine learning algorithms made it easier to process.

difference b/t structured and unstructured. start with unstructured

is more ambiguous. Without a predefined data model, you can store a far broader range of rich data including images, sound, video, and text. As the scope for storage increases, and as data becomes more complex and dynamic, so too does the difficulty with which you can search and analyze that information. Thankfully, there are modern data management platforms, such as MongoDB Atlas, that make it easier to store and process large amounts of unstructured data.

A consequence of wide column families?

is that a column family may define a sort order for its columns. . This way we can access orders by their order key and access ranges of orders by their keys.

opacity

is that we can store whatever we like in the aggregate. The database may impose some general size limit, but other than that we have complete freedom.

Each NoSQL solution has a different model that it uses, which we put into four categories widely used in the NoSQL ecosystem:

key-value, document, column-family, and graph.

All operations in SQL consume and return relations, which leads to ...

leads to the mathematically elegant relational algebra.

**The document model

makes the aggregate transparent to the database allowing you to do queries and partial retrievals. However, since the document has no schema, the database cannot act much on the structure of the document to optimize the storage and retrieval of parts of the aggregate.

semi - structured data

mix b/t both. has some defining or consistent characteristics but doesn't conform to a structure as rigid as is expected with a relational database. Therefore, there are some organizational properties such as semantic tags or metadata to make it easier to organize, but there's still fluidity in the data.

"NoSQL" that we recognize today traces back to..

o a meetup on June 11, 2009 in San Francisco organized by Johan Oskarsson, a software developer based in London.

Column-family databases

organize their columns into column families. Each column has to be part of a single column family, and the column acts as unit for access, with the assumption that data for a particular column family will be usually accessed together.

In order to make this polyglot world work, ....

our view is that organizations also need to shift from integration databases to application databases

Relational databases have been a successful technology for twenty years, providing ...

persistence, concurrency control, and an integration mechanism.

Structured data (also known as relational data)

refers to data that fits a predefined data model. It can be easily mapped into designated fields. A US ZIP code can be stored as a five digit string (e.g. 90210), a State as a two-character abbreviation (e.g. CA), etc.

NoSQL databases are designed to hold what data?

semi-structured data

The linear, controlled nature of structured data is best suited to

statistical-type big data analytics using similarly structured query language (SQL). If you want to know which product line sells best during the summer months or which manufacturing component is likely to fail next, a regular relational database will perform adequately.

what kind of projects are candidates for polyglot persistence ?

strategic and rapid time to market and/ or data intensive

The two models differ in that in a key-value database, the aggregate is ......... In contrast, a document database is able to see .....

the aggregate is opaque to the database --- just some big blob of mostly meaningless bits. In contrast, a document database is able to see a structure in the aggregate.

For application developers, the biggest frustration has been what's commonly called the impedance mismatch:

the difference between the relational model and the in-memory data structures

In conversation, the term "data model" often means

the model of the specific data in an application.

[Daigneau]

where applications would communicate over HTTP. Web services enabled a new form of a widely used communication mechanism—a challenger to using the SQL with shared databases. (Much of this work was done under the banner of "Service-Oriented Architecture"—a term most notable for its lack of a consistent meaning.)

storage model

which describes the database stores and manipulates the data internally. In an ideal world, we should be ignorant of the storage model, but in practice we need at least some inkling of it—primarily to achieve decent performance.

a different approach is to treat your database as an application database

which is only directly accessed by a single application codebase that's looked after by a single team. With an application database, only the team using the application needs to know about the database structure, which makes it much easier to maintain and evolve the schema. Since the application team controls both the database and the application code, the responsibility for database integrity can be put in the application code.

integration database

with multiple applications, usually developed by separate teams, storing their data in a common database. This improves communication because all the applications are operating on a consistent set of persistent data.

an application database

you get more freedom of choosing a database. Since there is a decoupling between your internal database and the services with which you talk to the outside world, the outside world doesn't have to care how you store your data, allowing you to consider nonrelational options


Ensembles d'études connexes

Single tenant builds for re-lease

View Set

Chapter 9: Cellular Communication

View Set

12.1 Monopolistic Competition and Oligopoly

View Set

AP Biology Ch. 6-7 AP collegeboard

View Set

Hardware and Software Module 11 Quiz

View Set

Chapter 5 Cost-Volume-Profit Relationships

View Set

Chapter 41 Management of Patients with musculoskeletal disorders

View Set

Information Systems Project Mgmt - Chapter 7 Quiz

View Set