DDIA_Reading_Notes
郭旭升 Lv6

Time

  • 2023/02/25
    (How time flies!)

Structure of the book

This book is divided into three parts.

  • Part I. Fundations of Data Systems.-discuss the fundamental ideas that underpin the design of data-intensive applications.
  • Part II. Distributed Data.-we move from data stored on one machine to data that is distributed across multiple machines.
  • Part III. Derived Data.-we discuss systems that derive some datasets from other datasets.

Current reading progress

General ideas of part I

The first four chapters go through the fundamental ideas that apply to all data systems, whether running on a single machine or distributed across a cluster of
machines:

  1. Chapter 1 introduces the terminology and approach that we’re going to use throughout this book. It examines what we actually mean by words like reliability, scalability, and maintainability, and how we can try to achieve these goals.

  2. Chapter 2 compares several different data models and query languages—the most visible distinguishing factor between databases from a developer’s point of
    view. We will see how different models are appropriate to different situations.

  3. Chapter 3 turns to the internals of storage engines and looks at how databases layout data on disk. Different storage engines are optimized for different workloads,
    and choosing the right one can have a huge effect on performance.

  4. Chapter 4 compares various formats for data encoding (serialization) and especially examines how they fare in an environment where application requirements change and schemas need to adapt over time.

Part I - Chapter 1

A data-intensive application is typically built from standard building blocks that pro‐
vide commonly needed functionality. For example, many applications need to:

  • Store data so that they, or another application, can find it again later (databases)
  • Remember the result of an expensive operation, to speed up reads (caches)
  • Allow users to search data by keyword or filter it in various ways (search indexes)
  • Send a message to another process, to be handled asynchronously (stream pro‐
    cessing)
  • Periodically crunch a large amount of accumulated data (batch processing)

Reliability

The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or soft‐
ware faults, and even human error).

For software, typical expectations include:

  • The application performs the function that the user expected.
  • It can tolerate the user making mistakes or using the software in unexpected ways.
  • Its performance is good enough for the required use case, under the expected load and data volume.
  • The system prevents any unauthorized access and abuse.

Scalability

As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.

  • Main concepts:
  1. Describing Load :
    Load can be described with a few numbers which we call load parameters. The best choice of parameters depends on the architecture of your system: it may be requests per second to a web
    server, the ratio of reads to writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache, or something else. Perhaps the average
    case is what matters for you, or perhaps your bottleneck is dominated by a small number of extreme cases.

  2. Describing Performance

Some systems are elastic, meaning that they can automatically add computing resources when they detect a load increase, whereas other systems are scaled manually (a
human analyzes the capacity and decides to add more machines to the system).
An elastic system can be useful if load is highly unpredictable, but manually scaled systems are simpler and may have fewer operational surprises

Maintainability

Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new usecases), and they should all be able to work on it productively.
pay particular attention to three design principles for software systems will hopefully minimize pain during maintenance, and thus avoid creating legacy software ourselves.

  • Operability
    Make it easy for operations teams to keep the system running smoothly.
  • Simplicity
    Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system. (Note this is not the same as simplicity of the user interface.)
  • Evolvability
    Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also known as extensibility, modifiability, or plasticity.

Part I - Chapter 2 - Data Models and Query Languages

The limits of my language mean the limits of my world.
—Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1922)
Can not agree more! Keep learning, language is not being a barrier!
Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on
how we think about the problem that we are solving.

What’s good?

  • Technology can be used for good, make underrepresented people’s voices heard, to create opportunities for everyone, and avert disasters.
  • Fortunately, behind the rapid changes in technology, there are enduring principles
    that remain true, no matter which version of a particular tool you are using. If you
    understand those principles, you’re in a position to see where each tool fits in, how to
    make good use of it, and how to avoid its pitfalls.

Q&A(General ideas about this book!)

What is Data-intensive application?

An application data-intensive if data is its primary challenge– the quantity of data, the complexity of data, or the speed at which it is changing–as opposed to compute-intensive, where CPU cycles are the bottleneck.

What will you learn from this book?

• You want to learn how to make data systems scalable, for example, to support
web or mobile apps with millions of users.

• You need to make applications highly available (minimizing downtime) and
operationally robust.

• You are looking for ways of making systems easier to maintain in the long run,
even as they grow and as requirements and technologies change.

• You have a natural curiosity for the way things work and want to know what
goes on inside major websites and online services. This book breaks down the
internals of various databases and data processing systems, and it’s great fun to
explore the bright thinking that went into their design.

Scope of this book?

Instead we discuss the various principles and trade-offs that are fundamental to data systems, and we explore the different design decisions taken by different products.

References

https://github.com/ept/ddia-references

 Comments