Preface

We wrote this book for data scientists and data engineers familiar with Python and pandas who are looking to handle larger-scale problems than their current tooling allows. Current PySpark users will find that some of this material overlaps with their existing knowledge of PySpark, but we hope they still find it helpful, and not just for getting away from the Java Virtual Machine (JVM).

If you are not familiar with Python, some excellent O’Reilly titles include Learning Python and Python for Data Analysis. If you and your team are more frequent users of JVM languages (such as Java or Scala), while we are a bit biased, we’d encourage you to check out Apache Spark along with Learning Spark (O’Reilly) and High Performance Spark (O'Reilly).

This book is primarily focused on data science and related tasks because, in our opinion, that is where Dask excels the most. If you have a more general problem that Dask does not seem to be quite the right fit for, we would (with a bit of bias again) encourage you to check out Scaling Python with Ray (O’Reilly), which has less of a data science focus.

A Note on Responsibility

As the saying goes, with great power comes great responsibility. Dask and tools like it enable you to process more data and build more complex models. It’s essential not to get carried away with collecting data simply for the sake of it, and to stop to ask yourself if including a new field in your model might have some unintended real-world implications. You don’t have to search very hard to find stories of well-meaning engineers and data scientists accidentally building models or tools that had devastating impacts, such as increased auditing of minorities, hiring algorithms discriminating based on gender, or subtler things like biases in word embeddings (a way to represent the meanings of words as vectors). Please use your newfound powers with such potential consequences in mind, for one never wants to end up in a textbook for the wrong reasons.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic: Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width: Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Tip	This element signifies a tip or suggestion.

Note	This element signifies a general note.

Warning

This element indicates a warning or caution.

Online Figures

Print readers can find larger, color versions of some figures at https://oreil.ly/SPWD-figures. Links to each figure also appear in their captions.

License

Once published in print and excluding O’Reilly’s distinctive design elements (i.e., cover art, design format, “look and feel”) or O’Reilly’s trademarks, service marks, and trade names, this book is available under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License. We’d like to thank O’Reilly for allowing us to make this book available under a Creative Commons license and hope that you will choose to support this book (and us) by purchasing several copies (it makes an excellent gift for whichever holiday season is coming up next).

Using Code Examples

The Scaling Python Machine Learning GitHub repo contains the majority of the examples in this book. They are mainly under the dask directory, with more esoteric parts (such as the cross-platform CUDA container) found in separate top-level directories.

If you have a technical question or a problem using the code examples, please email support@oreilly.com.

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Scaling Python with Dask by Holden Karau and Mika Kimmins (O’Reilly). Copyright 2023 Holden Karau and Mika Kimmins, 978-1-098-11987-4.”

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com.

O’Reilly Online Learning

Note	For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.

Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-889-8969 (in the United States or Canada)
707-829-7019 (international or local)
707-829-0104 (fax)
support@oreilly.com
https://www.oreilly.com/about/contact.html

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/scaling-python-dask.

For news and information about our books and courses, visit https://oreilly.com.

Find us on LinkedIn: https://linkedin.com/company/oreilly-media

Follow us on Twitter: https://twitter.com/oreillymedia

Watch us on YouTube: https://youtube.com/oreillymedia

Acknowledgments

This is a book written by two trans immigrants living in America at a time when the walls can feel like they’re closing in. We choose to dedicate this book to those fighting for a more just world in whichever way, however small—thank you. To all those we lost or didn’t get to meet, we miss you. To those we have yet to meet, we are excited to meet you.

This book would not exist if not for the communities it is built on. From the Dask community to the PyData community, thank you. Thank you to all the early readers and reviewers for your contributions and guidance. These reviewers include Ruben Berenguel, Tom Drabas, Adam Breindel, Kevin Kho, John Iannone, Joseph Gnanaprakasam, Jess Males, and many more. A special thanks to Ann Spencer for reviewing the early proposals of what eventually became this and Scaling Python with Ray. Any remaining mistakes are entirely our fault, sometimes going against reviewers' advice.^[1]

Holden would also like to thank her wife and partners for putting up with her long in-the-bathtub writing sessions. A special thank you to Timbit for guarding the house and generally giving Holden a reason to get out of bed (albeit often a bit too early for her taste).

Mika would additionally like to thank Holden for her mentorship and help, and give a shout-out to her colleagues at the Harvard data science department for providing her with unlimited free coffee.

1. We are sometimes stubborn to a fault.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preface.asciidoc

preface.asciidoc

Preface

A Note on Responsibility

Conventions Used in This Book

Online Figures

License

Using Code Examples

O’Reilly Online Learning

How to Contact Us

Acknowledgments

Files

preface.asciidoc

Latest commit

History

preface.asciidoc

File metadata and controls

Preface

A Note on Responsibility

Conventions Used in This Book

Online Figures

License

Using Code Examples

O’Reilly Online Learning

How to Contact Us

Acknowledgments