Gradient Flow #32: Data Cascades, Demand for Data Engineers, Exploiting ML models

Ben Lorica 罗瑞卡

Apr 08, 2021

Subscribe • Previous Issues

This edition has 428 words which will take you about 2 minutes to read.

“I would believe only in a god who could dance.” - Friedrich Nietzsche.

Data Exchange podcast

Machine Learning in Healthcare I speak with Parisa Rashidi, Associate Professor at the Department of Biomedical Engineering and Director of the Intelligent Health Lab at the University of Florida.
Data quality is the key to great AI products and services Abe Gong is the CEO and co-founder at Superconductive, a startup founded by the team behind the Great Expectations (GE) open source project. GE is one of a growing number of tools aimed at improving data quality through tools for validation and testing.

Featured Virtual Event

I am co-chair of Ray Summit, a FREE conference that brings together developers, engineers, data scientists, and architects interested in machine learning, AI, and other compute-intensive applications. The Ray community and ecosystem have significantly expanded since last year’s conference and we have another outstanding series of keynotes, talks, and tutorials for you.

Data & Machine Learning tools and infrastructure

Data Cascades: Why we need feedback channels throughout the machine learning lifecycle I wrote a post on an important new study from Google Research. Underinvestment in data quality and related issues can have compounding consequences and add to technical debt over time. This important qualitative study is based on interviews with data practitioners in high-stakes domains such as healthcare, humanitarian settings, and climate change. We all need to incentivize people to focus more on tasks related to data quality, DataOps, and overall “data excellence”. As one of the participants of the study observed: “Everyone wants to do the model work, not the data work.”
How Amazon uses Ray for petabyte-scale data processing I am really looking forward to this technical presentation at the upcoming Ray Summit.
Prisma Migrate is now GA Prisma is an open source, next-generation database schema migration tool.
Julia 1.6 Highlights
Spark NLP 3.0 Release Notes
Exploiting machine learning pickle files Pickle is a popular Python module for serializing and deserializing objects and is widely used for ML models. The author describes an open source tool that can be used to reverse engineer, test, and create malicious pickle files.

[Image: Flatiron Building in Shanghai, by SGL]

Funding Updates

Streamlit announces $35M Series B Streamlit is a great tool for building and sharing machine learning applications.
Flatfile raises $35M Series A
Snorkel announces $35M Series B and launches Application Studio, a visual builder for machine learning applications.

Recommendations

One Simple Chart: Data Engineering jobs in the U.S. In addition interest in reinforcement learning remains robust, and I compare the demand for TensorFlow and PyTorch.
The Algorithms That Make Instacart Roll A wonderful overview from one of the leading grocery delivery companies in the US.
A conversation about machine learning in mental health applications Set aside the hype about full automation, as in other domains over the near term applications aim to augment and assist therapists. Also see this recent survey paper on conversational agents for mental health.
Baserow: self-hosted Airtable alternative Also see the GitHub repo.

Closing Short: If you enjoy Japanese food or Sushi, you’ll enjoy this illuminating documentary from NHK.

If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe:

Ben Lorica edits the Gradient Flow newsletter. He is co-chair of the Ray Summit, external chair of the NLP Summit, and host of the Data Exchange podcast. You can follow him on Twitter @BigData. This newsletter is produced by Gradient Flow.