Gradient Flow #45: Top Places to Work for Data Scientists; Model Serving; Tuning Language Models

Subscribe • Previous Issues

“There's no sense in being precise when you don't even know what you're talking about.” - John von Neumann

Data Exchange podcast

  • Deploying Machine Learning Models Safely and Systematically  Hamel Husain is a Staff Machine Learning Engineer at GitHub and a core developer for fastai. 

  • Machine Learning in Astronomy and Physics    Dr. Viviana Acquaviva, Associate Professor at the CUNY Graduate Center, is an Astrophysicist with a strong interest in Data Science and Machine Learning.

  • Large-scale machine learning and AI on multi-modal data    Bob Friday is VP and CTO at Mist Systems a Juniper Company.  His team uses data, machine learning, and AI to “optimize user experiences and simplify operations across the wireless access, wired access, and SD-WAN domains”. They’ve deployed deep learning models for anomaly detection, and virtual assistants that provide insight and guidance to IT staff via a conversational interface.

[Photo by Paul Skorupskas on Unsplash.]

Data & Machine Learning Tools and Infrastructure

  • Immediate 3X serving speed up with Ray Serve   Ray Serve is quietly becoming one of the more popular open source libraries for model serving. Learn how Wildlife Studios - one of the largest mobile gaming companies in the world - successfully deployed Ray Serve to deliver in-game offers. 

  • cleanlab: machine learning with noisy labels   An open source library for confident learning, an approach that involves pruning noisy data (as opposed to fixing label errors), and ranking examples to train with confidence.

  • Designing data ingestion pipelines   ML practitioners understand that scaling data ingestion pipelines is crucial and inefficiencies at this stage can really cripple training throughput. Through the lens of deep learning for recommendation systems, a team from Facebook and Stanford present an architecture for end-to-end training data ingestion. 

  • Zingg    We live in an age where companies have data in disparate systems. In this context, scalable entity resolution and master data management systems bring tremendous benefits to downstream analytic and machine learning applications. Zingg is a new open source library for large-scale entity resolution. It’s built on top of Apache Spark.

Recommendations


If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe:


Ben Lorica edits the Gradient Flow newsletter. He is co-chair of the Ray Summit, external chair of the NLP Summit, and host of the Data Exchange podcast. You can follow him on Twitter @BigData. This newsletter is produced by Gradient Flow.

Gradient Flow #44: 2021 NLP Industry Survey Results; No-Code Landscape

Subscribe • Previous Issues

“Life is like an ice cream cone. It’s melting so you better eat it before it’s too late.” - Lila.GPT3

FREE Report: 2021 NLP Industry Survey Results

Read this new report and learn how companies are building language applications today. The 2021 NLP Industry Survey was a global online survey that drew a total of 655 respondents from over fifty countries.

Download


Data Exchange podcast

  • How To Lead In Data Science   We speak with Jike Chong and Yue Cathy Chang about their newly released field guide for data scientists at all career stages. They offer specific guidance for those who are looking to manage teams or work towards a seat at their company’s top leadership table. 

  • The unreasonable effectiveness of multiple dispatch   My annual check-in on the state of Julia with Viral Shah, Co-founder and CEO of Julia Computing.


Data & Machine Learning Tools and Infrastructure

  • Hugging Face NLP Datasets  An open source library that provides one-line data loaders for many public datasets, and efficient data pre-processing.

  • TextDistance   A Python library for comparing distance between two or more sequences using Hamming, Levenshtein, and many other algorithms.

  • Merlion from Salesforce, is a new open source machine learning library for time series. I’ve been using Greykite but I’m always on the lookout for time series tools, I can’t wait to try this out.

  • Introducing Ant Ray Serving    An online service framework which provides users with a serverless platform to publish Java/Python code as online services. Ant Ray Serving usage within Ant Group has reached an impressive 60,000 cores and 5,000 nodes. As of June/2021, Ant Group had the largest Ray cluster in production (200,000+ CPU cores).

  • Performance Improvements in Databricks SQL    Powered by Photon, a new native vectorized engine, DB SQL lets you operate a multi-cloud lakehouse architecture

[Image: Half Dome, Yosemite National Park from Wikimedia.]

Recommendations

  • Taking Low-Code and No-Code Development to the Next Level   In this new post with Assaf Araki of Intel Capital, we describe exciting new tools that aim to boost developer productivity and agility, while also expanding the pool of people who can build software applications.

  • The Future is Big Graphs   Graphs have become a key abstraction in today's data-processing pipelines. This ACM survey - particularly the section on Ecosystems - will help you get up to speed on tools for working with graphs. 


If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe:


Ben Lorica edits the Gradient Flow newsletter. He is co-chair of the Ray Summit, external chair of the NLP Summit, and host of the Data Exchange podcast. You can follow him on Twitter @BigData. This newsletter is produced by Gradient Flow.

Gradient Flow #43: Graph Databases; Language Understanding; Program Synthesis

Subscribe • Previous Issues

“Done is better than perfect.” - Sheryl Sandberg

Data Exchange podcast

  • The State of Data Journalism    A conversation with Tara Kelly, Data Editor at DataJournalism.com (DJC) an organization created by the European Journalism Centre. DJC provides journalists and media groups with free resources, materials, online video courses and community forums. 

  • Why Graph Databases and Graph Analytics are hot again   Our friend Paco Nathan has been doing a lot of work with graphs and as such he’s had to immerse himself in the world of graph data management. This conversation is focused on what’s new with graph databases, use cases of graph databases, graph analytics, and graph neural networks.


Featured FREE Virtual conference

I’m once again the co-chair of the NLP Summit and we have another great lineup for you this year. We have speakers and case studies from leading organizations including Hugging Face, Stanford NLP and Stanza, Spark NLP, Morgan Stanley, Microsoft Research, Eleuther, and AI21 Labs - creators of the largest language model available to developers.

REGISTER


Data & Machine Learning Tools and Infrastructure

  • The Data Lakehouse :: FAQ    A data management paradigm that we first introduced last year is quietly and steadily gaining traction.

  • Data Validation Tool   In our soon to be released Data Engineering Survey, respondents cited Data Quality and Data Validation as one of the key challenges facing their data teams. This newly open sourced library from Google is a Python tool that provides an automated and repeatable solution for data validation across different environments. 

  • Darts   An open source Python library for easy manipulation and forecasting of time series. Among other things, Darts lowers the barrier for using deep learning models for forecasting and allows you to train on multiple (thousands or more) of possibly multi-dimensional time series.

  • Whale: Scaling Deep Learning Model Training to the Trillions

  • River   An open source Python library for online machine learning.

[Image: Examples of real-world situations that can cause a model to degrade]

Recommendations


Closing short: Solo Band


If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe:


Ben Lorica edits the Gradient Flow newsletter. He is co-chair of the Ray Summit, external chair of the NLP Summit, and host of the Data Exchange podcast. You can follow him on Twitter @BigData. This newsletter is produced by Gradient Flow.

Gradient Flow #42: Data Quality; Oscilloscope for Deep Learning; Feature Stores

Subscribe • Previous Issues

“The slow philosophy is not about doing everything in tortoise mode. It's less about the speed and more about investing the right amount of time and attention in the problem so you solve it.” - Carl Honoré

Data Exchange podcast

[Image: Dana King’s “Monumental Reckoning”, by BL]

Data & Machine Learning Tools and Infrastructure

  • Data Quality Unpacked   Kenn So (of Shasta Ventures) and I list some new solutions and startups, and we describe key features to look for in a data quality solution.

  • How Ikigai Labs Serves Interactive AI Workflows at Scale using Ray Serve  “Ray Serve can serve not only the various deep learning models, but also arbitrary Python code in a distributed manner. Since one of the biggest missions in the Ikigai data pipeline is to run user’s arbitrary Python code at scale with interactivity, Ray Serve provided answers to many challenges we faced as it enabled us to serve users’ code with real-time interaction.”

  • What feature stores are and how they are used today    A short overview from a group of UC Berkeley PhD students. A recent VLDB tutorial by a team from Stanford, Apple, and Uber, also contains a good description of feature stores. The VLDB tutorial instructors predict that next generation feature stores will provide native support for embeddings (derived data in the form of low-dimensional, learned continuous vector representations). This would require tools for searching and querying embeddings as well as support for versioning, provenance, and downstream quality metrics.

  • Jurassic-1 Jumbo, is the largest model ever made available to developers   J1-Jumbo is a 178B-parameter model, and J1-Large is a 7B-parameter model.

  • Hora, is an approximate nearest neighbor search algorithm written in Rust that comes with a Python API

  • HashiCorp State of Cloud Strategy Survey “76% are already multi-cloud.”  As I pointed out in a short post last year, surveys consistently show that a vast majority of respondents work at companies that use multiple clouds.

[Image: A Sample of Data Quality Startups]

Recommendations


Closing Short: Dance, Dance, Dance


If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe:


Ben Lorica edits the Gradient Flow newsletter. He is co-chair of the Ray Summit, external chair of the NLP Summit, and host of the Data Exchange podcast. You can follow him on Twitter @BigData. This newsletter is produced by Gradient Flow.

Gradient Flow #41: What’s New in Data Engineering; MLOps Anti-Patterns

Subscribe • Previous Issues

“Time matters most when decisions are irreversible.”  - Peter Bernstein

Data Exchange podcast

[Image: Langkawi Sky Bridge from pxfuel.]

Data & Machine Learning Tools and Infrastructure

[Image: The Dahlia Garden by BL]

Recommendations


Closing Short: #Mesmerizing


If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe:


Ben Lorica edits the Gradient Flow newsletter. He is co-chair of the Ray Summit, external chair of the NLP Summit, and host of the Data Exchange podcast. You can follow him on Twitter @BigData. This newsletter is produced by Gradient Flow.

Loading more posts…