Gradient Flow: ML in Finance, Disinformation, AI Superpowers

Subscribe • Previous Issues

This edition has 870 words which will take you about 5 minutes to read.

“We’re going to need technological solutions, but I don’t think they’re going to solve the problem.”  - Hany Farid

Data Exchange podcast

  • Identifying and mitigating liabilities and risks associated with AI   As AI and machine learning become more widely deployed, lawyers and technologists need to collaborate more closely so they can identify and mitigate liabilities and risks associated with AI.  Andrew Burt, is the Chief Legal Officer at Immuta and co-founder and Managing Partner of BNH.ai, the first law firm run by lawyers and technologists focused on helping companies identify and mitigate those risks.

  • How machine learning is being used in quantitative finance  Quants use ML alongside other modeling techniques. While ML clearly makes sense for extracting signals from unstructured and nontraditional sources of data, I’ve long wondered as to how much quants use modern tools like deep learning or reinforcement learning. In this episode, Arum Verma, Head of Quantitative Research Solutions at Bloomberg describes the growing use of ML in finance.  

[Image: box cube empty clear glass]

Machine Learning tools and infrastructure


Virtual Conferences

  • Hany Farid at the Spark+AI Summit   The battle between teams that generate and detect deepfakes has profound implications for the general public and for policy makers. Hany Farid is considered by many to be the “father of digital forensics'', a field that now finds itself at the center of the battle against deepfakes. At this year’s conference, he will provide an overview of the creation of deep fakes, and he will also describe emerging techniques for detecting them. This is a FREE event, register here.

  • Pulsar Summit   A free virtual conference centered on Apache Pulsar, a fast-growing messaging system originally developed by Yahoo. This must attend event for data engineers and data architects takes place June 17-18.

  • The Future of Transfer Learning in NLP  A survey talk by Thomas Wolf of Hugging Face, one of the most popular natural language processing libraries, and one of my favorite open source tools. The beauty of Hugging Face is it makes all the complex research that Thomas describes in his talk, available to developers. Hugging Face engineers learn and implement all the cutting edge research, in turn, developers benefit through their easy-to-use libraries. Learn how to fine-tune language models by attending the free event - The Road to AutoML - register here.


Work and Hiring

[Image: Entering the Vortex 1 by Dean Wampler.]

Recommendations

  • Verification Handbook  A free ebook with contributions from journalists who have had to wade through user-generated content. The list of verification tools and strategies in this book is worth scanning. We are awash in disinformation and we are about to start a contentious phase in the Presidential election campaign in the US. 

  • Computer algorithms that scan everything spit out flawed tenant screening reports   You’ve heard me talk about the importance of model governance and for tools for managing risks in ML, it’s precisely because models are being used in many important settings. Granted this example is less about models but more about data unification, companies in this space probably depend on entity resolution tools that combine rules and models.

  • Symbolic Mathematics Finally Yields to Neural Networks  An accessible overview of recent research into the use of machine learning to solve complex integrals and differential equations. While there is much to be excited about, these are preliminary results. The deep learning based solver only tackled equations with one variable and only involved elementary functions. The results are promising enough that I expect more researchers to focus on applications of ML to symbolic math. Look for AI to augment research mathematicians - not just Wall Street quants and traders - in the near future!

  • Protesters sing Lean on Me The late Bill Withers is watching over all of us.


If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe:


Ben Lorica edits the Gradient Flow newsletter. He is the Program Chair of the Spark+AI Summit, co-chair of the Ray Summit, and host of the Data Exchange podcast. You can follow him on Twitter @BigData.  This newsletter is produced by Gradient Flow.

Gradient Flow: Scalability, Privacy, and AutoML

Subscribe Previous Issues

This edition has 716 words which will take you about 4 minutes to read.

“Hope is the thing with feathers...” - Emily Dickinson

Data Exchange podcast

  • Improving performance and scalability of data science libraries  Wes McKinney created the pandas project in 2008 and over time it has become one of the most popular tools in data science. In this episode we discussed his approach to growing an open source project, and his current focus on sustaining the development of Apache Arrow. 

  • Understanding machine learning model governance  As machine learning becomes widely deployed, organizations will need to develop processes and tools to ensure that models behave as intended. Harish Doddi describes how companies can work towards having the right set of controls and validation steps in place.  

  • High-quality Transcripts  We are happy to announce that we are beginning to produce high-quality transcripts for some episodes. Our transcripts are PDF files that are free to download. The growing collection of transcripts can be found here.


Machine Learning tools and infrastructure


Virtual Conferences

  • Presto, virtual book tour  Our friend and colleague Paco Nathan is hosting a free online panel with the authors of “Presto: The Definitive Guide”.  Presto is a very popular open source, distributed SQL engine that came out of Facebook. This takes place on May 27th, register here.

  • The road to AutoML  Hear from experts building solutions for AutoML’s key building blocks - hyperparameter tuning and neural architecture search. This free virtual event takes place June 10th, register here.

  • Nate Silver at the Spark+AI Summit   With the US Presidential elections taking place later this year, we are happy to have Nate Silver, the leading election forecaster in the US. He will give a technical talk on the nuts and bolts of building complex election forecasting models, and how to convey probabilities and uncertainty to the general public. This is a FREE event, register here.


Work and Hiring

  • My First Year as a Freelance AI Engineer  A really good overview peppered with practical advice useful to all aspiring freelancers (not just engineers).

  • SQL Interview Questions  Originally written for people interviewing for data analyst or data scientist positions, this is also a handy guide for ETL/data engineers.   

  • The reason Zoom calls drain your energy.  I’m hopeful that over time, companies and managers will limit video calls, or end them earlier to give people breaks in between. We all seem to accept that virtual conferences have to be shorter in duration compared to live events, why are video calls any different?


Recommendations

  • The Shrink next door A gripping podcast serial from Bloomberg’s Joe Nocera, who for many years wrote a business column for the NYTimes.

  • You are messing with magic   The advertising industry assumes that ad companies always benefit from more data and fancier models, this essay highlights that selection effects are frequently stronger than the advertising effect alone.

  • Flash Crash  My first job post-academia was as a lead quant in a small hedge fund. I’ve since followed the industry from afar and I have read my share of books about traders and trading. This newly released book is going to be classic, and the advanced endorsements that prompted me to read it are well deserved.

  • Michael Franti's virtual concerts  This popular SF Bay Area musician decided to ride out the initial part of the pandemic in Bali, Indonesia.    


If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe:


Ben Lorica edits the Gradient Flow newsletter. He is the Program Chair of the Spark+AI Summit, co-chair of the Ray Summit, and host of the Data Exchange podcast. You can follow him on Twitter @BigData.  This newsletter is produced by Gradient Flow.

Deep Learning Platform, TinyML, Privacy ↔ Contact Tracing

SubscribePrevious Issues

Data Exchange podcast

  • An open source platform for training deep learning models  Evan Sparks describes the newly open sourced Determined Training Platform for training deep learning models. It includes components for distributed training and hyperparameter tuning, experiment tracking and tools for collaboration and governance, a scheduler specialized for DL workflows, and more.  

  • Why TinyML will be huge  Pete Warden explains the enormous impact of deep learning on embedded systems in tiny devices, and why he has chosen to focus on building tools for ultra-low-power systems.

  • Human-in-the-loop machine learning  Rob Munro describes his long standing interest in ML and NLP, and building real-world human-in-the-loop systems. He shares insights about creating an effective customer-centered view of ML-based products.


Machine Learning tools and infrastructure

  • Introducing RaySGD   Distributed model training is difficult to set up and expensive to run. RaySGD is a new library that makes distributed PyTorch and TensorFlow model training simple and cheap.

  • A system for massively parallel hyperparameter tuning   An interesting new paper from the group of researchers behind Hyperband and other state-of-the-art hyperparameter tuning algorithms. Their new algorithm outperforms existing state-of-the-art hyperparameter optimization methods, and is suitable for massive parallelism.

  • Machine learning and microcontrollers   Google’s Pete Warden has a great series of screencast videos on how to enable ultra-low power machine learning at the edge.  No machine learning or microcontroller experience is necessary, and you can train models small enough to fit into any environment.

  • Reinforcement Learning in Public Policy   A group of researchers from Harvard and Salesforce Research used RLlib to derive AI-driven tax policies based on economic simulations. RL is used to model interactions between different players in the economy, including workers and governments.


COVID-19

  • Moscow's Facial Recognition Tech Will Outlast the Coronavirus   Here’s a compelling, high-stakes reminder that computer vision and other technology needed for contact tracing can also be used for mass surveillance.

  • Data collection and unification for forecasting epidemics  AI is the headline in this 60 Minutes segment, but I believe the key is the combining of multiple data sources. A similar data unification project in the UK (“Data can save us from COVID-19”) has produced real-time dashboards that give senior policy makers the information they need to make sound decisions.

  • Epidemic Modeling 103  Bruno Gonçalves describes how you can add confidence intervals and stochastic effects to your CoVID-19 models, to address common limitations of an epidemic model. Bruno was a recent guest on the Data Exchange podcast.

  • What Happens Next?  Speaking of epidemic simulations, the visualizations on this site are great teaching tools.  Check out these playable simulations to gain a greater understanding of what’s ahead in the next few months and years.


Virtual Conferences

Here’s an  are updates on events I’m involved with:

  • The Future of Machine Learning and AI   Two award-winning researchers - Michael Jordan and Ion Stoica - will speak on May 13 on the interplay between machine learning, decision science, and economics, and on the growing importance of distributed computing. Register here.

  • Automatic Forecasting   At the recent MLOps Virtual conference which I hosted, Perry Stephenson described how he uses the Databricks Platform to develop custom forecasting models for many different groups within Atlassian. He provides a very practical approach for how one can build, deploy, and manage many different ML models.

  • Understand the Ray ecosystem in a few minutes   I recorded this brief video to explain why Ray is generating buzz among machine learning enthusiasts and Python developers.


Work and Hiring


Recommendations


If you enjoyed this newsletter, please support our work by encouraging your friends and colleagues to subscribe:


Ben Lorica edits the Gradient Flow newsletter. He is the Program Chair of the Spark+AI Summit, co-chair of the Ray Summit, and host of the Data Exchange podcast. You can follow him on Twitter @BigData.  This newsletter is produced by Gradient Flow Research.

Modeling Epidemics, the Future of AI, and Alternative History

Data Exchange podcast

Machine Learning tools and infrastructure

  • Understanding the Ray Ecosystem and CommunityIn this new post I co-wrote with Ion Stoica, we explain the reasons behind the growing popularity of Ray.

  • Demystifying AI Infrastructure  This recent post by my friend Assaf Araki, contains a landscape map that brings greater clarity to the AI ecosystem. The article charts the layers of the AI technical stack and the vendors within each layer.

  • TensorFlow vs. PyTorch chartA revealing look at the frequency of TensorFlow vs. PyTorch listings in recent job postings.

  • PyCaretThis month marks the release of version 1.0 (“the first stable release”) of this easy to use wrapper for scikit-learn, XGBoost, Microsoft LightGBM, spaCy and other libraries.

  • KDNuggets survey  The results of this survey from KDNuggets reveal varied expectations about the use and impact of AutoML over time, segmented by background and country of origin.

  • Apache Pulsar user survey  The first user survey from the Apache Pulsar PMC team tracks Pulsar’s adoption rate, hot features, and a look at real-time streaming applications.

  • MLflow Model Registry  As companies begin relying on machine learning, they need to be able to keep track of model versions, dependencies, permissions and other related assets. The newly announced MLflow Model Registry on the Databricks platform is poised to become a central hub for ML models in companies with teams that use machine learning in a variety of applications.


COVID-19

  • COVID-19 forecasts  Cornell uses machine learning to predict CoVID-19 activity in Chinese provinces in real-time, leveraging internet search data and news alerts, and combining them with estimates from mechanistic models.

  • Unemployment visualization  A striking visualization of the spike in the national unemployment rate in the US over recent decades.

  • Countrywide lockdown works  A simple analysis - no sophisticated models - of daily death registry data for a sample of 1,161 Italian municipalities in the seven regions most severely hit by COVID-19.

  • Misinformation during a pandemic This interesting new study from the University of Chicago studies the effects of news coverage of the novel coronavirus using two cable news programs in the US (Hannity and Tucker Carlson Tonight). 

  • COVID-19 resource hub from Databricks Seven COVID-19 data sets are updated regularly and made available on the Databricks Community Edition.


Virtual Conferences

Here’s an update on events I’m involved with:

  • Ray Summit Connect  The kickoff event for Ray Summit Connect is on May 13th and features two award-winning speakers - Michael Jordan and Ion Stoica - presenting on “The Future of Machine Learning and AI”. Register here.

  • MLOps Virtual Event  I’m taking part in an April 30th virtual event on MLOps. Databricks is sending key thought leaders to speak at this event: Matei Zaharia, Sean Owens, Clemens Mewald. Register here.

  • Spark+AI Summit  The acclaimed Spark+AI Summit is going virtual June 22 - 26 and will be FREE! That’s 200+ sessions on data, machine learning, and AI, plus keynote speakers like Nate Silver. Register here.

Work and hiring


Recommendations

If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe:


Ben Lorica leads the team that edits the Gradient Flow newsletter. He is the Program Chair of the Spark+AI Summit, co-chair of the Ray Summit, and host of the Data Exchange podcast. You can follow him on Twitter @BigData.  This newsletter is produced by Gradient Flow.

Life on Lockdown, Next-gen Simulation Tools, and the Misinformation Apocalypse

Next-generation simulation software will incorporate deep reinforcement learning

Chris Nicholson, founder and CEO of Pathmind, a startup applying deep reinforcement learning (DRL) to business simulations. Through early previews from Pathmind, I’ve already seen early glimpses of how DRL is being incorporated into simulation modeling software. I expect this to be an arena where RL will be extensively used (albeit in the background). 

Business at the speed of AI: Lessons from Shopify

Solmaz Shahalizadeh, VP and Head of Data Science and Data Platform Engineering at Shopify, and she has played a critical role in helping Shopify scale its data and machine learning infrastructure.

How deep learning is being used for search and information retrieval

Edo Liberty, is the founder of Hypercube, a startup building tools for deploying deep learning models in search and information retrieval involving large collections. When I spoke at AI Week in Tel Aviv last November several friends encouraged me to learn more about Hypercube - I’m glad I took their advice! 

COVID-19

  • Life on lockdown in China: a New Yorker piece by one of my favorite writers, Peter Hessler.

  • The power of data in a pandemic: data unification, refinement, and cleaning for a data platform that will provide UK organizations responsible for coordinating the response with secure, reliable and timely data.

  • Epidemic Modeling 101: My friend Bruno Gonçalves explains fundamental concepts in mathematical biology that are relevant to modeling COVID-19.

  • A recent paper investigates how air temperature and humidity influence the transmission of COVID-19. My home region of Southeast Asia will be an important test case for the importance of these factors.

Enterprise applications of reinforcement learning

The success of reinforcement learning in game play (Atari, Go, multiplayer video games) has led to considerable interest from industrial data scientists and machine learning engineers. I recently wrote a post describing use cases in recommendations, personalization, and business simulation modeling.

Machine Learning tools and infrastructure

  • Open source database management systems: Andy Pavlo takes a look at the code repo sizes of some popular systems.

  • New JAX swing: Created by researchers at Google Brain, JAX seems to be taking hold in the machine learning research community. DeepMind recently released two new libraries -   RLax (RL on JAX) and Haiku (a simple DL library on JAX).

  • Ray Summit has been postponed until the Fall. In the meantime, enjoy an amazing series of virtual conferences beginning in mid May on the theme “Scalable machine learning, scalable Python, for everyone”.  The first event features a talk on the state of AI and ML by Michael Jordan.

  • A Tour of End-to-End Machine Learning Platforms

  • Easy Distributed Scikit-Learn with Ray: Ameer Haj Ali of RISELab  describes an easy way to scale your scikit-learn applications to a cluster with Ray’s implementation of joblib’s backend.

Work and hiring

Recommendations

Loading more posts…