Gradient Flow #40: Data Augmentation in NLP, Temporal Knowledge Bases, Storage for AI

Jul 29, 2021

“Discovering you were wrong is an update, not a failure, and your worldview is a living document meant to be revised.” - Julie Gaef

Data Exchange podcast

Storage Technologies for a Multi-cloud World Brad King is the CTO of Scality, a company that builds software-defined file and object storage systems for hybrid & multi-cloud settings. Storage and compute are the basic building blocks of (cloud) computing platforms and this conversation highlights all the important considerations and recent innovations in storage technologies that data engineers, architects, and machine learning professionals need to know.
The Rise of Data Augmentation for Language Models I speak with Steven Feng, Graduate Student and Ed Hovy, Research Professor, both from the Language Technologies Institute of Carnegie Mellon University. We discussed their recent survey paper on Data Augmentation Approaches in NLP, an active field of research on techniques for increasing the diversity of training examples without explicitly collecting new data.

Temporal An open source orchestration and workflow management framework that enables teams to write scalable and reliable applications. It excels at orchestrating services but is also well-suited for data pipelines.
CueObserve, an open source anomaly detection tool for data in your SQL data warehouses and databases
QuestDB Another high-performance, open source, time-series database that supports SQL. For readers who work in open source, I also recommend the story of how this project came about.
How TikTok’s recommender works Is the most crucial element your Location, Shares, Likes, or Follows? According to this WSJ investigation, how long you linger over a piece of content is the key to TikTok’s recommender. Their research also gauged how quickly TikTok’s model learns users' interests and sends them down a rabbit hole of similar content. As with early investigations into YouTube’s recsys platform, extreme content is used to optimize watch time.
Time-aware language models A team of researchers from Google developed a new approach to pretraining language models that facilitates the acquisition of temporal knowledge.

10 major trends of China's AI industry in 2021 Notes from the recent World Artificial Intelligence Conference, you might need to use machine translation to read this article. (via the China AI newsletter)
Why Wait - Top Data Trends I was a recent guest on my friend Dhruba Borthakur’s new podcast. Dhruba is the creator of the popular, open source embeddable database, RocksDB.
The Benchmark Lottery in Machine Learning A much needed paper on the importance of understanding the individual, community, social, and political pressures that influence why some ML benchmarks become canonical and others do not.
2021 Silicon Valley Software Engineering Talent Report Average salaries from the report: backend ($140K), frontend ($128K), full stack ($134K), DevOps ($171K), mobile ($147K), data scientist ($141K), security engineer ($170K).
Speaking of engineering talent, here’s Dropbox’ Engineering Career Framework.
What’s bad about Julia? A good post from an avid Julia user: “Learning why you may not want to choose to use a tool is just as important as learning why you may.”

Closing short:

#TokyoOlympics @NBCOlympics

If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe:

Ben Lorica edits the Gradient Flow newsletter. He is co-chair of the Ray Summit, external chair of the NLP Summit, and host of the Data Exchange podcast. You can follow him on Twitter @BigData. This newsletter is produced by Gradient Flow.