Gradient Flow #40: Data Augmentation in NLP, Temporal Knowledge Bases, Storage for AI

Subscribe • Previous Issues

“Discovering you were wrong is an update, not a failure, and your worldview is a living document meant to be revised.” - Julie Gaef

Data Exchange podcast

  • Storage Technologies for a Multi-cloud World  Brad King is the CTO of Scality, a company that builds software-defined file and object storage systems for hybrid & multi-cloud settings. Storage and compute are the basic building blocks of (cloud) computing platforms and this conversation highlights all the important considerations and recent innovations in storage technologies that data engineers, architects, and machine learning professionals need to know.

  • The Rise of Data Augmentation for Language Models     I speak with Steven Feng, Graduate Student and Ed  Hovy, Research Professor, both from the Language Technologies Institute of Carnegie Mellon University. We discussed their recent survey paper on Data Augmentation Approaches in NLP, an active field of research on techniques for increasing the diversity of training examples without explicitly collecting new data.

[Image by Ludomił Sawicki from Unsplash.]

Data & Machine Learning Tools and Infrastructure

  • Temporal   An open source orchestration and workflow management framework that enables teams to write scalable and reliable applications. It excels at orchestrating services but is also well-suited for data pipelines.

  • CueObserve, an open source anomaly detection tool for data in your SQL data warehouses and databases

  • QuestDB   Another high-performance, open source, time-series database that supports SQL. For readers who work in open source, I also recommend the story of how this project came about.

  • How TikTok’s recommender works  Is the most crucial element your Location, Shares, Likes, or Follows? According to this WSJ investigation, how long you linger over a piece of content is the key to TikTok’s recommender. Their research also gauged how quickly TikTok’s model learns users' interests and sends them down a rabbit hole of similar content. As with early investigations into YouTube’s recsys platform, extreme content is used to optimize watch time.

  • Time-aware language models  A team of researchers from Google developed a new approach to pretraining language models that facilitates the acquisition of temporal knowledge.

[Image by Yoichi Aihara from Pixabay.]

Recommendations


Closing short:


If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe:


Ben Lorica edits the Gradient Flow newsletter. He is co-chair of the Ray Summit, external chair of the NLP Summit, and host of the Data Exchange podcast. You can follow him on Twitter @BigData. This newsletter is produced by Gradient Flow.

Loading more posts…