Applications for SNAP Research Projects

Research Positions in the SNAP Group
Autumn Quarter 2024-25

Welcome to the application page for research positions in the SNAP group under Prof. Jure Leskovec, Autumn Quarter 2024-25!

Our group has open positions for Research Assistants and students interested in independent studies and research (CS191, CS195, CS199, CS399). These positions are available for Stanford University students only. Below are some of the possible research projects. All projects are high-impact, allowing participants to perform research and work on real-world problems and data, and leading to research publications or open source software. Positions are often extended over several quarters. We are looking for highly motivated students with any combination of skills: machine learning, data mining, network analysis, algorithms, and computer systems.

Please apply by filling out and submitting the form below. Apply quickly since the selection process starts soon after this announcement is posted. Thanks for your interest!

If you have any questions please contact Lata Nair at lnairp24@stanford.edu.

Application Form

First and Last Name

SUNetID

SUNetID is your Stanford CS login name and contact email address, <your_SUNetID>@cs.stanford.edu. If you don't have a SUNetID, use <your_last_name>_<your_first_name>, so if your last name is Smith and your first name is John, use smith_john.

Email

Department

Student Status

Project(s)

Please select all the projects that you are interested in. Project descriptions are available below.

	Foundation Model for Structured Data [description] Keywords: Foundation Model, Relational Data, Structured Data, Large Model Pre-training
	100 Billion Parameter LLM for Biology: Training and Applications [description] Keywords: LLM Training, Foundation Model, AI for Science, Protein Folding, Virtual Cell
	Building an Operating System for Biological LLM Agent [description] Keywords: LLM Agent, Computational Biology
	Graph Transformer for Knowledge Graphs [description] Keywords: Knowledge Graphs, Graph Transformers, Data Processing, Information Retrieval
	What Makes a Good Cell Embedding [description] Keywords: Foundation Model, AI for Science, Genomics
	Optimizing LLM Agents for Complex Real-world Tasks [description] Keywords: LLMs, Agents, Multiagent, Prompt/Tool Optimization

Position

Please select the position you are interested in. Please select all that apply.

	25% RA
	50% RA
	Independent study (CS399, CS199, CS191, CS195)

Statement of Purpose

Briefly explain why you would like to participate in this project, why you think you are qualified to work on it, and how you would like to contribute.

Your Resume

Your Transcript

Click on the button below to Submit

Projects

Foundation Model for Structured Data

Keywords: Foundation Model, Relational Data, Structured Data, Large Model Pre-training

Much of the world's most valued data is stored in tables in a relational database (RDB). Building AI/ML models over tabular data still requires manual feature engineering. We recently proposed Relational Deep Learning (RDL, https://proceedings.mlr.press/v235/fey24a.html, https://relbench.stanford.edu) which revolutionarizes AI over structured data. Next, we are excited to explore foundation models (FMs) for RDL. Drawing on the success of LLMs in NLP and VLMs in vision, we seek to create transformer-based models trained on diverse, multimodal, large-scale structured data with the goal of easy and flexible adaptation to custom databases and custom predictive tasks. The core of this project is the development of a novel generative framework and architecture components which are able to fully utilize the relational structure of databases. To drive this we have partnered with SAP, a leading database company with deep expertise in industrial database systems and usecases.

This project will span the entire foundation model development pipeline. This includes designing model architecture and pre-training tasks, setting up large-scale training runs, and extensive evaluations including LLM baselines (GPT-4 API calls, LLaMA fine-tuning, etc). We are seeking highly motivated students with experience in machine learning and deep learning (e.g., CS224W, CS224N, CS231N, CS229) and a strong knowledge of the latest language modeling techniques. Students who join this project will have the opportunity to make wide-ranging contributions to multiple aspects of the project.

100 Billion Parameter LLM for Biology: Training and Applications

Keywords: LLM Training, Foundation Model, AI for Science, Protein Folding, Virtual Cell

We are developing the largest open source language model for science. We are training this model, a large language model for genomic sequences (DNA, RNA and proteins), on a cluster of 1024 H100 GPUs (1/8th of the compute budget of GPT-4). Not only will we open source the model, but also for the first time, the full training implementation and details. We are looking for two types of students: a) ML Systems: Help with the development of various aspects of our large-scale system, such as managing data, distributed communication, and machine learning architecture, as well as running and managing experiments on the cluster, b) Applied ML and biology: help with various aspects of post training: creating fine-tuned models, aligning the model with preference optimization, building sequence-to-structure modules competitive with AlphaFold and generating data for self-distillation.

We are looking for highly motivated students who have experience in machine learning, natural language processing, ML systems and engineering (courses such as CS224W, CS224N, CS231N, CS229 etc. are helpful). A strong background in PyTorch is recommended. Experience with CUDA and distributed computing is a large plus.

Building an Operating System for Biological LLM Agent

Keywords: LLM Agent, Computational Biology

We are excited about the potential of AI agents in biology, as they have the capability to automate biological discoveries, uncover novel hypotheses, and much more. However, a significant challenge in the development of these AI agents is their inability to perform experiments and analyses, which are crucial for making real scientific discoveries. Our goal is to build the necessary infrastructure for an AI agent tailored for biology. Students involved in this project will be responsible for implementing tool interfaces, building agents, and potentially fine-tuning an LLM for biological applications.

We are seeking highly motivated students with experience in machine learning and deep learning (e.g., CS224W, CS224N, CS231N, CS229) who are proficient with LLM application frameworks such as LangChain and are familiar with the latest research in AI agents.

Graph Transformer for Knowledge Graphs

Keywords: Knowledge Graphs, Graph Transformers, Data Processing, Information Retrieval

The recent advances in natural language processing using large language models have opened unprecedented opportunities in the field of deep learning. Foundation models and transformer architectures have seamlessly extended to various domains, including image generation, biology, and medicine, handling diverse data types. This expansion has stimulated the integration of such models into graph-related tasks. Although graph transformers and graph attention mechanisms have been introduced, they are not fully harnessing the inherent structure of knowledge graphs. In this project, we aim to address this gap by developing a graph transformer for knowledge graph reasoning. Our goal is to build this model from scratch, which includes data collection and processing, model design, experimentation with various databases, and model evaluation, among other tasks. We will develop a robust, computationally efficient, and versatile architecture capable of adapting to a range of tasks, including but not limited to information retrieval, answering complex queries, drug repurposing, and product recommendation. The successful deployment of our model could have significant implications in database management, social network analysis, and biology.

We are looking for highly motivated students who have a background in machine learning, and deep learning (e.g., CS224W, CS 230, CS231N etc.) and experience in data processing. Strong coding skills and proficiency in PyTorch are required. Experience with distributed computing is a plus.

What Makes a Good Cell Embedding

Keywords: Foundation Model, AI for Science, Genomics

As efforts to model virtual cells advance, gaps remain in defining robust cell embeddings. Key challenges include 1) removing unwanted variances, 2) accurately reflecting biological structures, and 3) ensuring utility in downstream analyses and applications. In this project, we aim to build the next generation of evaluation frameworks for cell embeddings and develop novel techniques, such as machine unlearning, to enhance zero-shot embeddings from foundation models.

We are looking for highly motivated students who have experience in machine learning and machine learning for biology (courses such as CS273B, CS224W, CS224N, CS231N, CS229 etc. are helpful). A strong background in GraphML is preferred. Experience with bioinformatics, computational chemistry is a plus.

Optimizing LLM Agents for Complex Real-world Tasks

Keywords: LLMs, Agents, Multiagent, Prompt/Tool Optimization

The development of large language model (LLM) pipelines has moved beyond simple agent-based techniques. Modern systems often involve multiple agents working together or using different data sources to complete tasks. Optimizing these agent systems presents challenges that require careful planning and execution to ensure the agents work well together. Our goal is to develop innovative techniques for building LLM agent systems capable of handling complex, real-world tasks. Multiple agents will be optimized to collaborate effectively, using tools to enhance reasoning and provide insights into general knowledge. Think of this multi-agent system as a large neural network, where our focus is on training the "network" of agents instead of individual parameters. We aim to refine how agents work together, creating a system that can adapt quickly and seamlessly to new tasks.

We are looking for highly motivated students with experience in machine learning (ML), natural language processing (NLP), ML systems, and engineering. A strong background in PyTorch is recommended. Experience with LLM agent framework programming (e.g., DSPy, Langchain) and distributed computing would also be beneficial.

Go to the application form.