Tables¶

The TTable in SNAP is a table data structure for storing tabular data, which can easily be converted into the SNAP graph. The TTable is much more efficient than other tabular data structures and functions seamlessly inside the SNAP universe. The TTable can easily store hundreds of millions of rows and perform complex data manipulation operations.

TTable objects can be easily loaded from CSV and TSV files. TTable objects can store integers, strings, and floats in its columns. Each column can store exactly one data type, and each column has its own name, a string.

This tutorial will cover:

how to create a TTable,
how to save and load a TTable,
how to perform columnwise operations on TTables,
how to perform rowwise operations on TTables,
how to join two TTables together,
and how to extract information from a TTable.

Creating a `TTable`¶

To create a TTable object in SNAP, you must first define the:

Context: The Context holds information behind-the-scenes about the mappings between integers and strings (which reduces memory usage). You don’t have to do anything with the Context except create it and use it as a parameter when creating your TTable. Many TTable objects can share the same Context.
Schema: The Schema defines the column names in the table and their data types, which, unlike in other Python packages, you must specify up-front. Each column can be either an integer column, a float column, or a string column. Each column can hold values of only that data type.

For this tutorial, let’s assume the table below is a tab-separated file with the following columns and values, called ‘student_grades.tsv’:

Student Grades¶
StudentID	Midterm1	Midterm2	Final
101	79	86	88
102	84	80	79
103	56	76	80
104	90	92	96
105	92	85	87
106	87	95	92
107	94	90	91
108	76	88	81

To turn this into a SNAP table, we must create a Context and Schema:

>>> import snap

>>> context = snap.TTableContext()
>>> schema = snap.Schema()
>>> schema.Add(snap.TStrTAttrPr("StudentID", snap.atInt))
>>> schema.Add(snap.TStrTAttrPr("Midterm1", snap.atInt))
>>> schema.Add(snap.TStrTAttrPr("Midterm2", snap.atInt))
>>> schema.Add(snap.TStrTAttrPr("Final", snap.atInt))

As you can see, defining the Context simply requires initializing an object of type TTableContext. That’s all you have to do for the Context!

For the Schema, you must first initialize an object of the SNAP Schema type, and then use the Add() method to create column types for the TTable you want to build. The Add() method takes one parameter, a SNAP TStrTAttrPr, which is a pair consisting of a string and an attribute. An attribute in SNAP is used to represent different data types using an integer key; you don’t have to worry about this, but just remember that the Schema requires this data type for the columns. There are always 2 components of a TStrTAttrPr: the name of the column, which is a string, and the type of data that the column with that name will hold. The options are atInt (integer attribute), atFlt (float attribute), and atStr (string attribute). Since our columns are type integer, we will use atInt for all of them.

We now have the building blocks for a TTable with four columns and a context! Next, we’ll show how to create a TTable from these components, plus a path to a file that we want to make a TTable from. TTable objects can be created from comma-separated files (CSV) and tab-separated files (TSV). Here’s an example:

>>> filename = "/path/to/student_grades.tsv"
>>> grade_table = snap.TTable.LoadSS(schema, filename, context, "\t", snap.TBool(True))

For the filename, we simply use the path to that file on the local machine. Then, to create a table, we use the function TTable.LoadSS(). This function takes in 5 parameters:

The Schema that we made before, which should correspond to the number and types of columns in the TSV file
The name of the path to the file, as a string
The Context created earlier
The separator used in the file (“t” for tab separated, “,” for comma separated, etc.)
A snap.TBool boolean value indicating whether or not the file has a ‘title line,’ that is, a beginning line of column names or other text that is not commented out with a #. Remember that your Schema already has column names, so you don’t want to include them from your CSV or TSV since they’ll throw an error! In our example above, we did have column names in our TSV, so we set this boolean to True.

Now we’ve successfully created a TTable in SNAP! Recall that you can accommodate any table by changing the Schema for the number and type of columns that you need.

Saving and Loading a `TTable` with Binary Format¶

Next, we’ll demonstrate how to save a TTable and load one from binary. TTable objects can be saved in binary format because this saves space (in fact, it’s orders of magnitude more efficient than saving it as text). To save a TTable to binary format, you use the following:

>>> outfile = "/path/to/grade_table.bin"
>>> FOut = snap.TFOut(outfile)
>>> table.Save(FOut)
>>> FOut.Flush()

The four steps are:

Create a path to the file you want to save your TTable to.
Create a TFOut object. A SNAP TFout object allows writing the contents of a file to the specified pathname.
Save the table to your TFOut object (here, named FOut) using the Save() function.
Flush your TFOut object. This flushes the write buffer for the stream, meaning that it has been cleared of the contents of our table and it can be used again for further saving operations.

Once we’ve saved a TTable object to binary format, we can also load TTable objects from their binary format as follows:

>>> context = snap.TTableContext()
>>> outfile = "/path/to/grade_table.bin"
>>> FIn = snap.TFIn(outfile)
>>> table = snap.TTable.Load(FIn, context)

Again, the four steps of loading a TTable from binary format are:

Create a Context object for the TTable. This is necessary when loading a TTable that has been stored in binary format.
Provide the pathname where the binary file currently resides.
Create an TFIn object with the pathname to the binary file. The SNAP FIn object is used to read the contents of a binary file and parse it back into a more complex data structure. It takes the pathname as a parameter.
Finally, create the TTable using the Load() method, which takes two parameters: the TFIn object we just made, and the context that was created in Step 1.

We’ve now covered the basics of how to create, save, and load TTable objects!

Columnwise `TTable` Operations¶

Now that we know how to create a TTable, let’s investigate different column operations that are supported by TTable objects. These column operations allow us to take two or more columns and create a new column via some operation. These include addition, subtraction, multiplication, division, modulo division, maximum, minimum, and concatenation. They are united by their function names, which are all of the form .ColFunc(), where Func is the operation name. There is also one more advanced function, AggregateCols(), that allows us to do other operations like count, first, last, mean, and median.

Let’s do an example by taking our table from above and performing some basic operations. Here is the original for reference:

Student Grades¶
StudentID	Midterm1	Midterm2	Final
101	79	86	88
102	84	80	79
103	56	76	80
104	90	92	96
105	92	85	87
106	87	95	92
107	94	90	91
108	76	88	81

Let’s say we wanted to know the total number of points that each student earned across the two midterms. To do this, we want to use the ColAdd() function, which looks like table.ColAdd(Attr1, Attr2, NewColName.

In the ColAdd() function, we provide three parameters: the first two are the columns we want to add together, using their string names, and the third is the name of the column we want to create that will hold the sums of the first two columns. This is true for all ColFunc() functions. Since we want to get the sum over the midterm scores, we will add together Midterm1 and Midterm2:

>>> grade_table.ColAdd("Midterm1", "Midterm2", "MidScoreSum")

Which yields:

Student Grades¶
StudentID	Midterm1	Midterm2	Final	MidScoreSum
101	79	86	88	165
102	84	80	79	164
103	56	76	80	132
104	90	92	96	182
105	92	85	87	177
106	87	95	92	182
107	94	90	91	184
108	76	88	81	164

Let’s say now that we wanted a column that gave the average of the midterm scores. In this case, we’d use the AggregateCols() method to create a new column with the mean of the midterm columns, row by row. The AggregateCols() has parameters table.AggregateCols(AggAttrs, AggOp, NewColName where AggAttrs is the list of columns you’re working with (it can be more than two), and AggOp is the operation you want to perform from the options: aaSum, aaCount, aaMin, aaMax, aaFirst, aaLast, aaMean, aaMedian. We’ll choose aaMean for our purposes here. Last, you’ll again provide the string name of the new column you’d like to create!

Here is the code for getting the mean over the midterm scores:

>>> AggAttrs = snap.TStrV()
>>> AggAttrs.Add("Midterm1")
>>> AggAttrs.Add("Midterm2")
>>> grade_table.AggregateCols(AggAttrs, snap.aaMean, "MidtermMean")

With the result:

Student Grades¶
StudentID	Midterm1	Midterm2	Final	MidScoreSum	MidtermMean
101	79	86	88	165	82.5
102	84	80	79	164	82
103	56	76	80	132	66
104	90	92	96	182	91
105	92	85	87	177	88.5
106	87	95	92	182	91
107	94	90	91	184	92
108	76	88	81	164	82

A similar methodology can be used for all of the column operation functions for TTable objects.

One important feature of this function group is: If the third parameter passed is an empty string, i.e.:

>>> table.ColDiv("Col1", "Col2", "")

then the results will overwrite the values in the column of the first parameter. In this case, the results of dividing Col1 values by Col2 values would replace the values in Col1.

Rowwise Table Operations¶

The operations shown above focused on creating new data from some combination of two pre-existing columns. Now, we’ll look at operations that summarize or elucidate information about the table: namely, the Group(), Aggregate(), AggregateCols(), Select(), and Unique() functions. These methods affect the table in different ways. Here, we will describe the use cases of the most important features.

First, we will investigate the Select() function family, which consists of SelectAtomicIntConst(), SelectAtomicFltConst(), SelectAtomicStrConst(), SelectAtomic(), and Select(). You will usually use the first four, as Select() is utilized for complex, layered selecting parameters.

First, let’s look at SelectAtomic***Const() functions, which allows you to select rows based on their value in a single column. For example, perhaps you want to select students who had final scores of 90 or above. Here are the general parameters of SelectAtomic***Const() (insert Int, Flt, or Str depending on the type): table.SelectAtomicIntConst(Column, Val, Cmp, SelectedTable.

Column is the column we want to select on. This would be final scores in the example above. Val is the value we want to compare to, which is 90 in the example above. Cmp is the comparator we want to use, with choices of less then (LT), less than or equal to (LTE), equal to (EQ), not equal to (NEQ), greater than or equal to (GTE), greater than (GT), substring of (SUBSTR), or superstring of (SUPERSTR). In the example above, we want to use greater than or equal to (GTE). Finally, we need to provide a SelectedTable, the table that we want add the selected rows to. Generally, using a new blank table is the right option.

Here’s the code to select only rows where the final score is greater than or equal to 90. Let’s assume we’ve greater a new blank TTable called ‘above_90_table’:

>>> grade_table.SelectAtomicIntConst("Final", 90, snap.GTE, above_90_table)

Let’s now look at the Group() and Unique() functions. The Group() function allows us to create a new column to label each column according to shared attributes by using Group(GroupByAttrs, GroupAttrName, Ordered=True. Let’s now look at the Group() and Unique() functions. The Group() function allows us to create a new column to label each column according to shared attributes by using Group(GroupByAttrs, GroupAttrName, Ordered=True.

Here, GroupByAttrs are the columns we want to group with respect to, where their values are the same. GroupAttrName will be the name of the new column with the labels. Let’s say we wanted to group students by their midterm mean score. As we can see above, two students scored an average 91, and two students scored an average 82, so we will see some groups developed. Let’s write the code for this operation:

>>> groupAttrs = snap.TStrV()
>>> groupAttrs.Add("MidtermMean")
>>> table.Group(groupAttrs, "MeanGroups", snap.TBool(True))

Which yields:

Student Grades¶
StudentID	Midterm1	Midterm2	Final	MidScoreSum	MidtermMean	MeanGroups
101	79	86	88	165	82.5	0
102	84	80	79	164	82	1
103	56	76	80	132	66	2
104	90	92	96	182	91	3
105	92	85	87	177	88.5	4
106	87	95	92	182	91	3
107	94	90	91	184	92	5
108	76	88	81	164	82	1

Another related method is Unique(). Rather than assigning the same labels to rows with similar values, any rows with the same sought-after values will be deleted so there are no remaining duplicates, using the paramaters Unique(Attrs, Ordered=True)(). Here, Attrs is simply the attributes that need to be equal in order for us to consider them duplicates.

Let’s try this on the original table, and instead of grouping by the midterm mean, we’ll use Unique() to keep only students with a unique midterm mean score:

>>> attrs = snap.TStrV()
>>> attrs.Add("MidtermMean", snap.TBool(True))
>>> table.Unique(attrs)

Which would instead yield:

Student Grades¶
StudentID	Midterm1	Midterm2	Final	MidScoreSum	MidtermMean
101	79	86	88	165	82.5
102	84	80	79	164	82
103	56	76	80	132	66
104	90	92	96	182	91
105	92	85	87	177	88.5

Students 106 and 108 have been removed because they had the same midterm mean score as students before them. Remember that Unique() goes from top to bottom row, so earlier rows will be preserved.

Now, let’s investigate the Aggregate() method, which allows us to aggregate statistics for each row based on values in certain columns. For example, we might want to add a column telling us how many instances of the AuthorID in each row exist in the dataset. Aggregate() is invoked using parameters Aggregate(GroupByAttrs, AggOp, ValAttr, ResAttr, Ordered=True.

The Aggregate method takes:

GroupByAttrs: The attributes (columns) that you want to aggregate with respect to. This will need to be a vector of strings that you create in advance.
AggOp: The operation you want to aggregate by: options are aaSum, aaCount, aaMin, aaMax, aaFirst, aaLast, aaMean, or aaMedian.
ValAttr: Which attribute (column) we want to aggregate over.
ResAttr: The name of the column where the result of the aggregation will be stored.
Ordered: Whether to treat grouping keys as ordered or unordered.

To make all this more concrete, let’s say we wanted to find the maximum final score over all students based on a particular mean midterm score. That is, for students with the same midterm score, we will add a value to their row indicating the highest final score achieved by someone with their same score. Here’s how we would use Aggregate() to do so:

>>> GroupBy = snap.TStrV()
>>> GroupBy.Add("MidtermMean")
>>> PapAuthT.Aggregate(GroupBy, snap.aaMax, "Final", "MaxFinal", snap.TBool(False))

Here, we use a variable GroupBy to hold a vector of strings representing the columns we want to group with respect to, that is, the MidtermMean column. We then use Aggregate() with the snap.aaCount function to count the number of times each mean appears in the dataset, and store the count in a new column called MeanCount. Here is what the result will look like:

Student Grades¶
StudentID	Midterm1	Midterm2	Final	MidScoreSum	MidtermMean	MaxFinal
101	79	86	88	165	82.5 88
102	84	80	79	164	82	81
103	56	76	80	132	66	80
104	90	92	96	182	91	96
105	92	85	87	177	88.5 87
106	87	95	92	182	91	96
107	94	90	91	184	92	91
108	76	88	81	164	82	81

As you can see, the MaxFinal values indicate the highest final score value for students with the same midterm mean. Notably, we see that students 102 and 108 have the same value, because they have the same midterm score, and their value is the maximum of either of their final scores (81 being higher than 79). The same occurred for students 104 and 106.

Two Table Operations¶

Some SNAP TTable operations help us to combine two different tables into a single table according to various rules. These functions include Intersection, Union, Join, and Minus. They work as follows:

Intersection(): creates a new table from all rows that appear in both original tables. Returns a new table.
Union(): creates a new table from all rows that appear in either original table. Returns a new table. UnionAll has a similar function, but retains duplicates of rows across the tables.
Minus(): creates a new table from all rows in the first table not present in the second table. Returns a new table.
Join(): a more customizable function, Join equi-joins two tables based on one attribute in the first table. Columns from the second table will be added to the first where the value of the desired attribute in the first table matches the value of the desired attribute in the second. Does not return a new table, but rather updates the original table with columns from the second table.
SimJoin(): a function that performs an equi-join if the distance between two rows is less than the specified threshold.

Let’s go back to our original grade table with four columns: StudentID, Midterm1, Midterm2, and Final. Let’s say we have another table that lists the student IDs of these students, plus a column with their names:

Student Names¶
ID	Name
101	Will
102	Amira
103	Todd
104	Yang
105	Cathy
106	Shubash
107	Nicolo
108	Maria

Let’s say we want to incorporate the Name column into our original table. We can do this using the Join() function, with parameters Join(Attr1, PTable, Attr2.

Here, Attr1 is the column we want to join on from the first table, PTable is the second table we want to join with, and Attr2 is the column we want to join on from the second table.

To combine our two tables, we would use:

>>> combined_table = grade_table.Join("StudentID", name_table, "ID")

Which will create a new table called ‘combined_table’ as so:

Student Grades¶
StudentID	Midterm1	Midterm2	Final	Name
101	79	86	88	Will
102	84	80	79	Amira
103	56	76	80	Todd
104	90	92	96	Yang
105	92	85	87	Cathy
106	87	95	92	Shubash
107	94	90	91	Nicolo
108	76	88	81	Maria

Getting Information from Tables¶

SNAP has many functions to get information from TTable objects, in the form of vectors or basic data types. Some of the most useful get functions include:

GetNumRows()
GetSchema()
GetIntVal(), GetFltVal(), and GetStrVal()
GetIntValAtRowIdx(), GetFltValAtRowIdx(), and GetStrValAtRowIdx()
ReadIntCol(), ReadFltCol(), and ReadStrCol()

These functions are relatively straightforward, and will assist with obtaining pieces of information and summary statistics from the TTable. The Val functions return single values, and the Col functions return vectors of entire column values.

Tables¶

Creating a `TTable`¶

Saving and Loading a `TTable` with Binary Format¶

Columnwise `TTable` Operations¶

Rowwise Table Operations¶

Two Table Operations¶

Getting Information from Tables¶

Table of Contents

Previous topic

Next topic

This Page

StudentID	Midterm1	Midterm2	Final
101	79	86	88
102	84	80	79
103	56	76	80
104	90	92	96
105	92	85	87
106	87	95	92
107	94	90	91
108	76	88	81

StudentID	Midterm1	Midterm2	Final
101	79	86	88
102	84	80	79
103	56	76	80
104	90	92	96
105	92	85	87
106	87	95	92
107	94	90	91
108	76	88	81

StudentID	Midterm1	Midterm2	Final	MidScoreSum
101	79	86	88	165
102	84	80	79	164
103	56	76	80	132
104	90	92	96	182
105	92	85	87	177
106	87	95	92	182
107	94	90	91	184
108	76	88	81	164

StudentID	Midterm1	Midterm2	Final
101	79	86	88
102	84	80	79
103	56	76	80
104	90	92	96
105	92	85	87
106	87	95	92
107	94	90	91
108	76	88	81

StudentID	Midterm1	Midterm2	Final
101	79	86	88
102	84	80	79
103	56	76	80
104	90	92	96
105	92	85	87
106	87	95	92
107	94	90	91
108	76	88	81

StudentID	Midterm1	Midterm2	Final	MidScoreSum
101	79	86	88	165
102	84	80	79	164
103	56	76	80	132
104	90	92	96	182
105	92	85	87	177
106	87	95	92	182
107	94	90	91	184
108	76	88	81	164

Tables¶

Creating a TTable¶

Saving and Loading a TTable with Binary Format¶

Columnwise TTable Operations¶

Rowwise Table Operations¶

Two Table Operations¶

Getting Information from Tables¶

Creating a `TTable`¶

Saving and Loading a `TTable` with Binary Format¶

Columnwise `TTable` Operations¶

StudentID	Midterm1	Midterm2	Final
101	79	86	88
102	84	80	79
103	56	76	80
104	90	92	96
105	92	85	87
106	87	95	92
107	94	90	91
108	76	88	81

StudentID	Midterm1	Midterm2	Final
101	79	86	88
102	84	80	79
103	56	76	80
104	90	92	96
105	92	85	87
106	87	95	92
107	94	90	91
108	76	88	81

StudentID	Midterm1	Midterm2	Final	MidScoreSum
101	79	86	88	165
102	84	80	79	164
103	56	76	80	132
104	90	92	96	182
105	92	85	87	177
106	87	95	92	182
107	94	90	91	184
108	76	88	81	164