Tables¶
The TTable
in SNAP is a table data structure for storing tabular data, which can easily be converted into the SNAP graph. The TTable
is much more efficient than other tabular data structures and functions seamlessly inside the SNAP universe. The TTable
can easily store hundreds of millions of rows and perform complex data manipulation operations.
TTable
objects can be easily loaded from CSV and TSV files. TTable
objects can store integers, strings, and floats in its columns. Each column can store exactly one data type, and each column has its own name, a string.
This tutorial will cover:
how to create a TTable,
how to save and load a TTable,
how to perform columnwise operations on TTables,
how to perform rowwise operations on TTables,
how to join two TTables together,
and how to extract information from a TTable.
Creating a TTable
¶
To create a TTable
object in SNAP, you must first define the:
Context: The Context holds information behind-the-scenes about the mappings between integers and strings (which reduces memory usage). You don’t have to do anything with the Context except create it and use it as a parameter when creating your
TTable
. ManyTTable
objects can share the same Context.Schema: The Schema defines the column names in the table and their data types, which, unlike in other Python packages, you must specify up-front. Each column can be either an integer column, a float column, or a string column. Each column can hold values of only that data type.
For this tutorial, let’s assume the table below is a tab-separated file with the following columns and values, called ‘student_grades.tsv’:
StudentID |
Midterm1 |
Midterm2 |
Final |
---|---|---|---|
101 |
79 |
86 |
88 |
102 |
84 |
80 |
79 |
103 |
56 |
76 |
80 |
104 |
90 |
92 |
96 |
105 |
92 |
85 |
87 |
106 |
87 |
95 |
92 |
107 |
94 |
90 |
91 |
108 |
76 |
88 |
81 |
To turn this into a SNAP table, we must create a Context and Schema:
>>> import snap
>>> context = snap.TTableContext()
>>> schema = snap.Schema()
>>> schema.Add(snap.TStrTAttrPr("StudentID", snap.atInt))
>>> schema.Add(snap.TStrTAttrPr("Midterm1", snap.atInt))
>>> schema.Add(snap.TStrTAttrPr("Midterm2", snap.atInt))
>>> schema.Add(snap.TStrTAttrPr("Final", snap.atInt))
As you can see, defining the Context simply requires initializing an object of type TTableContext. That’s all you have to do for the Context!
For the Schema, you must first initialize an object of the SNAP Schema type, and then use the Add()
method to create column types for the TTable
you want to build. The Add() method takes one parameter, a SNAP TStrTAttrPr, which is a pair consisting of a string and an attribute. An attribute in SNAP is used to represent different data types using an integer key; you don’t have to worry about this, but just remember that the Schema requires this data type for the columns. There are always 2 components of a TStrTAttrPr: the name of the column, which is a string, and the type of data that the column with that name will hold. The options are atInt (integer attribute), atFlt (float attribute), and atStr (string attribute). Since our columns are type integer, we will use atInt for all of them.
We now have the building blocks for a TTable
with four columns and a context! Next, we’ll show how to create a TTable
from these components, plus a path to a file that we want to make a TTable
from. TTable
objects can be created from comma-separated files (CSV) and tab-separated files (TSV). Here’s an example:
>>> filename = "/path/to/student_grades.tsv"
>>> grade_table = snap.TTable.LoadSS(schema, filename, context, "\t", snap.TBool(True))
For the filename, we simply use the path to that file on the local machine. Then, to create a table, we use the function TTable.LoadSS()
. This function takes in 5 parameters:
The Schema that we made before, which should correspond to the number and types of columns in the TSV file
The name of the path to the file, as a string
The Context created earlier
The separator used in the file (“t” for tab separated, “,” for comma separated, etc.)
A snap.TBool boolean value indicating whether or not the file has a ‘title line,’ that is, a beginning line of column names or other text that is not commented out with a #. Remember that your Schema already has column names, so you don’t want to include them from your CSV or TSV since they’ll throw an error! In our example above, we did have column names in our TSV, so we set this boolean to True.
Now we’ve successfully created a TTable
in SNAP! Recall that you can accommodate any table by changing the Schema for the number and type of columns that you need.
Saving and Loading a TTable
with Binary Format¶
Next, we’ll demonstrate how to save a TTable
and load one from binary. TTable
objects can be saved in binary format because this saves space (in fact, it’s orders of magnitude more efficient than saving it as text). To save a TTable
to binary format, you use the following:
>>> outfile = "/path/to/grade_table.bin"
>>> FOut = snap.TFOut(outfile)
>>> table.Save(FOut)
>>> FOut.Flush()
The four steps are:
Create a path to the file you want to save your
TTable
to.Create a TFOut object. A SNAP TFout object allows writing the contents of a file to the specified pathname.
Save the table to your TFOut object (here, named FOut) using the
Save()
function.Flush your TFOut object. This flushes the write buffer for the stream, meaning that it has been cleared of the contents of our table and it can be used again for further saving operations.
Once we’ve saved a TTable
object to binary format, we can also load TTable
objects from their binary format as follows:
>>> context = snap.TTableContext()
>>> outfile = "/path/to/grade_table.bin"
>>> FIn = snap.TFIn(outfile)
>>> table = snap.TTable.Load(FIn, context)
Again, the four steps of loading a TTable
from binary format are:
Create a Context object for the
TTable
. This is necessary when loading aTTable
that has been stored in binary format.Provide the pathname where the binary file currently resides.
Create an TFIn object with the pathname to the binary file. The SNAP FIn object is used to read the contents of a binary file and parse it back into a more complex data structure. It takes the pathname as a parameter.
Finally, create the
TTable
using theLoad()
method, which takes two parameters: the TFIn object we just made, and the context that was created in Step 1.
We’ve now covered the basics of how to create, save, and load TTable
objects!
Columnwise TTable
Operations¶
Now that we know how to create a TTable
, let’s investigate different column operations that are supported by TTable
objects. These column operations allow us to take two or more columns and create a new column via some operation. These include addition, subtraction, multiplication, division, modulo division, maximum, minimum, and concatenation. They are united by their function names, which are all of the form .ColFunc(), where Func is the operation name. There is also one more advanced function, AggregateCols()
, that allows us to do other operations like count, first, last, mean, and median.
Let’s do an example by taking our table from above and performing some basic operations. Here is the original for reference:
StudentID |
Midterm1 |
Midterm2 |
Final |
---|---|---|---|
101 |
79 |
86 |
88 |
102 |
84 |
80 |
79 |
103 |
56 |
76 |
80 |
104 |
90 |
92 |
96 |
105 |
92 |
85 |
87 |
106 |
87 |
95 |
92 |
107 |
94 |
90 |
91 |
108 |
76 |
88 |
81 |
Let’s say we wanted to know the total number of points that each student earned across the two midterms. To do this, we want to use the ColAdd()
function, which looks like table.ColAdd(Attr1, Attr2, NewColName.
In the ColAdd()
function, we provide three parameters: the first two are the columns we want to add together, using their string names, and the third is the name of the column we want to create that will hold the sums of the first two columns. This is true for all ColFunc() functions. Since we want to get the sum over the midterm scores, we will add together Midterm1 and Midterm2:
>>> grade_table.ColAdd("Midterm1", "Midterm2", "MidScoreSum")
Which yields:
StudentID |
Midterm1 |
Midterm2 |
Final |
MidScoreSum |
---|---|---|---|---|
101 |
79 |
86 |
88 |
165 |
102 |
84 |
80 |
79 |
164 |
103 |
56 |
76 |
80 |
132 |
104 |
90 |
92 |
96 |
182 |
105 |
92 |
85 |
87 |
177 |
106 |
87 |
95 |
92 |
182 |
107 |
94 |
90 |
91 |
184 |
108 |
76 |
88 |
81 |
164 |
Let’s say now that we wanted a column that gave the average of the midterm scores. In this case, we’d use the AggregateCols()
method to create a new column with the mean of the midterm columns, row by row. The AggregateCols()
has parameters table.AggregateCols(AggAttrs, AggOp, NewColName where AggAttrs is the list of columns you’re working with (it can be more than two), and AggOp is the operation you want to perform from the options: aaSum, aaCount, aaMin, aaMax, aaFirst, aaLast, aaMean, aaMedian. We’ll choose aaMean for our purposes here. Last, you’ll again provide the string name of the new column you’d like to create!
Here is the code for getting the mean over the midterm scores:
>>> AggAttrs = snap.TStrV()
>>> AggAttrs.Add("Midterm1")
>>> AggAttrs.Add("Midterm2")
>>> grade_table.AggregateCols(AggAttrs, snap.aaMean, "MidtermMean")
With the result:
StudentID |
Midterm1 |
Midterm2 |
Final |
MidScoreSum |
MidtermMean |
---|---|---|---|---|---|
101 |
79 |
86 |
88 |
165 |
82.5 |
102 |
84 |
80 |
79 |
164 |
82 |
103 |
56 |
76 |
80 |
132 |
66 |
104 |
90 |
92 |
96 |
182 |
91 |
105 |
92 |
85 |
87 |
177 |
88.5 |
106 |
87 |
95 |
92 |
182 |
91 |
107 |
94 |
90 |
91 |
184 |
92 |
108 |
76 |
88 |
81 |
164 |
82 |
A similar methodology can be used for all of the column operation functions for TTable
objects.
One important feature of this function group is: If the third parameter passed is an empty string, i.e.:
>>> table.ColDiv("Col1", "Col2", "")
then the results will overwrite the values in the column of the first parameter. In this case, the results of dividing Col1 values by Col2 values would replace the values in Col1.
Rowwise Table Operations¶
The operations shown above focused on creating new data from some combination of two pre-existing columns. Now, we’ll look at operations that summarize or elucidate information about the table: namely, the Group(), Aggregate(), AggregateCols(), Select(), and Unique() functions. These methods affect the table in different ways. Here, we will describe the use cases of the most important features.
First, we will investigate the Select()
function family, which consists of SelectAtomicIntConst()
, SelectAtomicFltConst()
, SelectAtomicStrConst()
, SelectAtomic()
, and Select()
. You will usually use the first four, as Select()
is utilized for complex, layered selecting parameters.
First, let’s look at SelectAtomic***Const()
functions, which allows you to select rows based on their value in a single column. For example, perhaps you want to select students who had final scores of 90 or above. Here are the general parameters of SelectAtomic***Const()
(insert Int, Flt, or Str depending on the type): table.SelectAtomicIntConst(Column, Val, Cmp, SelectedTable.
Column is the column we want to select on. This would be final scores in the example above. Val is the value we want to compare to, which is 90 in the example above. Cmp is the comparator we want to use, with choices of less then (LT), less than or equal to (LTE), equal to (EQ), not equal to (NEQ), greater than or equal to (GTE), greater than (GT), substring of (SUBSTR), or superstring of (SUPERSTR). In the example above, we want to use greater than or equal to (GTE). Finally, we need to provide a SelectedTable, the table that we want add the selected rows to. Generally, using a new blank table is the right option.
Here’s the code to select only rows where the final score is greater than or equal to 90. Let’s assume we’ve greater a new blank TTable
called ‘above_90_table’:
>>> grade_table.SelectAtomicIntConst("Final", 90, snap.GTE, above_90_table)
Let’s now look at the Group()
and Unique()
functions. The Group()
function allows us to create a new column to label each column according to shared attributes by using Group(GroupByAttrs, GroupAttrName, Ordered=True.
Let’s now look at the Group()
and Unique()
functions. The Group()
function allows us to create a new column to label each column according to shared attributes by using Group(GroupByAttrs, GroupAttrName, Ordered=True.
Here, GroupByAttrs are the columns we want to group with respect to, where their values are the same. GroupAttrName will be the name of the new column with the labels. Let’s say we wanted to group students by their midterm mean score. As we can see above, two students scored an average 91, and two students scored an average 82, so we will see some groups developed. Let’s write the code for this operation:
>>> groupAttrs = snap.TStrV()
>>> groupAttrs.Add("MidtermMean")
>>> table.Group(groupAttrs, "MeanGroups", snap.TBool(True))
Which yields:
StudentID |
Midterm1 |
Midterm2 |
Final |
MidScoreSum |
MidtermMean |
MeanGroups |
---|---|---|---|---|---|---|
101 |
79 |
86 |
88 |
165 |
82.5 |
0 |
102 |
84 |
80 |
79 |
164 |
82 |
1 |
103 |
56 |
76 |
80 |
132 |
66 |
2 |
104 |
90 |
92 |
96 |
182 |
91 |
3 |
105 |
92 |
85 |
87 |
177 |
88.5 |
4 |
106 |
87 |
95 |
92 |
182 |
91 |
3 |
107 |
94 |
90 |
91 |
184 |
92 |
5 |
108 |
76 |
88 |
81 |
164 |
82 |
1 |
Another related method is Unique()
. Rather than assigning the same labels to rows with similar values, any rows with the same sought-after values will be deleted so there are no remaining duplicates, using the paramaters Unique(Attrs, Ordered=True)()
.
Here, Attrs is simply the attributes that need to be equal in order for us to consider them duplicates.
Let’s try this on the original table, and instead of grouping by the midterm mean, we’ll use Unique()
to keep only students with a unique midterm mean score:
>>> attrs = snap.TStrV()
>>> attrs.Add("MidtermMean", snap.TBool(True))
>>> table.Unique(attrs)
Which would instead yield:
StudentID |
Midterm1 |
Midterm2 |
Final |
MidScoreSum |
MidtermMean |
---|---|---|---|---|---|
101 |
79 |
86 |
88 |
165 |
82.5 |
102 |
84 |
80 |
79 |
164 |
82 |
103 |
56 |
76 |
80 |
132 |
66 |
104 |
90 |
92 |
96 |
182 |
91 |
105 |
92 |
85 |
87 |
177 |
88.5 |
Students 106 and 108 have been removed because they had the same midterm mean score as students before them. Remember that Unique() goes from top to bottom row, so earlier rows will be preserved.
Now, let’s investigate the Aggregate()
method, which allows us to aggregate statistics for each row based on values in certain columns. For example, we might want to add a column telling us how many instances of the AuthorID in each row exist in the dataset. Aggregate()
is invoked using parameters Aggregate(GroupByAttrs, AggOp, ValAttr, ResAttr, Ordered=True.
The Aggregate method takes:
GroupByAttrs: The attributes (columns) that you want to aggregate with respect to. This will need to be a vector of strings that you create in advance.
AggOp: The operation you want to aggregate by: options are aaSum, aaCount, aaMin, aaMax, aaFirst, aaLast, aaMean, or aaMedian.
ValAttr: Which attribute (column) we want to aggregate over.
ResAttr: The name of the column where the result of the aggregation will be stored.
Ordered: Whether to treat grouping keys as ordered or unordered.
To make all this more concrete, let’s say we wanted to find the maximum final score over all students based on a particular mean midterm score. That is, for students with the same midterm score, we will add a value to their row indicating the highest final score achieved by someone with their same score. Here’s how we would use Aggregate() to do so:
>>> GroupBy = snap.TStrV()
>>> GroupBy.Add("MidtermMean")
>>> PapAuthT.Aggregate(GroupBy, snap.aaMax, "Final", "MaxFinal", snap.TBool(False))
Here, we use a variable GroupBy to hold a vector of strings representing the columns we want to group with respect to, that is, the MidtermMean column. We then use Aggregate()
with the snap.aaCount function to count the number of times each mean appears in the dataset, and store the count in a new column called MeanCount. Here is what the result will look like:
StudentID |
Midterm1 |
Midterm2 |
Final |
MidScoreSum |
MidtermMean |
MaxFinal |
---|---|---|---|---|---|---|
101 |
79 |
86 |
88 |
165 |
82.5 88 |
|
102 |
84 |
80 |
79 |
164 |
82 |
81 |
103 |
56 |
76 |
80 |
132 |
66 |
80 |
104 |
90 |
92 |
96 |
182 |
91 |
96 |
105 |
92 |
85 |
87 |
177 |
88.5 87 |
|
106 |
87 |
95 |
92 |
182 |
91 |
96 |
107 |
94 |
90 |
91 |
184 |
92 |
91 |
108 |
76 |
88 |
81 |
164 |
82 |
81 |
As you can see, the MaxFinal values indicate the highest final score value for students with the same midterm mean. Notably, we see that students 102 and 108 have the same value, because they have the same midterm score, and their value is the maximum of either of their final scores (81 being higher than 79). The same occurred for students 104 and 106.
Two Table Operations¶
Some SNAP TTable
operations help us to combine two different tables into a single table according to various rules. These functions include Intersection, Union, Join, and Minus. They work as follows:
Intersection()
: creates a new table from all rows that appear in both original tables. Returns a new table.Union()
: creates a new table from all rows that appear in either original table. Returns a new table. UnionAll has a similar function, but retains duplicates of rows across the tables.Minus()
: creates a new table from all rows in the first table not present in the second table. Returns a new table.Join()
: a more customizable function, Join equi-joins two tables based on one attribute in the first table. Columns from the second table will be added to the first where the value of the desired attribute in the first table matches the value of the desired attribute in the second. Does not return a new table, but rather updates the original table with columns from the second table.SimJoin()
: a function that performs an equi-join if the distance between two rows is less than the specified threshold.
Let’s go back to our original grade table with four columns: StudentID, Midterm1, Midterm2, and Final. Let’s say we have another table that lists the student IDs of these students, plus a column with their names:
ID |
Name |
---|---|
101 |
Will |
102 |
Amira |
103 |
Todd |
104 |
Yang |
105 |
Cathy |
106 |
Shubash |
107 |
Nicolo |
108 |
Maria |
Let’s say we want to incorporate the Name column into our original table. We can do this using the Join()
function, with parameters Join(Attr1, PTable, Attr2.
Here, Attr1 is the column we want to join on from the first table, PTable is the second table we want to join with, and Attr2 is the column we want to join on from the second table.
To combine our two tables, we would use:
>>> combined_table = grade_table.Join("StudentID", name_table, "ID")
Which will create a new table called ‘combined_table’ as so:
StudentID |
Midterm1 |
Midterm2 |
Final |
Name |
---|---|---|---|---|
101 |
79 |
86 |
88 |
Will |
102 |
84 |
80 |
79 |
Amira |
103 |
56 |
76 |
80 |
Todd |
104 |
90 |
92 |
96 |
Yang |
105 |
92 |
85 |
87 |
Cathy |
106 |
87 |
95 |
92 |
Shubash |
107 |
94 |
90 |
91 |
Nicolo |
108 |
76 |
88 |
81 |
Maria |
Getting Information from Tables¶
SNAP has many functions to get information from TTable
objects, in the form of vectors or basic data types. Some of the most useful get functions include:
GetNumRows()
GetSchema()
GetIntVal()
,GetFltVal()
, andGetStrVal()
GetIntValAtRowIdx()
,GetFltValAtRowIdx()
, andGetStrValAtRowIdx()
ReadIntCol()
,ReadFltCol()
, andReadStrCol()
These functions are relatively straightforward, and will assist with obtaining pieces of information and summary statistics from the TTable
. The Val functions return single values, and the Col functions return vectors of entire column values.