Vectorizers

A vectorizer transforms a list of tokens to a vector where we can make mathematical computations. The vectorizers are generic packages that must be instantiated with a sparse array package which describes the final vector. The first step is to instantiate a sparse array with a Row_Type and Column_Type which represent indices of the vector and a Value_Type which represents value of the cell. For example, a simple counter vector can be declared as follows:

package Counter_Arrays is
   new SCI.Sparse.COO_Arrays (Row_Type    => Positive,
                              Column_Type => Positive,
                              Value_Type  => Natural);

The sparse array only records cells which have values. The Row_Type can be used to represent a document and the Column_Type refers to the occurrence of the token in that document.

Counters

The SCI.Vectorizers.Counters is a vectorizer that counts the occurrence of tokens and builds a vector of these counters. The Vectorizer_Type contains an ordered map of tokens represented by Token_Type and for each of them it indicates a column index within the final vector. When tokens are recorded for a document, it looks for a token in the map, finds the associated column and increments a counter for the document and token. If the token is not found, it is inserted in the map and associated with a new column in the vector. For example, the package can be instantiated as follows:

package Token_Counters is
   new SCI.Vectorizers.Counters (Token_Type => Unbounded_String,
                                 Arrays => Counter_Arrays,
                                 "<" => "<");

An instance of the vectorizer is declared and configured:

 V : Token_Counters.Vectorizer_Type;
 ...
    V.Counters.Default := 0;

The vectorizer is filled with documents and tokens by using the Add_Token procedure with the Row representing the document. Because the vector cell can be any Ada private type, it is necessary to provide an Increment function that gets the current value and increment it.

 function Increment (Value : in Natural) return Natural is (Value + 1);
 Token_Counters.Add_Token (Into      => V,
                           Token     => Item,
                           Increment => Increment'Access);

After filling the vectorizer instance, it will contain in V.Counters the cells which count the token occurrence per document scanned. The result can be used to compute similarities between different documents (known as rows).

Indefinite counters

The SCI.Vectorizers.Indefinite_Counters is similar to the SCI.Vectorizers.Counters package but allows to use indefinite types for the Token_Type. For example, it can be used to use a String for the Token_Type as follows:

package Token_Counters is
   new SCI.Vectorizers.Indefinite_Counters (Token_Type => String,
                                            Arrays => Counter_Arrays,
                                            "<" => "<");

Transformers

The SCI.Vectorizers.Transformers transforms a count matrix to a normalized tf or tf-idf representation. Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.

The Frequency_Type defines the floating type to represent the frequency. A Convert function must be provided to convert the counter number used by the sparse array into a Frequency_Type. The transformer is then instantiated:

function To_Float (Value : Natural) return Float is (Float (Value));
package Counter_Transformers is
   new SCI.Vectorizers.Transformers (Frequency_Type => Float,
                                     Arrays => Counter_Arrays,
                                     Convert => To_Float);

Given the counters computed by the SCI.Vectorizers.Counters package, the tf-idf values are computed as follows:

F : Counter_Transformers.Frequency_Arrays.Array_Type;
...
   Counter_Transformers.TIDF (From => V.Counters, Into => F);