## Contents |

## Week 1 (Jan - 05)

During the first lecture Dr.Abbott introduced the course and gave us the whole picture of what we are going to do during this quarter. He also explained some of the possible projects for master thesis.

Currently, I am working on my presentation for the next class. After I talked to my adviser Dr.Crespi we decided to start building a background for my thesis while I am completing CS 590 course.

My plan is to study learning problems in one application (We haven’t decided on which application I will use yet.) but for the first step he asked me to study the probabilistic automaton and the algorithm which Jim Crutchfield has introduced to build epsilon machine (probabilistic automata and also deterministic) based on the history of the application for learning and predicting purposes in future.

## Week 2 (Jan - 12)

For this week I am going to talk breifly about speech recognition problem and HMM which has been used to model the problem.

For more information about HMM model you can take a look at this Tutorial.

What I have done during past week is basically some research about speech recognition problem to get familiar with the problem for the first step. I also studied HMM from the given tutorial above and learned about three basic problems of HMM. Although the solution of the third one isn’t completely clear to me and I need to take more time to study that part again.

There is one more algorithm that I got familiar with during last week. The algorithm is called CSSR (Causal-State Splitting Reconstruction) which has been introduced by Jim Crutchfield. The algorithm is basically a method to build an epsilon machine and epsilon machine itself is a finite deterministic and probabilistic automaton that each state called causal state with some properties. One major property of causal state is that two histories, two series of past data, leave you in the same causal state if they leave you with the same distribution of future data. Another important property is that causal states themselves form a Markov process.

The authors of the paper believed that their algorithm can be applied in any domain where HMMs have proved their value.

You can check the paper here – An Algorithm for Pattern Discovery in Time Series

Since it was a long paper I couldn’t read the whole paper. I only studied their algorithm for building the epsilon machine.

## Week 3 (Jan - 19)

I have been continuing searching and reading papers about CSSR algorithm. I wanted to figure out for which applications the algorithm has been used to find a good application for my thesis and also understand the algorithm better and check its performs in practice.

**What is Casual-State Spliting Reconstruction (CSSR) algorithm?**

Since I couldn’t figure out how to write mathematical notations on wiki (Although it has to be the same as latex but it didn’t work for me), I provided a pdf copy of description on CSSR algorithm. CSSR

I also checked some papers that has been applied the CSSR algorithm for NLP tasks. NLP studies problems of automated generation and understanding of natural human languages. Actually the result that they got from using this algorithm is a little bit below the best system that has been used before but they still think if they add more features they cam improve the performance by using CSSR.

## Week 4 (Jan - 26)

During last week I prepare the presentation based on what I have read so far about CSSR algorithm. One of my concerns was to find a good application that shows the applicability of the algorithm in practice but I still haven't found something that gets my attention.

Another question that I had to answer is how we can define the similarity between two probabilistic automata? In other way what is the distance measure between two probabilistic automata? I found this paper as the answer of the question Lp Distance and Equivalence of Probabilistic Automata

The authors of the papers give an efficient algorithm for computing the distance between two probabilistic automata for p even and prove that the problem in NP-hard for all odd values using the reduction from the Max-Clique problem.

A problem is closely related to that of computing a distance between two probabilistic automata is to test for their equivalence.

I also looked through another paper which used a different measure to calculate the distance between two probabilistic automaton which is called Kullback-Leibler but none of the papers are clear to me yet and I need to read them more carefully.

## Week 5 (Feb - 02)

Based on the research I have done about the CSSR algorithm, NLP is one of the areas that using CSSR algorithm is likely to improve the performance.

The paper that I read this week is about using the CSSR algorithm for Named Entity Recognition (NER)(Link to the paper). NER task consist of detecting names referring to entities such as person, locations, organizations, etc. in a text. NE detection is needed for Named Entity Classification which is used in many NLP applications such as Question Answering, Information Retrieval, etc. In this paper they just do the detection not the classification.

They used “B-I-O” approach to tag the words. Each word has a B, I or O tag, being B the tag for a word where a NE begins, I the tag if the word is part of the NE and finally O if the word is not part of the NE.

They used CSSR to learn an automaton for NE structure and once they have that it can be used to detect the NEs in untagged text.

To learn the automaton they used different information about the words for example orthographic, morphosyntactic, about the position of the word in the sentence, etc. With this information they translate the words to a closed set of symbols which is actually the alphabet of automaton and the translated sentence will be the sequence that the machine has to learn.

**Some of the challenges in their approach:**

1. The CSSR algorithm is conceived to model stationary problem but NE patterns is not part of this category. So, what they did is to regard a text sequence as a stationary process in which NEs occur once a while.

2. To find the most likely tag for each word they used Viterbi algorithm which for each word in a sentence the possible states the machine could reach if the current word had the tag B,I or O and the probabilities of these paths will compute and the one with the highest probability for each tag will be recorded.

3. When performing the tagging of NEs given a text, it is possible to find a symbol sequences that haven’t been seen in the training section. So, in this case automaton goes to a state called sink state which receives all the unseen transitions and contains all the unseen suffixes.

## Week 6 (Feb - 09)

Based on the last meeting that I had with Dr.Crespi, my thesis will be “learning probabilistic automata with focusing on CSSR algorithm and apply it on a application to detect anomaly”.

During this week I have been thinking about implementing the algorithm and a lot of questions came to my mind. First of all which data set I want to use? What is our alphabet? Which maximum length of history I will consider? And so on…

I have the CS 560 course this quarter and for the first assignment she gave us a pose detection problem and we used a simple ANN to learn the patterns and use it to determine the pose of a new image. So, I thought the CSSR algorithm is another way to detect the patterns and since it has to perform even better it would be a good idea to test that. By doing this even if I don’t get a good result I will understand the algorithm clearly.

Now my goal is to use four category of face images with up, left, right and front poses and use CSSR to learn these patterns. Therefore I can answer some of the questions. Assuming that I will do some preprocessing on the images and convert them to gray scale images my alphabet will be numbers from 0 through 255 because each image will be a set of pixels that has a color between 0 and 255. Actually I even need to think a bout this more. Maybe changing the colored image to a binary image to have an alphabet of two symbols 0 and 1 work better. But I need to consider how much information I will loose to have smaller alphabet? Does this effect my performance significantly?

CSSR Algorithm build a machine based on one observation and each image will be represented by an array of numbers. Now let say I have 10 images for front pose, do I have to consider each image separately as a one observation to build a machine and then combine the all 10 machines together or concatenate all the images together first and make one single observation that has all 10 images and build one machine? My plane is to do the second one since it seems to be more meaningful to me. I also though I have to train and expand the machine till increasing the length of history will not give a new distribution for future. So the issue of what is the maximum length of history has been answered.

## Week 7 (Feb - 16)

## Week 8 (Feb - 23)

I had put my name in the list to present today and I was working on my presentation but I was not satisfied with what I had done. There are some questions that I'd like to ask my adviser before presenting those concepts to the class. So I removed my name from the list.

## Week 9 (Mar - 01)

I have been working on my prospectus an I'll will present a draft of it during the class.

## Week 10 (Mar - 08)

During this week I have been looking for some papers related to my subject in order to complete my prospectus. I looked through two papers, Anomaly detection via optimal symbolic observation of physical process and automatic detection in massive moving objects.

In anomaly detection applications, the objective is to detect abnormal conditions occuring within the monitored system by analyzing observable process data. To this end, models are often constructed to charactrize normal and abnormal behaviors. A model may represent the possible and un acceptable operational dynamics of a monitored process. Finite state machine (FSM) representations of system components can be formulated and composed to describe relevant operations of integrated system.

to detect anomalies, it is needed not only a set of sensors to retrieve process data but also an observer to integrate and analyze the collected process information. Thus optimizing sensor configurations and rigorously synthesizing their corresponding observers are important design goals.

let's look at this example again:

There is a larg number of vessels sailing near american coats. Manually trace enormous numbers of moving object and identify suspicious ones is not realistic.Therefore it's highly desirable to develop automated tools that can evaluate the behavior of all maritime vessels and flag the suspicious ones.

In the paper, they study the problem and propose an effective and scalable classification methods for detecting anomalous behavior in moving object.

#### Movement Motifs

They assumed the input data consists of a set of labeled movments paths, P = {p_1, p_2, ...} while each path is a sequence of time-related opints of an object. For each object there is a set of none-spatiotemporal attributes that describe non-motion related properties, such as "size" and "type".

It is necessary to extract movements motifs at certain heigher abstraction (not in precision of seconds and inches) level in order to perform efficient and meaningful analysis.

Considering two ship movements showing above they share similar movements except an extra loop in the dashed path. the make semantically meaningful comparisons between moving objects, extract movements motifs at a proper level for further classification.

A movement motif is a prototypical movement pattern of an object, often pre-defined domain experts. For example straight line, right turn, U-turn, loop,etc. Let there be M defiend motifs {m_1, m_2,... m_M}. A movement path is then transformed to a sequence of motifs with other oiecess of information.

They also assumed that the motif extractor is able to extract some spatiotempral information. Using such information they can derive a set of attributes, duration, avg-speed, etc.

Three ships has been shown on above picture. They are moving in an area that contains an important landmark, The left and right ships have the same movement shapes except that the left one makes its circle around a landmark. This extra piece of information combined with the movement motif can be crucial in decision making. If they had the information that left ship made that extra move late an night with very low speed it's possible case of abnormal movement.

The vector of such information can be use to feed into some learinig machine. But since the number of distict values for each attribute is very larg they used different approach. They propose an extraction process which will smooth out the high granularity values and extract higher level features. For example, the excat distance from a ship to a landmark is unimportant and it is enough to consider "rather close-to it" in comparision with most other ships.

They then applied a clustering method to extract that high level features. Once we have clustered information getting a vector of high level features we can classify the movement of an object.