NETtalk (artificial neural network)

Training

Summarize

Perspective

The training dataset was a 20,008-word subset of the Brown Corpus, with manually annotated phoneme and stress for each letter. The development process was described in a 1993 interview. It took three months -- 250 person-hours -- to create the training dataset, but only a few days to train the network.^[3]^[4]

After it was run successfully on this, the authors tried it on a phonological transcription of an interview with a young Latino boy from a barrio in Los Angeles. This resulted in a network that reproduced his Spanish accent.^[2]^: 115

The original NETtalk was implemented on a Ridge 32, which took 0.275 seconds per learning step (one forward and one backward pass). Training NETtalk became a benchmark to test for the efficiency of backpropagation programs. For example, an implementation on Connection Machine-1 (with 16384 processors) ran at 52x speedup. An implementation on a 10-cell Warp ran at 340x speedup.^[5]^[6]

The following table compiles the benchmark scores as of 1988.^[5]^[6]^[7] Speed is measured in "millions of connections per second" (MCPS). For example, the original NETtalk on Ridge 32 took 0.275 seconds per forward-backward pass, giving ${\frac {18629/10^{6}}{0.275}}=0.068$ MCPS. Relative times are normalized to the MicroVax.

More information System, MCPS ...

Performance Comparison (as of 1988)
System	MCPS	Relative Time
MicroVax	0.008	1
Sun 3/75	0.01	1.3
VAX-11 780	0.027	3.4
Sun 160 with FPA	0.034	4.2
DEC VAX 8600	0.06	7.5
Ridge 32	0.07	8.8
Convex C-1	1.8	225
16,384-core CM-1	2.6	325
Cray-2	7	860
65,536-core CM-1	13	1600
10-cell Warp	17	2100
10-cell iWarp	36	4500

Remove ads

Architecture

The network had three layers and 18,629 adjustable weights, large by the standards of 1986. There were worries that it would overfit the dataset, but it was trained successfully.^[2]

The input of the network has 203 units, divided into 7 groups of 29 units each. Each group is a one-hot encoding of one character. There are 29 possible characters: 26 letters, comma, period, and word boundary (whitespace). To produce the pronunciation of a single character, the network takes the character itself, as well as 3 characters before and 3 characters after it.

The hidden layer has 80 units.

The output has 26 units. 21 units encode for articulatory features (point of articulation, voicing, vowel height, etc.) of phonemes, and 5 units encode for stress and syllable boundaries.

Sejnowski studied the learned representation in the network, and found that phonemes that sound similar are clustered together in representation space. The output of the network degrades, but remains understandable, when some hidden neurons are removed.^[8]

Remove ads

NETtalk (artificial neural network)

Training

Architecture

References

External links

Wikiwand - on